MarkItDownNet

MarkItDownNet is a lightweight .NET library that converts PDFs and images into normalised Markdown with positional metadata. For each processed document the library returns:

Canonical Markdown text
Page information (original width and height)
Line level bounding boxes
Word level bounding boxes

Bounding boxes use [x,y,w,h] normalised to [0..1] with a top left origin.

FUNSD dataset comparison

Una descrizione del tool di confronto con il dataset FUNSD, il report delle differenze di bounding box e le istruzioni per l'esecuzione sono disponibili in docs/funsd_comparison.md.

Pipeline

PDF -> PdfPig text extraction -> (optional) PDFtoImage rasterisation -> OCR
Image -> OCR
                 |
                 v
            Markdown (Markdig)

If a PDF yields too few native words the pages are rasterised with PDFtoImage and OCRed with the selected engine.

Installing .NET

This repository does not rely on the system dotnet. Install the SDK locally using the provided script:

chmod +x ./dotnet-install.sh
./dotnet-install.sh --channel 9.0
~/.dotnet/dotnet --version

Build and Test

All build and test commands must use the locally installed dotnet:

~/.dotnet/dotnet build
~/.dotnet/dotnet test

Tesseract and leptonica

La libreria include le dipendenze native minime per Linux x64 in runtimes/linux-x64/native e non richiede l'installazione di Tesseract o Leptonica sul sistema. Il binding .NET di Tesseract è fornito tramite un pacchetto NuGet locale (local-packages/Tesseract.5.2.0.nupkg) derivato dal repository charlesw/tesseract.

Per eseguire l'OCR è necessario soltanto fornire i file tessdata delle lingue. Su Ubuntu 24.04 è sufficiente installare i pacchetti delle lingue desiderate, ad esempio:

sudo apt-get install -y tesseract-ocr-eng tesseract-ocr-ita tesseract-ocr-osd

Impostare quindi OcrDataPath nelle opzioni puntando alla cartella che contiene i dati di lingua (ad es. /usr/share/tesseract-ocr/5/tessdata).

Usage

var options = new MarkItDownOptions
{
    OcrDataPath = "/usr/share/tesseract-ocr/5/tessdata",
    OcrEngine = OcrEngine.Tesseract, // or OcrEngine.RapidOcr
    OcrLanguage = OcrLanguage.English,
    PdfRasterDpi = 300
};
var converter = new MarkItDownConverter(options);
var result = await converter.ConvertAsync("sample.pdf", "application/pdf");
Console.WriteLine(result.Markdown);

Configuration

MarkItDownOptions exposes run‑time tunables:

OcrEngine – OCR engine to use (Tesseract or RapidOcr)
OcrDataPath – location of Tesseract language data (TESSDATA_PREFIX)
OcrLanguage – language passed to the OCR engine (English, Italian, Latin)
PdfRasterDpi – DPI for rasterising PDFs during OCR fallback
MinimumNativeWordThreshold – minimum words before OCR is triggered
NormalizeMarkdown – toggle Markdig normalisation

OCR engine comparison

The sample dataset/training/busta_paga_internet.jpeg was processed with both OCR backends.

Engine	Time (s)	Δ vs Tesseract	Characters	Words	CER vs Tesseract
Tesseract	1.12	–	1181	199	–
RapidOCR	3.68	+229%	1376	177	0.59

Character and word counts are derived from the respective Markdown outputs, and the character error rate (CER) is the normalised Levenshtein distance between Tesseract and RapidOCR text. On this sample RapidOCR required about 3.7× the processing time of Tesseract (+229%). Timings were collected on Ubuntu 24.04 using Tesseract 5.3.4 and the RapidOCR .NET runtime (BustaPagaNet).

Logging

Logging uses Serilog. The library reads standard Serilog settings (see src/MarkItDownNet/appsettings.json for an example) and supports console and rolling file sinks. Set Serilog__MinimumLevel via environment variables to control verbosity.

Testing assets

Tests create a small PDF on the fly ensuring that extraction works without external files. OCR based tests are not executed by default as they require Tesseract data files.

Evaluation

A comparison with Docling ground truth on sample PDFs and TIFFs is available in the Docling comparison report.

Docling's image samples are distributed as TIFF files. The comparison tool converts them to JPEG via BitMiracle.LibTiff.NET and SkiaSharp before passing them to MarkItDownNet:

~/.dotnet/dotnet run --project tools/DoclingComparison/DoclingComparison.csproj docling/tests/data/tiff/2206.01062.tif

Metric	Docling	MarkItDownNet	Difference
Word count	17 344	17 803	+2.65%
Word match rate	100%	99.37%	−0.63%
Markdown similarity	–	73%	–
BBox mean absolute error	0%	10.74%	+10.74%

These large arXiv PDFs showed a 99.37% word match rate and a 10.74% mean absolute error in bounding boxes.

Docling comparison

The tests project verifies Markdown and bounding box accuracy against the Docling ground truth for ocr_test.pdf.

Item	Docling	MarkItDownNet	Diff %
Markdown	`Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package`	same	0%
BBox X	0.1171	0.1171	0%
BBox Y	0.0915	0.0915	0%
BBox W	0.7312	0.7312	0%
BBox H	0.0902	0.0902	0%

Bounding boxes use normalised [x,y,w,h] coordinates. The test asserts equality within a two decimal tolerance.

Docling data conversion timings

The following timings were captured while converting the PDF, TIFF, and PNG samples from Docling's tests/data directory. Each value represents the time in milliseconds to produce Markdown text and to serialise bounding boxes.

File	Type	Markdown ms	BBox ms
2305.03393v1-pg9-img.png	png	1537.34	52.91
2203.01017v2.pdf	pdf	1147.85	44.90
2206.01062.pdf	pdf	654.79	20.40
2305.03393v1-pg9.pdf	pdf	85.03	0.87
2305.03393v1.pdf	pdf	287.15	16.69
amt_handbook_sample.pdf	pdf	136.57	1.46
code_and_formula.pdf	pdf	49.39	1.85
multi_page.pdf	pdf	63.96	2.85
picture_classification.pdf	pdf	20.78	1.19
redp5110_sampled.pdf	pdf	302.47	12.68
right_to_left_01.pdf	pdf	32.79	0.54
right_to_left_02.pdf	pdf	20.31	0.49
right_to_left_03.pdf	pdf	34.83	0.36
2206.01062.tif	tiff	4007.83	1.80

Type	Avg Markdown ms	Avg BBox ms
png	1537.34	52.91
pdf	236.33	8.69
tiff	4007.83	1.80
Overall	598.65	11.36

Comparison with markitdown timings

The markitdown project reports Docling dataset timings in seconds. Comparing the published averages shows that MarkItDownNet processes these samples substantially faster:

Type	markitdown MD s	markitdown BBox s	MarkItDownNet MD s	MarkItDownNet BBox s
pdf	3.29	5.14	0.24	0.01
png	2.51	5.56	1.54	0.05
tiff	2.57	4.19	4.01	0.00
Overall	3.18	5.10	0.60	0.01

On these samples, MarkItDownNet completed Markdown conversion roughly an order of magnitude faster for PDFs and produced bounding boxes two orders of magnitude quicker than markitdown.

MD parity bench

Generazione MD:

dotnet run --project tools/MarkItDownNet.Cli -- mdgen --txt-dir dataset/validation/_ocr/pytesseract-cli --out-dir dataset/validation/_md --engines markitdown,markitdownnet --python-exe python3

Confronto & report:

dotnet run --project tools/MarkItDownNet.Cli -- mdcompare --md-dir dataset/validation/_md --baseline markitdown --out-json artifacts/mdbench/bench-md.json --out-html artifacts/mdbench/bench-md.html --summary-md artifacts/mdbench/summary-md.md

MD parity STRICT (manifest + hash)

Esegui il controllo completo (manifest, hash e metriche) con:

 bash tools/scripts/md_parity_strict.sh

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
artifacts		artifacts
dataset		dataset
docs		docs
local-packages		local-packages
src		src
tests/MarkItDownNet.Tests		tests/MarkItDownNet.Tests
tools		tools
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Directory.Build.targets		Directory.Build.targets
LICENSE		LICENSE
MarkItDownNet.sln		MarkItDownNet.sln
README.md		README.md
dotnet-install.sh		dotnet-install.sh
markitdownnet.json		markitdownnet.json
nuget.config		nuget.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MarkItDownNet

FUNSD dataset comparison

Pipeline

Installing .NET

Build and Test

Tesseract and leptonica

Usage

Configuration

OCR engine comparison

Logging

Testing assets

Evaluation

Docling comparison

Docling data conversion timings

Comparison with markitdown timings

MD parity bench

MD parity STRICT (manifest + hash)

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mapo80/markitdownnet

Folders and files

Latest commit

History

Repository files navigation

MarkItDownNet

FUNSD dataset comparison

Pipeline

Installing .NET

Build and Test

Tesseract and leptonica

Usage

Configuration

OCR engine comparison

Logging

Testing assets

Evaluation

Docling comparison

Docling data conversion timings

Comparison with markitdown timings

MD parity bench

MD parity STRICT (manifest + hash)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages