MarkItDownNet is a lightweight .NET library that converts PDFs and images into normalised Markdown with positional metadata. For each processed document the library returns:
- Canonical Markdown text
- Page information (original width and height)
- Line level bounding boxes
- Word level bounding boxes
Bounding boxes use [x,y,w,h] normalised to [0..1] with a top left origin.
Una descrizione del tool di confronto con il dataset FUNSD, il report delle differenze di bounding box e le istruzioni per l'esecuzione sono disponibili in docs/funsd_comparison.md.
PDF -> PdfPig text extraction -> (optional) PDFtoImage rasterisation -> OCR
Image -> OCR
|
v
Markdown (Markdig)
If a PDF yields too few native words the pages are rasterised with PDFtoImage and OCRed with the selected engine.
This repository does not rely on the system dotnet. Install the SDK locally using the provided script:
chmod +x ./dotnet-install.sh
./dotnet-install.sh --channel 9.0
~/.dotnet/dotnet --versionAll build and test commands must use the locally installed dotnet:
~/.dotnet/dotnet build
~/.dotnet/dotnet testLa libreria include le dipendenze native minime per Linux x64 in runtimes/linux-x64/native e non richiede l'installazione di Tesseract o Leptonica sul sistema.
Il binding .NET di Tesseract è fornito tramite un pacchetto NuGet locale (local-packages/Tesseract.5.2.0.nupkg) derivato dal repository charlesw/tesseract.
Per eseguire l'OCR è necessario soltanto fornire i file tessdata delle lingue. Su Ubuntu 24.04 è sufficiente installare i pacchetti delle lingue desiderate, ad esempio:
sudo apt-get install -y tesseract-ocr-eng tesseract-ocr-ita tesseract-ocr-osdImpostare quindi OcrDataPath nelle opzioni puntando alla cartella che contiene i dati di lingua (ad es. /usr/share/tesseract-ocr/5/tessdata).
var options = new MarkItDownOptions
{
OcrDataPath = "/usr/share/tesseract-ocr/5/tessdata",
OcrEngine = OcrEngine.Tesseract, // or OcrEngine.RapidOcr
OcrLanguage = OcrLanguage.English,
PdfRasterDpi = 300
};
var converter = new MarkItDownConverter(options);
var result = await converter.ConvertAsync("sample.pdf", "application/pdf");
Console.WriteLine(result.Markdown);MarkItDownOptions exposes run‑time tunables:
OcrEngine– OCR engine to use (TesseractorRapidOcr)OcrDataPath– location of Tesseract language data (TESSDATA_PREFIX)OcrLanguage– language passed to the OCR engine (English,Italian,Latin)PdfRasterDpi– DPI for rasterising PDFs during OCR fallbackMinimumNativeWordThreshold– minimum words before OCR is triggeredNormalizeMarkdown– toggle Markdig normalisation
The sample dataset/training/busta_paga_internet.jpeg was processed with both OCR backends.
| Engine | Time (s) | Δ vs Tesseract | Characters | Words | CER vs Tesseract |
|---|---|---|---|---|---|
| Tesseract | 1.12 | – | 1181 | 199 | – |
| RapidOCR | 3.68 | +229% | 1376 | 177 | 0.59 |
Character and word counts are derived from the respective Markdown outputs, and the character error rate (CER) is the normalised
Levenshtein distance between Tesseract and RapidOCR text. On this sample RapidOCR required about 3.7× the processing time of
Tesseract (+229%). Timings were collected on Ubuntu 24.04 using Tesseract 5.3.4 and the RapidOCR .NET runtime (BustaPagaNet).
Logging uses Serilog. The library reads standard Serilog settings (see src/MarkItDownNet/appsettings.json for an example) and supports console and rolling file sinks. Set Serilog__MinimumLevel via environment variables to control verbosity.
Tests create a small PDF on the fly ensuring that extraction works without external files. OCR based tests are not executed by default as they require Tesseract data files.
A comparison with Docling ground truth on sample PDFs and TIFFs is available in the Docling comparison report.
Docling's image samples are distributed as TIFF files. The comparison tool converts them to JPEG via BitMiracle.LibTiff.NET and SkiaSharp before passing them to MarkItDownNet:
~/.dotnet/dotnet run --project tools/DoclingComparison/DoclingComparison.csproj docling/tests/data/tiff/2206.01062.tif| Metric | Docling | MarkItDownNet | Difference |
|---|---|---|---|
| Word count | 17 344 | 17 803 | +2.65% |
| Word match rate | 100% | 99.37% | −0.63% |
| Markdown similarity | – | 73% | – |
| BBox mean absolute error | 0% | 10.74% | +10.74% |
These large arXiv PDFs showed a 99.37% word match rate and a 10.74% mean absolute error in bounding boxes.
The tests project verifies Markdown and bounding box accuracy against the Docling ground truth for ocr_test.pdf.
| Item | Docling | MarkItDownNet | Abs. diff | Diff % |
|---|---|---|---|---|
| Markdown | Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package |
same | 0 | 0% |
| BBox X | 0.1171 | 0.1171 | 0 | 0% |
| BBox Y | 0.0915 | 0.0915 | 0 | 0% |
| BBox W | 0.7312 | 0.7312 | 0 | 0% |
| BBox H | 0.0902 | 0.0902 | 0 | 0% |
Bounding boxes use normalised [x,y,w,h] coordinates. The test asserts equality within a two decimal tolerance.
The following timings were captured while converting the PDF, TIFF, and PNG samples from Docling's tests/data directory. Each value represents the time in milliseconds to produce Markdown text and to serialise bounding boxes.
| File | Type | Markdown ms | BBox ms |
|---|---|---|---|
| 2305.03393v1-pg9-img.png | png | 1537.34 | 52.91 |
| 2203.01017v2.pdf | 1147.85 | 44.90 | |
| 2206.01062.pdf | 654.79 | 20.40 | |
| 2305.03393v1-pg9.pdf | 85.03 | 0.87 | |
| 2305.03393v1.pdf | 287.15 | 16.69 | |
| amt_handbook_sample.pdf | 136.57 | 1.46 | |
| code_and_formula.pdf | 49.39 | 1.85 | |
| multi_page.pdf | 63.96 | 2.85 | |
| picture_classification.pdf | 20.78 | 1.19 | |
| redp5110_sampled.pdf | 302.47 | 12.68 | |
| right_to_left_01.pdf | 32.79 | 0.54 | |
| right_to_left_02.pdf | 20.31 | 0.49 | |
| right_to_left_03.pdf | 34.83 | 0.36 | |
| 2206.01062.tif | tiff | 4007.83 | 1.80 |
| Type | Avg Markdown ms | Avg BBox ms |
|---|---|---|
| png | 1537.34 | 52.91 |
| 236.33 | 8.69 | |
| tiff | 4007.83 | 1.80 |
| Overall | 598.65 | 11.36 |
The markitdown project reports Docling dataset timings in seconds. Comparing the published averages shows that MarkItDownNet processes these samples substantially faster:
| Type | markitdown MD s | markitdown BBox s | MarkItDownNet MD s | MarkItDownNet BBox s |
|---|---|---|---|---|
| 3.29 | 5.14 | 0.24 | 0.01 | |
| png | 2.51 | 5.56 | 1.54 | 0.05 |
| tiff | 2.57 | 4.19 | 4.01 | 0.00 |
| Overall | 3.18 | 5.10 | 0.60 | 0.01 |
On these samples, MarkItDownNet completed Markdown conversion roughly an order of magnitude faster for PDFs and produced bounding boxes two orders of magnitude quicker than markitdown.
Generazione MD:
dotnet run --project tools/MarkItDownNet.Cli -- mdgen --txt-dir dataset/validation/_ocr/pytesseract-cli --out-dir dataset/validation/_md --engines markitdown,markitdownnet --python-exe python3
Confronto & report:
dotnet run --project tools/MarkItDownNet.Cli -- mdcompare --md-dir dataset/validation/_md --baseline markitdown --out-json artifacts/mdbench/bench-md.json --out-html artifacts/mdbench/bench-md.html --summary-md artifacts/mdbench/summary-md.md
Esegui il controllo completo (manifest, hash e metriche) con:
bash tools/scripts/md_parity_strict.shMIT