Skip to content

mapo80/markitdownnet

Repository files navigation

MarkItDownNet

MarkItDownNet is a lightweight .NET library that converts PDFs and images into normalised Markdown with positional metadata. For each processed document the library returns:

  • Canonical Markdown text
  • Page information (original width and height)
  • Line level bounding boxes
  • Word level bounding boxes

Bounding boxes use [x,y,w,h] normalised to [0..1] with a top left origin.

FUNSD dataset comparison

Una descrizione del tool di confronto con il dataset FUNSD, il report delle differenze di bounding box e le istruzioni per l'esecuzione sono disponibili in docs/funsd_comparison.md.

Pipeline

PDF -> PdfPig text extraction -> (optional) PDFtoImage rasterisation -> OCR
Image -> OCR
                 |
                 v
            Markdown (Markdig)

If a PDF yields too few native words the pages are rasterised with PDFtoImage and OCRed with the selected engine.

Installing .NET

This repository does not rely on the system dotnet. Install the SDK locally using the provided script:

chmod +x ./dotnet-install.sh
./dotnet-install.sh --channel 9.0
~/.dotnet/dotnet --version

Build and Test

All build and test commands must use the locally installed dotnet:

~/.dotnet/dotnet build
~/.dotnet/dotnet test

Tesseract and leptonica

La libreria include le dipendenze native minime per Linux x64 in runtimes/linux-x64/native e non richiede l'installazione di Tesseract o Leptonica sul sistema. Il binding .NET di Tesseract è fornito tramite un pacchetto NuGet locale (local-packages/Tesseract.5.2.0.nupkg) derivato dal repository charlesw/tesseract.

Per eseguire l'OCR è necessario soltanto fornire i file tessdata delle lingue. Su Ubuntu 24.04 è sufficiente installare i pacchetti delle lingue desiderate, ad esempio:

sudo apt-get install -y tesseract-ocr-eng tesseract-ocr-ita tesseract-ocr-osd

Impostare quindi OcrDataPath nelle opzioni puntando alla cartella che contiene i dati di lingua (ad es. /usr/share/tesseract-ocr/5/tessdata).

Usage

var options = new MarkItDownOptions
{
    OcrDataPath = "/usr/share/tesseract-ocr/5/tessdata",
    OcrEngine = OcrEngine.Tesseract, // or OcrEngine.RapidOcr
    OcrLanguage = OcrLanguage.English,
    PdfRasterDpi = 300
};
var converter = new MarkItDownConverter(options);
var result = await converter.ConvertAsync("sample.pdf", "application/pdf");
Console.WriteLine(result.Markdown);

Configuration

MarkItDownOptions exposes run‑time tunables:

  • OcrEngine – OCR engine to use (Tesseract or RapidOcr)
  • OcrDataPath – location of Tesseract language data (TESSDATA_PREFIX)
  • OcrLanguage – language passed to the OCR engine (English, Italian, Latin)
  • PdfRasterDpi – DPI for rasterising PDFs during OCR fallback
  • MinimumNativeWordThreshold – minimum words before OCR is triggered
  • NormalizeMarkdown – toggle Markdig normalisation

OCR engine comparison

The sample dataset/training/busta_paga_internet.jpeg was processed with both OCR backends.

Engine Time (s) Δ vs Tesseract Characters Words CER vs Tesseract
Tesseract 1.12 1181 199
RapidOCR 3.68 +229% 1376 177 0.59

Character and word counts are derived from the respective Markdown outputs, and the character error rate (CER) is the normalised Levenshtein distance between Tesseract and RapidOCR text. On this sample RapidOCR required about 3.7× the processing time of Tesseract (+229%). Timings were collected on Ubuntu 24.04 using Tesseract 5.3.4 and the RapidOCR .NET runtime (BustaPagaNet).

Logging

Logging uses Serilog. The library reads standard Serilog settings (see src/MarkItDownNet/appsettings.json for an example) and supports console and rolling file sinks. Set Serilog__MinimumLevel via environment variables to control verbosity.

Testing assets

Tests create a small PDF on the fly ensuring that extraction works without external files. OCR based tests are not executed by default as they require Tesseract data files.

Evaluation

A comparison with Docling ground truth on sample PDFs and TIFFs is available in the Docling comparison report.

Docling's image samples are distributed as TIFF files. The comparison tool converts them to JPEG via BitMiracle.LibTiff.NET and SkiaSharp before passing them to MarkItDownNet:

~/.dotnet/dotnet run --project tools/DoclingComparison/DoclingComparison.csproj docling/tests/data/tiff/2206.01062.tif
Metric Docling MarkItDownNet Difference
Word count 17 344 17 803 +2.65%
Word match rate 100% 99.37% −0.63%
Markdown similarity 73%
BBox mean absolute error 0% 10.74% +10.74%

These large arXiv PDFs showed a 99.37% word match rate and a 10.74% mean absolute error in bounding boxes.

Docling comparison

The tests project verifies Markdown and bounding box accuracy against the Docling ground truth for ocr_test.pdf.

Item Docling MarkItDownNet Abs. diff Diff %
Markdown Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package same 0 0%
BBox X 0.1171 0.1171 0 0%
BBox Y 0.0915 0.0915 0 0%
BBox W 0.7312 0.7312 0 0%
BBox H 0.0902 0.0902 0 0%

Bounding boxes use normalised [x,y,w,h] coordinates. The test asserts equality within a two decimal tolerance.

Docling data conversion timings

The following timings were captured while converting the PDF, TIFF, and PNG samples from Docling's tests/data directory. Each value represents the time in milliseconds to produce Markdown text and to serialise bounding boxes.

File Type Markdown ms BBox ms
2305.03393v1-pg9-img.png png 1537.34 52.91
2203.01017v2.pdf pdf 1147.85 44.90
2206.01062.pdf pdf 654.79 20.40
2305.03393v1-pg9.pdf pdf 85.03 0.87
2305.03393v1.pdf pdf 287.15 16.69
amt_handbook_sample.pdf pdf 136.57 1.46
code_and_formula.pdf pdf 49.39 1.85
multi_page.pdf pdf 63.96 2.85
picture_classification.pdf pdf 20.78 1.19
redp5110_sampled.pdf pdf 302.47 12.68
right_to_left_01.pdf pdf 32.79 0.54
right_to_left_02.pdf pdf 20.31 0.49
right_to_left_03.pdf pdf 34.83 0.36
2206.01062.tif tiff 4007.83 1.80
Type Avg Markdown ms Avg BBox ms
png 1537.34 52.91
pdf 236.33 8.69
tiff 4007.83 1.80
Overall 598.65 11.36

Comparison with markitdown timings

The markitdown project reports Docling dataset timings in seconds. Comparing the published averages shows that MarkItDownNet processes these samples substantially faster:

Type markitdown MD s markitdown BBox s MarkItDownNet MD s MarkItDownNet BBox s
pdf 3.29 5.14 0.24 0.01
png 2.51 5.56 1.54 0.05
tiff 2.57 4.19 4.01 0.00
Overall 3.18 5.10 0.60 0.01

On these samples, MarkItDownNet completed Markdown conversion roughly an order of magnitude faster for PDFs and produced bounding boxes two orders of magnitude quicker than markitdown.

MD parity bench

Generazione MD:

dotnet run --project tools/MarkItDownNet.Cli -- mdgen --txt-dir dataset/validation/_ocr/pytesseract-cli --out-dir dataset/validation/_md --engines markitdown,markitdownnet --python-exe python3

Confronto & report:

dotnet run --project tools/MarkItDownNet.Cli -- mdcompare --md-dir dataset/validation/_md --baseline markitdown --out-json artifacts/mdbench/bench-md.json --out-html artifacts/mdbench/bench-md.html --summary-md artifacts/mdbench/summary-md.md

MD parity STRICT (manifest + hash)

Esegui il controllo completo (manifest, hash e metriche) con:

 bash tools/scripts/md_parity_strict.sh

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages