Skip to content

Conversation

@sneakybatman
Copy link
Contributor

Summary

This PR adds support for configurable word-level confidence score aggregation methods in text recognition models. Previously, models used either arithmetic mean or minimum for aggregating character-level confidence scores into word-level confidence, with no way for users to customize this behavior.

Motivation

Different use cases may require different confidence aggregation strategies:

  • Arithmetic mean: Good general-purpose default, balances all character confidences
  • Geometric mean: More sensitive to low confidence characters, useful when any low confidence should significantly impact the word score
  • Harmonic mean: Even more conservative, heavily penalizes low confidence characters
  • Minimum: Most conservative approach, word confidence equals weakest character (good for high-precision requirements)
  • Maximum: Most optimistic, useful when you want the best-case confidence
  • Custom callable: Full flexibility for specialized use cases

Changes

  • Add aggregate_confidence() utility function in core.py with support for 5 built-in methods plus custom callables
  • Add ConfidenceAggregation type alias for type hints
  • Add confidence_aggregation parameter to RecognitionPostProcessor base class
  • Update all PyTorch PostProcessors: PARSeq, ViTSTR, CRNN, SAR, MASTER, VIPTR
  • Update all TensorFlow PostProcessors: PARSeq, ViTSTR, SAR, MASTER
  • Update remap_preds() for split crop handling to use configurable aggregation
  • Add comprehensive unit tests (20 new test cases)

Usage Example

from doctr.models import recognition

# Use default aggregation (model-specific)
model = recognition.parseq(pretrained=True)

# Or customize at the PostProcessor level
from doctr.models.recognition.parseq.pytorch import PARSeqPostProcessor

# Use geometric mean for more conservative confidence scores
processor = PARSeqPostProcessor(vocab, confidence_aggregation="geometric_mean")

# Use custom aggregation function
import numpy as np
processor = PARSeqPostProcessor(vocab, confidence_aggregation=lambda probs: np.percentile(probs, 25))

Test plan

  • All existing tests pass
  • New unit tests for aggregate_confidence() function cover all 5 methods
  • Tests verify correct handling of edge cases (empty arrays, single values, zeros)
  • Tests verify custom callable support
  • PyTorch postprocessor tests updated and passing
  • TensorFlow postprocessor tests updated and passing

Add support for configurable word-level confidence score aggregation
methods in text recognition models. Users can now choose how to
aggregate character-level confidence scores into word-level confidence.

Supported aggregation methods:
- "mean": Arithmetic mean (default for transformer models)
- "geometric_mean": Geometric mean (sensitive to low values)
- "harmonic_mean": Harmonic mean (even more sensitive to low values)
- "min": Minimum confidence (most conservative, default for CTC/attention models)
- "max": Maximum confidence (most optimistic)
- Custom callable: User-defined aggregation function

Changes:
- Add `aggregate_confidence()` utility function in core.py
- Add `confidence_aggregation` parameter to RecognitionPostProcessor
- Update all PyTorch PostProcessors (PARSeq, ViTSTR, CRNN, SAR, MASTER, VIPTR)
- Update all TensorFlow PostProcessors (PARSeq, ViTSTR, SAR, MASTER)
- Update `remap_preds()` for split crop handling
- Add comprehensive unit tests for aggregation methods
- Maintain backward compatibility with sensible defaults per model type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant