Feature/coreference match by Davidnet · Pull Request #71 · dataiku/kiji-proxy

Davidnet · 2026-01-05T05:33:27Z

This pull request introduces coreference (coref) detection and pronoun substitution support to the PII detection and masking pipeline. The ONNX-based PII detector is extended to output both PII entity and coreference cluster predictions, and the masking service is updated to optionally substitute pronouns based on configuration. Several data structures and service constructors are updated to support these features.

Coreference detection and output:

The ONNXModelDetectorSimple now outputs both PII and coreference logits, processes them to assign detected entities to coreference clusters, and includes coreference cluster information in the DetectorOutput. Entities now have an associated ClusterID. (src/backend/pii/detectors/onnx_model_detector.go, src/backend/pii/detectors/types.go) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Pronoun substitution support:

Added an EnablePronounSubstitution flag to the config, defaulting to true, to control whether pronoun substitution is performed during masking. (src/backend/config/config.go) [1] [2]
The MaskingService now tracks pronoun and entity replacements with new data structures, and is constructed with a pronoun mapper and config to support this feature. (src/backend/pii/masking_service.go) [1] [2]

Testing and deterministic output:

Added a new NewGeneratorServiceWithSeed function to create a deterministic generator for testing purposes. (src/backend/pii/generator_service.go)

Refactoring and utility improvements:

Introduced utility functions and types in the masking service to handle UTF-8 boundaries, overlap checks, and replacement tracking for robust text manipulation. (src/backend/pii/masking_service.go)

These changes lay the groundwork for robust, cluster-aware PII masking and enable optional pronoun substitution for improved privacy and text coherence.

Summary by CodeRabbit

New Features
- Added pronoun substitution capability for PII masking with configurable control
- Implemented PII restoration to recover original content from masked text
- Enhanced PII detection with coreference analysis for entity relationship tracking
Tests
- Comprehensive test suite for masking and restoration workflows

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…g by adding a pronoun mapper and updating the masking service.

…ists - Correct "her" mapping to support possessive cases (mapping to "his"/"their" instead of "him"). - Add support for possessive pronoun "hers". - Sync name lists in `DetectGenderFromName` with `FirstNameGenerator` for better coverage.

… enhance masking service for entity and pronoun tracking

coderabbitai · 2026-01-05T05:33:39Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This PR introduces pronoun substitution and coreference resolution capabilities for PII masking. It adds configuration for pronoun substitution, coreference detection in ONNX models, a pronoun mapper for gender-aware substitution, and enhanced masking/restoration logic tracking entity and pronoun replacements with position recovery.

Changes

Cohort / File(s)	Summary
Configuration & Infrastructure `src/backend/config/config.go`	Added `EnablePronounSubstitution` bool field to Config struct, defaulting to true.
Coreference Detection `src/backend/pii/detectors/onnx_model_detector.go`	Extended ONNX detector to process coref_logits output tensor alongside PII logits. Added coref label counting, cluster discovery, and entity-to-cluster association via `findClusterForEntity` helper. Introduced `safeUintToInt` for bounds-safe span calculations and updated resource cleanup for new tensor.
Type Definitions `src/backend/pii/detectors/types.go`	Added `EntityMention` type with text and position fields. Extended `DetectorOutput` with `CorefClusters map[int][]EntityMention`. Added `ClusterID` field to Entity struct.
Deterministic Generation `src/backend/pii/generator_service.go`	Added `NewGeneratorServiceWithSeed(seed int64)` constructor for deterministic RNG seeding (testing).
Pronoun Handling Infrastructure `src/backend/pii/pronoun_mapper.go`	New module introducing `PronounGender` enum (Unknown, Male, Female, Neutral), `PronounMapper` type managing gender-to-pronoun mappings, and methods: `MapPronoun`, `GetAllPronouns`, `DetectGenderFromName` with name-substring heuristics and capitalization preservation.
Masking & Restoration `src/backend/pii/masking_service.go`	Added `PronounReplacement` and `EntityReplacement` types for tracking substitutions. Expanded `MaskedResult` with entity/pronoun replacements, gender mappings, masked text, and entities. Updated `MaskingService` constructor to accept config. Enhanced `MaskText` to compute per-entity replacements, detect cluster gender changes, apply pronoun substitutions (when enabled), and track restoration data. Introduced `RestorePII`, `restoreEntities`, `reversePronouns` for PII recovery. Added UTF-8 boundary checks and overlap detection helpers.
Masking Tests `src/backend/pii/masking_service_test.go`	Comprehensive test suite with mock detector, covering multi-entity masking, UTF-8 resilience, coreference overlaps, pronoun gender switching, restoration, disabled substitution, and end-to-end scenarios.
Service Integration `src/backend/proxy/handler.go`	Updated `NewMaskingService` constructor call to pass config parameter. Changed masking flow to build `maskedToOriginal` mappings from `EntityReplacements` instead of `MaskedToOriginal`.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Handler as Proxy Handler
    participant MaskSvc as Masking Service
    participant GenSvc as Generator Service
    participant Detector as ONNX Detector
    participant PronMapper as Pronoun Mapper
    
    Client->>Handler: POST /mask (text, config)
    activate Handler
    Handler->>Detector: Detect(text)
    activate Detector
    Detector->>Detector: Process PII & Coref logits
    Detector-->>Handler: DetectorOutput{Entities, CorefClusters}
    deactivate Detector
    
    Handler->>MaskSvc: MaskText(text, entities, corefClusters)
    activate MaskSvc
    
    MaskSvc->>GenSvc: GenerateReplacement(label, entity)
    activate GenSvc
    GenSvc-->>MaskSvc: maskedValue
    deactivate GenSvc
    
    MaskSvc->>MaskSvc: Detect cluster gender<br/>(from coref mentions)
    MaskSvc->>PronMapper: DetectGenderFromName(maskedName)
    activate PronMapper
    PronMapper-->>MaskSvc: detectedGender
    deactivate PronMapper
    
    alt EnablePronounSubstitution
        MaskSvc->>MaskSvc: Identify gender change<br/>(original→masked)
        MaskSvc->>PronMapper: MapPronoun(pronoun, fromGender, toGender)
        activate PronMapper
        PronMapper-->>MaskSvc: substitutedPronoun
        deactivate PronMapper
        MaskSvc->>MaskSvc: Apply pronoun replacements<br/>(UTF-8 boundary safe)
    end
    
    MaskSvc->>MaskSvc: Apply entity replacements<br/>(descending order)
    MaskSvc-->>Handler: MaskedResult{MaskedText, EntityReplacements, PronounReplacements, GenderMappings}
    deactivate MaskSvc
    
    Handler-->>Client: {masked_text, entity_replacements, gender_mappings}
    deactivate Handler
    
    Note over MaskSvc,PronMapper: RestorePII reverses this flow via<br/>restoreEntities + reversePronouns

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 Coreference clusters now bound,
Pronouns dance through gender's round,
Entities masked with care and grace,
Restoration finds each misplaced trace—
A hopping sprint through PII's embrace!

Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… update entity position handling

Davidnet · 2026-01-05T06:05:31Z

@coderabbitai review

coderabbitai · 2026-01-05T06:05:37Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot

Pull request overview

This PR introduces coreference resolution and gender-aware pronoun substitution to the PII masking pipeline. The changes enable the system to detect pronoun references to PII entities, substitute them based on gender when entity names change, and restore the original text including pronouns.

Key changes:

Extended ONNX detector to output coreference clusters alongside PII entities
Added pronoun mapping system with gender detection and substitution logic
Refactored masking service to use position-based tracking for entity and pronoun replacements
Added configuration flag to enable/disable pronoun substitution

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
src/backend/config/config.go	Adds `EnablePronounSubstitution` configuration flag (defaults to true)
src/backend/pii/detectors/types.go	Extends detector output with `EntityMention` type and `CorefClusters` map, adds `ClusterID` to entities
src/backend/pii/detectors/onnx_model_detector.go	Implements coreference extraction from model output, assigns cluster IDs to entities, adds safe uint-to-int conversion
src/backend/pii/pronoun_mapper.go	New file implementing pronoun gender mapping, includes detection of gender from names and pronoun conversion
src/backend/pii/masking_service.go	Refactored to track entity/pronoun replacements separately, implements gender-based pronoun substitution and position-based restoration
src/backend/pii/generator_service.go	Adds `NewGeneratorServiceWithSeed` for deterministic testing
src/backend/proxy/handler.go	Updates masking service constructor call and builds backward-compatible map from entity replacements
src/backend/pii/masking_service_test.go	Comprehensive test suite covering masking, restoration, pronoun substitution, UTF-8 handling, and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/backend/pii/pronoun_mapper.go