Skip to content

Feature/coreference match#71

Draft
Davidnet wants to merge 8 commits intomainfrom
feature/coreference-match
Draft

Feature/coreference match#71
Davidnet wants to merge 8 commits intomainfrom
feature/coreference-match

Conversation

@Davidnet
Copy link
Member

@Davidnet Davidnet commented Jan 5, 2026

This pull request introduces coreference (coref) detection and pronoun substitution support to the PII detection and masking pipeline. The ONNX-based PII detector is extended to output both PII entity and coreference cluster predictions, and the masking service is updated to optionally substitute pronouns based on configuration. Several data structures and service constructors are updated to support these features.

Coreference detection and output:

  • The ONNXModelDetectorSimple now outputs both PII and coreference logits, processes them to assign detected entities to coreference clusters, and includes coreference cluster information in the DetectorOutput. Entities now have an associated ClusterID. (src/backend/pii/detectors/onnx_model_detector.go, src/backend/pii/detectors/types.go) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Pronoun substitution support:

  • Added an EnablePronounSubstitution flag to the config, defaulting to true, to control whether pronoun substitution is performed during masking. (src/backend/config/config.go) [1] [2]
  • The MaskingService now tracks pronoun and entity replacements with new data structures, and is constructed with a pronoun mapper and config to support this feature. (src/backend/pii/masking_service.go) [1] [2]

Testing and deterministic output:

  • Added a new NewGeneratorServiceWithSeed function to create a deterministic generator for testing purposes. (src/backend/pii/generator_service.go)

Refactoring and utility improvements:

  • Introduced utility functions and types in the masking service to handle UTF-8 boundaries, overlap checks, and replacement tracking for robust text manipulation. (src/backend/pii/masking_service.go)

These changes lay the groundwork for robust, cluster-aware PII masking and enable optional pronoun substitution for improved privacy and text coherence.

Summary by CodeRabbit

  • New Features

    • Added pronoun substitution capability for PII masking with configurable control
    • Implemented PII restoration to recover original content from masked text
    • Enhanced PII detection with coreference analysis for entity relationship tracking
  • Tests

    • Comprehensive test suite for masking and restoration workflows

✏️ Tip: You can customize this high-level summary in your review settings.

…g by adding a pronoun mapper and updating the masking service.
…ists

- Correct "her" mapping to support possessive cases (mapping to "his"/"their" instead of "him").
- Add support for possessive pronoun "hers".
- Sync name lists in `DetectGenderFromName` with `FirstNameGenerator` for better coverage.
… enhance masking service for entity and pronoun tracking
@Davidnet Davidnet linked an issue Jan 5, 2026 that may be closed by this pull request
16 tasks
@coderabbitai
Copy link

coderabbitai bot commented Jan 5, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This PR introduces pronoun substitution and coreference resolution capabilities for PII masking. It adds configuration for pronoun substitution, coreference detection in ONNX models, a pronoun mapper for gender-aware substitution, and enhanced masking/restoration logic tracking entity and pronoun replacements with position recovery.

Changes

Cohort / File(s) Summary
Configuration & Infrastructure
src/backend/config/config.go
Added EnablePronounSubstitution bool field to Config struct, defaulting to true.
Coreference Detection
src/backend/pii/detectors/onnx_model_detector.go
Extended ONNX detector to process coref_logits output tensor alongside PII logits. Added coref label counting, cluster discovery, and entity-to-cluster association via findClusterForEntity helper. Introduced safeUintToInt for bounds-safe span calculations and updated resource cleanup for new tensor.
Type Definitions
src/backend/pii/detectors/types.go
Added EntityMention type with text and position fields. Extended DetectorOutput with CorefClusters map[int][]EntityMention. Added ClusterID field to Entity struct.
Deterministic Generation
src/backend/pii/generator_service.go
Added NewGeneratorServiceWithSeed(seed int64) constructor for deterministic RNG seeding (testing).
Pronoun Handling Infrastructure
src/backend/pii/pronoun_mapper.go
New module introducing PronounGender enum (Unknown, Male, Female, Neutral), PronounMapper type managing gender-to-pronoun mappings, and methods: MapPronoun, GetAllPronouns, DetectGenderFromName with name-substring heuristics and capitalization preservation.
Masking & Restoration
src/backend/pii/masking_service.go
Added PronounReplacement and EntityReplacement types for tracking substitutions. Expanded MaskedResult with entity/pronoun replacements, gender mappings, masked text, and entities. Updated MaskingService constructor to accept config. Enhanced MaskText to compute per-entity replacements, detect cluster gender changes, apply pronoun substitutions (when enabled), and track restoration data. Introduced RestorePII, restoreEntities, reversePronouns for PII recovery. Added UTF-8 boundary checks and overlap detection helpers.
Masking Tests
src/backend/pii/masking_service_test.go
Comprehensive test suite with mock detector, covering multi-entity masking, UTF-8 resilience, coreference overlaps, pronoun gender switching, restoration, disabled substitution, and end-to-end scenarios.
Service Integration
src/backend/proxy/handler.go
Updated NewMaskingService constructor call to pass config parameter. Changed masking flow to build maskedToOriginal mappings from EntityReplacements instead of MaskedToOriginal.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Handler as Proxy Handler
    participant MaskSvc as Masking Service
    participant GenSvc as Generator Service
    participant Detector as ONNX Detector
    participant PronMapper as Pronoun Mapper
    
    Client->>Handler: POST /mask (text, config)
    activate Handler
    Handler->>Detector: Detect(text)
    activate Detector
    Detector->>Detector: Process PII & Coref logits
    Detector-->>Handler: DetectorOutput{Entities, CorefClusters}
    deactivate Detector
    
    Handler->>MaskSvc: MaskText(text, entities, corefClusters)
    activate MaskSvc
    
    MaskSvc->>GenSvc: GenerateReplacement(label, entity)
    activate GenSvc
    GenSvc-->>MaskSvc: maskedValue
    deactivate GenSvc
    
    MaskSvc->>MaskSvc: Detect cluster gender<br/>(from coref mentions)
    MaskSvc->>PronMapper: DetectGenderFromName(maskedName)
    activate PronMapper
    PronMapper-->>MaskSvc: detectedGender
    deactivate PronMapper
    
    alt EnablePronounSubstitution
        MaskSvc->>MaskSvc: Identify gender change<br/>(original→masked)
        MaskSvc->>PronMapper: MapPronoun(pronoun, fromGender, toGender)
        activate PronMapper
        PronMapper-->>MaskSvc: substitutedPronoun
        deactivate PronMapper
        MaskSvc->>MaskSvc: Apply pronoun replacements<br/>(UTF-8 boundary safe)
    end
    
    MaskSvc->>MaskSvc: Apply entity replacements<br/>(descending order)
    MaskSvc-->>Handler: MaskedResult{MaskedText, EntityReplacements, PronounReplacements, GenderMappings}
    deactivate MaskSvc
    
    Handler-->>Client: {masked_text, entity_replacements, gender_mappings}
    deactivate Handler
    
    Note over MaskSvc,PronMapper: RestorePII reverses this flow via<br/>restoreEntities + reversePronouns
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 Coreference clusters now bound,
Pronouns dance through gender's round,
Entities masked with care and grace,
Restoration finds each misplaced trace—
A hopping sprint through PII's embrace!


Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

@Davidnet
Copy link
Member Author

Davidnet commented Jan 5, 2026

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces coreference resolution and gender-aware pronoun substitution to the PII masking pipeline. The changes enable the system to detect pronoun references to PII entities, substitute them based on gender when entity names change, and restore the original text including pronouns.

Key changes:

  • Extended ONNX detector to output coreference clusters alongside PII entities
  • Added pronoun mapping system with gender detection and substitution logic
  • Refactored masking service to use position-based tracking for entity and pronoun replacements
  • Added configuration flag to enable/disable pronoun substitution

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/backend/config/config.go Adds EnablePronounSubstitution configuration flag (defaults to true)
src/backend/pii/detectors/types.go Extends detector output with EntityMention type and CorefClusters map, adds ClusterID to entities
src/backend/pii/detectors/onnx_model_detector.go Implements coreference extraction from model output, assigns cluster IDs to entities, adds safe uint-to-int conversion
src/backend/pii/pronoun_mapper.go New file implementing pronoun gender mapping, includes detection of gender from names and pronoun conversion
src/backend/pii/masking_service.go Refactored to track entity/pronoun replacements separately, implements gender-based pronoun substitution and position-based restoration
src/backend/pii/generator_service.go Adds NewGeneratorServiceWithSeed for deterministic testing
src/backend/proxy/handler.go Updates masking service constructor call and builds backward-compatible map from entity replacements
src/backend/pii/masking_service_test.go Comprehensive test suite covering masking, restoration, pronoun substitution, UTF-8 handling, and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +250 to +254
// Build maskedToOriginal map from EntityReplacements
maskedToOriginal := make(map[string]string)
for _, repl := range result.EntityReplacements {
maskedToOriginal[repl.MaskedText] = repl.OriginalText
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building the maskedToOriginal map from EntityReplacements can produce incorrect mappings when multiple entities have the same original text but different masked values. Only the last occurrence will be preserved in the map since map keys must be unique. This could cause restoration issues where some masked entities map to the wrong original text. Consider tracking these mappings differently or using the position-based restoration exclusively.

Suggested change
// Build maskedToOriginal map from EntityReplacements
maskedToOriginal := make(map[string]string)
for _, repl := range result.EntityReplacements {
maskedToOriginal[repl.MaskedText] = repl.OriginalText
}
// Avoid building a maskedToOriginal map from EntityReplacements here, as
// multiple entities can share the same masked or original text, which would
// cause collisions and incorrect mappings in a map keyed by text.
// Instead, rely on the positional information in result.Entities for
// accurate restoration, and return an empty mapping here.
maskedToOriginal := make(map[string]string)

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +102
type Replacement struct {
StartPos int
EndPos int
NewText string
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local type Replacement defined inside the MaskText function has the same name as the replacement type defined inside the reversePronouns function (line 306). While these are in different scopes and won't cause compilation errors, using the same name for different types in the same file can be confusing. Consider using more descriptive names like TextReplacement and PronounReplacementPair to improve clarity.

Copilot uses AI. Check for mistakes.
Comment on lines 48 to 57
type MaskedResult struct {
MaskedText string
MaskedToOriginal map[string]string
Entities []detectors.Entity
MaskedText string
EntityReplacements []EntityReplacement // Position-based tracking
Entities []detectors.Entity
PronounReplacements []PronounReplacement
GenderMappings map[int]struct {
OriginalGender PronounGender
MaskedGender PronounGender
}
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MaskedResult struct is missing documentation. This is a public type that has been significantly changed with new fields (EntityReplacements, PronounReplacements, GenderMappings). Adding a docstring would help API consumers understand the purpose of each field and how to use them, especially since the structure is more complex now with multiple tracking mechanisms.

Copilot uses AI. Check for mistakes.
Comment on lines +202 to +206
for i := range entityReplacements {
pos := strings.Index(maskedText, entityReplacements[i].MaskedText)
if pos >= 0 {
entityReplacements[i].MaskedStartPos = pos
entityReplacements[i].MaskedEndPos = pos + len(entityReplacements[i].MaskedText)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using strings.Index to find masked entity positions is unreliable when the same masked text appears multiple times in the output. This will always find the first occurrence, which may not correspond to the correct entity replacement. This could cause incorrect restoration behavior when multiple entities get masked to the same or similar values. Consider using a position-tracking approach during the replacement phase instead of searching afterwards.

Suggested change
for i := range entityReplacements {
pos := strings.Index(maskedText, entityReplacements[i].MaskedText)
if pos >= 0 {
entityReplacements[i].MaskedStartPos = pos
entityReplacements[i].MaskedEndPos = pos + len(entityReplacements[i].MaskedText)
posSearchStart := 0
for i := range entityReplacements {
masked := entityReplacements[i].MaskedText
if masked == "" {
continue
}
relPos := strings.Index(maskedText[posSearchStart:], masked)
if relPos >= 0 {
pos := posSearchStart + relPos
entityReplacements[i].MaskedStartPos = pos
entityReplacements[i].MaskedEndPos = pos + len(masked)
// Advance search start to just after this occurrence to avoid
// matching the same occurrence for subsequent entities.
posSearchStart = entityReplacements[i].MaskedEndPos

Copilot uses AI. Check for mistakes.
Comment on lines +278 to +294
// Sort by position descending to avoid position shifts
sort.Slice(replacements, func(i, j int) bool {
return replacements[i].MaskedStartPos > replacements[j].MaskedStartPos
})

result := text
for _, repl := range replacements {
// Use positions in masked text to replace with original text
if repl.MaskedStartPos >= 0 && repl.MaskedEndPos <= len(result) &&
repl.MaskedStartPos <= repl.MaskedEndPos &&
isValidUTF8Boundary(result, repl.MaskedStartPos) &&
isValidUTF8Boundary(result, repl.MaskedEndPos) {
result = result[:repl.MaskedStartPos] + repl.OriginalText + result[repl.MaskedEndPos:]
}
}

return result
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restoration logic uses positions that were calculated in the masked text, but then applies replacements sequentially which shifts positions. After the first replacement, all subsequent positions become invalid. While the code sorts in descending order to mitigate this, the positions stored in MaskedStartPos and MaskedEndPos were calculated from the final masked text, not accounting for prior replacements during restoration. This could lead to incorrect restoration when multiple entities exist.

Suggested change
// Sort by position descending to avoid position shifts
sort.Slice(replacements, func(i, j int) bool {
return replacements[i].MaskedStartPos > replacements[j].MaskedStartPos
})
result := text
for _, repl := range replacements {
// Use positions in masked text to replace with original text
if repl.MaskedStartPos >= 0 && repl.MaskedEndPos <= len(result) &&
repl.MaskedStartPos <= repl.MaskedEndPos &&
isValidUTF8Boundary(result, repl.MaskedStartPos) &&
isValidUTF8Boundary(result, repl.MaskedEndPos) {
result = result[:repl.MaskedStartPos] + repl.OriginalText + result[repl.MaskedEndPos:]
}
}
return result
// Work from the original masked text and build the result to avoid index shifts
original := text
// Sort by start position ascending so we can rebuild the string from left to right
sort.Slice(replacements, func(i, j int) bool {
return replacements[i].MaskedStartPos < replacements[j].MaskedStartPos
})
var builder strings.Builder
builder.Grow(len(original)) // heuristic; final size may be larger or smaller
lastPos := 0
for _, repl := range replacements {
// Validate positions against the original masked text
if repl.MaskedStartPos < 0 || repl.MaskedEndPos > len(original) ||
repl.MaskedStartPos > repl.MaskedEndPos {
continue
}
// Skip overlapping or out-of-order replacements
if repl.MaskedStartPos < lastPos {
continue
}
// Ensure we are not cutting through the middle of a UTF-8 code point
if !isValidUTF8Boundary(original, repl.MaskedStartPos) ||
!isValidUTF8Boundary(original, repl.MaskedEndPos) {
continue
}
// Copy text between the last position and the start of this replacement
builder.WriteString(original[lastPos:repl.MaskedStartPos])
// Insert the original (unmasked) text
builder.WriteString(repl.OriginalText)
lastPos = repl.MaskedEndPos
}
// Append any remaining text after the last replacement
builder.WriteString(original[lastPos:])
return builder.String()

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +132
for _, male := range maleNames {
if strings.Contains(lowerName, male) {
return GenderMale
}
}

for _, female := range femaleNames {
if strings.Contains(lowerName, female) {
return GenderFemale
}
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using strings.Contains for name matching can produce false positives. For example, a name like "Sarahjones" or "Thomasina" would match "sarah" or "tom" respectively, even though these are different names. Consider using exact word matching or comparing against the full name instead of substring matching.

Copilot uses AI. Check for mistakes.
EndPos: safeUintToInt(startOffset[1]),
IsEntity: label != "O",
}
corefClusters[bestCluster] = append(corefClusters[bestCluster], mention)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coreference clusters are being created at the token level, which can result in duplicate mentions for the same entity when multiple tokens comprise a single word or entity. This could create redundant or overlapping mentions in the coreference clusters. Consider aggregating token-level coref predictions into word or entity-level mentions to avoid duplication.

Suggested change
corefClusters[bestCluster] = append(corefClusters[bestCluster], mention)
// Avoid inserting duplicate mentions for the same span/text into a cluster.
cluster := corefClusters[bestCluster]
if len(cluster) > 0 {
last := cluster[len(cluster)-1]
if last.StartPos == mention.StartPos &&
last.EndPos == mention.EndPos &&
last.Text == mention.Text &&
last.IsEntity == mention.IsEntity {
// Duplicate mention; skip appending.
continue
}
}
cluster = append(cluster, mention)
corefClusters[bestCluster] = cluster

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +111
// GetAllPronouns returns all supported pronouns (sorted by length descending)
func (pm *PronounMapper) GetAllPronouns() []string {
var pronouns []string
for pronoun := range pm.pronounMap {
pronouns = append(pronouns, pronoun)
}
return pronouns
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name GetAllPronouns suggests it returns pronouns sorted by length descending (as stated in the comment), but the actual implementation does not sort the results. The caller in reversePronouns does sort them, but the function comment is misleading. Consider either sorting within this function to match the documentation, or updating the comment to remove the sorting claim.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate Co-reference Detection for Pronoun Substitution in PII Masking

1 participant

Comments