Conversation
…g by adding a pronoun mapper and updating the masking service.
…ists - Correct "her" mapping to support possessive cases (mapping to "his"/"their" instead of "him"). - Add support for possessive pronoun "hers". - Sync name lists in `DetectGenderFromName` with `FirstNameGenerator` for better coverage.
… enhance masking service for entity and pronoun tracking
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThis PR introduces pronoun substitution and coreference resolution capabilities for PII masking. It adds configuration for pronoun substitution, coreference detection in ONNX models, a pronoun mapper for gender-aware substitution, and enhanced masking/restoration logic tracking entity and pronoun replacements with position recovery. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Handler as Proxy Handler
participant MaskSvc as Masking Service
participant GenSvc as Generator Service
participant Detector as ONNX Detector
participant PronMapper as Pronoun Mapper
Client->>Handler: POST /mask (text, config)
activate Handler
Handler->>Detector: Detect(text)
activate Detector
Detector->>Detector: Process PII & Coref logits
Detector-->>Handler: DetectorOutput{Entities, CorefClusters}
deactivate Detector
Handler->>MaskSvc: MaskText(text, entities, corefClusters)
activate MaskSvc
MaskSvc->>GenSvc: GenerateReplacement(label, entity)
activate GenSvc
GenSvc-->>MaskSvc: maskedValue
deactivate GenSvc
MaskSvc->>MaskSvc: Detect cluster gender<br/>(from coref mentions)
MaskSvc->>PronMapper: DetectGenderFromName(maskedName)
activate PronMapper
PronMapper-->>MaskSvc: detectedGender
deactivate PronMapper
alt EnablePronounSubstitution
MaskSvc->>MaskSvc: Identify gender change<br/>(original→masked)
MaskSvc->>PronMapper: MapPronoun(pronoun, fromGender, toGender)
activate PronMapper
PronMapper-->>MaskSvc: substitutedPronoun
deactivate PronMapper
MaskSvc->>MaskSvc: Apply pronoun replacements<br/>(UTF-8 boundary safe)
end
MaskSvc->>MaskSvc: Apply entity replacements<br/>(descending order)
MaskSvc-->>Handler: MaskedResult{MaskedText, EntityReplacements, PronounReplacements, GenderMappings}
deactivate MaskSvc
Handler-->>Client: {masked_text, entity_replacements, gender_mappings}
deactivate Handler
Note over MaskSvc,PronMapper: RestorePII reverses this flow via<br/>restoreEntities + reversePronouns
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
Note 🎁 Summarized by CodeRabbit FreeYour organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login. Comment |
… update entity position handling
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Pull request overview
This PR introduces coreference resolution and gender-aware pronoun substitution to the PII masking pipeline. The changes enable the system to detect pronoun references to PII entities, substitute them based on gender when entity names change, and restore the original text including pronouns.
Key changes:
- Extended ONNX detector to output coreference clusters alongside PII entities
- Added pronoun mapping system with gender detection and substitution logic
- Refactored masking service to use position-based tracking for entity and pronoun replacements
- Added configuration flag to enable/disable pronoun substitution
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/backend/config/config.go | Adds EnablePronounSubstitution configuration flag (defaults to true) |
| src/backend/pii/detectors/types.go | Extends detector output with EntityMention type and CorefClusters map, adds ClusterID to entities |
| src/backend/pii/detectors/onnx_model_detector.go | Implements coreference extraction from model output, assigns cluster IDs to entities, adds safe uint-to-int conversion |
| src/backend/pii/pronoun_mapper.go | New file implementing pronoun gender mapping, includes detection of gender from names and pronoun conversion |
| src/backend/pii/masking_service.go | Refactored to track entity/pronoun replacements separately, implements gender-based pronoun substitution and position-based restoration |
| src/backend/pii/generator_service.go | Adds NewGeneratorServiceWithSeed for deterministic testing |
| src/backend/proxy/handler.go | Updates masking service constructor call and builds backward-compatible map from entity replacements |
| src/backend/pii/masking_service_test.go | Comprehensive test suite covering masking, restoration, pronoun substitution, UTF-8 handling, and edge cases |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Build maskedToOriginal map from EntityReplacements | ||
| maskedToOriginal := make(map[string]string) | ||
| for _, repl := range result.EntityReplacements { | ||
| maskedToOriginal[repl.MaskedText] = repl.OriginalText | ||
| } |
There was a problem hiding this comment.
Building the maskedToOriginal map from EntityReplacements can produce incorrect mappings when multiple entities have the same original text but different masked values. Only the last occurrence will be preserved in the map since map keys must be unique. This could cause restoration issues where some masked entities map to the wrong original text. Consider tracking these mappings differently or using the position-based restoration exclusively.
| // Build maskedToOriginal map from EntityReplacements | |
| maskedToOriginal := make(map[string]string) | |
| for _, repl := range result.EntityReplacements { | |
| maskedToOriginal[repl.MaskedText] = repl.OriginalText | |
| } | |
| // Avoid building a maskedToOriginal map from EntityReplacements here, as | |
| // multiple entities can share the same masked or original text, which would | |
| // cause collisions and incorrect mappings in a map keyed by text. | |
| // Instead, rely on the positional information in result.Entities for | |
| // accurate restoration, and return an empty mapping here. | |
| maskedToOriginal := make(map[string]string) |
| type Replacement struct { | ||
| StartPos int | ||
| EndPos int | ||
| NewText string | ||
| } |
There was a problem hiding this comment.
The local type Replacement defined inside the MaskText function has the same name as the replacement type defined inside the reversePronouns function (line 306). While these are in different scopes and won't cause compilation errors, using the same name for different types in the same file can be confusing. Consider using more descriptive names like TextReplacement and PronounReplacementPair to improve clarity.
| type MaskedResult struct { | ||
| MaskedText string | ||
| MaskedToOriginal map[string]string | ||
| Entities []detectors.Entity | ||
| MaskedText string | ||
| EntityReplacements []EntityReplacement // Position-based tracking | ||
| Entities []detectors.Entity | ||
| PronounReplacements []PronounReplacement | ||
| GenderMappings map[int]struct { | ||
| OriginalGender PronounGender | ||
| MaskedGender PronounGender | ||
| } | ||
| } |
There was a problem hiding this comment.
The MaskedResult struct is missing documentation. This is a public type that has been significantly changed with new fields (EntityReplacements, PronounReplacements, GenderMappings). Adding a docstring would help API consumers understand the purpose of each field and how to use them, especially since the structure is more complex now with multiple tracking mechanisms.
| for i := range entityReplacements { | ||
| pos := strings.Index(maskedText, entityReplacements[i].MaskedText) | ||
| if pos >= 0 { | ||
| entityReplacements[i].MaskedStartPos = pos | ||
| entityReplacements[i].MaskedEndPos = pos + len(entityReplacements[i].MaskedText) |
There was a problem hiding this comment.
Using strings.Index to find masked entity positions is unreliable when the same masked text appears multiple times in the output. This will always find the first occurrence, which may not correspond to the correct entity replacement. This could cause incorrect restoration behavior when multiple entities get masked to the same or similar values. Consider using a position-tracking approach during the replacement phase instead of searching afterwards.
| for i := range entityReplacements { | |
| pos := strings.Index(maskedText, entityReplacements[i].MaskedText) | |
| if pos >= 0 { | |
| entityReplacements[i].MaskedStartPos = pos | |
| entityReplacements[i].MaskedEndPos = pos + len(entityReplacements[i].MaskedText) | |
| posSearchStart := 0 | |
| for i := range entityReplacements { | |
| masked := entityReplacements[i].MaskedText | |
| if masked == "" { | |
| continue | |
| } | |
| relPos := strings.Index(maskedText[posSearchStart:], masked) | |
| if relPos >= 0 { | |
| pos := posSearchStart + relPos | |
| entityReplacements[i].MaskedStartPos = pos | |
| entityReplacements[i].MaskedEndPos = pos + len(masked) | |
| // Advance search start to just after this occurrence to avoid | |
| // matching the same occurrence for subsequent entities. | |
| posSearchStart = entityReplacements[i].MaskedEndPos |
| // Sort by position descending to avoid position shifts | ||
| sort.Slice(replacements, func(i, j int) bool { | ||
| return replacements[i].MaskedStartPos > replacements[j].MaskedStartPos | ||
| }) | ||
|
|
||
| result := text | ||
| for _, repl := range replacements { | ||
| // Use positions in masked text to replace with original text | ||
| if repl.MaskedStartPos >= 0 && repl.MaskedEndPos <= len(result) && | ||
| repl.MaskedStartPos <= repl.MaskedEndPos && | ||
| isValidUTF8Boundary(result, repl.MaskedStartPos) && | ||
| isValidUTF8Boundary(result, repl.MaskedEndPos) { | ||
| result = result[:repl.MaskedStartPos] + repl.OriginalText + result[repl.MaskedEndPos:] | ||
| } | ||
| } | ||
|
|
||
| return result |
There was a problem hiding this comment.
The restoration logic uses positions that were calculated in the masked text, but then applies replacements sequentially which shifts positions. After the first replacement, all subsequent positions become invalid. While the code sorts in descending order to mitigate this, the positions stored in MaskedStartPos and MaskedEndPos were calculated from the final masked text, not accounting for prior replacements during restoration. This could lead to incorrect restoration when multiple entities exist.
| // Sort by position descending to avoid position shifts | |
| sort.Slice(replacements, func(i, j int) bool { | |
| return replacements[i].MaskedStartPos > replacements[j].MaskedStartPos | |
| }) | |
| result := text | |
| for _, repl := range replacements { | |
| // Use positions in masked text to replace with original text | |
| if repl.MaskedStartPos >= 0 && repl.MaskedEndPos <= len(result) && | |
| repl.MaskedStartPos <= repl.MaskedEndPos && | |
| isValidUTF8Boundary(result, repl.MaskedStartPos) && | |
| isValidUTF8Boundary(result, repl.MaskedEndPos) { | |
| result = result[:repl.MaskedStartPos] + repl.OriginalText + result[repl.MaskedEndPos:] | |
| } | |
| } | |
| return result | |
| // Work from the original masked text and build the result to avoid index shifts | |
| original := text | |
| // Sort by start position ascending so we can rebuild the string from left to right | |
| sort.Slice(replacements, func(i, j int) bool { | |
| return replacements[i].MaskedStartPos < replacements[j].MaskedStartPos | |
| }) | |
| var builder strings.Builder | |
| builder.Grow(len(original)) // heuristic; final size may be larger or smaller | |
| lastPos := 0 | |
| for _, repl := range replacements { | |
| // Validate positions against the original masked text | |
| if repl.MaskedStartPos < 0 || repl.MaskedEndPos > len(original) || | |
| repl.MaskedStartPos > repl.MaskedEndPos { | |
| continue | |
| } | |
| // Skip overlapping or out-of-order replacements | |
| if repl.MaskedStartPos < lastPos { | |
| continue | |
| } | |
| // Ensure we are not cutting through the middle of a UTF-8 code point | |
| if !isValidUTF8Boundary(original, repl.MaskedStartPos) || | |
| !isValidUTF8Boundary(original, repl.MaskedEndPos) { | |
| continue | |
| } | |
| // Copy text between the last position and the start of this replacement | |
| builder.WriteString(original[lastPos:repl.MaskedStartPos]) | |
| // Insert the original (unmasked) text | |
| builder.WriteString(repl.OriginalText) | |
| lastPos = repl.MaskedEndPos | |
| } | |
| // Append any remaining text after the last replacement | |
| builder.WriteString(original[lastPos:]) | |
| return builder.String() |
| for _, male := range maleNames { | ||
| if strings.Contains(lowerName, male) { | ||
| return GenderMale | ||
| } | ||
| } | ||
|
|
||
| for _, female := range femaleNames { | ||
| if strings.Contains(lowerName, female) { | ||
| return GenderFemale | ||
| } | ||
| } |
There was a problem hiding this comment.
Using strings.Contains for name matching can produce false positives. For example, a name like "Sarahjones" or "Thomasina" would match "sarah" or "tom" respectively, even though these are different names. Consider using exact word matching or comparing against the full name instead of substring matching.
| EndPos: safeUintToInt(startOffset[1]), | ||
| IsEntity: label != "O", | ||
| } | ||
| corefClusters[bestCluster] = append(corefClusters[bestCluster], mention) |
There was a problem hiding this comment.
Coreference clusters are being created at the token level, which can result in duplicate mentions for the same entity when multiple tokens comprise a single word or entity. This could create redundant or overlapping mentions in the coreference clusters. Consider aggregating token-level coref predictions into word or entity-level mentions to avoid duplication.
| corefClusters[bestCluster] = append(corefClusters[bestCluster], mention) | |
| // Avoid inserting duplicate mentions for the same span/text into a cluster. | |
| cluster := corefClusters[bestCluster] | |
| if len(cluster) > 0 { | |
| last := cluster[len(cluster)-1] | |
| if last.StartPos == mention.StartPos && | |
| last.EndPos == mention.EndPos && | |
| last.Text == mention.Text && | |
| last.IsEntity == mention.IsEntity { | |
| // Duplicate mention; skip appending. | |
| continue | |
| } | |
| } | |
| cluster = append(cluster, mention) | |
| corefClusters[bestCluster] = cluster |
| // GetAllPronouns returns all supported pronouns (sorted by length descending) | ||
| func (pm *PronounMapper) GetAllPronouns() []string { | ||
| var pronouns []string | ||
| for pronoun := range pm.pronounMap { | ||
| pronouns = append(pronouns, pronoun) | ||
| } | ||
| return pronouns | ||
| } |
There was a problem hiding this comment.
The function name GetAllPronouns suggests it returns pronouns sorted by length descending (as stated in the comment), but the actual implementation does not sort the results. The caller in reversePronouns does sort them, but the function comment is misleading. Consider either sorting within this function to match the documentation, or updating the comment to remove the sorting claim.
This pull request introduces coreference (coref) detection and pronoun substitution support to the PII detection and masking pipeline. The ONNX-based PII detector is extended to output both PII entity and coreference cluster predictions, and the masking service is updated to optionally substitute pronouns based on configuration. Several data structures and service constructors are updated to support these features.
Coreference detection and output:
ONNXModelDetectorSimplenow outputs both PII and coreference logits, processes them to assign detected entities to coreference clusters, and includes coreference cluster information in theDetectorOutput. Entities now have an associatedClusterID. (src/backend/pii/detectors/onnx_model_detector.go,src/backend/pii/detectors/types.go) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]Pronoun substitution support:
EnablePronounSubstitutionflag to the config, defaulting totrue, to control whether pronoun substitution is performed during masking. (src/backend/config/config.go) [1] [2]MaskingServicenow tracks pronoun and entity replacements with new data structures, and is constructed with a pronoun mapper and config to support this feature. (src/backend/pii/masking_service.go) [1] [2]Testing and deterministic output:
NewGeneratorServiceWithSeedfunction to create a deterministic generator for testing purposes. (src/backend/pii/generator_service.go)Refactoring and utility improvements:
src/backend/pii/masking_service.go)These changes lay the groundwork for robust, cluster-aware PII masking and enable optional pronoun substitution for improved privacy and text coherence.
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.