-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Background
Understanding your dataset is the foundation of building trustworthy AI systems. As noted in Why Dataset Cards Should Be Your First Step Toward Trustworthy AI, dataset cards provide transparency and accountability, helping teams identify potential biases and make informed decisions about model training.
Currently, the Yaak proxy has three dataset directories with ~20,000 samples and ~6,000 training samples (note: we stopped generating training set samples since we now tokenize as part of the ML training / ML pipeline, ,you can ignore the folder), but lacks comprehensive statistics and documentation about:
- Language distribution: Which languages are represented and how balanced?
- PII type distribution: Are certain PII types over/under-represented?
- Text characteristics: Length distributions, complexity metrics
- Entity density: How many PII entities per text?
- Co-reference patterns: How are entities referenced across texts?
- Data quality: Consistency and completeness metrics
This issue proposes creating an automated statistics generation system and comprehensive dataset card to document the Yaak PII detection dataset.
Dataset Card Purpose
A dataset card serves as "nutrition labels" for ML data, providing:
✅ Enhanced model performance - Understanding data strengths/weaknesses
✅ Stronger ethical foundation - Surface biases early
✅ Simplified compliance - Documentation for regulatory requirements
✅ Improved collaboration - Clear communication across teams
Implementation Plan
Phase 1: Statistics Generation Script
File: src/scripts/generate_dataset_stats.py (new)
Create a comprehensive statistics generation script:
#!/usr/bin/env python3
"""
Generate comprehensive statistics and visualizations for the Yaak PII dataset.
Usage:
python src/scripts/generate_dataset_stats.py \
--input model/dataset/reviewed_samples \
--output docs/dataset_statistics \
--format markdown
This script analyzes the dataset and generates:
- Distribution statistics (languages, PII types, text lengths)
- Visualization plots (histograms, bar charts, heatmaps)
- Summary markdown report
- Dataset card in standard format
"""
import argparse
import json
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Tuple
import re
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm
class DatasetStatistics:
"""Compute and store dataset statistics."""
def __init__(self):
self.total_samples = 0
self.pii_type_counts = Counter()
self.pii_per_sample = []
self.text_lengths = []
self.token_counts = []
self.coref_cluster_counts = []
self.coref_mention_counts = []
self.entity_type_counts = Counter()
self.language_estimates = Counter()
self.label_value_examples = defaultdict(list)
self.empty_samples = 0
self.max_pii_per_sample = 0
self.samples_with_coreferences = 0
def analyze_sample(self, sample: Dict[str, Any], sample_id: str):
"""Analyze a single sample and update statistics."""
self.total_samples += 1
# Text analysis
text = sample.get("text", "")
self.text_lengths.append(len(text))
self.token_counts.append(len(text.split()))
# Language detection (simple heuristic based on character sets)
language = self._estimate_language(text)
self.language_estimates[language] += 1
# PII entity analysis
privacy_mask = sample.get("privacy_mask", [])
num_pii = len(privacy_mask)
self.pii_per_sample.append(num_pii)
self.max_pii_per_sample = max(self.max_pii_per_sample, num_pii)
if num_pii == 0:
self.empty_samples += 1
for entity in privacy_mask:
label = entity.get("label", "UNKNOWN")
value = entity.get("value", "")
self.pii_type_counts[label] += 1
# Store examples (limit to 5 per type)
if len(self.label_value_examples[label]) < 5:
self.label_value_examples[label].append(value)
# Co-reference analysis
coreferences = sample.get("coreferences", [])
num_clusters = len(coreferences)
self.coref_cluster_counts.append(num_clusters)
if num_clusters > 0:
self.samples_with_coreferences += 1
for coref in coreferences:
entity_type = coref.get("entity_type", "unknown")
mentions = coref.get("mentions", [])
self.entity_type_counts[entity_type] += 1
self.coref_mention_counts.append(len(mentions))
def _estimate_language(self, text: str) -> str:
"""Estimate language based on character sets and patterns."""
if not text:
return "unknown"
# Count different character types
latin_chars = len(re.findall(r'[a-zA-Z]', text))
cyrillic_chars = len(re.findall(r'[а-яА-ЯёЁ]', text))
arabic_chars = len(re.findall(r'[\u0600-\u06FF]', text))
chinese_chars = len(re.findall(r'[\u4e00-\u9fff]', text))
total_chars = len(re.findall(r'\w', text))
if total_chars == 0:
return "unknown"
# Simple heuristics
if cyrillic_chars / total_chars > 0.3:
return "Russian/Cyrillic"
elif arabic_chars / total_chars > 0.3:
return "Arabic"
elif chinese_chars / total_chars > 0.3:
return "Chinese/CJK"
elif latin_chars / total_chars > 0.7:
# Further distinguish common languages
if re.search(r'\b(the|and|is|in|to|of)\b', text.lower()):
return "English"
elif re.search(r'\b(der|die|das|und|ist|in)\b', text.lower()):
return "German"
elif re.search(r'\b(le|la|les|et|est|de)\b', text.lower()):
return "French"
elif re.search(r'\b(el|la|los|las|y|es|de)\b', text.lower()):
return "Spanish"
elif re.search(r'\b(de|het|een|en|is|van)\b', text.lower()):
return "Dutch"
else:
return "Other Latin"
else:
return "Mixed/Other"
def compute_summary_stats(self) -> Dict[str, Any]:
"""Compute summary statistics."""
return {
"dataset_overview": {
"total_samples": self.total_samples,
"samples_with_pii": self.total_samples - self.empty_samples,
"samples_without_pii": self.empty_samples,
"samples_with_coreferences": self.samples_with_coreferences,
},
"text_statistics": {
"avg_text_length": np.mean(self.text_lengths) if self.text_lengths else 0,
"median_text_length": np.median(self.text_lengths) if self.text_lengths else 0,
"min_text_length": min(self.text_lengths) if self.text_lengths else 0,
"max_text_length": max(self.text_lengths) if self.text_lengths else 0,
"avg_token_count": np.mean(self.token_counts) if self.token_counts else 0,
"median_token_count": np.median(self.token_counts) if self.token_counts else 0,
},
"pii_statistics": {
"total_pii_entities": sum(self.pii_type_counts.values()),
"unique_pii_types": len(self.pii_type_counts),
"avg_pii_per_sample": np.mean(self.pii_per_sample) if self.pii_per_sample else 0,
"median_pii_per_sample": np.median(self.pii_per_sample) if self.pii_per_sample else 0,
"max_pii_per_sample": self.max_pii_per_sample,
"pii_type_distribution": dict(self.pii_type_counts.most_common()),
},
"coreference_statistics": {
"samples_with_coreferences": self.samples_with_coreferences,
"avg_clusters_per_sample": np.mean(self.coref_cluster_counts) if self.coref_cluster_counts else 0,
"avg_mentions_per_cluster": np.mean(self.coref_mention_counts) if self.coref_mention_counts else 0,
"entity_type_distribution": dict(self.entity_type_counts.most_common()),
},
"language_distribution": dict(self.language_estimates.most_common()),
}
class DatasetVisualizer:
"""Create visualizations for dataset statistics."""
def __init__(self, output_dir: Path):
self.output_dir = output_dir
self.output_dir.mkdir(parents=True, exist_ok=True)
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
def plot_pii_type_distribution(self, pii_counts: Dict[str, int]):
"""Plot PII type distribution as horizontal bar chart."""
fig, ax = plt.subplots(figsize=(12, 8))
# Sort by count
sorted_items = sorted(pii_counts.items(), key=lambda x: x[1], reverse=True)
labels, counts = zip(*sorted_items)
# Create horizontal bar chart
y_pos = np.arange(len(labels))
ax.barh(y_pos, counts, color='steelblue')
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.invert_yaxis()
ax.set_xlabel('Count')
ax.set_title('PII Entity Type Distribution', fontsize=14, fontweight='bold')
# Add count labels
for i, v in enumerate(counts):
ax.text(v + max(counts)*0.01, i, str(v), va='center')
plt.tight_layout()
plt.savefig(self.output_dir / 'pii_type_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
def plot_text_length_distribution(self, text_lengths: List[int]):
"""Plot text length distribution as histogram."""
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(text_lengths, bins=50, color='skyblue', edgecolor='black', alpha=0.7)
ax.set_xlabel('Text Length (characters)')
ax.set_ylabel('Frequency')
ax.set_title('Text Length Distribution', fontsize=14, fontweight='bold')
# Add mean and median lines
mean_length = np.mean(text_lengths)
median_length = np.median(text_lengths)
ax.axvline(mean_length, color='red', linestyle='--', label=f'Mean: {mean_length:.0f}')
ax.axvline(median_length, color='green', linestyle='--', label=f'Median: {median_length:.0f}')
ax.legend()
plt.tight_layout()
plt.savefig(self.output_dir / 'text_length_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
def plot_pii_per_sample_distribution(self, pii_per_sample: List[int]):
"""Plot PII entities per sample distribution."""
fig, ax = plt.subplots(figsize=(12, 6))
max_pii = max(pii_per_sample) if pii_per_sample else 0
bins = range(0, max_pii + 2)
ax.hist(pii_per_sample, bins=bins, color='coral', edgecolor='black', alpha=0.7)
ax.set_xlabel('Number of PII Entities per Sample')
ax.set_ylabel('Frequency')
ax.set_title('PII Entities per Sample Distribution', fontsize=14, fontweight='bold')
# Add mean line
mean_pii = np.mean(pii_per_sample)
ax.axvline(mean_pii, color='red', linestyle='--', label=f'Mean: {mean_pii:.1f}')
ax.legend()
plt.tight_layout()
plt.savefig(self.output_dir / 'pii_per_sample_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
def plot_language_distribution(self, language_counts: Dict[str, int]):
"""Plot language distribution as pie chart."""
fig, ax = plt.subplots(figsize=(10, 8))
labels = list(language_counts.keys())
sizes = list(language_counts.values())
# Create color palette
colors = sns.color_palette('Set3', len(labels))
wedges, texts, autotexts = ax.pie(
sizes,
labels=labels,
autopct='%1.1f%%',
colors=colors,
startangle=90
)
# Enhance text
for text in texts:
text.set_fontsize(10)
for autotext in autotexts:
autotext.set_color('white')
autotext.set_fontweight('bold')
autotext.set_fontsize(9)
ax.set_title('Language Distribution (Estimated)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(self.output_dir / 'language_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
def plot_entity_type_distribution(self, entity_counts: Dict[str, int]):
"""Plot co-reference entity type distribution."""
if not entity_counts:
return
fig, ax = plt.subplots(figsize=(10, 6))
labels = list(entity_counts.keys())
counts = list(entity_counts.values())
x_pos = np.arange(len(labels))
ax.bar(x_pos, counts, color='lightgreen', edgecolor='black')
ax.set_xticks(x_pos)
ax.set_xticklabels(labels, rotation=45, ha='right')
ax.set_ylabel('Count')
ax.set_title('Co-reference Entity Type Distribution', fontsize=14, fontweight='bold')
# Add count labels
for i, v in enumerate(counts):
ax.text(i, v + max(counts)*0.01, str(v), ha='center', va='bottom')
plt.tight_layout()
plt.savefig(self.output_dir / 'entity_type_distribution.png', dpi=300, bbox_inches='tight')
plt.close()
class MarkdownReportGenerator:
"""Generate markdown report with statistics."""
@staticmethod
def generate_report(stats_summary: Dict[str, Any], output_path: Path):
"""Generate comprehensive markdown report."""
report = []
report.append("# Yaak PII Detection Dataset Statistics\n")
report.append(f"**Generated:** {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')} UTC\n")
report.append("---\n")
# Dataset Overview
overview = stats_summary["dataset_overview"]
report.append("## Dataset Overview\n")
report.append(f"- **Total Samples:** {overview['total_samples']:,}\n")
report.append(f"- **Samples with PII:** {overview['samples_with_pii']:,} ({overview['samples_with_pii']/overview['total_samples']*100:.1f}%)\n")
report.append(f"- **Samples without PII:** {overview['samples_without_pii']:,} ({overview['samples_without_pii']/overview['total_samples']*100:.1f}%)\n")
report.append(f"- **Samples with Co-references:** {overview['samples_with_coreferences']:,} ({overview['samples_with_coreferences']/overview['total_samples']*100:.1f}%)\n")
report.append("\n")
# Text Statistics
text_stats = stats_summary["text_statistics"]
report.append("## Text Statistics\n")
report.append(f"- **Average Text Length:** {text_stats['avg_text_length']:.0f} characters\n")
report.append(f"- **Median Text Length:** {text_stats['median_text_length']:.0f} characters\n")
report.append(f"- **Min Text Length:** {text_stats['min_text_length']:,} characters\n")
report.append(f"- **Max Text Length:** {text_stats['max_text_length']:,} characters\n")
report.append(f"- **Average Token Count:** {text_stats['avg_token_count']:.0f} tokens\n")
report.append(f"- **Median Token Count:** {text_stats['median_token_count']:.0f} tokens\n")
report.append("\n\n")
# PII Statistics
pii_stats = stats_summary["pii_statistics"]
report.append("## PII Entity Statistics\n")
report.append(f"- **Total PII Entities:** {pii_stats['total_pii_entities']:,}\n")
report.append(f"- **Unique PII Types:** {pii_stats['unique_pii_types']}\n")
report.append(f"- **Average PII per Sample:** {pii_stats['avg_pii_per_sample']:.2f}\n")
report.append(f"- **Median PII per Sample:** {pii_stats['median_pii_per_sample']:.0f}\n")
report.append(f"- **Max PII per Sample:** {pii_stats['max_pii_per_sample']}\n")
report.append("\n### PII Type Distribution\n")
pii_dist = pii_stats["pii_type_distribution"]
report.append("| PII Type | Count | Percentage |\n")
report.append("|----------|-------|------------|\n")
total_pii = sum(pii_dist.values())
for pii_type, count in sorted(pii_dist.items(), key=lambda x: x[1], reverse=True):
percentage = (count / total_pii * 100) if total_pii > 0 else 0
report.append(f"| {pii_type} | {count:,} | {percentage:.1f}% |\n")
report.append("\n\n")
report.append("\n\n")
# Co-reference Statistics
coref_stats = stats_summary["coreference_statistics"]
report.append("## Co-reference Statistics\n")
report.append(f"- **Samples with Co-references:** {coref_stats['samples_with_coreferences']:,}\n")
report.append(f"- **Average Clusters per Sample:** {coref_stats['avg_clusters_per_sample']:.2f}\n")
report.append(f"- **Average Mentions per Cluster:** {coref_stats['avg_mentions_per_cluster']:.2f}\n")
entity_dist = coref_stats["entity_type_distribution"]
if entity_dist:
report.append("\n### Entity Type Distribution\n")
report.append("| Entity Type | Count |\n")
report.append("|-------------|-------|\n")
for entity_type, count in sorted(entity_dist.items(), key=lambda x: x[1], reverse=True):
report.append(f"| {entity_type} | {count:,} |\n")
report.append("\n\n")
# Language Distribution
lang_dist = stats_summary["language_distribution"]
report.append("## Language Distribution (Estimated)\n")
report.append("| Language | Count | Percentage |\n")
report.append("|----------|-------|------------|\n")
total_samples = sum(lang_dist.values())
for lang, count in sorted(lang_dist.items(), key=lambda x: x[1], reverse=True):
percentage = (count / total_samples * 100) if total_samples > 0 else 0
report.append(f"| {lang} | {count:,} | {percentage:.1f}% |\n")
report.append("\n\n")
# Write report
with open(output_path, 'w', encoding='utf-8') as f:
f.write(''.join(report))
def analyze_dataset(input_dir: Path, output_dir: Path, limit: int | None = None):
"""
Analyze dataset and generate statistics.
Args:
input_dir: Directory containing JSON samples
output_dir: Directory to write statistics and visualizations
limit: Maximum number of files to analyze (None for all)
"""
output_dir.mkdir(parents=True, exist_ok=True)
# Initialize statistics collector
stats = DatasetStatistics()
# Get all JSON files
json_files = sorted(input_dir.glob("*.json"))
if limit:
json_files = json_files[:limit]
print(f"Analyzing {len(json_files)} samples from {input_dir}...")
# Analyze samples
for json_file in tqdm(json_files, desc="Analyzing samples"):
try:
with open(json_file, 'r', encoding='utf-8') as f:
sample = json.load(f)
sample_id = json_file.stem
stats.analyze_sample(sample, sample_id)
except Exception as e:
print(f"Error analyzing {json_file}: {e}")
continue
# Compute summary statistics
print("\nComputing summary statistics...")
summary = stats.compute_summary_stats()
# Save raw statistics as JSON
stats_file = output_dir / "statistics.json"
with open(stats_file, 'w', encoding='utf-8') as f:
json.dump(summary, f, indent=2, ensure_ascii=False)
print(f"✓ Saved statistics to {stats_file}")
# Generate visualizations
print("\nGenerating visualizations...")
visualizer = DatasetVisualizer(output_dir)
visualizer.plot_pii_type_distribution(summary["pii_statistics"]["pii_type_distribution"])
visualizer.plot_text_length_distribution(stats.text_lengths)
visualizer.plot_pii_per_sample_distribution(stats.pii_per_sample)
visualizer.plot_language_distribution(summary["language_distribution"])
visualizer.plot_entity_type_distribution(summary["coreference_statistics"]["entity_type_distribution"])
print("✓ Generated visualizations")
# Generate markdown report
print("\nGenerating markdown report...")
report_file = output_dir / "DATASET_STATISTICS.md"
MarkdownReportGenerator.generate_report(summary, report_file)
print(f"✓ Generated report: {report_file}")
return summary
def main():
parser = argparse.ArgumentParser(
description="Generate statistics and visualizations for Yaak PII dataset"
)
parser.add_argument(
"--input",
type=Path,
default=Path("model/dataset/reviewed_samples"),
help="Input directory with JSON samples"
)
parser.add_argument(
"--output",
type=Path,
default=Path("docs/dataset_statistics"),
help="Output directory for statistics and visualizations"
)
parser.add_argument(
"--limit",
type=int,
default=None,
help="Limit number of files to analyze (default: analyze all)"
)
args = parser.parse_args()
if not args.input.exists():
print(f"Error: Input directory {args.input} does not exist")
return 1
summary = analyze_dataset(args.input, args.output, args.limit)
print(f"\n{'='*60}")
print("Analysis complete!")
print(f"Total samples analyzed: {summary['dataset_overview']['total_samples']:,}")
print(f"Output directory: {args.output}")
print(f"\nGenerated files:")
print(f" - statistics.json")
print(f" - DATASET_STATISTICS.md")
print(f" - *.png (visualizations)")
print(f"{'='*60}")
return 0
if __name__ == "__main__":
exit(main())Dependencies to add to pyproject.toml:
[project.optional-dependencies]
# Add to existing sections
stats = [
"matplotlib>=3.5.0",
"seaborn>=0.12.0",
"numpy>=1.21.0",
]Phase 2: Dataset Card Template
File: docs/DATASET_CARD.md (new)
Create a comprehensive dataset card following industry standards:
# Yaak PII Detection Dataset Card
## Dataset Description
### Dataset Summary
The Yaak PII Detection Dataset is a multilingual collection of text samples annotated with Personally Identifiable Information (PII) entities and co-reference clusters. The dataset is designed to train machine learning models for automatic PII detection and masking in API requests to protect user privacy.
**Key Features:**
- ~20,000 samples across multiple languages
- 20+ PII entity types (names, addresses, IDs, financial info, etc.)
- Co-reference annotations linking entity mentions
- Realistic text scenarios (emails, forms, messages, documents)
### Supported Tasks and Leaderboards
**Primary Task:** Named Entity Recognition (NER) for PII Detection
- Input: Text string
- Output: Sequence of BIO-tagged tokens identifying PII entities
**Secondary Task:** Co-reference Resolution
- Input: Text string
- Output: Clusters of entity mentions referring to the same entity
### Languages
The dataset includes samples in multiple languages:
- English
- German
- French
- Spanish
- Dutch
- Danish
*See [Dataset Statistics](dataset_statistics/DATASET_STATISTICS.md) for exact distribution.*
---
## Dataset Structure
### Data Instances
A typical data instance looks like:
```json
{
"text": "Contact John Smith at john@example.com or call 555-1234.",
"privacy_mask": [
{"value": "John", "label": "FIRSTNAME"},
{"value": "Smith", "label": "SURNAME"},
{"value": "john@example.com", "label": "EMAIL"},
{"value": "555-1234", "label": "PHONENUMBER"}
],
"coreferences": [
{
"cluster_id": 0,
"mentions": ["John Smith"],
"entity_type": "person"
}
]
}Data Fields
text(string): The input text containing PIIprivacy_mask(list): List of PII entitiesvalue(string): The entity textlabel(string): PII type (FIRSTNAME, EMAIL, SSN, etc.)
coreferences(list): Co-reference clusterscluster_id(int): Unique cluster identifiermentions(list): Text spans referring to same entityentity_type(string): Type of entity (person, organization, location)
Data Splits
The dataset is organized into three directories:
- samples/: Raw generated samples (~20,000)
- reviewed_samples/: LLM-reviewed and corrected samples (~20,000)
- training_samples/: Final training data (~6,000)
Note: Training splits are created dynamically during training (90% train, 10% validation).
Dataset Creation
Curation Rationale
This dataset was created to address the need for privacy-preserving AI systems that can detect and mask PII in API communications. Traditional PII detection systems often fail on:
- Multilingual content
- Complex co-reference patterns
- Domain-specific terminology
- Edge cases and rare PII types
Source Data
Initial Data Collection and Normalization
Generation Method: Synthetic data generation using Large Language Models (LLMs)
Process:
- LLM Generation: OpenAI API generates realistic text samples with PII
- Structured Output: LLM provides JSON with text, entities, and co-references
- LLM Review: Second LLM pass reviews and corrects annotations
- Quality Control: Manual spot-checking and validation
Prompts: Carefully crafted prompts instruct the LLM to:
- Generate realistic scenarios (emails, forms, support tickets, etc.)
- Include diverse PII types (4-10 types per sample)
- Vary text complexity and length
- Represent multiple languages
- Include co-reference patterns
Who are the source language producers?
Language is synthetically generated by Large Language Models trained on diverse web text. The models are capable of producing natural-sounding text in multiple languages.
Annotations
Annotation process
Automated Annotation:
- Primary annotations are generated by LLM during initial creation
- Review pass by second LLM corrects errors and inconsistencies
- No human annotation for initial dataset (future work: LabelStudio integration)
Annotation Format:
- Direct labels (FIRSTNAME, EMAIL, etc.) without BIO prefixes
- Character-level spans for entity values
- Cluster IDs for co-reference grouping
Who are the annotators?
- Primary: OpenAI GPT models (gpt-4, gpt-3.5-turbo)
- Future: Human annotators via LabelStudio (planned)
Personal and Sensitive Information
Important: This dataset contains synthetic PII only. All personal information is artificially generated and does not correspond to real individuals.
PII Types Included:
- Names (first, last)
- Contact info (email, phone, address)
- Identification numbers (SSN, passport, driver license, national ID)
- Financial info (IBAN, credit card)
- Demographics (age, date of birth)
- Passwords (synthetic examples)
Considerations for Using the Data
Social Impact of Dataset
Positive Impacts:
- ✅ Enables development of privacy-preserving AI systems
- ✅ Protects users from inadvertent PII exposure to external APIs
- ✅ Supports GDPR and privacy compliance efforts
- ✅ Democratizes access to PII detection technology
Potential Risks:
⚠️ Model trained on synthetic data may miss real-world edge cases⚠️ Over-reliance on automated PII detection could create false sense of security⚠️ Adversarial attacks could attempt to evade detection
Discussion of Biases
Known Limitations:
- Language Bias: Primary focus on Western European languages; limited Asian/African language representation
- Cultural Bias: PII patterns reflect Western naming conventions and ID formats
- Synthetic Bias: LLM-generated data may not capture full real-world distribution
- Format Bias: Formal text styles over-represented vs. informal/slang
- Recency Bias: Modern communication styles (email, forms) over traditional formats
Mitigation Strategies:
- Diverse prompt engineering to increase variety
- Multi-language support in generation prompts
- Ongoing dataset expansion with human review (LabelStudio)
- Regular evaluation on real-world test cases
Other Known Limitations
- Synthetic Nature: May not generalize to all real-world scenarios
- Co-reference Complexity: Simple co-reference patterns; may miss complex cross-sentence references
- Context Dependency: Limited contextual reasoning (e.g., "John" might not always be a name)
- Rare PII Types: Under-representation of uncommon PII categories
- Length Constraints: Most samples are short-to-medium length; limited long documents
Additional Information
Dataset Curators
Yaak Team - Privacy Proxy Development Team
- Project Lead: [Name]
- Contributors: [List contributors]
Licensing Information
This dataset is released under the MIT License.
MIT License
Copyright (c) 2024 Yaak Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
Citation Information
If you use this dataset in your research, please cite:
@dataset{yaak_pii_dataset_2024,
title={Yaak PII Detection Dataset},
author={Yaak Team},
year={2024},
publisher={GitHub},
url={https://github.com/hanneshapke/yaak-proxy}
}Contributions
We welcome contributions to improve this dataset! Please see:
- Contributing Guidelines
- LabelStudio Setup for human review
- Dataset Generation for synthetic data generation
Changelog
v1.0.0 (2024-11-24)
- Initial release with ~20,000 samples
- Support for 20+ PII types
- Co-reference annotations
- Multi-language support (6 languages)
Contact
For questions or feedback about this dataset:
- GitHub Issues: https://github.com/hanneshapke/yaak-proxy/issues
- Email: [contact email]
Dataset card inspired by Hugging Face Dataset Cards and Datasheets for Datasets.
### Phase 3: Update Makefile
**File**: `Makefile`
Add targets for statistics generation:
```makefile
# Dataset statistics targets
.PHONY: dataset-stats dataset-stats-quick dataset-card
dataset-stats: ## Generate full dataset statistics and visualizations
\t@echo "Generating dataset statistics..."
\tpip install -e ".[stats]" > /dev/null 2>&1
\tpython3 src/scripts/generate_dataset_stats.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output docs/dataset_statistics
\t@echo "✓ Statistics generated in docs/dataset_statistics/"
dataset-stats-quick: ## Generate dataset statistics for 1000 samples (quick test)
\t@echo "Generating dataset statistics (1000 samples)..."
\tpip install -e ".[stats]" > /dev/null 2>&1
\tpython3 src/scripts/generate_dataset_stats.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output docs/dataset_statistics \
\t\t--limit 1000
\t@echo "✓ Statistics generated in docs/dataset_statistics/"
dataset-card: ## Open the dataset card in browser
\t@open docs/DATASET_CARD.md || xdg-open docs/DATASET_CARD.md || echo "Dataset card: docs/DATASET_CARD.md"
Usage Workflow
Generate Statistics
# Quick test with 1000 samples
make dataset-stats-quick
# Full dataset analysis
make dataset-stats
# Custom analysis
python src/scripts/generate_dataset_stats.py \
--input model/dataset/training_samples \
--output docs/training_statsView Results
# View markdown report
open docs/dataset_statistics/DATASET_STATISTICS.md
# View dataset card
make dataset-card
# View raw statistics JSON
cat docs/dataset_statistics/statistics.json | jqOutput Files
After running, you'll find:
docs/dataset_statistics/DATASET_STATISTICS.md- Comprehensive stats reportdocs/dataset_statistics/statistics.json- Raw statistics datadocs/dataset_statistics/*.png- Visualization charts:pii_type_distribution.pngtext_length_distribution.pngpii_per_sample_distribution.pnglanguage_distribution.pngentity_type_distribution.png
docs/DATASET_CARD.md- Complete dataset card
Success Criteria
- Statistics generation script created (
src/scripts/generate_dataset_stats.py) - Script successfully analyzes all JSON samples
- Statistics computed:
- Language distribution
- PII type distribution
- Text length statistics
- PII entities per sample
- Co-reference statistics
- Visualizations generated:
- PII type distribution chart
- Text length histogram
- PII per sample histogram
- Language pie chart
- Entity type bar chart
- Markdown report exported (
DATASET_STATISTICS.md) - Dataset card created following industry standards
- Dataset card includes:
- Dataset description and tasks
- Data structure documentation
- Creation methodology
- Bias discussion
- Limitations and considerations
- Usage guidelines
- Citation information
- Makefile targets added
- Successfully runs on full dataset
- Documentation updated in README
Future Enhancements
- Interactive Dashboard: Create Streamlit/Plotly dashboard for exploring statistics
- Temporal Analysis: Track dataset evolution over time
- Quality Metrics: Add annotation quality scores and confidence metrics
- Comparison Reports: Compare statistics across different dataset versions
- Anomaly Detection: Flag outlier samples for review
- Export Formats: Support for Hugging Face datasets format
References
- Why Dataset Cards Should Be Your First Step Toward Trustworthy AI
- Datasheets for Datasets (Paper)
- Hugging Face Dataset Cards
- Model Cards for Model Reporting
- Data Nutrition Project
Notes
This is marked as a "good first issue" because:
- Clear implementation steps with complete code examples
- Self-contained task with visible outputs
- Combines data analysis, visualization, and documentation
- Good introduction to dataset understanding and ML best practices
- Immediate value to the project (better dataset documentation)