Generate Dataset Statistics and Dataset Card

## Background

Understanding your dataset is the foundation of building trustworthy AI systems. As noted in [Why Dataset Cards Should Be Your First Step Toward Trustworthy AI](https://www.linkedin.com/pulse/why-dataset-cards-should-your-first-step-toward-trustworthy-ai-qsjfc/), dataset cards provide transparency and accountability, helping teams identify potential biases and make informed decisions about model training.

Currently, the Yaak proxy has three dataset directories with ~20,000 samples and ~6,000 training samples (note: we stopped generating training set samples since we now tokenize as part of the ML training / ML pipeline, ,you can ignore the folder), but lacks comprehensive statistics and documentation about:

- **Language distribution**: Which languages are represented and how balanced?
- **PII type distribution**: Are certain PII types over/under-represented?
- **Text characteristics**: Length distributions, complexity metrics
- **Entity density**: How many PII entities per text?
- **Co-reference patterns**: How are entities referenced across texts?
- **Data quality**: Consistency and completeness metrics

This issue proposes creating an automated statistics generation system and comprehensive dataset card to document the Yaak PII detection dataset.

---

## Dataset Card Purpose

A dataset card serves as "nutrition labels" for ML data, providing:

✅ **Enhanced model performance** - Understanding data strengths/weaknesses  
✅ **Stronger ethical foundation** - Surface biases early  
✅ **Simplified compliance** - Documentation for regulatory requirements  
✅ **Improved collaboration** - Clear communication across teams  

---

## Implementation Plan

### Phase 1: Statistics Generation Script

**File**: `src/scripts/generate_dataset_stats.py` (new)

Create a comprehensive statistics generation script:

```python
#!/usr/bin/env python3
"""
Generate comprehensive statistics and visualizations for the Yaak PII dataset.

Usage:
    python src/scripts/generate_dataset_stats.py \
        --input model/dataset/reviewed_samples \
        --output docs/dataset_statistics \
        --format markdown

This script analyzes the dataset and generates:
- Distribution statistics (languages, PII types, text lengths)
- Visualization plots (histograms, bar charts, heatmaps)
- Summary markdown report
- Dataset card in standard format
"""

import argparse
import json
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Tuple
import re

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm


class DatasetStatistics:
    """Compute and store dataset statistics."""
    
    def __init__(self):
        self.total_samples = 0
        self.pii_type_counts = Counter()
        self.pii_per_sample = []
        self.text_lengths = []
        self.token_counts = []
        self.coref_cluster_counts = []
        self.coref_mention_counts = []
        self.entity_type_counts = Counter()
        self.language_estimates = Counter()
        self.label_value_examples = defaultdict(list)
        self.empty_samples = 0
        self.max_pii_per_sample = 0
        self.samples_with_coreferences = 0
        
    def analyze_sample(self, sample: Dict[str, Any], sample_id: str):
        """Analyze a single sample and update statistics."""
        self.total_samples += 1
        
        # Text analysis
        text = sample.get("text", "")
        self.text_lengths.append(len(text))
        self.token_counts.append(len(text.split()))
        
        # Language detection (simple heuristic based on character sets)
        language = self._estimate_language(text)
        self.language_estimates[language] += 1
        
        # PII entity analysis
        privacy_mask = sample.get("privacy_mask", [])
        num_pii = len(privacy_mask)
        self.pii_per_sample.append(num_pii)
        self.max_pii_per_sample = max(self.max_pii_per_sample, num_pii)
        
        if num_pii == 0:
            self.empty_samples += 1
        
        for entity in privacy_mask:
            label = entity.get("label", "UNKNOWN")
            value = entity.get("value", "")
            self.pii_type_counts[label] += 1
            
            # Store examples (limit to 5 per type)
            if len(self.label_value_examples[label]) < 5:
                self.label_value_examples[label].append(value)
        
        # Co-reference analysis
        coreferences = sample.get("coreferences", [])
        num_clusters = len(coreferences)
        self.coref_cluster_counts.append(num_clusters)
        
        if num_clusters > 0:
            self.samples_with_coreferences += 1
        
        for coref in coreferences:
            entity_type = coref.get("entity_type", "unknown")
            mentions = coref.get("mentions", [])
            self.entity_type_counts[entity_type] += 1
            self.coref_mention_counts.append(len(mentions))
    
    def _estimate_language(self, text: str) -> str:
        """Estimate language based on character sets and patterns."""
        if not text:
            return "unknown"
        
        # Count different character types
        latin_chars = len(re.findall(r'[a-zA-Z]', text))
        cyrillic_chars = len(re.findall(r'[а-яА-ЯёЁ]', text))
        arabic_chars = len(re.findall(r'[\u0600-\u06FF]', text))
        chinese_chars = len(re.findall(r'[\u4e00-\u9fff]', text))
        
        total_chars = len(re.findall(r'\w', text))
        
        if total_chars == 0:
            return "unknown"
        
        # Simple heuristics
        if cyrillic_chars / total_chars > 0.3:
            return "Russian/Cyrillic"
        elif arabic_chars / total_chars > 0.3:
            return "Arabic"
        elif chinese_chars / total_chars > 0.3:
            return "Chinese/CJK"
        elif latin_chars / total_chars > 0.7:
            # Further distinguish common languages
            if re.search(r'\b(the|and|is|in|to|of)\b', text.lower()):
                return "English"
            elif re.search(r'\b(der|die|das|und|ist|in)\b', text.lower()):
                return "German"
            elif re.search(r'\b(le|la|les|et|est|de)\b', text.lower()):
                return "French"
            elif re.search(r'\b(el|la|los|las|y|es|de)\b', text.lower()):
                return "Spanish"
            elif re.search(r'\b(de|het|een|en|is|van)\b', text.lower()):
                return "Dutch"
            else:
                return "Other Latin"
        else:
            return "Mixed/Other"
    
    def compute_summary_stats(self) -> Dict[str, Any]:
        """Compute summary statistics."""
        return {
            "dataset_overview": {
                "total_samples": self.total_samples,
                "samples_with_pii": self.total_samples - self.empty_samples,
                "samples_without_pii": self.empty_samples,
                "samples_with_coreferences": self.samples_with_coreferences,
            },
            "text_statistics": {
                "avg_text_length": np.mean(self.text_lengths) if self.text_lengths else 0,
                "median_text_length": np.median(self.text_lengths) if self.text_lengths else 0,
                "min_text_length": min(self.text_lengths) if self.text_lengths else 0,
                "max_text_length": max(self.text_lengths) if self.text_lengths else 0,
                "avg_token_count": np.mean(self.token_counts) if self.token_counts else 0,
                "median_token_count": np.median(self.token_counts) if self.token_counts else 0,
            },
            "pii_statistics": {
                "total_pii_entities": sum(self.pii_type_counts.values()),
                "unique_pii_types": len(self.pii_type_counts),
                "avg_pii_per_sample": np.mean(self.pii_per_sample) if self.pii_per_sample else 0,
                "median_pii_per_sample": np.median(self.pii_per_sample) if self.pii_per_sample else 0,
                "max_pii_per_sample": self.max_pii_per_sample,
                "pii_type_distribution": dict(self.pii_type_counts.most_common()),
            },
            "coreference_statistics": {
                "samples_with_coreferences": self.samples_with_coreferences,
                "avg_clusters_per_sample": np.mean(self.coref_cluster_counts) if self.coref_cluster_counts else 0,
                "avg_mentions_per_cluster": np.mean(self.coref_mention_counts) if self.coref_mention_counts else 0,
                "entity_type_distribution": dict(self.entity_type_counts.most_common()),
            },
            "language_distribution": dict(self.language_estimates.most_common()),
        }


class DatasetVisualizer:
    """Create visualizations for dataset statistics."""
    
    def __init__(self, output_dir: Path):
        self.output_dir = output_dir
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Set style
        sns.set_style("whitegrid")
        plt.rcParams['figure.figsize'] = (12, 6)
    
    def plot_pii_type_distribution(self, pii_counts: Dict[str, int]):
        """Plot PII type distribution as horizontal bar chart."""
        fig, ax = plt.subplots(figsize=(12, 8))
        
        # Sort by count
        sorted_items = sorted(pii_counts.items(), key=lambda x: x[1], reverse=True)
        labels, counts = zip(*sorted_items)
        
        # Create horizontal bar chart
        y_pos = np.arange(len(labels))
        ax.barh(y_pos, counts, color='steelblue')
        ax.set_yticks(y_pos)
        ax.set_yticklabels(labels)
        ax.invert_yaxis()
        ax.set_xlabel('Count')
        ax.set_title('PII Entity Type Distribution', fontsize=14, fontweight='bold')
        
        # Add count labels
        for i, v in enumerate(counts):
            ax.text(v + max(counts)*0.01, i, str(v), va='center')
        
        plt.tight_layout()
        plt.savefig(self.output_dir / 'pii_type_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()
    
    def plot_text_length_distribution(self, text_lengths: List[int]):
        """Plot text length distribution as histogram."""
        fig, ax = plt.subplots(figsize=(12, 6))
        
        ax.hist(text_lengths, bins=50, color='skyblue', edgecolor='black', alpha=0.7)
        ax.set_xlabel('Text Length (characters)')
        ax.set_ylabel('Frequency')
        ax.set_title('Text Length Distribution', fontsize=14, fontweight='bold')
        
        # Add mean and median lines
        mean_length = np.mean(text_lengths)
        median_length = np.median(text_lengths)
        ax.axvline(mean_length, color='red', linestyle='--', label=f'Mean: {mean_length:.0f}')
        ax.axvline(median_length, color='green', linestyle='--', label=f'Median: {median_length:.0f}')
        ax.legend()
        
        plt.tight_layout()
        plt.savefig(self.output_dir / 'text_length_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()
    
    def plot_pii_per_sample_distribution(self, pii_per_sample: List[int]):
        """Plot PII entities per sample distribution."""
        fig, ax = plt.subplots(figsize=(12, 6))
        
        max_pii = max(pii_per_sample) if pii_per_sample else 0
        bins = range(0, max_pii + 2)
        
        ax.hist(pii_per_sample, bins=bins, color='coral', edgecolor='black', alpha=0.7)
        ax.set_xlabel('Number of PII Entities per Sample')
        ax.set_ylabel('Frequency')
        ax.set_title('PII Entities per Sample Distribution', fontsize=14, fontweight='bold')
        
        # Add mean line
        mean_pii = np.mean(pii_per_sample)
        ax.axvline(mean_pii, color='red', linestyle='--', label=f'Mean: {mean_pii:.1f}')
        ax.legend()
        
        plt.tight_layout()
        plt.savefig(self.output_dir / 'pii_per_sample_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()
    
    def plot_language_distribution(self, language_counts: Dict[str, int]):
        """Plot language distribution as pie chart."""
        fig, ax = plt.subplots(figsize=(10, 8))
        
        labels = list(language_counts.keys())
        sizes = list(language_counts.values())
        
        # Create color palette
        colors = sns.color_palette('Set3', len(labels))
        
        wedges, texts, autotexts = ax.pie(
            sizes, 
            labels=labels, 
            autopct='%1.1f%%',
            colors=colors,
            startangle=90
        )
        
        # Enhance text
        for text in texts:
            text.set_fontsize(10)
        for autotext in autotexts:
            autotext.set_color('white')
            autotext.set_fontweight('bold')
            autotext.set_fontsize(9)
        
        ax.set_title('Language Distribution (Estimated)', fontsize=14, fontweight='bold')
        
        plt.tight_layout()
        plt.savefig(self.output_dir / 'language_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()
    
    def plot_entity_type_distribution(self, entity_counts: Dict[str, int]):
        """Plot co-reference entity type distribution."""
        if not entity_counts:
            return
        
        fig, ax = plt.subplots(figsize=(10, 6))
        
        labels = list(entity_counts.keys())
        counts = list(entity_counts.values())
        
        x_pos = np.arange(len(labels))
        ax.bar(x_pos, counts, color='lightgreen', edgecolor='black')
        ax.set_xticks(x_pos)
        ax.set_xticklabels(labels, rotation=45, ha='right')
        ax.set_ylabel('Count')
        ax.set_title('Co-reference Entity Type Distribution', fontsize=14, fontweight='bold')
        
        # Add count labels
        for i, v in enumerate(counts):
            ax.text(i, v + max(counts)*0.01, str(v), ha='center', va='bottom')
        
        plt.tight_layout()
        plt.savefig(self.output_dir / 'entity_type_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()


class MarkdownReportGenerator:
    """Generate markdown report with statistics."""
    
    @staticmethod
    def generate_report(stats_summary: Dict[str, Any], output_path: Path):
        """Generate comprehensive markdown report."""
        
        report = []
        report.append("# Yaak PII Detection Dataset Statistics\n")
        report.append(f"**Generated:** {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')} UTC\n")
        report.append("---\n")
        
        # Dataset Overview
        overview = stats_summary["dataset_overview"]
        report.append("## Dataset Overview\n")
        report.append(f"- **Total Samples:** {overview['total_samples']:,}\n")
        report.append(f"- **Samples with PII:** {overview['samples_with_pii']:,} ({overview['samples_with_pii']/overview['total_samples']*100:.1f}%)\n")
        report.append(f"- **Samples without PII:** {overview['samples_without_pii']:,} ({overview['samples_without_pii']/overview['total_samples']*100:.1f}%)\n")
        report.append(f"- **Samples with Co-references:** {overview['samples_with_coreferences']:,} ({overview['samples_with_coreferences']/overview['total_samples']*100:.1f}%)\n")
        report.append("\n")
        
        # Text Statistics
        text_stats = stats_summary["text_statistics"]
        report.append("## Text Statistics\n")
        report.append(f"- **Average Text Length:** {text_stats['avg_text_length']:.0f} characters\n")
        report.append(f"- **Median Text Length:** {text_stats['median_text_length']:.0f} characters\n")
        report.append(f"- **Min Text Length:** {text_stats['min_text_length']:,} characters\n")
        report.append(f"- **Max Text Length:** {text_stats['max_text_length']:,} characters\n")
        report.append(f"- **Average Token Count:** {text_stats['avg_token_count']:.0f} tokens\n")
        report.append(f"- **Median Token Count:** {text_stats['median_token_count']:.0f} tokens\n")
        report.append("\n![Text Length Distribution](text_length_distribution.png)\n")
        
        # PII Statistics
        pii_stats = stats_summary["pii_statistics"]
        report.append("## PII Entity Statistics\n")
        report.append(f"- **Total PII Entities:** {pii_stats['total_pii_entities']:,}\n")
        report.append(f"- **Unique PII Types:** {pii_stats['unique_pii_types']}\n")
        report.append(f"- **Average PII per Sample:** {pii_stats['avg_pii_per_sample']:.2f}\n")
        report.append(f"- **Median PII per Sample:** {pii_stats['median_pii_per_sample']:.0f}\n")
        report.append(f"- **Max PII per Sample:** {pii_stats['max_pii_per_sample']}\n")
        report.append("\n### PII Type Distribution\n")
        
        pii_dist = pii_stats["pii_type_distribution"]
        report.append("| PII Type | Count | Percentage |\n")
        report.append("|----------|-------|------------|\n")
        total_pii = sum(pii_dist.values())
        for pii_type, count in sorted(pii_dist.items(), key=lambda x: x[1], reverse=True):
            percentage = (count / total_pii * 100) if total_pii > 0 else 0
            report.append(f"| {pii_type} | {count:,} | {percentage:.1f}% |\n")
        
        report.append("\n![PII Type Distribution](pii_type_distribution.png)\n")
        report.append("\n![PII per Sample Distribution](pii_per_sample_distribution.png)\n")
        
        # Co-reference Statistics
        coref_stats = stats_summary["coreference_statistics"]
        report.append("## Co-reference Statistics\n")
        report.append(f"- **Samples with Co-references:** {coref_stats['samples_with_coreferences']:,}\n")
        report.append(f"- **Average Clusters per Sample:** {coref_stats['avg_clusters_per_sample']:.2f}\n")
        report.append(f"- **Average Mentions per Cluster:** {coref_stats['avg_mentions_per_cluster']:.2f}\n")
        
        entity_dist = coref_stats["entity_type_distribution"]
        if entity_dist:
            report.append("\n### Entity Type Distribution\n")
            report.append("| Entity Type | Count |\n")
            report.append("|-------------|-------|\n")
            for entity_type, count in sorted(entity_dist.items(), key=lambda x: x[1], reverse=True):
                report.append(f"| {entity_type} | {count:,} |\n")
            report.append("\n![Entity Type Distribution](entity_type_distribution.png)\n")
        
        # Language Distribution
        lang_dist = stats_summary["language_distribution"]
        report.append("## Language Distribution (Estimated)\n")
        report.append("| Language | Count | Percentage |\n")
        report.append("|----------|-------|------------|\n")
        total_samples = sum(lang_dist.values())
        for lang, count in sorted(lang_dist.items(), key=lambda x: x[1], reverse=True):
            percentage = (count / total_samples * 100) if total_samples > 0 else 0
            report.append(f"| {lang} | {count:,} | {percentage:.1f}% |\n")
        
        report.append("\n![Language Distribution](language_distribution.png)\n")
        
        # Write report
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(''.join(report))


def analyze_dataset(input_dir: Path, output_dir: Path, limit: int | None = None):
    """
    Analyze dataset and generate statistics.
    
    Args:
        input_dir: Directory containing JSON samples
        output_dir: Directory to write statistics and visualizations
        limit: Maximum number of files to analyze (None for all)
    """
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Initialize statistics collector
    stats = DatasetStatistics()
    
    # Get all JSON files
    json_files = sorted(input_dir.glob("*.json"))
    
    if limit:
        json_files = json_files[:limit]
    
    print(f"Analyzing {len(json_files)} samples from {input_dir}...")
    
    # Analyze samples
    for json_file in tqdm(json_files, desc="Analyzing samples"):
        try:
            with open(json_file, 'r', encoding='utf-8') as f:
                sample = json.load(f)
            
            sample_id = json_file.stem
            stats.analyze_sample(sample, sample_id)
            
        except Exception as e:
            print(f"Error analyzing {json_file}: {e}")
            continue
    
    # Compute summary statistics
    print("\nComputing summary statistics...")
    summary = stats.compute_summary_stats()
    
    # Save raw statistics as JSON
    stats_file = output_dir / "statistics.json"
    with open(stats_file, 'w', encoding='utf-8') as f:
        json.dump(summary, f, indent=2, ensure_ascii=False)
    print(f"✓ Saved statistics to {stats_file}")
    
    # Generate visualizations
    print("\nGenerating visualizations...")
    visualizer = DatasetVisualizer(output_dir)
    visualizer.plot_pii_type_distribution(summary["pii_statistics"]["pii_type_distribution"])
    visualizer.plot_text_length_distribution(stats.text_lengths)
    visualizer.plot_pii_per_sample_distribution(stats.pii_per_sample)
    visualizer.plot_language_distribution(summary["language_distribution"])
    visualizer.plot_entity_type_distribution(summary["coreference_statistics"]["entity_type_distribution"])
    print("✓ Generated visualizations")
    
    # Generate markdown report
    print("\nGenerating markdown report...")
    report_file = output_dir / "DATASET_STATISTICS.md"
    MarkdownReportGenerator.generate_report(summary, report_file)
    print(f"✓ Generated report: {report_file}")
    
    return summary


def main():
    parser = argparse.ArgumentParser(
        description="Generate statistics and visualizations for Yaak PII dataset"
    )
    parser.add_argument(
        "--input",
        type=Path,
        default=Path("model/dataset/reviewed_samples"),
        help="Input directory with JSON samples"
    )
    parser.add_argument(
        "--output",
        type=Path,
        default=Path("docs/dataset_statistics"),
        help="Output directory for statistics and visualizations"
    )
    parser.add_argument(
        "--limit",
        type=int,
        default=None,
        help="Limit number of files to analyze (default: analyze all)"
    )
    
    args = parser.parse_args()
    
    if not args.input.exists():
        print(f"Error: Input directory {args.input} does not exist")
        return 1
    
    summary = analyze_dataset(args.input, args.output, args.limit)
    
    print(f"\n{'='*60}")
    print("Analysis complete!")
    print(f"Total samples analyzed: {summary['dataset_overview']['total_samples']:,}")
    print(f"Output directory: {args.output}")
    print(f"\nGenerated files:")
    print(f"  - statistics.json")
    print(f"  - DATASET_STATISTICS.md")
    print(f"  - *.png (visualizations)")
    print(f"{'='*60}")
    
    return 0


if __name__ == "__main__":
    exit(main())
```

**Dependencies to add to** `pyproject.toml`:

```toml
[project.optional-dependencies]
# Add to existing sections
stats = [
    "matplotlib>=3.5.0",
    "seaborn>=0.12.0",
    "numpy>=1.21.0",
]
```

### Phase 2: Dataset Card Template

**File**: `docs/DATASET_CARD.md` (new)

Create a comprehensive dataset card following industry standards:

```markdown
# Yaak PII Detection Dataset Card

## Dataset Description

### Dataset Summary

The Yaak PII Detection Dataset is a multilingual collection of text samples annotated with Personally Identifiable Information (PII) entities and co-reference clusters. The dataset is designed to train machine learning models for automatic PII detection and masking in API requests to protect user privacy.

**Key Features:**
- ~20,000 samples across multiple languages
- 20+ PII entity types (names, addresses, IDs, financial info, etc.)
- Co-reference annotations linking entity mentions
- Realistic text scenarios (emails, forms, messages, documents)

### Supported Tasks and Leaderboards

**Primary Task:** Named Entity Recognition (NER) for PII Detection
- Input: Text string
- Output: Sequence of BIO-tagged tokens identifying PII entities

**Secondary Task:** Co-reference Resolution
- Input: Text string
- Output: Clusters of entity mentions referring to the same entity

### Languages

The dataset includes samples in multiple languages:
- English
- German
- French
- Spanish
- Dutch
- Danish

*See [Dataset Statistics](dataset_statistics/DATASET_STATISTICS.md) for exact distribution.*

---

## Dataset Structure

### Data Instances

A typical data instance looks like:

```json
{
  "text": "Contact John Smith at john@example.com or call 555-1234.",
  "privacy_mask": [
    {"value": "John", "label": "FIRSTNAME"},
    {"value": "Smith", "label": "SURNAME"},
    {"value": "john@example.com", "label": "EMAIL"},
    {"value": "555-1234", "label": "PHONENUMBER"}
  ],
  "coreferences": [
    {
      "cluster_id": 0,
      "mentions": ["John Smith"],
      "entity_type": "person"
    }
  ]
}
```

### Data Fields

- `text` (string): The input text containing PII
- `privacy_mask` (list): List of PII entities
  - `value` (string): The entity text
  - `label` (string): PII type (FIRSTNAME, EMAIL, SSN, etc.)
- `coreferences` (list): Co-reference clusters
  - `cluster_id` (int): Unique cluster identifier
  - `mentions` (list): Text spans referring to same entity
  - `entity_type` (string): Type of entity (person, organization, location)

### Data Splits

The dataset is organized into three directories:
- **samples/**: Raw generated samples (~20,000)
- **reviewed_samples/**: LLM-reviewed and corrected samples (~20,000)
- **training_samples/**: Final training data (~6,000)

*Note: Training splits are created dynamically during training (90% train, 10% validation).*

---

## Dataset Creation

### Curation Rationale

This dataset was created to address the need for privacy-preserving AI systems that can detect and mask PII in API communications. Traditional PII detection systems often fail on:
- Multilingual content
- Complex co-reference patterns
- Domain-specific terminology
- Edge cases and rare PII types

### Source Data

#### Initial Data Collection and Normalization

**Generation Method:** Synthetic data generation using Large Language Models (LLMs)

**Process:**
1. **LLM Generation**: OpenAI API generates realistic text samples with PII
2. **Structured Output**: LLM provides JSON with text, entities, and co-references
3. **LLM Review**: Second LLM pass reviews and corrects annotations
4. **Quality Control**: Manual spot-checking and validation

**Prompts:** Carefully crafted prompts instruct the LLM to:
- Generate realistic scenarios (emails, forms, support tickets, etc.)
- Include diverse PII types (4-10 types per sample)
- Vary text complexity and length
- Represent multiple languages
- Include co-reference patterns

#### Who are the source language producers?

Language is synthetically generated by Large Language Models trained on diverse web text. The models are capable of producing natural-sounding text in multiple languages.

### Annotations

#### Annotation process

**Automated Annotation:**
- Primary annotations are generated by LLM during initial creation
- Review pass by second LLM corrects errors and inconsistencies
- No human annotation for initial dataset (future work: LabelStudio integration)

**Annotation Format:**
- Direct labels (FIRSTNAME, EMAIL, etc.) without BIO prefixes
- Character-level spans for entity values
- Cluster IDs for co-reference grouping

#### Who are the annotators?

- **Primary:** OpenAI GPT models (gpt-4, gpt-3.5-turbo)
- **Future:** Human annotators via LabelStudio (planned)

### Personal and Sensitive Information

**Important:** This dataset contains **synthetic** PII only. All personal information is artificially generated and does not correspond to real individuals.

**PII Types Included:**
- Names (first, last)
- Contact info (email, phone, address)
- Identification numbers (SSN, passport, driver license, national ID)
- Financial info (IBAN, credit card)
- Demographics (age, date of birth)
- Passwords (synthetic examples)

---

## Considerations for Using the Data

### Social Impact of Dataset

**Positive Impacts:**
- ✅ Enables development of privacy-preserving AI systems
- ✅ Protects users from inadvertent PII exposure to external APIs
- ✅ Supports GDPR and privacy compliance efforts
- ✅ Democratizes access to PII detection technology

**Potential Risks:**
- ⚠️ Model trained on synthetic data may miss real-world edge cases
- ⚠️ Over-reliance on automated PII detection could create false sense of security
- ⚠️ Adversarial attacks could attempt to evade detection

### Discussion of Biases

**Known Limitations:**
1. **Language Bias**: Primary focus on Western European languages; limited Asian/African language representation
2. **Cultural Bias**: PII patterns reflect Western naming conventions and ID formats
3. **Synthetic Bias**: LLM-generated data may not capture full real-world distribution
4. **Format Bias**: Formal text styles over-represented vs. informal/slang
5. **Recency Bias**: Modern communication styles (email, forms) over traditional formats

**Mitigation Strategies:**
- Diverse prompt engineering to increase variety
- Multi-language support in generation prompts
- Ongoing dataset expansion with human review (LabelStudio)
- Regular evaluation on real-world test cases

### Other Known Limitations

1. **Synthetic Nature**: May not generalize to all real-world scenarios
2. **Co-reference Complexity**: Simple co-reference patterns; may miss complex cross-sentence references
3. **Context Dependency**: Limited contextual reasoning (e.g., "John" might not always be a name)
4. **Rare PII Types**: Under-representation of uncommon PII categories
5. **Length Constraints**: Most samples are short-to-medium length; limited long documents

---

## Additional Information

### Dataset Curators

**Yaak Team** - Privacy Proxy Development Team
- Project Lead: [Name]
- Contributors: [List contributors]

### Licensing Information

This dataset is released under the **MIT License**.

```
MIT License

Copyright (c) 2024 Yaak Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
```

### Citation Information

If you use this dataset in your research, please cite:

```bibtex
@dataset{yaak_pii_dataset_2024,
  title={Yaak PII Detection Dataset},
  author={Yaak Team},
  year={2024},
  publisher={GitHub},
  url={https://github.com/hanneshapke/yaak-proxy}
}
```

### Contributions

We welcome contributions to improve this dataset! Please see:
- [Contributing Guidelines](../CONTRIBUTING.md)
- [LabelStudio Setup](../docs/LABELSTUDIO_SETUP.md) for human review
- [Dataset Generation](../model/dataset/README.md) for synthetic data generation

### Changelog

**v1.0.0** (2024-11-24)
- Initial release with ~20,000 samples
- Support for 20+ PII types
- Co-reference annotations
- Multi-language support (6 languages)

---

### Contact

For questions or feedback about this dataset:
- **GitHub Issues**: https://github.com/hanneshapke/yaak-proxy/issues
- **Email**: [contact email]

---

*Dataset card inspired by [Hugging Face Dataset Cards](https://huggingface.co/docs/hub/datasets-cards) and [Datasheets for Datasets](https://arxiv.org/abs/1803.09010).*
```

### Phase 3: Update Makefile

**File**: `Makefile`

Add targets for statistics generation:

```makefile
# Dataset statistics targets
.PHONY: dataset-stats dataset-stats-quick dataset-card

dataset-stats:  ## Generate full dataset statistics and visualizations
\t@echo "Generating dataset statistics..."
\tpip install -e ".[stats]" > /dev/null 2>&1
\tpython3 src/scripts/generate_dataset_stats.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output docs/dataset_statistics
\t@echo "✓ Statistics generated in docs/dataset_statistics/"

dataset-stats-quick:  ## Generate dataset statistics for 1000 samples (quick test)
\t@echo "Generating dataset statistics (1000 samples)..."
\tpip install -e ".[stats]" > /dev/null 2>&1
\tpython3 src/scripts/generate_dataset_stats.py \
\t\t--input model/dataset/reviewed_samples \
\t\t--output docs/dataset_statistics \
\t\t--limit 1000
\t@echo "✓ Statistics generated in docs/dataset_statistics/"

dataset-card:  ## Open the dataset card in browser
\t@open docs/DATASET_CARD.md || xdg-open docs/DATASET_CARD.md || echo "Dataset card: docs/DATASET_CARD.md"
```

---

## Usage Workflow

### Generate Statistics

```bash
# Quick test with 1000 samples
make dataset-stats-quick

# Full dataset analysis
make dataset-stats

# Custom analysis
python src/scripts/generate_dataset_stats.py \
    --input model/dataset/training_samples \
    --output docs/training_stats
```

### View Results

```bash
# View markdown report
open docs/dataset_statistics/DATASET_STATISTICS.md

# View dataset card
make dataset-card

# View raw statistics JSON
cat docs/dataset_statistics/statistics.json | jq
```

### Output Files

After running, you'll find:
- `docs/dataset_statistics/DATASET_STATISTICS.md` - Comprehensive stats report
- `docs/dataset_statistics/statistics.json` - Raw statistics data
- `docs/dataset_statistics/*.png` - Visualization charts:
  - `pii_type_distribution.png`
  - `text_length_distribution.png`
  - `pii_per_sample_distribution.png`
  - `language_distribution.png`
  - `entity_type_distribution.png`
- `docs/DATASET_CARD.md` - Complete dataset card

---

## Success Criteria

- [ ] Statistics generation script created (`src/scripts/generate_dataset_stats.py`)
- [ ] Script successfully analyzes all JSON samples
- [ ] Statistics computed:
  - [ ] Language distribution
  - [ ] PII type distribution
  - [ ] Text length statistics
  - [ ] PII entities per sample
  - [ ] Co-reference statistics
- [ ] Visualizations generated:
  - [ ] PII type distribution chart
  - [ ] Text length histogram
  - [ ] PII per sample histogram
  - [ ] Language pie chart
  - [ ] Entity type bar chart
- [ ] Markdown report exported (`DATASET_STATISTICS.md`)
- [ ] Dataset card created following industry standards
- [ ] Dataset card includes:
  - [ ] Dataset description and tasks
  - [ ] Data structure documentation
  - [ ] Creation methodology
  - [ ] Bias discussion
  - [ ] Limitations and considerations
  - [ ] Usage guidelines
  - [ ] Citation information
- [ ] Makefile targets added
- [ ] Successfully runs on full dataset
- [ ] Documentation updated in README

---

## Future Enhancements

1. **Interactive Dashboard**: Create Streamlit/Plotly dashboard for exploring statistics
2. **Temporal Analysis**: Track dataset evolution over time
3. **Quality Metrics**: Add annotation quality scores and confidence metrics
4. **Comparison Reports**: Compare statistics across different dataset versions
5. **Anomaly Detection**: Flag outlier samples for review
6. **Export Formats**: Support for Hugging Face datasets format

---

## References

- [Why Dataset Cards Should Be Your First Step Toward Trustworthy AI](https://www.linkedin.com/pulse/why-dataset-cards-should-your-first-step-toward-trustworthy-ai-qsjfc/)
- [Datasheets for Datasets (Paper)](https://arxiv.org/abs/1803.09010)
- [Hugging Face Dataset Cards](https://huggingface.co/docs/hub/datasets-cards)
- [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)
- [Data Nutrition Project](https://datanutrition.org/)

---

## Notes

This is marked as a "good first issue" because:
- Clear implementation steps with complete code examples
- Self-contained task with visible outputs
- Combines data analysis, visualization, and documentation
- Good introduction to dataset understanding and ML best practices
- Immediate value to the project (better dataset documentation)

Generate Dataset Statistics and Dataset Card #32

Description

Background

Dataset Card Purpose

Implementation Plan

Phase 1: Statistics Generation Script

Phase 2: Dataset Card Template

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Changelog

Contact

Usage Workflow

Generate Statistics

View Results

Output Files

Success Criteria

Future Enhancements

References

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions