Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 14% (0.14x) speedup for clean in unstructured/cleaners/core.py

⏱️ Runtime : 1.98 milliseconds 1.74 milliseconds (best of 43 runs)

📝 Explanation and details

The optimized code achieves a 14% speedup by pre-compiling regex patterns at module level instead of compiling them on every function call. This is a classic performance optimization in Python that exploits regex compilation overhead.

Key Optimization:
Three regex patterns are now compiled once at module initialization:

  • _WHITESPACE_CHARS_RE = re.compile(r"[\xa0\n]")
  • _MULTIPLE_SPACES_RE = re.compile(r"[ ]{2,}")
  • _DASHES_RE = re.compile(r"[-\u2013]")

Why This Works:
When you call re.sub(pattern, repl, string) with a string pattern, Python must:

  1. Parse the regex pattern string
  2. Compile it into an internal finite state machine
  3. Execute the match/substitution

By pre-compiling, steps 1-2 happen once at import time instead of on every function call. The line profiler data confirms this:

  • clean_extra_whitespace: Dropped from 2.01ms to 1.25ms (38% faster) - the two regex operations now use pre-compiled patterns
  • clean_dashes: Dropped from 0.94ms to 0.57ms (40% faster) - single regex now pre-compiled
  • Overall clean function: Dropped from 4.26ms to 3.20ms (25% faster in line profiler)

Test Results Show:
The optimization particularly benefits scenarios with:

  • Whitespace cleaning (39-55% faster): test_extra_whitespace_collapsed_and_nbsp_handled, test_clean_extra_whitespace_double_spaces
  • Dash replacement (35-49% faster): test_dashes_and_endash_replaced_by_space, test_clean_dashes_hyphen
  • Combined operations (27-58% faster): test_combined_flags_order_and_behavior, test_empty_string_and_only_punctuation_edge_cases
  • Large-scale text (8-22% faster): test_large_scale_performance_and_correctness_under_limits, test_clean_maximum_consecutive_same_char

Impact on Production:
Text cleaning functions are typically called in tight loops during document processing pipelines. Even though individual calls save only microseconds, when processing thousands of documents with millions of text fragments, these savings compound significantly. The optimization is purely internal - no API changes, no behavioral differences - making it a safe performance win.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 45 Passed
🌀 Generated Regression Tests 64 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 2 Passed
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
cleaners/test_core.py::test_clean 103μs 79.2μs 30.6%✅
cleaners/test_core.py::test_clean_bullets 21.5μs 22.2μs -2.98%⚠️
cleaners/test_core.py::test_clean_dashes 15.9μs 11.8μs 35.1%✅
cleaners/test_core.py::test_clean_extra_whitespace 20.4μs 14.3μs 42.5%✅
cleaners/test_core.py::test_clean_trailing_punctuation 8.83μs 9.07μs -2.63%⚠️
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
from unstructured.cleaners.core import clean


def test_basic_no_flags_strips_whitespace_only():
    # Basic scenario: when no flags are set, clean should only strip leading/trailing whitespace.
    input_text = "   Hello World   "  # leading and trailing spaces present
    codeflash_output = clean(input_text)
    result = codeflash_output  # 1.49μs -> 1.46μs (2.40% faster)


def test_lowercase_applied_before_other_operations():
    # Lowercasing should be applied first per implementation.
    # We verify that lowercase happens before trailing punctuation removal.
    input_text = "  HELLO WORLD!...  "  # contains uppercase and trailing punctuation
    # Enable lowercase and trailing_punctuation flags
    codeflash_output = clean(input_text, lowercase=True, trailing_punctuation=True)
    result = codeflash_output  # 3.15μs -> 3.23μs (2.66% slower)


def test_extra_whitespace_collapsed_and_nbsp_handled():
    # Ensure non-breaking spaces and newlines collapse into single spaces and multiple spaces collapse.
    input_text = "ITEM 1.\xa0\xa0\n    BUSINESS"  # contains NBSPs and newline and many spaces
    codeflash_output = clean(input_text, extra_whitespace=True)
    result = codeflash_output  # 11.8μs -> 8.46μs (39.4% faster)


def test_dashes_and_endash_replaced_by_space():
    # Both ASCII hyphen '-' and EN DASH '\u2013' should be replaced with spaces.
    input_text = "ITEM-1\u2013SECOND -THIRD"
    codeflash_output = clean(input_text, dashes=True)
    result = codeflash_output  # 9.29μs -> 6.72μs (38.4% faster)


def test_bullets_removed_only_at_start():
    # Bullets should be removed only when they appear at the start of the string.
    bullet_input = "●  This is an excellent point!"
    # bullets=True should strip the leading bullet and its following whitespace
    codeflash_output = clean(bullet_input, bullets=True)
    result = codeflash_output  # 6.84μs -> 6.94μs (1.44% slower)
    # If bullet appears in the middle it should remain untouched (except for final strip)
    middle_bullet = "This is a ● bullet in the middle"
    codeflash_output = clean(middle_bullet, bullets=True)
    result_middle = codeflash_output  # 1.69μs -> 1.65μs (2.18% faster)
    # Ensure leading bullet without trailing text yields empty string after cleaning
    lone_bullet = "•"
    codeflash_output = clean(lone_bullet, bullets=True)
    result_lone = codeflash_output  # 2.41μs -> 2.12μs (13.7% faster)


def test_trailing_punctuation_removal_only_for_specified_chars():
    # trailing_punctuation removes only .,;: characters at the end, not other punctuation like ! or ?
    input_text = "Hello there!???:;;.."
    # After rstrip of ".,;:" only the trailing .,;: characters are removed but '!' and '?' remain
    codeflash_output = clean(input_text, trailing_punctuation=True)
    result = codeflash_output  # 2.52μs -> 2.49μs (1.33% faster)


def test_combined_flags_order_and_behavior():
    # Test a combination of flags to ensure order in implementation is respected:
    # lowercase -> trailing_punctuation -> dashes -> extra_whitespace -> bullets
    input_text = "  -HELLO-WORLD!..  "
    # Enable lowercase, trailing punctuation, and dashes
    codeflash_output = clean(input_text, lowercase=True, trailing_punctuation=True, dashes=True)
    result = codeflash_output  # 9.44μs -> 7.01μs (34.6% faster)


def test_empty_string_and_only_punctuation_edge_cases():
    # Empty string should return empty string regardless of flags.
    codeflash_output = clean(
        "",
        extra_whitespace=True,
        dashes=True,
        bullets=True,
        trailing_punctuation=True,
        lowercase=True,
    )  # 10.8μs -> 6.85μs (57.6% faster)
    # String with only punctuation and spaces when trailing_punctuation=True should reduce to empty string
    punctuations = "....;;;:::   ...,,,"
    codeflash_output = clean(punctuations, trailing_punctuation=True)
    result = codeflash_output  # 1.70μs -> 1.69μs (0.532% faster)


def test_bullets_does_not_remove_internal_bullets():
    # Ensure that bullets() only affects leading bullets and leaves other bullets intact.
    text = "This is a test ● with an internal bullet."
    # Enable bullets removal: should not remove the internal bullet
    codeflash_output = clean(text, bullets=True)
    result = codeflash_output  # 4.06μs -> 4.14μs (1.88% slower)
    # If the bullet is at the very start with whitespace, it should be removed
    leading = "•\xa0Leading bullet"
    codeflash_output = clean(leading, bullets=True)  # 4.78μs -> 4.78μs (0.105% slower)


def test_large_scale_performance_and_correctness_under_limits():
    # Large scale test but respecting the constraints: use fewer than 1000 iterations/elements.
    # Build a repetitive string that contains dashes, EN DASHs, NBSPs, and newlines that must be cleaned.
    parts = []
    count = 500  # under the 1000 element constraint
    for i in range(count):  # this loop is permissible (<=1000)
        # Create pairs "Word{i}-Next{i}" separated by multiple spaces and NBSP/newline to stress extra_whitespace
        parts.append(f"Word{i}-Next{i}  \xa0\n")
    large_input = "".join(parts)
    # Apply cleaning for dashes and extra_whitespace
    codeflash_output = clean(large_input, dashes=True, extra_whitespace=True)
    result = codeflash_output  # 653μs -> 603μs (8.39% faster)
    # Validate token count: each original pair yields two tokens -> total tokens = count * 2
    tokens = result.split(" ")
    # Filter out any accidental empty tokens (shouldn't be any after extra_whitespace cleaning)
    tokens = [t for t in tokens if t]


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.cleaners.core import clean


def test_clean_no_options():
    """Test clean function with no cleaning options enabled (default behavior)."""
    # Input text should be returned as-is with only strip() applied
    text = "  Hello World  "
    codeflash_output = clean(text)
    result = codeflash_output  # 1.41μs -> 1.44μs (2.16% slower)


def test_clean_empty_string():
    """Test clean function with empty string input."""
    # Empty string should remain empty after cleaning
    codeflash_output = clean("")
    result = codeflash_output  # 1.15μs -> 1.12μs (2.50% faster)


def test_clean_whitespace_only():
    """Test clean function with whitespace-only input."""
    # String with only whitespace should become empty after strip
    codeflash_output = clean("   \t\n  ")
    result = codeflash_output  # 1.23μs -> 1.33μs (7.81% slower)


def test_clean_lowercase_simple():
    """Test clean function with lowercase option enabled."""
    # Text should be converted to lowercase
    text = "Hello World"
    codeflash_output = clean(text, lowercase=True)
    result = codeflash_output  # 2.31μs -> 2.23μs (3.45% faster)


def test_clean_lowercase_with_uppercase():
    """Test clean function with mixed case text and lowercase enabled."""
    # Mixed case should be fully converted to lowercase
    text = "MiXeD CaSe TeXt"
    codeflash_output = clean(text, lowercase=True)
    result = codeflash_output  # 2.18μs -> 2.14μs (1.92% faster)


def test_clean_trailing_punctuation_single():
    """Test clean function with trailing punctuation option."""
    # Trailing period should be removed
    text = "Hello World."
    codeflash_output = clean(text, trailing_punctuation=True)
    result = codeflash_output  # 2.56μs -> 2.55μs (0.196% faster)


def test_clean_trailing_punctuation_multiple():
    """Test clean function with multiple trailing punctuation marks."""
    # Multiple trailing punctuation marks should be removed
    text = "Hello World.,:;"
    codeflash_output = clean(text, trailing_punctuation=True)
    result = codeflash_output  # 2.54μs -> 2.60μs (2.12% slower)


def test_clean_trailing_punctuation_no_trailing():
    """Test clean function with trailing punctuation option but no punctuation present."""
    # Text without trailing punctuation should remain unchanged
    text = "Hello World"
    codeflash_output = clean(text, trailing_punctuation=True)
    result = codeflash_output  # 2.39μs -> 2.51μs (5.02% slower)


def test_clean_extra_whitespace_double_spaces():
    """Test clean function with extra whitespace option."""
    # Multiple spaces should be reduced to single space
    text = "Hello  World"
    codeflash_output = clean(text, extra_whitespace=True)
    result = codeflash_output  # 9.77μs -> 6.55μs (49.3% faster)


def test_clean_extra_whitespace_tabs():
    """Test clean function with extra whitespace including tabs."""
    # Non-breaking space (\\xa0) should be converted to regular space
    text = "Hello\xa0World"
    codeflash_output = clean(text, extra_whitespace=True)
    result = codeflash_output  # 9.92μs -> 6.39μs (55.3% faster)


def test_clean_extra_whitespace_newlines():
    """Test clean function with extra whitespace including newlines."""
    # Newlines should be converted to spaces
    text = "Hello\nWorld"
    codeflash_output = clean(text, extra_whitespace=True)
    result = codeflash_output  # 9.61μs -> 6.21μs (54.8% faster)


def test_clean_dashes_hyphen():
    """Test clean function with dashes option."""
    # Regular hyphen should be converted to space
    text = "Hello-World"
    codeflash_output = clean(text, dashes=True)
    result = codeflash_output  # 7.76μs -> 5.22μs (48.6% faster)


def test_clean_dashes_en_dash():
    """Test clean function with en dash (unicode \\u2013)."""
    # EN DASH should be converted to space
    text = "Hello\u2013World"
    codeflash_output = clean(text, dashes=True)
    result = codeflash_output  # 8.14μs -> 5.51μs (47.9% faster)


def test_clean_dashes_multiple():
    """Test clean function with multiple dashes."""
    # Multiple dashes should each be converted to space
    text = "Hello--World"
    codeflash_output = clean(text, dashes=True)
    result = codeflash_output  # 7.93μs -> 5.34μs (48.4% faster)


def test_clean_bullets_basic():
    """Test clean function with bullets option."""
    # Unicode bullet at start should be removed
    text = "● This is text"
    codeflash_output = clean(text, bullets=True)
    result = codeflash_output  # 6.70μs -> 6.68μs (0.269% faster)


def test_clean_bullets_no_bullet():
    """Test clean function with bullets option but no bullet present."""
    # Text without bullet should remain unchanged
    text = "This is text"
    codeflash_output = clean(text, bullets=True)
    result = codeflash_output  # 3.92μs -> 3.97μs (1.16% slower)


def test_clean_multiple_options_combined():
    """Test clean function with multiple options enabled together."""
    # All options applied in order: lowercase, trailing_punctuation, dashes, extra_whitespace, bullets
    text = "  ●  HELLO-WORLD.  "
    codeflash_output = clean(
        text,
        lowercase=True,
        trailing_punctuation=True,
        dashes=True,
        extra_whitespace=True,
        bullets=True,
    )
    result = codeflash_output  # 18.2μs -> 14.3μs (27.4% faster)


def test_clean_order_lowercase_first():
    """Test that lowercase is applied before trailing punctuation."""
    # Uppercase letters should be lowercased, then punctuation removed
    text = "HELLO."
    codeflash_output = clean(text, lowercase=True, trailing_punctuation=True)
    result = codeflash_output  # 2.89μs -> 2.91μs (0.824% slower)


def test_clean_special_characters_preserved():
    """Test that special characters not targeted by cleaning are preserved."""
    # Characters like ! @ # $ % etc should remain
    text = "Hello! @World #$%"
    codeflash_output = clean(text)
    result = codeflash_output  # 1.20μs -> 1.26μs (4.92% slower)


def test_clean_single_character():
    """Test clean function with single character input."""
    # Single character should be processed normally
    codeflash_output = clean("A")
    result = codeflash_output  # 1.23μs -> 1.23μs (0.487% slower)


def test_clean_single_character_lowercase():
    """Test clean function with single character and lowercase enabled."""
    # Single uppercase character should become lowercase
    codeflash_output = clean("A", lowercase=True)
    result = codeflash_output  # 2.09μs -> 2.08μs (0.770% faster)


def test_clean_only_punctuation():
    """Test clean function with text containing only punctuation."""
    # Only punctuation should be handled based on trailing_punctuation option
    text = ".,:;"
    codeflash_output = clean(text, trailing_punctuation=True)
    result = codeflash_output  # 2.52μs -> 2.59μs (2.78% slower)


def test_clean_digits_preserved():
    """Test that numeric digits are preserved during cleaning."""
    # Digits should not be affected by any cleaning operation
    text = "Hello 12345"
    codeflash_output = clean(text, lowercase=True, extra_whitespace=True)
    result = codeflash_output  # 9.64μs -> 6.24μs (54.6% faster)


def test_clean_unicode_text():
    """Test clean function with unicode characters."""
    # Unicode letters should be preserved
    text = "Café"
    codeflash_output = clean(text, lowercase=True)
    result = codeflash_output  # 2.70μs -> 2.78μs (2.88% slower)


def test_clean_very_long_whitespace():
    """Test clean function with many consecutive spaces."""
    # Long sequence of spaces should be reduced to single space
    text = "Hello" + " " * 100 + "World"
    codeflash_output = clean(text, extra_whitespace=True)
    result = codeflash_output  # 10.7μs -> 7.38μs (44.5% faster)


def test_clean_mixed_whitespace_types():
    """Test clean function with mixed whitespace types."""
    # Mix of spaces, tabs, and newlines should be normalized
    text = "Hello \t\n\xa0 World"
    codeflash_output = clean(text, extra_whitespace=True)
    result = codeflash_output  # 10.9μs -> 7.65μs (42.8% faster)


def test_clean_bullet_only():
    """Test clean function with only a bullet character."""
    # Single bullet should be removed, leaving empty string after strip
    text = "●"
    codeflash_output = clean(text, bullets=True)
    result = codeflash_output  # 5.90μs -> 6.19μs (4.69% slower)


def test_clean_bullet_with_spaces():
    """Test clean function with bullet followed by spaces."""
    # Bullet and surrounding spaces should be cleaned
    text = "●    "
    codeflash_output = clean(text, bullets=True)
    result = codeflash_output  # 6.62μs -> 6.84μs (3.30% slower)


def test_clean_dash_at_boundaries():
    """Test clean function with dashes at string boundaries."""
    # Leading and trailing dashes should become spaces
    text = "-Hello-World-"
    codeflash_output = clean(text, dashes=True)
    result = codeflash_output  # 8.35μs -> 6.16μs (35.5% faster)


def test_clean_trailing_punctuation_internal_period():
    """Test that internal periods are not removed."""
    # Only trailing punctuation should be removed, not internal
    text = "Hello. World."
    codeflash_output = clean(text, trailing_punctuation=True)
    result = codeflash_output  # 2.54μs -> 2.60μs (2.27% slower)


def test_clean_trailing_punctuation_mixed_order():
    """Test trailing punctuation with mixed punctuation types."""
    # All trailing punctuation types should be removed
    text = "Hello.,:;:,."
    codeflash_output = clean(text, trailing_punctuation=True)
    result = codeflash_output  # 2.54μs -> 2.67μs (4.87% slower)


def test_clean_lowercase_numbers_and_special():
    """Test lowercase with text containing numbers and special characters."""
    # Lowercase should only affect letters, not digits or special chars
    text = "ABC123!@#"
    codeflash_output = clean(text, lowercase=True)
    result = codeflash_output  # 2.09μs -> 2.03μs (3.00% faster)


def test_clean_bullets_different_types():
    """Test bullets cleaning with text that might look like bullets."""
    # Only actual unicode bullets at start should trigger removal
    text = "This has • character inside"
    codeflash_output = clean(text, bullets=True)
    result = codeflash_output  # 4.19μs -> 4.09μs (2.64% faster)


def test_clean_empty_after_cleaning():
    """Test that aggressive cleaning can result in empty string."""
    # Multiple cleaning options might result in empty output
    text = "●.,:;"
    codeflash_output = clean(text, trailing_punctuation=True, bullets=True, extra_whitespace=True)
    result = codeflash_output  # 12.3μs -> 8.90μs (38.1% faster)


def test_clean_strip_final_application():
    """Test that final strip() is applied after all cleaning."""
    # Leading/trailing whitespace from cleaning operations should be stripped
    text = "-Text-"
    codeflash_output = clean(text, dashes=True)
    result = codeflash_output  # 8.09μs -> 5.41μs (49.6% faster)


def test_clean_large_text_no_options():
    """Test clean function with large text and no cleaning options."""
    # Generate large text that should be processed efficiently
    large_text = "Word " * 500  # 3000 characters
    codeflash_output = clean(large_text)
    result = codeflash_output  # 1.89μs -> 2.04μs (7.17% slower)


def test_clean_large_text_with_extra_whitespace():
    """Test clean function with large text containing extra whitespace."""
    # Large text with many whitespace variations
    large_text = "Hello  \n  World\t\xa0Multiple  " * 50  # ~2700 characters
    codeflash_output = clean(large_text, extra_whitespace=True)
    result = codeflash_output  # 90.7μs -> 79.0μs (14.8% faster)


def test_clean_large_text_with_dashes():
    """Test clean function with large text containing many dashes."""
    # Large text with mixed dash types
    large_text = "Hello-World\u2013Text" * 100  # ~2400 characters
    codeflash_output = clean(large_text, dashes=True)
    result = codeflash_output  # 64.8μs -> 57.8μs (12.0% faster)


def test_clean_large_text_with_trailing_punctuation():
    """Test clean function with large text having trailing punctuation."""
    # Text with many trailing punctuation marks
    large_text = "Sentence.,:;" * 100  # ~1200 characters
    codeflash_output = clean(large_text, trailing_punctuation=True)
    result = codeflash_output  # 3.10μs -> 3.05μs (1.61% faster)


def test_clean_large_text_lowercase():
    """Test clean function with large uppercase text."""
    # Large text with all uppercase
    large_text = "HELLO WORLD " * 200  # ~2600 characters
    codeflash_output = clean(large_text, lowercase=True)
    result = codeflash_output  # 6.39μs -> 6.56μs (2.58% slower)


def test_clean_large_text_all_options():
    """Test clean function with large text and all options enabled."""
    # Complex text with multiple cleaning needs
    large_text = "  ●  HELLO-WORLD\n\n.,:;  " * 50  # ~2500 characters
    codeflash_output = clean(
        large_text,
        lowercase=True,
        trailing_punctuation=True,
        dashes=True,
        extra_whitespace=True,
        bullets=True,
    )
    result = codeflash_output  # 123μs -> 116μs (6.67% faster)


def test_clean_large_text_repeated_patterns():
    """Test clean function with large text of repeated patterns."""
    # Large text with repeating pattern
    pattern = "Hello.    WORLD-●\n"
    large_text = pattern * 100  # ~1800 characters
    codeflash_output = clean(
        large_text,
        lowercase=True,
        trailing_punctuation=True,
        dashes=True,
        extra_whitespace=True,
        bullets=True,
    )
    result = codeflash_output  # 150μs -> 138μs (9.19% faster)


def test_clean_large_text_mixed_content():
    """Test clean function with large text containing mixed content types."""
    # Mix of letters, numbers, special characters, and whitespace
    base_text = "Item 1: Price $99.99 (20% off) - In Stock\n\n"
    large_text = base_text * 50  # ~2250 characters
    codeflash_output = clean(large_text, extra_whitespace=True)
    result = codeflash_output  # 92.9μs -> 80.5μs (15.4% faster)


def test_clean_large_text_unicode_bullets_sequence():
    """Test clean function with large text containing unicode bullets."""
    # Multiple lines with bullets
    large_text = "● Point 1\n● Point 2\n● Point 3\n" * 50  # ~1350 characters
    codeflash_output = clean(large_text, bullets=True, extra_whitespace=True)
    result = codeflash_output  # 83.4μs -> 73.5μs (13.5% faster)


def test_clean_alternating_clean_options():
    """Test clean function with text alternating cleaning needs."""
    # Text with different cleaning needs in different sections
    text_parts = ["HELLO.  " * 20, "-WORLD-" * 20, "●TEXT●" * 20]
    large_text = "".join(text_parts)  # ~1240 characters
    codeflash_output = clean(
        large_text,
        lowercase=True,
        trailing_punctuation=True,
        dashes=True,
        extra_whitespace=True,
        bullets=True,
    )
    result = codeflash_output  # 47.2μs -> 41.3μs (14.3% faster)


def test_clean_maximum_consecutive_same_char():
    """Test clean function with maximum consecutive same characters."""
    # Large sequence of same character repeated
    large_text = "a" * 500 + " " * 500 + "b" * 500  # 1500 characters
    codeflash_output = clean(large_text, extra_whitespace=True)
    result = codeflash_output  # 39.4μs -> 32.3μs (21.9% faster)


def test_clean_large_punctuation_heavy_text():
    """Test clean function with text heavy on punctuation."""
    # Text with many punctuation marks throughout
    large_text = "Hello... world... " * 50  # ~1050 characters
    codeflash_output = clean(large_text, trailing_punctuation=True, extra_whitespace=True)
    result = codeflash_output  # 33.6μs -> 26.7μs (26.0% faster)


def test_clean_performance_scaling():
    """Test that clean function performs reasonably with increasing size."""
    # Test with progressively larger texts to ensure no exponential slowdown
    for size in [100, 200, 400]:
        large_text = "Word " * size
        codeflash_output = clean(
            large_text,
            lowercase=True,
            trailing_punctuation=True,
            dashes=True,
            extra_whitespace=True,
            bullets=True,
        )
        result = codeflash_output  # 140μs -> 118μs (19.0% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.cleaners.core import clean


def test_clean():
    clean(
        "",
        extra_whitespace=True,
        dashes=False,
        bullets=True,
        trailing_punctuation=True,
        lowercase=False,
    )


def test_clean_2():
    clean(
        "-",
        extra_whitespace=False,
        dashes=True,
        bullets=False,
        trailing_punctuation=False,
        lowercase=True,
    )
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmppw0yo7eh/test_concolic_coverage.py::test_clean 9.54μs 6.34μs 50.6%✅
codeflash_concolic_xdo_puqm/tmppw0yo7eh/test_concolic_coverage.py::test_clean_2 7.66μs 5.36μs 43.0%✅

To edit these changes git checkout codeflash/optimize-clean-mkrwcfq7 and push.

Codeflash Static Badge

The optimized code achieves a **14% speedup** by pre-compiling regex patterns at module level instead of compiling them on every function call. This is a classic performance optimization in Python that exploits regex compilation overhead.

**Key Optimization:**
Three regex patterns are now compiled once at module initialization:
- `_WHITESPACE_CHARS_RE = re.compile(r"[\xa0\n]")` 
- `_MULTIPLE_SPACES_RE = re.compile(r"[ ]{2,}")` 
- `_DASHES_RE = re.compile(r"[-\u2013]")`

**Why This Works:**
When you call `re.sub(pattern, repl, string)` with a string pattern, Python must:
1. Parse the regex pattern string
2. Compile it into an internal finite state machine
3. Execute the match/substitution

By pre-compiling, steps 1-2 happen once at import time instead of on every function call. The line profiler data confirms this:

- `clean_extra_whitespace`: Dropped from **2.01ms to 1.25ms** (38% faster) - the two regex operations now use pre-compiled patterns
- `clean_dashes`: Dropped from **0.94ms to 0.57ms** (40% faster) - single regex now pre-compiled
- Overall `clean` function: Dropped from **4.26ms to 3.20ms** (25% faster in line profiler)

**Test Results Show:**
The optimization particularly benefits scenarios with:
- **Whitespace cleaning** (39-55% faster): `test_extra_whitespace_collapsed_and_nbsp_handled`, `test_clean_extra_whitespace_double_spaces`
- **Dash replacement** (35-49% faster): `test_dashes_and_endash_replaced_by_space`, `test_clean_dashes_hyphen`
- **Combined operations** (27-58% faster): `test_combined_flags_order_and_behavior`, `test_empty_string_and_only_punctuation_edge_cases`
- **Large-scale text** (8-22% faster): `test_large_scale_performance_and_correctness_under_limits`, `test_clean_maximum_consecutive_same_char`

**Impact on Production:**
Text cleaning functions are typically called in tight loops during document processing pipelines. Even though individual calls save only microseconds, when processing thousands of documents with millions of text fragments, these savings compound significantly. The optimization is purely internal - no API changes, no behavioral differences - making it a safe performance win.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 05:56
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant