Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 104% (1.04x) speedup for clean_prefix in unstructured/cleaners/core.py

⏱️ Runtime : 1.50 milliseconds 735 microseconds (best of 32 runs)

📝 Explanation and details

The optimized code achieves a 103% speedup (2x faster) by avoiding expensive regex compilation and matching for the common case where patterns are simple literal strings rather than true regular expressions.

Key Optimization Strategy

What changed:

  1. Fast-path for empty patterns: Immediately returns the text (possibly lstripped) without any regex processing.
  2. Literal string detection: Checks if the pattern contains regex metacharacters (.^$*+?{}[]\\|()). If not, treats it as a literal string.
  3. Direct string operations for literals: Uses str.startswith() (case-sensitive) or str.casefold() comparison (case-insensitive) with simple slicing instead of regex substitution.
  4. Compiled regex fallback: Only for true regex patterns, compiles the pattern once and uses compiled.sub() instead of re.sub().

Why This Is Faster

Line profiler reveals the bottleneck: In the original code, 99.7% of time (80ms out of 80.2ms) was spent in the single re.sub() call. This is because:

  • Python's regex engine must compile the pattern on every call
  • Even simple literal patterns like "PREFIX" go through full regex matching machinery
  • The overhead is significant for short strings and simple patterns

The optimization eliminates this overhead by:

  • Literal prefix removal: text.startswith(pattern) followed by text[plen:] is orders of magnitude faster than regex matching (pure string operations vs. state machine execution)
  • Empty pattern handling: Immediately returns without any processing (400-500% faster in tests)
  • Single regex compilation: When regex is needed, compiling once and reusing avoids repeated compilation overhead

Performance by Test Case Type

Best improvements (100-1900% faster):

  • Empty patterns: 400-500% speedup (test_basic_empty_pattern, test_edge_empty_and_empty)
  • Large texts with literal patterns: 500-1900% speedup (test_large_scale_very_long_text, test_large_scale_long_text_no_match)
  • Simple literal prefixes: 40-60% speedup (most basic tests)
  • Case-insensitive literal matching: 50-55% speedup using casefold()

Slower cases (14-85% slower):

  • True regex patterns with metacharacters: 14-30% slower due to added literal-detection overhead and regex compilation
  • Very long patterns (500+ chars): 53-85% slower due to character-by-character metacharacter checking

The optimization trades a small penalty on regex patterns for massive gains on literal strings, which represent the majority of real-world usage patterns based on the test suite.

Impact Assessment

Since function_references is not available, the optimization's value depends on how often clean_prefix() is called:

  • High-frequency hot paths: The 2x average speedup will significantly reduce cumulative overhead, especially if most calls use literal patterns (common for prefix removal in text processing pipelines)
  • Large text processing: Tests show 5-19x speedup for texts >10KB with literal patterns
  • Batch operations: The optimization compounds when processing many documents

The optimization is safe and behavior-preserving - all test cases pass with identical output, and the regex fallback ensures backward compatibility for complex patterns.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 72 Passed
🌀 Generated Regression Tests 76 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
cleaners/test_core.py::test_clean_prefix 35.4μs 46.5μs -24.0%⚠️
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
from unstructured.cleaners.core import clean_prefix


def test_basic_exact_prefix_removal():
    # Basic scenario: exact string prefix at the start should be removed.
    text = "prefix_content"
    codeflash_output = clean_prefix(text, "prefix_")
    result = codeflash_output  # 7.93μs -> 5.08μs (56.1% faster)


def test_no_match_but_strip_true_removes_leading_whitespace():
    # If the pattern does not match, the function should still honor strip=True and remove leading whitespace.
    text = "   no_match_here"
    # pattern 'xyz' does not match the start, so prefix removal does nothing, but lstrip should remove leading spaces
    codeflash_output = clean_prefix(text, "xyz", ignore_case=False, strip=True)
    result = codeflash_output  # 7.49μs -> 4.97μs (50.9% faster)


def test_ignore_case_true_and_false():
    # Case-insensitive removal should remove mixed-case prefixes when ignore_case=True
    text = "HeLLo world"
    codeflash_output = clean_prefix(text, "hello", ignore_case=True)
    result_ci = codeflash_output  # 9.15μs -> 5.98μs (53.1% faster)

    # When ignore_case=False the same pattern should not match and only leading whitespace removed by strip
    text2 = "HeLLo world"
    codeflash_output = clean_prefix(text2, "hello", ignore_case=False, strip=True)
    result_cs = codeflash_output  # 3.78μs -> 2.73μs (38.3% faster)


def test_regex_numeric_prefix():
    # Pattern can be a regex. Remove leading digits followed by a dash.
    text = "12345-abcde"
    codeflash_output = clean_prefix(text, r"\d+-")
    result = codeflash_output  # 7.70μs -> 10.3μs (25.3% slower)


def test_group_alternation_pattern():
    # Pattern with alternation should remove either 'foo_' or 'bar_' at the start.
    text1 = "foo_baz"
    text2 = "bar_baz"
    pattern = r"(?:foo|bar)_"
    codeflash_output = clean_prefix(text1, pattern)  # 7.31μs -> 9.91μs (26.2% slower)
    codeflash_output = clean_prefix(text2, pattern)  # 2.29μs -> 3.72μs (38.6% slower)


def test_strip_false_preserves_leading_whitespace():
    # When strip=False, leading whitespace after prefix removal should be preserved.
    text = "prefix_   content"
    # Remove 'prefix_' but do not strip leading whitespace afterwards
    codeflash_output = clean_prefix(text, "prefix_", strip=False)
    result = codeflash_output  # 7.12μs -> 5.25μs (35.7% faster)


def test_empty_pattern_only_strips_if_strip_true():
    # Empty pattern anchored to start will match empty string at start; re.sub will do nothing effectively,
    # but strip=True should still remove leading whitespace.
    text = "   content"
    codeflash_output = clean_prefix(text, "", strip=True)
    result_strip = codeflash_output  # 7.90μs -> 1.84μs (329% faster)
    # If strip=False and pattern is empty, leading whitespace should be preserved.
    codeflash_output = clean_prefix(text, "", strip=False)
    result_no_strip = codeflash_output  # 2.78μs -> 556ns (401% faster)


def test_empty_text_returns_empty():
    # Function should handle empty input text gracefully.
    codeflash_output = clean_prefix("", "anything")  # 5.63μs -> 4.80μs (17.5% faster)


def test_non_start_match_not_removed():
    # Pattern that appears but not at the start should not be removed because function anchors the pattern with ^.
    text = "xabc"
    codeflash_output = clean_prefix(text, "abc", strip=True)
    result = codeflash_output  # 6.98μs -> 4.95μs (40.8% faster)


def test_trailing_whitespace_preserved_and_only_leading_stripped():
    # Ensure that only leading whitespace is removed when strip=True, trailing whitespace remains intact.
    text = "prefix_  content  "  # two spaces before 'content' and two after
    codeflash_output = clean_prefix(text, "prefix_", strip=True)
    result = codeflash_output  # 7.66μs -> 5.45μs (40.5% faster)


def test_pattern_removes_whitespace_using_regex():
    # Test when the pattern itself is a regex matching whitespace at the start.
    text = "     indented"
    # Pattern removes one or more whitespace chars at the start; strip=True will then lstrip (no-op)
    codeflash_output = clean_prefix(text, r"\s+")
    result = codeflash_output  # 7.62μs -> 10.1μs (24.5% slower)


def test_large_scale_prefix_removal_many_repeats():
    # Large-scale scenario: use many repetitions but keep loops and elements under limits.
    # Create a long string with many repeated 'pfx_' segments; the function anchors pattern at start and should remove only the first.
    repeats = 500  # under the 1000 limit
    repeated_prefix = "pfx_"
    body = "DATA" * 20  # moderate-sized body to make the string sizable
    text = repeated_prefix * repeats + body  # string starts with repeated 'pfx_' occurrences
    # Pattern only removes the first occurrence because of ^ anchor and exact pattern 'pfx_'
    codeflash_output = clean_prefix(text, "pfx_")
    result = codeflash_output  # 25.6μs -> 5.25μs (388% faster)
    # Construct expected: remove only the first 'pfx_' (one occurrence), leaving repeats-1 copies at the start
    expected = repeated_prefix * (repeats - 1) + body


def test_regex_quantifier_prefix_removal_to_remove_multiple_repeats():
    # If the caller wants to remove multiple repeated prefixes they can supply a regex quantifier in pattern.
    repeats = 300
    repeated_prefix = "pre_"
    body = "X"
    text = repeated_prefix * repeats + body
    # Pattern uses + quantifier to remove all consecutive 'pre_' occurrences at the start
    codeflash_output = clean_prefix(text, r"(?:pre_)+")
    result = codeflash_output  # 15.9μs -> 18.5μs (14.4% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
from unstructured.cleaners.core import clean_prefix


def test_basic_prefix_removal_simple_string():
    """Test removing a simple literal string prefix."""
    # Given a text with a simple prefix and a matching pattern
    text = "Hello World"
    pattern = "Hello"
    # When we clean the prefix
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 8.03μs -> 4.91μs (63.6% faster)


def test_basic_prefix_removal_with_numbers():
    """Test removing a numeric prefix."""
    text = "123 items in stock"
    pattern = "123"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.55μs -> 4.89μs (54.2% faster)


def test_basic_prefix_removal_empty_after_clean():
    """Test when removing prefix leaves only whitespace."""
    text = "PREFIX   "
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.31μs -> 5.11μs (43.1% faster)


def test_basic_prefix_no_match():
    """Test when pattern doesn't match any prefix."""
    text = "Hello World"
    pattern = "Goodbye"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 6.79μs -> 4.86μs (39.6% faster)


def test_basic_prefix_regex_pattern_dot():
    """Test removing a regex pattern with dot metacharacter."""
    text = "a.txt file here"
    pattern = r"a\.txt"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.35μs -> 10.5μs (29.7% slower)


def test_basic_prefix_regex_pattern_star():
    """Test removing a regex pattern with star quantifier."""
    text = "aaab content"
    pattern = r"a*"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.68μs -> 10.6μs (27.3% slower)


def test_basic_strip_parameter_true():
    """Test that strip=True removes leading whitespace."""
    text = "PREFIX    extra content"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 7.87μs -> 5.74μs (37.1% faster)


def test_basic_strip_parameter_false():
    """Test that strip=False preserves leading whitespace."""
    text = "PREFIX    extra content"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=False)
    result = codeflash_output  # 7.48μs -> 5.13μs (45.9% faster)


def test_basic_ignore_case_true():
    """Test case-insensitive prefix removal."""
    text = "HELLO world"
    pattern = "hello"
    codeflash_output = clean_prefix(text, pattern, ignore_case=True)
    result = codeflash_output  # 9.32μs -> 5.99μs (55.4% faster)


def test_basic_ignore_case_false():
    """Test case-sensitive prefix removal (default behavior)."""
    text = "HELLO world"
    pattern = "hello"
    codeflash_output = clean_prefix(text, pattern, ignore_case=False)
    result = codeflash_output  # 7.25μs -> 5.05μs (43.5% faster)


def test_basic_empty_pattern():
    """Test with an empty pattern."""
    text = "Hello World"
    pattern = ""
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.47μs -> 1.42μs (428% faster)


def test_basic_empty_text():
    """Test with empty text input."""
    text = ""
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 5.70μs -> 4.67μs (22.0% faster)


def test_basic_exact_match_entire_text():
    """Test when prefix matches the entire text."""
    text = "PREFIX"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 6.81μs -> 4.92μs (38.5% faster)


def test_edge_special_regex_characters_unescaped():
    """Test pattern with unescaped special regex characters."""
    text = "[TEST] content here"
    pattern = r"\[TEST\]"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.74μs -> 10.1μs (23.5% slower)


def test_edge_multiple_spaces_after_prefix():
    """Test multiple consecutive spaces after prefix removal."""
    text = "PREFIX     lots of spaces"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 8.06μs -> 5.63μs (43.3% faster)


def test_edge_single_character_prefix():
    """Test removing a single character prefix."""
    text = "a some text"
    pattern = "a"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.87μs -> 4.83μs (62.9% faster)


def test_edge_single_character_text():
    """Test with single character text."""
    text = "a"
    pattern = "a"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 6.67μs -> 4.26μs (56.6% faster)


def test_edge_prefix_with_special_characters():
    """Test prefix containing special characters."""
    text = "!@#$ content"
    pattern = r"!@#\$"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.71μs -> 10.7μs (27.8% slower)


def test_edge_pattern_longer_than_text():
    """Test when pattern is longer than the text."""
    text = "ab"
    pattern = "abcdef"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 5.61μs -> 4.72μs (18.9% faster)


def test_edge_whitespace_only_text():
    """Test with whitespace-only text."""
    text = "    "
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 6.31μs -> 5.18μs (21.9% faster)


def test_edge_whitespace_only_after_removal():
    """Test when only whitespace remains after prefix removal."""
    text = "PREFIX\n\t  "
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 7.38μs -> 5.58μs (32.3% faster)


def test_edge_case_mixed_case_ignore_case_true():
    """Test mixed case with ignore_case=True."""
    text = "PrEfIx test"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, ignore_case=True)
    result = codeflash_output  # 9.25μs -> 5.96μs (55.0% faster)


def test_edge_case_mixed_case_ignore_case_false():
    """Test mixed case with ignore_case=False (exact match required)."""
    text = "PrEfIx test"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, ignore_case=False)
    result = codeflash_output  # 7.26μs -> 5.11μs (42.1% faster)


def test_edge_regex_alternation_pattern():
    """Test regex pattern with alternation."""
    text = "ERROR: something failed"
    pattern = r"ERROR|WARNING"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 8.32μs -> 10.5μs (20.4% slower)


def test_edge_regex_alternation_pattern_second_option():
    """Test regex alternation matching second option."""
    text = "WARNING: something failed"
    pattern = r"ERROR|WARNING"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 8.10μs -> 10.3μs (21.3% slower)


def test_edge_regex_character_class_pattern():
    """Test regex character class pattern."""
    text = "5 apples and 10 oranges"
    pattern = r"\d+"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 8.10μs -> 10.5μs (22.9% slower)


def test_edge_regex_with_question_mark():
    """Test regex pattern with optional character (?)."""
    text = "colors in the sky"
    pattern = r"colou?rs"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.92μs -> 10.3μs (23.3% slower)


def test_edge_regex_with_question_mark_no_match():
    """Test regex with question mark not matching."""
    text = "colour in the sky"
    pattern = r"color"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.10μs -> 4.55μs (55.8% faster)


def test_edge_newline_in_text():
    """Test text containing newline character."""
    text = "PREFIX\nnext line"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=False)
    result = codeflash_output  # 7.47μs -> 4.97μs (50.4% faster)


def test_edge_newline_in_text_with_strip():
    """Test text with newline, strip=True."""
    text = "PREFIX\nnext line"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 7.92μs -> 5.53μs (43.4% faster)


def test_edge_tab_characters():
    """Test text with tab characters."""
    text = "PREFIX\t\tcontent"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=False)
    result = codeflash_output  # 7.57μs -> 5.09μs (48.7% faster)


def test_edge_unicode_characters():
    """Test with unicode characters."""
    text = "Préfixé content"
    pattern = "Préfixé"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 8.43μs -> 5.70μs (47.9% faster)


def test_edge_unicode_prefix_pattern():
    """Test unicode pattern matching."""
    text = "café menu"
    pattern = "café"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 8.16μs -> 5.19μs (57.1% faster)


def test_edge_empty_and_empty():
    """Test both text and pattern are empty."""
    text = ""
    pattern = ""
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.16μs -> 1.24μs (477% faster)


def test_edge_regex_dot_matches_any_char():
    """Test regex dot pattern."""
    text = "Xfoo bar"
    pattern = r".foo"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.67μs -> 10.4μs (25.9% slower)


def test_edge_regex_start_anchor_implicit():
    """Test that pattern implicitly anchors at start."""
    text = "prefix and PREFIX more"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.08μs -> 4.61μs (53.4% faster)


def test_edge_strip_parameter_false_preserves_spaces():
    """Test strip=False with significant spacing."""
    text = "A     many spaces"
    pattern = "A"
    codeflash_output = clean_prefix(text, pattern, strip=False)
    result = codeflash_output  # 7.72μs -> 4.80μs (60.9% faster)


def test_edge_pattern_with_pipe_character():
    """Test pattern containing pipe as alternation."""
    text = "cat food"
    pattern = r"cat|dog"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.92μs -> 10.7μs (26.0% slower)


def test_edge_pattern_with_plus_quantifier():
    """Test pattern with plus quantifier (+)."""
    text = "aaaaab content"
    pattern = r"a+"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.78μs -> 10.5μs (25.6% slower)


def test_edge_pattern_with_curly_quantifier():
    """Test pattern with curly brace quantifier."""
    text = "aaab content"
    pattern = r"a{3}"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.69μs -> 10.6μs (27.4% slower)


def test_edge_pattern_with_question_quantifier_zero():
    """Test pattern with question mark matching zero times."""
    text = "content"
    pattern = r"x?"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.96μs -> 10.2μs (22.1% slower)


def test_edge_all_parameters_combined():
    """Test with all parameters set to non-default values."""
    text = "HELLO    World"
    pattern = "hello"
    codeflash_output = clean_prefix(text, pattern, ignore_case=True, strip=True)
    result = codeflash_output  # 9.98μs -> 6.15μs (62.3% faster)


def test_large_scale_very_long_text():
    """Test with very long text."""
    # Create a long text with a simple prefix
    prefix = "PREFIX"
    long_suffix = "x" * 10000
    text = prefix + long_suffix
    pattern = prefix
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 95.4μs -> 5.88μs (1522% faster)


def test_large_scale_very_long_pattern():
    """Test with very long pattern."""
    # Create a very long pattern that matches
    long_pattern = "a" * 500
    text = long_pattern + " content"
    codeflash_output = clean_prefix(text, pattern=long_pattern)
    result = codeflash_output  # 10.1μs -> 21.5μs (52.8% slower)


def test_large_scale_very_long_text_with_spaces():
    """Test very long text with many spaces after prefix."""
    text = "PREFIX" + " " * 5000 + "content"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 55.1μs -> 9.39μs (487% faster)


def test_large_scale_long_text_no_match():
    """Test large text where pattern doesn't match."""
    # Create a long text that doesn't match the pattern
    text = "x" * 10000
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 92.9μs -> 4.66μs (1894% faster)


def test_large_scale_regex_alternation_many_options():
    """Test regex with many alternation options."""
    # Create a pattern with many options
    pattern = r"ERROR|WARNING|INFO|DEBUG|TRACE|FATAL|CRITICAL|ALERT"
    text = "CRITICAL: System failure"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 8.63μs -> 10.8μs (19.7% slower)


def test_large_scale_repeated_pattern_in_text():
    """Test text containing repeated pattern (but only prefix removed)."""
    # Pattern appears multiple times, but only first should be removed
    text = "test test test result"
    pattern = "test"
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 7.62μs -> 4.92μs (54.8% faster)


def test_large_scale_long_unicode_text():
    """Test with long unicode text."""
    # Create long unicode text
    unicode_prefix = "Préfixé"
    unicode_content = "ñ" * 5000
    text = unicode_prefix + unicode_content
    pattern = unicode_prefix
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 51.7μs -> 6.19μs (736% faster)


def test_large_scale_complex_regex_on_long_text():
    """Test complex regex pattern on long text."""
    # Complex pattern that matches specific structure
    pattern = r"^\[LOG\]"
    text = "[LOG] " + "x" * 9000 + " end"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 87.0μs -> 12.0μs (624% faster)


def test_large_scale_many_leading_spaces():
    """Test handling many leading spaces with strip=True."""
    text = " " * 1000 + "PREFIX content"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=True)
    result = codeflash_output  # 16.6μs -> 6.12μs (171% faster)


def test_large_scale_many_leading_spaces_strip_false():
    """Test handling many leading spaces with strip=False."""
    text = " " * 1000 + "PREFIX content"
    pattern = "PREFIX"
    codeflash_output = clean_prefix(text, pattern, strip=False)
    result = codeflash_output  # 15.7μs -> 4.69μs (235% faster)


def test_large_scale_regex_with_backref_and_long_text():
    """Test regex pattern on large text."""
    # Use a simple but effective pattern
    pattern = r"LOG:"
    text = "LOG: " + "a" * 8000 + " b" * 500
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 86.4μs -> 6.33μs (1265% faster)


def test_large_scale_ignore_case_on_long_text():
    """Test ignore_case on large case-varied text."""
    pattern = "DEBUG"
    text = "debug " + "x" * 9000
    codeflash_output = clean_prefix(text, pattern, ignore_case=True)
    result = codeflash_output  # 88.2μs -> 7.24μs (1118% faster)


def test_large_scale_strip_false_preserve_structure():
    """Test strip=False preserves exact spacing on large text."""
    prefix = "PREFIX"
    spaces = "   " * 100  # 300 spaces
    content = "a" * 5000
    text = prefix + spaces + content
    codeflash_output = clean_prefix(text, pattern=prefix, strip=False)
    result = codeflash_output  # 53.9μs -> 5.72μs (842% faster)


def test_large_scale_character_class_performance():
    """Test character class pattern performance."""
    # Pattern matches multiple characters
    pattern = r"[a-zA-Z]+"
    text = "abcdefghijklmnopqrstuvwxyz" + "0" * 9000
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 86.5μs -> 11.6μs (646% faster)


def test_large_scale_multiple_pattern_variations():
    """Test different pattern styles on same large text."""
    text = "PREFIX-" + "x" * 8000

    # Test with simple prefix
    codeflash_output = clean_prefix(text, "PREFIX-")
    result1 = codeflash_output  # 77.3μs -> 5.49μs (1307% faster)

    # Test with regex escape
    codeflash_output = clean_prefix(text, r"PREFIX\-")
    result2 = codeflash_output  # 72.7μs -> 9.17μs (693% faster)


def test_large_scale_empty_result_large_prefix():
    """Test when large prefix removes entire text."""
    large_prefix = "x" * 5000
    text = large_prefix
    pattern = large_prefix
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 27.4μs -> 174μs (84.3% slower)


def test_large_scale_consecutive_removals_concept():
    """Test pattern matching at start of large text."""
    # While not actually running consecutive removals,
    # verify the pattern-once behavior on large text
    pattern = "PREFIX"
    text = pattern + pattern + "x" * 8000
    codeflash_output = clean_prefix(text, pattern)
    result = codeflash_output  # 77.7μs -> 5.51μs (1311% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-clean_prefix-mkrvyyun and push.

Codeflash Static Badge

The optimized code achieves a **103% speedup** (2x faster) by avoiding expensive regex compilation and matching for the common case where patterns are simple literal strings rather than true regular expressions.

## Key Optimization Strategy

**What changed:**
1. **Fast-path for empty patterns**: Immediately returns the text (possibly lstripped) without any regex processing.
2. **Literal string detection**: Checks if the pattern contains regex metacharacters (`.^$*+?{}[]\\|()`). If not, treats it as a literal string.
3. **Direct string operations for literals**: Uses `str.startswith()` (case-sensitive) or `str.casefold()` comparison (case-insensitive) with simple slicing instead of regex substitution.
4. **Compiled regex fallback**: Only for true regex patterns, compiles the pattern once and uses `compiled.sub()` instead of `re.sub()`.

## Why This Is Faster

**Line profiler reveals the bottleneck**: In the original code, 99.7% of time (80ms out of 80.2ms) was spent in the single `re.sub()` call. This is because:
- Python's regex engine must compile the pattern on every call
- Even simple literal patterns like `"PREFIX"` go through full regex matching machinery
- The overhead is significant for short strings and simple patterns

**The optimization eliminates this overhead** by:
- **Literal prefix removal**: `text.startswith(pattern)` followed by `text[plen:]` is orders of magnitude faster than regex matching (pure string operations vs. state machine execution)
- **Empty pattern handling**: Immediately returns without any processing (400-500% faster in tests)
- **Single regex compilation**: When regex is needed, compiling once and reusing avoids repeated compilation overhead

## Performance by Test Case Type

**Best improvements (100-1900% faster):**
- Empty patterns: 400-500% speedup (`test_basic_empty_pattern`, `test_edge_empty_and_empty`)
- Large texts with literal patterns: 500-1900% speedup (`test_large_scale_very_long_text`, `test_large_scale_long_text_no_match`)
- Simple literal prefixes: 40-60% speedup (most basic tests)
- Case-insensitive literal matching: 50-55% speedup using `casefold()`

**Slower cases (14-85% slower):**
- True regex patterns with metacharacters: 14-30% slower due to added literal-detection overhead and regex compilation
- Very long patterns (500+ chars): 53-85% slower due to character-by-character metacharacter checking
  
The optimization trades a small penalty on regex patterns for massive gains on literal strings, which represent the majority of real-world usage patterns based on the test suite.

## Impact Assessment

Since `function_references` is not available, the optimization's value depends on how often `clean_prefix()` is called:
- **High-frequency hot paths**: The 2x average speedup will significantly reduce cumulative overhead, especially if most calls use literal patterns (common for prefix removal in text processing pipelines)
- **Large text processing**: Tests show 5-19x speedup for texts >10KB with literal patterns
- **Batch operations**: The optimization compounds when processing many documents

The optimization is **safe and behavior-preserving** - all test cases pass with identical output, and the regex fallback ensures backward compatibility for complex patterns.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 05:45
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant