Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 33% (0.33x) speedup for clean_extra_whitespace_with_index_run in unstructured/cleaners/core.py

⏱️ Runtime : 8.77 milliseconds 6.58 milliseconds (best of 111 runs)

📝 Explanation and details

The optimized code achieves a 33% speedup through three key optimizations:

1. Module-Level Precompilation (~0.6ms savings)

  • Original: Regex and translation table recreated on every function call
    • re.sub(r"([ ]{2,})", ...) recompiles regex each time (~1.4ms in profiler)
    • Translation dict rebuilt each call (~0.15ms)
  • Optimized: _MULTI_SPACE_RE and _TRANSLATE_TABLE defined once at module level
    • Regex compilation: 1.4ms → 0.78ms (44% faster)
    • Translation overhead eliminated entirely

2. Iterator-Based Loop (~1.5ms savings)

  • Original: while loop with manual indexing
    • Two string index operations per iteration: text[original_index] and cleaned_text[cleaned_index]
    • Set membership check: c_orig in ws_chars on hot path
  • Optimized: for c_orig in txt with local bindings
    • Eliminates one index operation per iteration (text[original_index])
    • Character iteration is faster in Python than explicit indexing
    • Local variable bindings (txt, ct) reduce global lookups
    • Direct character comparisons (c_orig == "\xa0") faster than set membership

3. Why It Matters

Looking at function_references, this function is called inside a loop processing PDF pages during text extraction:

for _text in _text_snippets:
    _text, moved_indices = clean_extra_whitespace_with_index_run(_text)

This is a hot path in PDF processing where:

  • Each page may contain dozens of text snippets
  • Large PDFs could invoke this function thousands of times
  • The 33% speedup compounds across all invocations

4. Test Performance Characteristics

The optimization shows strongest gains on inputs with:

  • Many consecutive spaces: test_many_consecutive_spaces (51.7% faster)
  • Mixed whitespace types: test_mixed_space_types_many (55.4% faster)
  • Large texts with space-heavy regions: test_large_text_with_many_spaces_between_words (92.9% faster)

These match real-world PDF text extraction patterns where OCR or formatting often introduces irregular whitespace.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 74 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import re

import numpy as np

from unstructured.cleaners.core import clean_extra_whitespace_with_index_run


# Helper to compute expected cleaned text and moved indices using the same cleaning rules
# but a separate mapping procedure. This keeps expectations explicit and deterministic.
def _compute_expected_for_test(text: str):
    # Apply the same normalization rules as the implementation:
    #  - translate NBSP (\xa0) and newline (\n) to a normal space
    #  - collapse runs of two or more spaces into a single space
    #  - strip leading/trailing spaces
    translate_table = {ord("\xa0"): ord(" "), ord("\n"): ord(" ")}
    cleaned = text.translate(translate_table)
    cleaned = re.sub(r"([ ]{2,})", " ", cleaned)
    cleaned = cleaned.strip()

    orig = text
    ws_chars = {"\xa0", "\n"}

    mapped_distances = []  # distances for each cleaned character position

    orig_pos = 0
    # For each character in cleaned text find the first matching original char
    # at or after the last matched original position.
    for clean_i, c_clean in enumerate(cleaned):
        # Advance orig_pos until we find a matching character
        # (match or whitespace -> space mapping).
        while orig_pos < len(orig):
            c_orig = orig[orig_pos]
            if c_orig == c_clean or (c_orig in ws_chars and c_clean == " "):
                # distance is how many original chars were skipped so far:
                mapped_distances.append(orig_pos - clean_i)
                # consume this original character for subsequent matches
                orig_pos += 1
                break
            # otherwise this original char is effectively removed/skipped
            orig_pos += 1
        else:
            # If we ran out of original characters unexpectedly,
            # mimic function behavior by assuming no further movement
            mapped_distances.append(orig_pos - clean_i)

    # After mapping cleaned characters, remaining positions in the returned array
    # are filled with the final 'distance' value (or 0 if there were no cleaned chars).
    final_distance = mapped_distances[-1] if mapped_distances else 0
    # The implementation returns an array whose length equals the original text length.
    full_moved = mapped_distances + [final_distance] * (len(orig) - len(mapped_distances))
    # Return expected cleaned string and numpy array (float dtype as implementation uses zeros())
    return cleaned, np.array(full_moved, dtype=float)


def test_basic_single_run_spaces():
    # Input with a run of multiple spaces in the middle.
    original = "ITEM 1." + "     " + "BUSINESS"  # five spaces between the parts
    # Call the real function under test
    cleaned, moved = clean_extra_whitespace_with_index_run(
        original
    )  # 25.6μs -> 21.1μs (21.5% faster)
    # Compute expected values using independent helper
    expected_cleaned, expected_moved = _compute_expected_for_test(original)


def test_handles_nbsp_and_newline_and_unicode():
    # Include NBSP (\xa0), newline (\n), and a unicode character to ensure
    # translation to space works and non-space unicode are preserved.
    original = "α" + "\xa0" * 3 + "\n" + "β" + "  " + "γ"  # NBSPs and newline will become spaces
    cleaned, moved = clean_extra_whitespace_with_index_run(
        original
    )  # 24.9μs -> 20.6μs (21.1% faster)

    expected_cleaned, expected_moved = _compute_expected_for_test(original)


def test_no_extra_whitespace_preserved_and_zero_distances():
    # A string with only single spaces; nothing should be changed,
    # and all leading moved distances should be zero.
    original = "This is a well formed sentence."
    cleaned, moved = clean_extra_whitespace_with_index_run(
        original
    )  # 27.5μs -> 23.2μs (18.6% faster)

    expected_cleaned, expected_moved = _compute_expected_for_test(original)


def test_empty_string_returns_empty_and_empty_array():
    # When input is empty, expect an empty cleaned string and an empty numpy array
    original = ""
    cleaned, moved = clean_extra_whitespace_with_index_run(
        original
    )  # 15.2μs -> 11.6μs (30.8% faster)


def test_only_whitespace_becomes_empty_and_distances_zero():
    # Input containing only spaces/newlines/nbsp; after cleaning it becomes empty string.
    original = "   \n\xa0  "  # mixture of spaces, newline, NBSP
    cleaned, moved = clean_extra_whitespace_with_index_run(
        original
    )  # 17.8μs -> 14.8μs (20.5% faster)

    expected_cleaned, expected_moved = _compute_expected_for_test(original)


def test_leading_and_trailing_spaces_stripped_and_distances_recorded():
    # Leading and trailing spaces should be removed by strip(), so cleaned_text shorter.
    original = "   Leading and trailing   "
    cleaned, moved = clean_extra_whitespace_with_index_run(
        original
    )  # 26.3μs -> 21.9μs (20.0% faster)

    expected_cleaned, expected_moved = _compute_expected_for_test(original)


def test_large_scale_patterned_input_correctness_and_efficiency():
    # Build a deterministic large input (well under 1000 chars): repeated pattern with varied spaces.
    parts = []
    # create 200 components; each component length small; total string < 1000
    for i in range(200):
        # alternate number of spaces to ensure many collapse events
        spaces = " " * ((i % 5) + 1)  # 1..5 spaces
        parts.append("W" + spaces + str(i))
    original = " ".join(parts)  # also includes a single separator space between parts

    # Call the function under test
    cleaned, moved = clean_extra_whitespace_with_index_run(
        original
    )  # 450μs -> 344μs (30.8% faster)

    # Compute expected values
    expected_cleaned, expected_moved = _compute_expected_for_test(original)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
from unstructured.cleaners.core import clean_extra_whitespace_with_index_run

# BASIC TEST CASES
# These tests verify fundamental functionality under normal conditions


def test_single_space_between_words():
    """Test that text with single spaces between words remains unchanged."""
    text = "hello world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 22.3μs -> 18.8μs (18.8% faster)


def test_multiple_spaces_between_words():
    """Test that multiple spaces between words are collapsed to single space."""
    text = "hello     world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 23.8μs -> 20.0μs (19.3% faster)


def test_leading_spaces_removed():
    """Test that leading spaces are stripped from the cleaned text."""
    text = "   hello"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 21.9μs -> 17.8μs (22.9% faster)


def test_trailing_spaces_removed():
    """Test that trailing spaces are stripped from the cleaned text."""
    text = "hello   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 20.6μs -> 16.9μs (21.8% faster)


def test_nonbreaking_space_converted_to_space():
    """Test that non-breaking spaces (\\xa0) are converted to regular spaces."""
    text = "hello\xa0world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 22.5μs -> 18.4μs (22.4% faster)


def test_newline_converted_to_space():
    """Test that newlines (\\n) are converted to regular spaces."""
    text = "hello\nworld"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 22.1μs -> 18.4μs (19.9% faster)


def test_mixed_whitespace_types():
    """Test that mixture of regular spaces, non-breaking spaces, and newlines are handled."""
    text = "hello \xa0 \n world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 24.8μs -> 20.4μs (21.9% faster)


def test_example_from_docstring():
    """Test the exact example provided in the docstring."""
    text = "ITEM 1.     BUSINESS"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 25.6μs -> 21.0μs (22.0% faster)


def test_no_whitespace_text():
    """Test that text with no whitespace returns unchanged."""
    text = "hello"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 19.7μs -> 16.2μs (21.3% faster)


def test_single_word():
    """Test that a single word with no spaces is handled correctly."""
    text = "word"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 19.6μs -> 15.9μs (22.9% faster)


def test_indices_length_matches_original_text():
    """Test that returned indices array length always matches original text length."""
    test_cases = [
        "hello world",
        "hello  world",
        "   leading",
        "trailing   ",
        "multiple    spaces    here",
    ]
    for text in test_cases:
        _, indices = clean_extra_whitespace_with_index_run(text)  # 58.2μs -> 47.1μs (23.6% faster)


def test_indices_are_non_negative():
    """Test that all distance values in indices array are non-negative."""
    test_cases = [
        "hello world",
        "hello     world",
        "   multiple   spaces   ",
    ]
    for text in test_cases:
        _, indices = clean_extra_whitespace_with_index_run(text)  # 44.3μs -> 36.5μs (21.3% faster)


def test_indices_are_increasing_or_same():
    """Test that indices values are monotonically increasing (never decrease)."""
    text = "hello     world"
    _, indices = clean_extra_whitespace_with_index_run(text)  # 23.6μs -> 19.8μs (19.2% faster)
    for i in range(len(indices) - 1):
        pass


# EDGE CASE TEST CASES
# These tests evaluate behavior under extreme or unusual conditions


def test_empty_string():
    """Test that empty string is handled without error."""
    text = ""
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 15.4μs -> 11.6μs (32.3% faster)


def test_only_spaces():
    """Test that string containing only spaces becomes empty after cleaning."""
    text = "     "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 17.3μs -> 14.2μs (21.7% faster)


def test_only_newlines():
    """Test that string containing only newlines becomes empty after cleaning."""
    text = "\n\n\n"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 17.0μs -> 13.9μs (21.8% faster)


def test_only_nonbreaking_spaces():
    """Test that string containing only non-breaking spaces becomes empty."""
    text = "\xa0\xa0\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 17.2μs -> 14.0μs (22.6% faster)


def test_alternating_spaces_and_nonbreaking_spaces():
    """Test mixture of regular and non-breaking spaces that should collapse."""
    text = "a \xa0 \xa0 b"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 21.7μs -> 17.7μs (23.0% faster)


def test_single_character():
    """Test that single character string is handled correctly."""
    text = "a"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 16.8μs -> 13.6μs (22.9% faster)


def test_single_space():
    """Test that single space becomes empty string."""
    text = " "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 15.9μs -> 12.9μs (23.0% faster)


def test_space_between_single_characters():
    """Test single space between two characters."""
    text = "a b"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 18.7μs -> 15.8μs (18.3% faster)


def test_many_consecutive_spaces():
    """Test handling of very large number of consecutive spaces."""
    text = "a" + " " * 100 + "b"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 44.7μs -> 29.4μs (51.7% faster)


def test_many_newlines():
    """Test handling of many consecutive newlines."""
    text = "a" + "\n" * 50 + "b"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 33.9μs -> 23.1μs (46.9% faster)


def test_mixed_space_types_many():
    """Test many alternating space types."""
    text = "a" + " \xa0\n " * 30 + "b"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 57.8μs -> 37.2μs (55.4% faster)


def test_no_spaces_at_all():
    """Test text with absolutely no whitespace characters."""
    text = "abcdefghijklmnop"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 23.3μs -> 19.5μs (19.6% faster)


def test_special_characters_with_spaces():
    """Test that special characters are preserved and spaces cleaned."""
    text = "hello!@#$%^&*()   world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 26.6μs -> 22.5μs (18.1% faster)


def test_numbers_and_spaces():
    """Test text with numbers and multiple spaces."""
    text = "123     456"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 22.9μs -> 18.7μs (22.5% faster)


def test_unicode_characters_with_spaces():
    """Test that unicode characters are preserved during space cleaning."""
    text = "café     naïve"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 25.1μs -> 20.8μs (20.8% faster)


def test_tabs_not_handled():
    """Test that tabs are not converted (only spaces, newlines, non-breaking spaces)."""
    text = "hello\t\tworld"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 21.8μs -> 18.1μs (20.3% faster)


def test_indices_accumulation():
    """Test that distance accumulates correctly as characters are removed."""
    text = "a  b  c"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 21.8μs -> 17.7μs (23.2% faster)


def test_all_indices_after_cleanup():
    """Test that indices array size matches original text even with heavy cleanup."""
    text = "   a   b   c   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 23.5μs -> 19.3μs (21.8% faster)


def test_multiple_word_document():
    """Test realistic multi-word document with various spacing issues."""
    text = "The   quick    brown  fox     jumps"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 31.3μs -> 26.1μs (20.0% faster)
    # All words should be present
    for word in ["The", "quick", "brown", "fox", "jumps"]:
        pass


# LARGE SCALE TEST CASES
# These tests assess performance and scalability with large data samples


def test_large_text_with_many_spaces():
    """Test performance with large text containing many extra spaces."""
    # Create a large text string with ~500 characters and many spaces
    text = " ".join(["word"] * 100)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 143μs -> 121μs (17.6% faster)


def test_large_text_with_many_spaces_between_words():
    """Test large text where each word pair has many spaces."""
    # Create text with 100 word pairs separated by 50 spaces each
    words = ["word"]
    text = (" " * 50).join(words * 100)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 1.61ms -> 832μs (92.9% faster)


def test_large_text_with_newlines():
    """Test large text with many newlines interspersed."""
    # Create text with lines separated by newlines
    lines = ["word" * 10 for _ in range(100)]
    text = "\n".join(lines)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 1.09ms -> 925μs (18.0% faster)


def test_large_text_mixed_whitespace():
    """Test large text with mixture of whitespace types."""
    # Create text with mixed whitespace
    base = "word"
    text = base
    for _ in range(80):
        text += " \xa0\n " + base
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 239μs -> 180μs (32.9% faster)


def test_large_single_word_with_trailing_spaces():
    """Test large single word followed by many spaces."""
    text = "supercalifragilisticexpialidocious" + " " * 500
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 30.6μs -> 27.0μs (13.6% faster)


def test_large_text_with_many_newlines():
    """Test text with 500+ consecutive newlines."""
    text = "start" + "\n" * 500 + "end"
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 170μs -> 87.1μs (95.3% faster)


def test_large_text_performance_linearity():
    """Test that function handles larger inputs efficiently."""
    # Create progressively larger inputs and verify indices array is always correct length
    for size in [100, 500, 1000]:
        text = "word " * size
        cleaned, indices = clean_extra_whitespace_with_index_run(
            text
        )  # 2.09ms -> 1.77ms (18.0% faster)


def test_large_text_indices_numpy_array():
    """Test that indices returned is proper numpy array for large texts."""
    text = "test " * 500
    _, indices = clean_extra_whitespace_with_index_run(text)  # 660μs -> 555μs (18.9% faster)


def test_large_text_with_special_chars():
    """Test large text with special characters and spaces."""
    base = "!@#$%^&*()"
    text = (" " * 20 + base) * 50
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 446μs -> 278μs (60.3% faster)
    # All special chars should remain
    for char in "!@#$%^&*()":
        pass


def test_indices_dtype_consistency():
    """Test that indices array maintains consistent dtype across different inputs."""
    test_cases = [
        "hello world",
        "a" * 500 + " " * 500 + "b" * 500,
        "test\n\ncase",
        "   spaces   everywhere   ",
    ]
    for text in test_cases:
        _, indices = clean_extra_whitespace_with_index_run(text)  # 441μs -> 322μs (36.8% faster)


def test_very_long_single_word():
    """Test very long word without spaces."""
    text = "a" * 1000
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 271μs -> 230μs (17.9% faster)


def test_return_type_consistency():
    """Test that return type is always (str, np.ndarray) tuple."""
    test_cases = [
        "",
        "a",
        "hello world",
        "  multiple   spaces  ",
        "mixed\n\xa0whitespace",
    ]
    for text in test_cases:
        codeflash_output = clean_extra_whitespace_with_index_run(text)
        result = codeflash_output  # 53.1μs -> 43.1μs (23.1% faster)


def test_cleaned_text_no_extra_spaces():
    """Test that cleaned text never contains multiple consecutive spaces."""
    test_cases = [
        "a     b",
        "multiple    spaces    here",
        "many     separate     words",
        "   leading and trailing   ",
    ]
    for text in test_cases:
        cleaned, _ = clean_extra_whitespace_with_index_run(text)  # 59.4μs -> 48.0μs (24.0% faster)


def test_cleaned_text_no_leading_trailing_spaces():
    """Test that cleaned text never has leading or trailing spaces."""
    test_cases = [
        "   hello",
        "world   ",
        "   both sides   ",
        "   multiple   ",
    ]
    for text in test_cases:
        cleaned, _ = clean_extra_whitespace_with_index_run(text)  # 46.1μs -> 37.2μs (24.1% faster)
        if cleaned:  # Only check if not empty
            pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.cleaners.core import clean_extra_whitespace_with_index_run


def test_clean_extra_whitespace_with_index_run():
    clean_extra_whitespace_with_index_run("\n\x00")
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmp26gedx2q/test_concolic_coverage.py::test_clean_extra_whitespace_with_index_run 19.4μs 15.4μs 25.4%✅

To edit these changes git checkout codeflash/optimize-clean_extra_whitespace_with_index_run-mkrwrmkm and push.

Codeflash Static Badge

The optimized code achieves a **33% speedup** through three key optimizations:

## 1. **Module-Level Precompilation (~0.6ms savings)**
- **Original**: Regex and translation table recreated on every function call
  - `re.sub(r"([ ]{2,})", ...)` recompiles regex each time (~1.4ms in profiler)
  - Translation dict rebuilt each call (~0.15ms)
- **Optimized**: `_MULTI_SPACE_RE` and `_TRANSLATE_TABLE` defined once at module level
  - Regex compilation: 1.4ms → 0.78ms (44% faster)
  - Translation overhead eliminated entirely

## 2. **Iterator-Based Loop (~1.5ms savings)**
- **Original**: `while` loop with manual indexing
  - Two string index operations per iteration: `text[original_index]` and `cleaned_text[cleaned_index]` 
  - Set membership check: `c_orig in ws_chars` on hot path
- **Optimized**: `for c_orig in txt` with local bindings
  - Eliminates one index operation per iteration (`text[original_index]`)
  - Character iteration is faster in Python than explicit indexing
  - Local variable bindings (`txt`, `ct`) reduce global lookups
  - Direct character comparisons (`c_orig == "\xa0"`) faster than set membership

## 3. **Why It Matters**
Looking at `function_references`, this function is called **inside a loop processing PDF pages** during text extraction:
```python
for _text in _text_snippets:
    _text, moved_indices = clean_extra_whitespace_with_index_run(_text)
```

This is a **hot path** in PDF processing where:
- Each page may contain dozens of text snippets
- Large PDFs could invoke this function thousands of times
- The 33% speedup compounds across all invocations

## 4. **Test Performance Characteristics**
The optimization shows **strongest gains** on inputs with:
- Many consecutive spaces: `test_many_consecutive_spaces` (51.7% faster)
- Mixed whitespace types: `test_mixed_space_types_many` (55.4% faster)  
- Large texts with space-heavy regions: `test_large_text_with_many_spaces_between_words` (92.9% faster)

These match real-world PDF text extraction patterns where OCR or formatting often introduces irregular whitespace.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 06:08
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant