⚡️ Speed up function clean_extra_whitespace_with_index_run by 33%
#260
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 33% (0.33x) speedup for
clean_extra_whitespace_with_index_runinunstructured/cleaners/core.py⏱️ Runtime :
8.77 milliseconds→6.58 milliseconds(best of111runs)📝 Explanation and details
The optimized code achieves a 33% speedup through three key optimizations:
1. Module-Level Precompilation (~0.6ms savings)
re.sub(r"([ ]{2,})", ...)recompiles regex each time (~1.4ms in profiler)_MULTI_SPACE_REand_TRANSLATE_TABLEdefined once at module level2. Iterator-Based Loop (~1.5ms savings)
whileloop with manual indexingtext[original_index]andcleaned_text[cleaned_index]c_orig in ws_charson hot pathfor c_orig in txtwith local bindingstext[original_index])txt,ct) reduce global lookupsc_orig == "\xa0") faster than set membership3. Why It Matters
Looking at
function_references, this function is called inside a loop processing PDF pages during text extraction:This is a hot path in PDF processing where:
4. Test Performance Characteristics
The optimization shows strongest gains on inputs with:
test_many_consecutive_spaces(51.7% faster)test_mixed_space_types_many(55.4% faster)test_large_text_with_many_spaces_between_words(92.9% faster)These match real-world PDF text extraction patterns where OCR or formatting often introduces irregular whitespace.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmp26gedx2q/test_concolic_coverage.py::test_clean_extra_whitespace_with_index_runTo edit these changes
git checkout codeflash/optimize-clean_extra_whitespace_with_index_run-mkrwrmkmand push.