⚡️ Speed up function clean_prefix by 104%
#257
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 104% (1.04x) speedup for
clean_prefixinunstructured/cleaners/core.py⏱️ Runtime :
1.50 milliseconds→735 microseconds(best of32runs)📝 Explanation and details
The optimized code achieves a 103% speedup (2x faster) by avoiding expensive regex compilation and matching for the common case where patterns are simple literal strings rather than true regular expressions.
Key Optimization Strategy
What changed:
.^$*+?{}[]\\|()). If not, treats it as a literal string.str.startswith()(case-sensitive) orstr.casefold()comparison (case-insensitive) with simple slicing instead of regex substitution.compiled.sub()instead ofre.sub().Why This Is Faster
Line profiler reveals the bottleneck: In the original code, 99.7% of time (80ms out of 80.2ms) was spent in the single
re.sub()call. This is because:"PREFIX"go through full regex matching machineryThe optimization eliminates this overhead by:
text.startswith(pattern)followed bytext[plen:]is orders of magnitude faster than regex matching (pure string operations vs. state machine execution)Performance by Test Case Type
Best improvements (100-1900% faster):
test_basic_empty_pattern,test_edge_empty_and_empty)test_large_scale_very_long_text,test_large_scale_long_text_no_match)casefold()Slower cases (14-85% slower):
The optimization trades a small penalty on regex patterns for massive gains on literal strings, which represent the majority of real-world usage patterns based on the test suite.
Impact Assessment
Since
function_referencesis not available, the optimization's value depends on how oftenclean_prefix()is called:The optimization is safe and behavior-preserving - all test cases pass with identical output, and the regex fallback ensures backward compatibility for complex patterns.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
cleaners/test_core.py::test_clean_prefix🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-clean_prefix-mkrvyyunand push.