⚡️ Speed up function calculate_percent_missing_text by 18%
#271
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 18% (0.18x) speedup for
calculate_percent_missing_textinunstructured/metrics/text_extraction.py⏱️ Runtime :
7.27 milliseconds→6.14 milliseconds(best of45runs)📝 Explanation and details
The optimized code achieves an 18% speedup through two key algorithmic improvements in the
bag_of_wordsfunction:1. Replaced nested while-loop with single-pass enumeration
The original code used a manual while-loop with complex index manipulation (
i,j) to scan through words, including an inner while-loop to concatenate consecutive single-character tokens. This approach required:len(words)calls in loop conditionsi = j)incorrect_wordstrings that were often discardedThe optimized version uses Python's
enumerate()for a single pass with direct indexing, eliminating the nested loop overhead and string concatenation entirely.2. Streamlined single-character word detection
Instead of scanning ahead to concatenate consecutive single-character tokens (which the original logic then mostly rejected), the optimized code makes local adjacency checks:
prev_single = i > 0 and len(words[i - 1]) == 1next_single = i + 1 < n and len(words[i + 1]) == 1This only processes isolated single alphanumeric characters, matching the original behavior while avoiding string building overhead.
3. Dictionary lookup optimization in
calculate_percent_missing_textReplaced the
if source_word not in output_bowcheck followed by separate dictionary access withoutput_bow.get(source_word, 0), reducing dictionary lookups from two to one per iteration.Performance impact based on workloads:
From the line profiler, the original code spent 20.9% of time in the while-loop condition and 18.6% checking word lengths. The optimized version reduces this to 22.1% for enumeration (which processes both iteration and indexing) and 20.8% for length checks—a net reduction in control flow overhead.
The test results show the optimization is particularly effective for:
test_large_identical_texts,test_large_repeated_words_some_missing)test_large_scale_half_missing,test_large_text_unique_words)Since
calculate_percent_missing_textis called in_process_documentfor document evaluation, this optimization directly benefits document processing pipelines where text extraction quality metrics are computed. The function is in a hot path during batch document evaluation, making the 18% improvement particularly valuable when processing large document sets.✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
metrics/test_text_extraction.py::test_calculate_percent_missing_text🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmpj3sc54h5/test_concolic_coverage.py::test_calculate_percent_missing_textTo edit these changes
git checkout codeflash/optimize-calculate_percent_missing_text-mks2sdnkand push.