Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 18% (0.18x) speedup for calculate_percent_missing_text in unstructured/metrics/text_extraction.py

⏱️ Runtime : 7.27 milliseconds 6.14 milliseconds (best of 45 runs)

📝 Explanation and details

The optimized code achieves an 18% speedup through two key algorithmic improvements in the bag_of_words function:

1. Replaced nested while-loop with single-pass enumeration

The original code used a manual while-loop with complex index manipulation (i, j) to scan through words, including an inner while-loop to concatenate consecutive single-character tokens. This approach required:

  • Repeated len(words) calls in loop conditions
  • Manual index incrementing and jumping (i = j)
  • Building intermediate incorrect_word strings that were often discarded

The optimized version uses Python's enumerate() for a single pass with direct indexing, eliminating the nested loop overhead and string concatenation entirely.

2. Streamlined single-character word detection

Instead of scanning ahead to concatenate consecutive single-character tokens (which the original logic then mostly rejected), the optimized code makes local adjacency checks:

  • prev_single = i > 0 and len(words[i - 1]) == 1
  • next_single = i + 1 < n and len(words[i + 1]) == 1

This only processes isolated single alphanumeric characters, matching the original behavior while avoiding string building overhead.

3. Dictionary lookup optimization in calculate_percent_missing_text

Replaced the if source_word not in output_bow check followed by separate dictionary access with output_bow.get(source_word, 0), reducing dictionary lookups from two to one per iteration.

Performance impact based on workloads:

From the line profiler, the original code spent 20.9% of time in the while-loop condition and 18.6% checking word lengths. The optimized version reduces this to 22.1% for enumeration (which processes both iteration and indexing) and 20.8% for length checks—a net reduction in control flow overhead.

The test results show the optimization is particularly effective for:

  • Large texts with repeated words: 31-42% faster (e.g., test_large_identical_texts, test_large_repeated_words_some_missing)
  • Documents with many unique words: 9-25% faster (e.g., test_large_scale_half_missing, test_large_text_unique_words)
  • Text with minimal single-character tokens that would have triggered the expensive concatenation path

Since calculate_percent_missing_text is called in _process_document for document evaluation, this optimization directly benefits document processing pipelines where text extraction quality metrics are computed. The function is in a hot path during batch document evaluation, making the 18% improvement particularly valuable when processing large document sets.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 16 Passed
🌀 Generated Regression Tests 61 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
metrics/test_text_extraction.py::test_calculate_percent_missing_text 181μs 177μs 2.44%✅
🌀 Click to see Generated Regression Tests
from unstructured.metrics.text_extraction import calculate_percent_missing_text


def test_exact_match_is_zero_percent_missing():
    # When output exactly matches source (case-insensitive and punctuation removed by internals),
    # there should be 0.0 missing text.
    source = "Hello world"
    output = "hello world"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 33.9μs -> 33.2μs (2.20% faster)


def test_output_has_extra_words_not_penalized():
    # If output contains all source words and has extra words, percent missing should still be 0.0
    source = "alpha beta"
    output = "alpha beta gamma gamma gamma"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 35.8μs -> 34.3μs (4.52% faster)


def test_missing_words_counting_with_repeats():
    # Verify that missing counts are computed correctly when source has repeated words
    # Source counts: a:1, b:2, c:1  (total 4)
    # Output counts: a:1, b:1  -> missing: (b: 1) + (c: 1) = 2 -> 2/4 = 0.5
    source = "a b b c"
    output = "a b"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 28.0μs -> 28.4μs (1.29% slower)


def test_empty_source_returns_zero():
    # If the source is empty, nothing can be missing by definition -> 0.0
    codeflash_output = calculate_percent_missing_text(output="anything here", source="")
    result = codeflash_output  # 26.9μs -> 26.2μs (2.57% faster)


def test_none_inputs_are_treated_as_empty_strings():
    # Passing None for either argument should be treated as empty string (prepare_str behavior)
    codeflash_output = calculate_percent_missing_text(output=None, source=None)
    result_both_none = codeflash_output  # 22.1μs -> 22.2μs (0.373% slower)

    codeflash_output = calculate_percent_missing_text(output=None, source="some text")
    result_none_output = codeflash_output  # 23.8μs -> 24.3μs (1.84% slower)


def test_spaced_out_word_in_source_is_ignored_by_bow_logic():
    # The implementation's bag_of_words ignores multi-letter sequences formed by single-letter tokens
    # e.g., "h e l l o" in the source becomes a run of single-letter tokens that the current algorithm
    # does not add to the bag. Therefore, a source that consists solely of a spaced-out word is treated
    # as empty and yields 0.0 missing text.
    source = "h e l l o"
    output = ""  # output has nothing
    # Since the source's spaced-out word is ignored by bag_of_words, total_source_word_count == 0 -> returns 0.0
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 27.5μs -> 27.1μs (1.43% faster)


def test_punctuation_and_case_are_normalized_in_bow():
    # Punctuation should be removed and comparisons are case-insensitive due to lowercasing in bag_of_words.
    source = "Hello, World!"
    output = "hello world"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 34.6μs -> 33.6μs (3.03% faster)


def test_single_character_alphanumeric_words_are_handled():
    # Single-character alphanumeric tokens (digits or letters) should be included when isolated.
    # Source counts: v2:1, 2:1, 3:3  -> total 5
    # Output counts: v2:1, 2:1, 3:2  -> missing 1 (one '3') -> 1/5 = 0.2
    source = "v2 2 3 3 3"
    output = "v2 2 3 3"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 35.0μs -> 35.0μs (0.140% faster)


def test_rounding_to_three_decimal_places():
    # Ensure rounding to three decimal places is applied.
    # Source: a a a -> total 3; Output: a a -> missing 1 -> 1/3 = 0.333333... -> should round to 0.333
    source = "a a a"
    output = "a a"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 27.7μs -> 27.6μs (0.478% faster)


def test_large_scale_half_missing():
    # Build a sizeable source with 500 unique tokens, and output that contains exactly the first half.
    # Missing should be exactly 0.5 (250/500).
    n = 500  # number of unique tokens in the source (kept <1000 per instructions)
    # Create distinct tokens word0 ... word499
    source_tokens = [f"word{i}" for i in range(n)]
    output_tokens = [f"word{i}" for i in range(n // 2)]  # first half present in output
    source = " ".join(source_tokens)
    output = " ".join(output_tokens)
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 414μs -> 382μs (8.27% faster)


def test_output_duplication_exceeding_source_counts_not_penalized():
    # If the output duplicates a word more times than in the source, it should not reduce the percentage
    # (i.e., duplication in output does not penalize; missing is based only on source counts).
    source = "repeat repeat unique"
    # output duplicates 'repeat' many times but still has 'unique' -> nothing missing.
    output = "repeat repeat repeat repeat unique"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 36.2μs -> 34.8μs (3.92% faster)


def test_output_missing_all_source_words_returns_one():
    # When the output contains none of the source words, the fraction should be 1.0 (100% missing).
    source = "one two three four"
    output = "alpha beta"
    codeflash_output = calculate_percent_missing_text(output=output, source=source)
    result = codeflash_output  # 34.4μs -> 34.1μs (0.915% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.metrics.text_extraction import calculate_percent_missing_text


class TestCalculatePercentMissingTextBasic:
    """Basic test cases for calculate_percent_missing_text function."""

    def test_identical_texts_no_missing(self):
        """Test that identical output and source texts result in 0% missing."""
        output = "hello world"
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 34.0μs -> 33.1μs (2.88% faster)

    def test_output_contains_all_source_words_plus_extra(self):
        """Test that output with extra words still shows 0% missing."""
        output = "hello world foo bar"
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 34.4μs -> 33.3μs (3.36% faster)

    def test_single_word_missing_completely(self):
        """Test when one word is completely missing from output."""
        output = "hello"
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 32.2μs -> 32.2μs (0.267% faster)

    def test_both_words_missing(self):
        """Test when output is empty but source has content."""
        output = ""
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 31.2μs -> 31.2μs (0.016% faster)

    def test_output_none_source_has_content(self):
        """Test when output is None but source has content."""
        output = None
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 31.6μs -> 31.4μs (0.708% faster)

    def test_source_none_output_has_content(self):
        """Test when source is None but output has content."""
        output = "hello world"
        source = None
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 26.4μs -> 25.4μs (3.98% faster)

    def test_both_none(self):
        """Test when both output and source are None."""
        output = None
        source = None
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 21.8μs -> 21.6μs (0.860% faster)

    def test_empty_strings(self):
        """Test when both output and source are empty strings."""
        output = ""
        source = ""
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 21.4μs -> 21.6μs (0.988% slower)

    def test_case_insensitive_matching(self):
        """Test that word matching is case-insensitive."""
        output = "HELLO WORLD"
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 33.7μs -> 32.7μs (3.02% faster)

    def test_partial_word_missing(self):
        """Test when some instances of a word are missing but not all."""
        output = "hello hello world"
        source = "hello hello hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 35.5μs -> 33.9μs (4.83% faster)

    def test_one_word_repeated_completely_missing(self):
        """Test when all instances of a word are missing."""
        output = "foo bar"
        source = "hello hello hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 34.1μs -> 33.3μs (2.25% faster)

    def test_punctuation_handling(self):
        """Test that punctuation is properly handled."""
        output = "hello world"
        source = "hello, world!"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 34.1μs -> 32.8μs (3.85% faster)

    def test_apostrophe_preserved_in_words(self):
        """Test that apostrophes within words are preserved."""
        output = "don't worry"
        source = "don't worry"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 33.0μs -> 32.3μs (2.36% faster)

    def test_single_character_words_skipped(self):
        """Test that single character words (except alphanumeric) are handled."""
        output = "a b c hello"
        source = "hello"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 33.4μs -> 32.4μs (2.92% faster)


class TestCalculatePercentMissingTextEdge:
    """Edge case tests for calculate_percent_missing_text function."""

    def test_whitespace_variations(self):
        """Test that different whitespace is handled consistently."""
        output = "hello    world"
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 33.2μs -> 32.8μs (1.33% faster)

    def test_newlines_and_tabs(self):
        """Test that newlines and tabs are treated as whitespace."""
        output = "hello\nworld\tthere"
        source = "hello world there"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 35.0μs -> 33.9μs (3.18% faster)

    def test_unicode_characters(self):
        """Test handling of unicode characters."""
        output = "café résumé"
        source = "café résumé"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 35.3μs -> 34.7μs (1.78% faster)

    def test_unicode_bullets_removed(self):
        """Test that unicode bullets are properly removed."""
        output = "this is great"
        source = "● this is great"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 40.0μs -> 38.5μs (3.86% faster)

    def test_source_only_whitespace(self):
        """Test when source contains only whitespace."""
        output = "hello world"
        source = "   "
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 26.0μs -> 24.9μs (4.46% faster)

    def test_output_only_whitespace(self):
        """Test when output contains only whitespace but source has words."""
        output = "   "
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 30.9μs -> 31.3μs (1.25% slower)

    def test_numbers_as_words(self):
        """Test that numbers are treated as valid words."""
        output = "hello 123"
        source = "hello 123"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 32.8μs -> 32.2μs (2.09% faster)

    def test_numbers_missing(self):
        """Test when numeric content is missing."""
        output = "hello"
        source = "hello 123"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 32.4μs -> 32.0μs (1.29% faster)

    def test_hyphenated_words_preserved(self):
        """Test that hyphenated words are preserved (hyphen not removed)."""
        output = "well-known"
        source = "well-known"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 30.8μs -> 30.6μs (0.789% faster)

    def test_very_long_word(self):
        """Test handling of very long words."""
        long_word = "a" * 100
        output = long_word
        source = long_word
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 31.6μs -> 30.2μs (4.49% faster)

    def test_many_short_words(self):
        """Test handling many short (length 1) words."""
        output = ""
        source = "a b c d e"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 26.9μs -> 26.9μs (0.201% slower)

    def test_mixed_case_with_numbers(self):
        """Test mixed case and numbers together."""
        output = "Test123 Hello456"
        source = "test123 hello456"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 34.4μs -> 32.9μs (4.46% faster)

    def test_special_punctuation_with_apostrophe(self):
        """Test special handling where apostrophe is preserved."""
        output = "it's don't won't"
        source = "it's don't won't"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 34.7μs -> 33.9μs (2.30% faster)

    def test_result_bounded_to_one(self):
        """Test that result is never greater than 1.0 (100%)."""
        output = ""
        source = "this is a test"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 33.7μs -> 33.9μs (0.552% slower)

    def test_result_bounded_to_zero(self):
        """Test that result is never less than 0.0 (0%)."""
        output = "this is a test with extra words"
        source = "this is a test"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 39.5μs -> 37.5μs (5.28% faster)

    def test_empty_string_whitespace_equivalence(self):
        """Test that empty string and whitespace-only strings are equivalent."""
        output1 = ""
        output2 = "   \n\t  "
        source = "hello"
        codeflash_output = calculate_percent_missing_text(output1, source)
        result1 = codeflash_output  # 29.7μs -> 29.8μs (0.171% slower)
        codeflash_output = calculate_percent_missing_text(output2, source)
        result2 = codeflash_output  # 17.7μs -> 18.2μs (2.46% slower)

    def test_rounding_to_three_decimal_places(self):
        """Test that result is rounded to 3 decimal places."""
        output = "a b"
        source = "a b c"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 27.0μs -> 27.2μs (0.930% slower)

    def test_exact_half_missing(self):
        """Test when exactly 50% is missing."""
        output = "hello"
        source = "hello world"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 32.7μs -> 32.3μs (1.32% faster)

    def test_one_third_missing(self):
        """Test when one-third of words are missing."""
        output = "a b"
        source = "a b c"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 27.2μs -> 27.3μs (0.377% slower)

    def test_two_thirds_missing(self):
        """Test when two-thirds of words are missing."""
        output = "a"
        source = "a b c"
        codeflash_output = calculate_percent_missing_text(output, source)
        result = codeflash_output  # 27.6μs -> 27.3μs (1.14% faster)


class TestCalculatePercentMissingTextLargeScale:
    """Large scale test cases for performance and scalability."""

    def test_large_identical_texts(self):
        """Test performance with large identical texts."""
        words = ["word"] * 500
        text = " ".join(words)
        codeflash_output = calculate_percent_missing_text(text, text)
        result = codeflash_output  # 412μs -> 291μs (41.7% faster)

    def test_large_text_all_missing(self):
        """Test performance when large source text is completely missing from output."""
        source_words = ["word"] * 500
        source_text = " ".join(source_words)
        output_text = ""
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 213μs -> 160μs (33.1% faster)

    def test_large_text_partial_missing(self):
        """Test performance when half of large text is missing."""
        source_words = ["word"] * 1000
        source_text = " ".join(source_words)
        output_words = ["word"] * 500
        output_text = " ".join(output_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 605μs -> 440μs (37.5% faster)

    def test_large_text_with_variety(self):
        """Test performance with large text containing many different words."""
        source_words = [f"word{i}" for i in range(500)]
        source_text = " ".join(source_words)
        output_words = [f"word{i}" for i in range(250)]
        output_text = " ".join(output_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 409μs -> 374μs (9.57% faster)

    def test_large_text_with_repetition(self):
        """Test performance with large text containing repeated words."""
        unique_words = ["apple", "banana", "cherry", "date", "elderberry"]
        source_words = unique_words * 100
        source_text = " ".join(source_words)
        output_words = unique_words * 50
        output_text = " ".join(output_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 322μs -> 241μs (33.8% faster)

    def test_large_source_small_output(self):
        """Test when source is much larger than output."""
        source_words = [f"word{i}" for i in range(500)]
        source_text = " ".join(source_words)
        output_text = "word0"
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 258μs -> 286μs (9.71% slower)

    def test_large_output_small_source(self):
        """Test when output is much larger than source."""
        source_text = "hello world"
        output_words = [f"word{i}" for i in range(500)]
        output_words.extend(["hello", "world"])
        output_text = " ".join(output_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 214μs -> 162μs (31.5% faster)

    def test_large_text_minimal_difference(self):
        """Test large texts with minimal difference."""
        base_words = ["word"] * 500
        source_text = " ".join(base_words)
        output_text = " ".join(base_words[:-1])
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 393μs -> 288μs (36.4% faster)

    def test_large_text_unique_words(self):
        """Test large text where every word is unique."""
        source_words = [f"unique{i}" for i in range(500)]
        source_text = " ".join(source_words)
        output_words = [f"unique{i}" for i in range(500)]
        output_text = " ".join(output_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 564μs -> 451μs (25.3% faster)

    def test_large_text_none_of_unique_words_present(self):
        """Test large text where none of the unique words are present."""
        source_words = [f"source{i}" for i in range(500)]
        source_text = " ".join(source_words)
        output_words = [f"output{i}" for i in range(500)]
        output_text = " ".join(output_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 452μs -> 438μs (3.38% faster)

    def test_large_text_with_punctuation(self):
        """Test large text with extensive punctuation."""
        base_text = "hello, world! this is a test. it should work. " * 50
        codeflash_output = calculate_percent_missing_text(base_text, base_text)
        result = codeflash_output  # 419μs -> 312μs (33.9% faster)

    def test_large_repeated_words_some_missing(self):
        """Test large text with repeated words where some instances are missing."""
        source_words = ["test"] * 1000
        source_text = " ".join(source_words)
        output_words = ["test"] * 500
        output_text = " ".join(output_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 580μs -> 420μs (38.3% faster)

    def test_large_text_mixed_case_and_punctuation(self):
        """Test large text with mixed case and extensive punctuation."""
        base_words = ["Hello", "World", "Test", "Case"] * 100
        source_text = ", ".join(base_words)
        output_text = ", ".join(base_words)
        codeflash_output = calculate_percent_missing_text(output_text, source_text)
        result = codeflash_output  # 339μs -> 258μs (31.1% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.metrics.text_extraction import calculate_percent_missing_text


def test_calculate_percent_missing_text():
    calculate_percent_missing_text("", "")
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmpj3sc54h5/test_concolic_coverage.py::test_calculate_percent_missing_text 21.8μs 21.9μs -0.192%⚠️

To edit these changes git checkout codeflash/optimize-calculate_percent_missing_text-mks2sdnk and push.

Codeflash Static Badge

The optimized code achieves an 18% speedup through two key algorithmic improvements in the `bag_of_words` function:

**1. Replaced nested while-loop with single-pass enumeration**

The original code used a manual while-loop with complex index manipulation (`i`, `j`) to scan through words, including an inner while-loop to concatenate consecutive single-character tokens. This approach required:
- Repeated `len(words)` calls in loop conditions
- Manual index incrementing and jumping (`i = j`)
- Building intermediate `incorrect_word` strings that were often discarded

The optimized version uses Python's `enumerate()` for a single pass with direct indexing, eliminating the nested loop overhead and string concatenation entirely.

**2. Streamlined single-character word detection**

Instead of scanning ahead to concatenate consecutive single-character tokens (which the original logic then mostly rejected), the optimized code makes local adjacency checks:
- `prev_single = i > 0 and len(words[i - 1]) == 1`
- `next_single = i + 1 < n and len(words[i + 1]) == 1`

This only processes isolated single alphanumeric characters, matching the original behavior while avoiding string building overhead.

**3. Dictionary lookup optimization in `calculate_percent_missing_text`**

Replaced the `if source_word not in output_bow` check followed by separate dictionary access with `output_bow.get(source_word, 0)`, reducing dictionary lookups from two to one per iteration.

**Performance impact based on workloads:**

From the line profiler, the original code spent 20.9% of time in the while-loop condition and 18.6% checking word lengths. The optimized version reduces this to 22.1% for enumeration (which processes both iteration and indexing) and 20.8% for length checks—a net reduction in control flow overhead.

The test results show the optimization is particularly effective for:
- Large texts with repeated words: 31-42% faster (e.g., `test_large_identical_texts`, `test_large_repeated_words_some_missing`)
- Documents with many unique words: 9-25% faster (e.g., `test_large_scale_half_missing`, `test_large_text_unique_words`)
- Text with minimal single-character tokens that would have triggered the expensive concatenation path

Since `calculate_percent_missing_text` is called in `_process_document` for document evaluation, this optimization directly benefits document processing pipelines where text extraction quality metrics are computed. The function is in a hot path during batch document evaluation, making the 18% improvement particularly valuable when processing large document sets.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 08:56
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant