⚡️ Speed up function `_uniquity_file` by 406% #276

codeflash-ai · 2026-01-24T09:55:17Z

📄 406% (4.06x) speedup for `_uniquity_file` in `unstructured/metrics/utils.py`

⏱️ Runtime : 8.79 milliseconds → 1.74 milliseconds (best of 125 runs)

📝 Explanation and details

The optimization replaces expensive regex operations with fast string operations, delivering a 406% speedup by eliminating the main performance bottlenecks in the original code.

Key Performance Changes:

Eliminated regex pattern matching - The original code used re.match(pattern, f) on every file in the list, then sorted matching files with a regex-based key function (_sorting_key). The optimized version uses simple string operations (startswith(), endswith(), isdigit()) to identify matching files, which are significantly faster than compiling and executing regex patterns.
Removed redundant sorting - The original sorted all matching files by _sorting_key, then extracted numbers from them again using re.search(). The optimized version directly extracts numbers during the initial scan and sorts only the integer list (which is much smaller than the file list).
Single-pass extraction - Instead of two regex passes (once for filtering, once for number extraction), the optimized code extracts numbers in a single pass using string slicing and isdigit().

Why This Works:

String methods like startswith() and endswith() are implemented in C and operate on contiguous memory, making them 10-20x faster than regex for simple prefix/suffix checks
isdigit() is faster than regex \d+ pattern matching for validating numeric strings
Processing a sorted list of integers is much cheaper than sorting strings with a custom key function that calls regex

Impact Based on Context:
The function is called by _get_non_duplicated_filename() which uses os.listdir(), suggesting it runs when handling file operations that may involve many files. The test results show dramatic improvements especially for large file lists (439-447% faster with 500 files), making this optimization particularly valuable when:

Processing directories with many duplicate files (common in data processing pipelines)
Generating unique filenames in batch operations
Working with archival or versioned file systems

The optimization preserves exact behavior including edge cases (files with numbers in names, special characters, multiple dots) while being most effective on larger file lists where regex overhead compounds.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 2 Passed
🌀 Generated Regression Tests	✅ 51 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`metrics/test_utils.py::test_uniquity_file`	26.0μs	7.83μs	232%✅

🌀 Click to see Generated Regression Tests

import pytest  # used for our unit tests

from unstructured.metrics.utils import _uniquity_file


def test_basic_no_duplicates_returns_one():
    # Basic: when no files in the directory match the target, the first available suffix is (1).
    file_list = ["other.txt", "another.doc"]  # no "report.txt" present
    codeflash_output = _uniquity_file(file_list, "report.txt")
    result = codeflash_output  # 12.7μs -> 4.62μs (176% faster)


def test_basic_existing_base_name_gets_one():
    # Basic: when the exact filename already exists, the function should return the (1) variant.
    file_list = ["summary.txt", "summary (1).txt"]  # base exists and also (1)
    # Even if (1) exists, the smallest missing suffix should be found (which here is (2))
    codeflash_output = _uniquity_file(file_list, "summary.txt")
    result = codeflash_output  # 22.7μs -> 6.61μs (244% faster)


def test_basic_unsorted_input_with_various_numbers():
    # Basic: input list is unsorted; sorting should be determined by numeric values within parentheses.
    file_list = ["data (2).csv", "data (10).csv", "data (1).csv", "irrelevant.csv"]
    # Expect the smallest missing positive integer not present: 1 and 2 present, so return (3).
    codeflash_output = _uniquity_file(file_list, "data.csv")
    result = codeflash_output  # 27.8μs -> 8.52μs (226% faster)


def test_numbers_in_base_filename_are_handled():
    # Edge: base filename contains digits; those digits should not be mistaken for the "suffix" number.
    # Example: "file2.txt" is the base name; duplicates like "file2 (1).txt" should be recognized.
    file_list = ["file2.txt", "file2 (1).txt", "file2 (3).txt"]
    # Since (1) exists, (2) is missing -> expect (2)
    codeflash_output = _uniquity_file(file_list, "file2.txt")
    result = codeflash_output  # 26.2μs -> 7.57μs (246% faster)


def test_files_with_extra_text_do_not_match_pattern():
    # Edge: filenames that include the base name but additional text before the parentheses should not match.
    file_list = [
        "note.txt",
        "note draft (1).txt",  # should NOT match because it's not exactly "note (1).txt"
        "note (1).txt",  # only this should match
    ]
    codeflash_output = _uniquity_file(file_list, "note.txt")
    result = codeflash_output  # 23.1μs -> 6.59μs (250% faster)


def test_multi_dot_filenames_handle_last_extension_only():
    # Edge: filenames with multiple dots should treat the last dot as the extension separator.
    file_list = ["archive.tar.gz", "archive.tar (1).gz", "archive.tar (2).gz"]
    # For target "archive.tar.gz", both parenthesized forms are relevant and (1),(2) exist -> expect (3)
    codeflash_output = _uniquity_file(file_list, "archive.tar.gz")
    result = codeflash_output  # 27.7μs -> 7.50μs (270% faster)


def test_no_extension_raises_value_error():
    # Edge: If the target filename does not contain a dot, rsplit(".", 1) will raise ValueError.
    # The function does not catch this, so we assert that the error is propagated.
    file_list = ["file", "file (1)"]
    with pytest.raises(ValueError):
        _uniquity_file(file_list, "file")  # 4.79μs -> 4.60μs (4.13% faster)


def test_missing_middle_number_returns_first_gap():
    # Edge: when numbers are not consecutive, the function must return the smallest missing number.
    # Create duplicates 1..10 but intentionally omit 5
    file_list = [f"gaptest ({i}).txt" for i in range(1, 11) if i != 5]
    # Include also the base file without parentheses
    file_list.append("gaptest.txt")
    # Expect the first missing positive integer among the parentheses is 5
    codeflash_output = _uniquity_file(file_list, "gaptest.txt")
    result = codeflash_output  # 47.9μs -> 12.0μs (299% faster)


def test_ignores_files_with_different_extensions_or_formats():
    # Edge: similar names but different extensions or extra suffixes must be ignored.
    file_list = [
        "project.txt",
        "project (1).doc",  # different extension, ignore
        "project (1).txt",  # relevant
        "project(1).txt",  # missing space before '(', pattern requires a space -> ignore
    ]
    # Only "project (1).txt" matches, so expect (2)
    codeflash_output = _uniquity_file(file_list, "project.txt")
    result = codeflash_output  # 24.5μs -> 7.28μs (237% faster)


def test_large_scale_sequential_existing():
    # Large scale: create many sequential duplicate filenames up to 500 to test scalability.
    # Keep the number under 1000 as required.
    n = 500
    file_list = [f"bigfile ({i}).log" for i in range(1, n + 1)]
    # Also include the base filename without parentheses
    file_list.append("bigfile.log")
    # The smallest missing positive integer should be n+1 (501)
    codeflash_output = _uniquity_file(file_list, "bigfile.log")
    result = codeflash_output  # 1.59ms -> 295μs (437% faster)


def test_large_scale_with_gap_near_end():
    # Large scale: many items but a single missing near the end should be found efficiently.
    # Create 300 entries but omit number 257 to check that the function returns the correct small gap.
    n = 300
    missing = 257
    file_list = [f"scale ({i}).data" for i in range(1, n + 1) if i != missing]
    # Add base file too
    file_list.append("scale.data")
    codeflash_output = _uniquity_file(file_list, "scale.data")
    result = codeflash_output  # 938μs -> 178μs (426% faster)


def test_complex_unordered_and_mixed_contents():
    # Combination: input contains many unrelated filenames and an unordered set of relevant ones.
    file_list = [
        "other (1).txt",
        "random.doc",
        "target (10).md",
        "target (2).md",
        "target.md",
        "target (1).md",
        "target (3).md",
        "target (5).md",
        "target (4).md",
    ]
    # Numbers present for target.md are 1,2,3,4,5,10 -> smallest missing is 6
    codeflash_output = _uniquity_file(file_list, "target.md")
    result = codeflash_output  # 40.8μs -> 10.4μs (291% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from unstructured.metrics.utils import _uniquity_file


class TestBasicFunctionality:
    """Test the fundamental functionality of _uniquity_file under normal conditions."""

    def test_empty_file_list_returns_one(self):
        """Test that an empty file list results in filename (1).ext."""
        # When no files exist, the first duplicate should be (1)
        codeflash_output = _uniquity_file([], "document.txt")
        result = codeflash_output  # 7.69μs -> 3.71μs (107% faster)

    def test_single_file_no_duplicates(self):
        """Test with a single file that doesn't match the target filename."""
        # File list has unrelated files, so result should still be (1)
        codeflash_output = _uniquity_file(["other.txt"], "document.txt")
        result = codeflash_output  # 11.2μs -> 4.47μs (150% faster)

    def test_existing_file_with_no_duplicates(self):
        """Test when target file exists but no numbered duplicates exist."""
        # Only the original file exists, so next should be (1)
        codeflash_output = _uniquity_file(["document.txt"], "document.txt")
        result = codeflash_output  # 16.9μs -> 3.96μs (326% faster)

    def test_file_with_single_duplicate(self):
        """Test when one duplicate already exists."""
        # File list has document (1).txt, so next should be (2)
        codeflash_output = _uniquity_file(["document.txt", "document (1).txt"], "document.txt")
        result = codeflash_output  # 22.5μs -> 6.55μs (244% faster)

    def test_file_with_sequential_duplicates(self):
        """Test with sequential duplicates (1), (2), (3)."""
        # Files exist: original, (1), (2), (3), so next should be (4)
        file_list = ["document.txt", "document (1).txt", "document (2).txt", "document (3).txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 29.4μs -> 8.04μs (265% faster)

    def test_different_file_extensions(self):
        """Test that the function correctly handles different file extensions."""
        # Should preserve the .pdf extension
        codeflash_output = _uniquity_file(["report.pdf", "report (1).pdf"], "report.pdf")
        result = codeflash_output  # 22.5μs -> 6.49μs (246% faster)

    def test_filename_with_dots_in_name(self):
        """Test filename that contains dots (but not as extension separator)."""
        # Filename with dots like "my.document.txt" should work correctly
        file_list = ["my.document.txt", "my.document (1).txt"]
        codeflash_output = _uniquity_file(file_list, "my.document.txt")
        result = codeflash_output  # 23.9μs -> 6.49μs (269% faster)

    def test_filename_with_spaces(self):
        """Test filename that contains spaces."""
        # Filename with spaces should be handled correctly
        file_list = ["my file.txt", "my file (1).txt"]
        codeflash_output = _uniquity_file(file_list, "my file.txt")
        result = codeflash_output  # 23.2μs -> 6.59μs (253% faster)

    def test_filename_with_special_characters(self):
        """Test filename with special regex characters like brackets, parentheses."""
        # Filename with special characters should be escaped properly
        file_list = ["[test].txt", "[test] (1).txt"]
        codeflash_output = _uniquity_file(file_list, "[test].txt")
        result = codeflash_output  # 23.3μs -> 6.57μs (254% faster)

    def test_uppercase_extension(self):
        """Test that uppercase file extensions are preserved."""
        # Extension case should be preserved
        codeflash_output = _uniquity_file(["document.TXT"], "document.TXT")
        result = codeflash_output  # 17.1μs -> 3.95μs (332% faster)


class TestEdgeCases:
    """Test the function's behavior under extreme or unusual conditions."""

    def test_gap_in_numbering_sequence(self):
        """Test when there's a gap in the numbering sequence."""
        # Files: original, (1), (3) - missing (2), so (2) should be returned
        file_list = ["document.txt", "document (1).txt", "document (3).txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 26.2μs -> 7.59μs (245% faster)

    def test_large_gap_in_numbering(self):
        """Test with a large gap in numbering sequence."""
        # Files: original, (1), (100) - should return (2) as first available
        file_list = ["document.txt", "document (1).txt", "document (100).txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 26.8μs -> 7.82μs (243% faster)

    def test_non_sequential_numbers(self):
        """Test with non-sequential numbering from the start."""
        # Files only have (2), (3), (4) - should return (1) as it's first available
        file_list = ["document (2).txt", "document (3).txt", "document (4).txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 26.6μs -> 7.74μs (244% faster)

    def test_unordered_file_list(self):
        """Test that function correctly processes unordered file list."""
        # Files in random order: should still find correct next number
        file_list = ["document (3).txt", "document.txt", "document (1).txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 26.7μs -> 7.39μs (261% faster)

    def test_unrelated_files_in_list(self):
        """Test with many unrelated files in the file list."""
        # List has unrelated files that shouldn't affect the result
        file_list = ["other.txt", "document.txt", "another.pdf", "document (1).txt", "test.txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 24.3μs -> 7.40μs (229% faster)

    def test_similar_filename_different_extension(self):
        """Test that files with same name but different extension don't interfere."""
        # document.txt and document.pdf should not interfere with each other
        file_list = ["document.txt", "document.pdf", "document (1).pdf"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 19.0μs -> 5.30μs (258% faster)

    def test_filename_that_is_substring_of_another(self):
        """Test when filename is substring of another filename."""
        # "doc.txt" should not match "document.txt" or "mydoc.txt"
        file_list = ["document.txt", "mydoc.txt", "doc.txt"]
        codeflash_output = _uniquity_file(file_list, "doc.txt")
        result = codeflash_output  # 18.0μs -> 4.80μs (275% faster)

    def test_very_long_filename(self):
        """Test with a very long filename."""
        # Long filename should work correctly
        long_name = "a" * 100
        codeflash_output = _uniquity_file([f"{long_name}.txt"], f"{long_name}.txt")
        result = codeflash_output  # 18.7μs -> 4.06μs (360% faster)

    def test_filename_with_numbers_in_name(self):
        """Test filename that already contains numbers (not in pattern)."""
        # "document123.txt" with duplicates should work correctly
        file_list = ["document123.txt", "document123 (1).txt"]
        codeflash_output = _uniquity_file(file_list, "document123.txt")
        result = codeflash_output  # 24.1μs -> 6.66μs (262% faster)

    def test_extension_with_multiple_dots(self):
        """Test file with multiple dots in name."""
        # "archive.tar.gz" should be treated as name "archive.tar" and extension "gz"
        file_list = ["archive.tar.gz"]
        codeflash_output = _uniquity_file(file_list, "archive.tar.gz")
        result = codeflash_output  # 17.4μs -> 4.07μs (329% faster)

    def test_single_character_filename(self):
        """Test with single character filename."""
        # Single character filename should work
        codeflash_output = _uniquity_file(["a.txt", "a (1).txt"], "a.txt")
        result = codeflash_output  # 21.6μs -> 6.45μs (235% faster)

    def test_single_character_extension(self):
        """Test with single character extension."""
        # Single character extension should work
        codeflash_output = _uniquity_file(["document.x"], "document.x")
        result = codeflash_output  # 16.4μs -> 4.02μs (308% faster)

    def test_case_sensitive_matching(self):
        """Test that matching is case-sensitive."""
        # "Document.txt" and "document.txt" should be treated as different
        file_list = ["Document.txt", "document.txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 17.6μs -> 4.78μs (268% faster)

    def test_regex_special_characters_in_filename(self):
        """Test filename with regex metacharacters."""
        # Filename with regex metacharacters like . * + ? [ ] ( ) etc.
        file_list = ["file.+test.txt", "file.+test (1).txt"]
        codeflash_output = _uniquity_file(file_list, "file.+test.txt")
        result = codeflash_output  # 23.4μs -> 6.65μs (251% faster)

    def test_zero_in_filename_number(self):
        """Test that numbers starting with zero are handled correctly."""
        # Files with leading zeros: (01), (02) should be treated as numbers 1, 2
        file_list = ["document.txt", "document (01).txt", "document (02).txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 27.0μs -> 7.80μs (245% faster)

    def test_mixed_naming_with_and_without_numbers(self):
        """Test files with various numbering patterns mixed."""
        # Complex mix of files with gaps and sequences
        file_list = ["file.txt", "file (1).txt", "file (2).txt", "file (5).txt"]
        codeflash_output = _uniquity_file(file_list, "file.txt")
        result = codeflash_output  # 29.0μs -> 8.28μs (250% faster)

    def test_empty_filename_before_extension(self):
        """Test edge case of just extension with no real name."""
        # This is an edge case - ".txt" as target filename
        codeflash_output = _uniquity_file([".txt"], ".txt")
        result = codeflash_output  # 15.9μs -> 3.65μs (337% faster)

    def test_filename_ending_with_parenthesis(self):
        """Test filename that naturally ends with a number in parentheses."""
        # Filename like "meeting(archived).txt" should not confuse the numbering
        file_list = ["meeting(archived).txt"]
        codeflash_output = _uniquity_file(file_list, "meeting(archived).txt")
        result = codeflash_output  # 18.6μs -> 3.97μs (368% faster)


class TestLargeScale:
    """Test function's performance and scalability with larger data samples."""

    def test_many_sequential_duplicates(self):
        """Test with a large number of sequential duplicates."""
        # Create file list with 100 sequential duplicates
        file_list = ["document.txt"]
        for i in range(1, 100):
            file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 326μs -> 63.1μs (417% faster)

    def test_many_files_with_gap(self):
        """Test finding first gap in large list."""
        # Create list with many duplicates but missing (50)
        file_list = ["document.txt"]
        for i in range(1, 100):
            if i != 50:
                file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 318μs -> 60.4μs (427% faster)

    def test_large_unsorted_list(self):
        """Test with large unsorted file list."""
        # Create large unsorted list of various files
        file_list = ["other1.txt", "other2.txt", "other3.txt"]
        for i in range(1, 100):
            file_list.append(f"document ({i}).txt")
        # Add original and some unrelated files
        file_list.extend(["document.txt", "unrelated.pdf", "unrelated.doc"])
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 328μs -> 64.1μs (413% faster)

    def test_large_list_with_many_similar_filenames(self):
        """Test performance with many similar but different filenames."""
        # Create list with many similar filenames
        file_list = []
        for i in range(100):
            file_list.append(f"document{i}.txt")
        file_list.append("document.txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 81.9μs -> 16.3μs (403% faster)

    def test_large_numbers_in_sequence(self):
        """Test with very large sequential numbers."""
        # Test with numbers going up to 500
        file_list = ["document.txt"]
        for i in range(1, 500):
            file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 1.60ms -> 296μs (439% faster)

    def test_large_random_numbers(self):
        """Test with large non-sequential numbers."""
        # File list has original plus duplicates with large random-like numbers
        file_list = ["document.txt", "document (1).txt", "document (2).txt", "document (1000).txt"]
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 30.3μs -> 8.39μs (261% faster)

    def test_large_list_finding_first_available(self):
        """Test large list where first available number is early in sequence."""
        # Create list: (2), (3), (4), ..., (100) - missing (1)
        file_list = []
        for i in range(2, 100):
            file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 313μs -> 57.3μs (447% faster)

    def test_performance_with_500_files(self):
        """Test function performance with 500 files in list."""
        # Create a realistic large file list
        file_list = []
        for i in range(500):
            if i == 0:
                file_list.append("document.txt")
            else:
                file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 1.59ms -> 292μs (446% faster)

    def test_many_unrelated_files_mixed_in(self):
        """Test with many unrelated files mixed with target duplicates."""
        # Create list with 500 unrelated files and 50 target duplicates
        file_list = []
        for i in range(500):
            file_list.append(f"file{i}.txt")
        file_list.append("document.txt")
        for i in range(1, 50):
            file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 453μs -> 90.9μs (399% faster)

    def test_large_gap_at_beginning(self):
        """Test with large gap at the beginning of numbering."""
        # Files: (100), (101), (102), ..., (150) - should return (1)
        file_list = []
        for i in range(100, 150):
            file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 170μs -> 32.5μs (423% faster)

    def test_alternating_gaps(self):
        """Test with multiple alternating gaps."""
        # Files: (1), (3), (5), (7), ... (99) - should return (2) as first gap
        file_list = ["document.txt"]
        for i in range(1, 100, 2):
            file_list.append(f"document ({i}).txt")
        codeflash_output = _uniquity_file(file_list, "document.txt")
        result = codeflash_output  # 171μs -> 32.7μs (425% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_uniquity_file-mks4vsp5 and push.

The optimization replaces expensive regex operations with fast string operations, delivering a **406% speedup** by eliminating the main performance bottlenecks in the original code. **Key Performance Changes:** 1. **Eliminated regex pattern matching** - The original code used `re.match(pattern, f)` on every file in the list, then sorted matching files with a regex-based key function (`_sorting_key`). The optimized version uses simple string operations (`startswith()`, `endswith()`, `isdigit()`) to identify matching files, which are **significantly faster** than compiling and executing regex patterns. 2. **Removed redundant sorting** - The original sorted all matching files by `_sorting_key`, then extracted numbers from them again using `re.search()`. The optimized version directly extracts numbers during the initial scan and sorts only the integer list (which is much smaller than the file list). 3. **Single-pass extraction** - Instead of two regex passes (once for filtering, once for number extraction), the optimized code extracts numbers in a single pass using string slicing and `isdigit()`. **Why This Works:** - String methods like `startswith()` and `endswith()` are implemented in C and operate on contiguous memory, making them 10-20x faster than regex for simple prefix/suffix checks - `isdigit()` is faster than regex `\d+` pattern matching for validating numeric strings - Processing a sorted list of integers is much cheaper than sorting strings with a custom key function that calls regex **Impact Based on Context:** The function is called by `_get_non_duplicated_filename()` which uses `os.listdir()`, suggesting it runs when handling file operations that may involve many files. The test results show dramatic improvements especially for large file lists (439-447% faster with 500 files), making this optimization particularly valuable when: - Processing directories with many duplicate files (common in data processing pipelines) - Generating unique filenames in batch operations - Working with archival or versioned file systems The optimization preserves exact behavior including edge cases (files with numbers in names, special characters, multiple dots) while being most effective on larger file lists where regex overhead compounds.

codeflash-ai bot requested a review from aseembits93 January 24, 2026 09:55

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `_uniquity_file` by 406% #276

⚡️ Speed up function `_uniquity_file` by 406% #276

codeflash-ai bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _uniquity_file by 406% #276

Are you sure you want to change the base?

⚡️ Speed up function _uniquity_file by 406% #276

Conversation

codeflash-ai bot commented Jan 24, 2026

📄 406% (4.06x) speedup for _uniquity_file in unstructured/metrics/utils.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_uniquity_file` by 406% #276

⚡️ Speed up function `_uniquity_file` by 406% #276

📄 406% (4.06x) speedup for `_uniquity_file` in `unstructured/metrics/utils.py`