Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 298% (2.98x) speedup for _rename_aggregated_columns in unstructured/metrics/utils.py

⏱️ Runtime : 11.3 milliseconds 2.84 milliseconds (best of 73 runs)

📝 Explanation and details

The optimized code achieves a 298% speedup by avoiding pandas' heavyweight DataFrame.rename() machinery when possible. Here's why it's faster:

Key Optimization

Early exit on non-matching DataFrames: The optimized version checks if any rename_map keys exist in the DataFrame columns before performing any renaming operation. In the common case where none of the special aggregation suffixes (_mean, _stdev, _pstdev, _count) are present in the columns, it immediately returns a shallow copy without invoking pandas' complex rename logic.

Performance Benefits

  1. Avoided overhead: df.rename(columns=...) internally performs extensive validation, index alignment, and creates multiple intermediate data structures even when no columns need renaming. The optimized version bypasses this entirely for non-matching cases.

  2. Selective column construction: When a match is found, it builds a new column list using a simple list comprehension and directly assigns it to df2.columns. This is significantly faster than pandas' rename machinery.

  3. Test results validate the approach:

    • Empty DataFrames: 1022% speedup (295μs → 26.3μs) - dramatic improvement from avoiding rename's overhead
    • No matching columns: 333-338% speedup across multiple tests - the early exit path is highly effective
    • Large DataFrames without matches: 605% speedup for numeric columns, 335% for many non-aggregated columns
    • DataFrames with aggregation columns: Still 114-116% speedup even when renaming is required

Impact on Production Workload

Based on the function_references, this function is called within get_mean_grouping(), a metrics aggregation pipeline that processes grouped DataFrames. The optimization particularly benefits scenarios where:

  • GroupBy operations produce DataFrames without the exact mapping keys (e.g., columns like "value_mean" instead of "_mean")
  • Multiple aggregations are performed in loops (the function is called once per agg_field)
  • The evaluation pipeline processes many small to medium DataFrames repeatedly

The 3-10x speedup for non-matching cases means the metrics pipeline will run substantially faster when processing diverse column naming patterns, with minimal impact on the matching case performance.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 41 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import pandas as pd  # used to create DataFrames for testing

# imports
from unstructured.metrics.utils import _rename_aggregated_columns


def test_basic_renaming_of_known_suffix_columns():
    # Basic case: columns that exactly match mapping keys should be renamed.
    df = pd.DataFrame({"_mean": [1, 2], "_stdev": [3, 4], "other": [5, 6]})
    # Call the function under test
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 230μs -> 107μs (114% faster)


def test_no_matching_columns_returns_equivalent_dataframe():
    # Edge: DataFrame with no columns that match the mapping keys should be equal to the input in content.
    df = pd.DataFrame({"a": [10], "b": [20]})
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 230μs -> 53.1μs (334% faster)


def test_exact_match_only_not_suffix_or_substring():
    # Edge: mapping only applies to exact column names, not to substrings or suffixes.
    df = pd.DataFrame({"value_mean": [1], "_mean_mean": [2], "_count": [3]})
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 230μs -> 113μs (104% faster)


def test_duplicate_target_column_names_after_rename_are_preserved():
    # Edge: If a column already named "mean" exists alongside "_mean", rename will produce duplicate column names.
    # Create DataFrame with two distinct columns that will collide after renaming
    df = pd.DataFrame({"_mean": [100], "mean": [200], "_stdev": [3]})
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 230μs -> 106μs (116% faster)


def test_non_string_column_names_are_untouched():
    # Edge: Non-string column names should be left unchanged (mapping keys are strings)
    df = pd.DataFrame({1: [7, 8], "_stdev": [9, 10]})
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 235μs -> 111μs (111% faster)


def test_empty_dataframe_behavior():
    # Edge: An empty DataFrame (no columns, no rows) returns a new empty DataFrame
    df = pd.DataFrame()
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 295μs -> 26.3μs (1022% faster)


def test_pstdev_and_count_are_mapped_correctly():
    # Basic: ensure all mappings are honored, including "_pstdev" and "_count"
    df = pd.DataFrame({"_pstdev": [0.5], "_count": [42], "keep": ["x"]})
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 242μs -> 115μs (109% faster)


def test_large_scale_with_many_columns_and_some_mappings():
    # Large-scale: create a DataFrame with many columns (but < 1000 as required).
    # Include a few columns that match the mapping keys among many unrelated columns.
    num_extra = 500  # safely under the 1000 elements threshold
    # Create many column names that won't match the rename map
    extra_cols = [f"col_{i}" for i in range(num_extra)]
    # Add the mapped columns in various positions
    cols = (
        extra_cols[:250] + ["_mean"] + extra_cols[250:400] + ["_stdev", "_count"] + extra_cols[400:]
    )
    # Create simple data: 5 rows of increasing integers for simplicity
    data = {c: list(range(5)) for c in cols}
    df = pd.DataFrame(data)
    # Apply rename function
    codeflash_output = _rename_aggregated_columns(df)
    renamed = codeflash_output  # 445μs -> 286μs (55.5% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

from unstructured.metrics.utils import _rename_aggregated_columns


def test_basic_rename_single_mean_column():
    """Test renaming a single _mean column to mean."""
    df = pd.DataFrame({"value_mean": [1.0, 2.0, 3.0]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 228μs -> 53.8μs (324% faster)
    expected = pd.DataFrame({"value": [1.0, 2.0, 3.0]})


def test_basic_rename_single_stdev_column():
    """Test renaming a single _stdev column to stdev."""
    df = pd.DataFrame({"metric_stdev": [0.1, 0.2, 0.3]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 230μs -> 53.1μs (333% faster)


def test_basic_rename_single_pstdev_column():
    """Test renaming a single _pstdev column to pstdev."""
    df = pd.DataFrame({"data_pstdev": [0.5, 0.6, 0.7]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 229μs -> 52.8μs (335% faster)


def test_basic_rename_single_count_column():
    """Test renaming a single _count column to count."""
    df = pd.DataFrame({"items_count": [10, 20, 30]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 227μs -> 52.7μs (332% faster)


def test_basic_rename_multiple_aggregated_columns():
    """Test renaming multiple aggregated columns at once."""
    df = pd.DataFrame(
        {"value_mean": [1.0, 2.0], "value_stdev": [0.1, 0.2], "value_count": [10, 20]}
    )
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 237μs -> 54.5μs (335% faster)


def test_basic_rename_all_aggregation_types():
    """Test renaming all four types of aggregated columns."""
    df = pd.DataFrame(
        {
            "metric_mean": [1.5],
            "metric_stdev": [0.2],
            "metric_pstdev": [0.15],
            "metric_count": [100],
        }
    )
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 250μs -> 54.5μs (360% faster)


def test_basic_no_aggregated_columns():
    """Test that non-aggregated columns remain unchanged."""
    df = pd.DataFrame({"id": [1, 2, 3], "name": ["a", "b", "c"]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 235μs -> 54.2μs (335% faster)


def test_basic_mixed_columns():
    """Test DataFrame with both aggregated and non-aggregated columns."""
    df = pd.DataFrame({"id": [1, 2], "value_mean": [10.0, 20.0], "name": ["a", "b"]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 239μs -> 56.2μs (327% faster)


def test_basic_preserves_data_integrity():
    """Test that data values are preserved during renaming."""
    df = pd.DataFrame({"score_mean": [100, 200, 300], "score_count": [10, 20, 30]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 227μs -> 52.7μs (332% faster)


def test_edge_empty_dataframe():
    """Test renaming columns in an empty DataFrame."""
    df = pd.DataFrame()
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 294μs -> 25.9μs (1038% faster)


def test_edge_empty_dataframe_with_aggregated_column_names():
    """Test empty DataFrame with aggregated column names but no rows."""
    df = pd.DataFrame(columns=["value_mean", "value_stdev", "value_count"])
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 226μs -> 50.7μs (348% faster)


def test_edge_column_name_only_suffix():
    """Test column named only with the suffix (e.g., '_mean' as column name)."""
    df = pd.DataFrame({"_mean": [1, 2, 3]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 229μs -> 107μs (114% faster)


def test_edge_column_name_with_multiple_suffixes():
    """Test column name containing multiple aggregation suffixes."""
    df = pd.DataFrame({"value_mean_stdev": [1, 2, 3]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 228μs -> 53.1μs (331% faster)


def test_edge_partial_suffix_match():
    """Test that partial suffix matches don't cause incorrect renaming."""
    df = pd.DataFrame({"mean_value": [1, 2, 3]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 229μs -> 53.2μs (332% faster)


def test_edge_case_sensitive_suffix():
    """Test that suffix matching is case-sensitive."""
    df = pd.DataFrame({"value_MEAN": [1, 2, 3], "value_Mean": [4, 5, 6]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 229μs -> 53.2μs (332% faster)


def test_edge_single_row_dataframe():
    """Test renaming with a DataFrame containing only one row."""
    df = pd.DataFrame({"metric_mean": [42.0]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 228μs -> 52.4μs (336% faster)


def test_edge_large_number_of_rows():
    """Test renaming with a DataFrame containing many rows."""
    df = pd.DataFrame({"value_mean": range(1000)})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 232μs -> 53.1μs (338% faster)


def test_edge_unicode_column_names():
    """Test columns with unicode characters in names."""
    df = pd.DataFrame({"métrique_mean": [1.0], "значение_stdev": [0.5]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 229μs -> 53.4μs (330% faster)


def test_edge_special_characters_in_column_names():
    """Test columns with special characters (not suffixes)."""
    df = pd.DataFrame({"value@#_mean": [1.0], "data$%_count": [10]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 235μs -> 54.1μs (335% faster)


def test_edge_numeric_column_names():
    """Test with numeric column names."""
    df = pd.DataFrame({1: [1], 2: [2], 3: [3]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 332μs -> 47.1μs (605% faster)


def test_edge_whitespace_in_column_names():
    """Test columns with whitespace in names."""
    df = pd.DataFrame({"value with spaces_mean": [1.0]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 229μs -> 52.4μs (338% faster)


def test_edge_multiple_identical_aggregation_suffixes():
    """Test DataFrame with multiple columns needing the same rename."""
    df = pd.DataFrame({"metric1_mean": [1.0], "metric2_mean": [2.0], "metric3_mean": [3.0]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 230μs -> 52.9μs (336% faster)


def test_edge_dataframe_with_nan_values():
    """Test that NaN values are preserved during renaming."""
    df = pd.DataFrame({"value_mean": [1.0, float("nan"), 3.0], "value_count": [10, 20, 30]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 235μs -> 54.4μs (332% faster)


def test_edge_dataframe_with_null_like_values():
    """Test with None and NaN values."""
    df = pd.DataFrame({"score_mean": [100.0, None, 300.0], "score_stdev": [10.0, 20.0, None]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 228μs -> 52.8μs (333% faster)


def test_edge_very_long_column_names():
    """Test with extremely long column names."""
    long_name = "a" * 500 + "_mean"
    df = pd.DataFrame({long_name: [1.0]})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 230μs -> 52.9μs (336% faster)
    expected_name = "a" * 500


def test_edge_duplicate_column_names_before_rename():
    """Test DataFrame with duplicate column names before renaming."""
    df = pd.DataFrame([[1, 2], [3, 4]], columns=["value_mean", "value_mean"])
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 257μs -> 83.8μs (207% faster)


def test_large_scale_many_aggregated_columns():
    """Test renaming performance with many aggregated columns."""
    # Create a DataFrame with 500 different metric columns
    data = {f"metric_{i}_mean": [float(i)] * 100 for i in range(500)}
    df = pd.DataFrame(data)
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 471μs -> 83.4μs (465% faster)


def test_large_scale_many_rows_single_column():
    """Test renaming with a DataFrame containing many rows but single column."""
    df = pd.DataFrame({"measurement_mean": range(1000)})
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 233μs -> 53.0μs (339% faster)


def test_large_scale_large_mixed_dataframe():
    """Test renaming on a large DataFrame with mixed aggregated and non-aggregated columns."""
    data = {"id": range(1000), "name": [f"item_{i}" for i in range(1000)]}
    # Add 50 aggregated columns
    for i in range(50):
        data[f"metric_{i}_mean"] = [float(i)] * 1000
        data[f"metric_{i}_stdev"] = [0.1 * i] * 1000

    df = pd.DataFrame(data)
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 379μs -> 69.4μs (447% faster)


def test_large_scale_dataframe_with_various_dtypes():
    """Test renaming with DataFrame containing various data types."""
    df = pd.DataFrame(
        {
            "int_mean": [1, 2, 3] * 100,
            "float_stdev": [1.1, 2.2, 3.3] * 100,
            "str_count": ["a", "b", "c"] * 100,
            "bool_pstdev": [True, False] * 150,
        }
    )
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 261μs -> 58.6μs (345% faster)


def test_large_scale_all_column_types_combined():
    """Test with DataFrame containing all four aggregation types for multiple metrics."""
    metrics = ["metric_a", "metric_b", "metric_c", "metric_d", "metric_e"]
    aggregations = ["mean", "stdev", "pstdev", "count"]
    data = {}

    for metric in metrics:
        for agg in aggregations:
            col_name = f"{metric}_{agg}"
            data[col_name] = [float(len(data))] * 200

    df = pd.DataFrame(data)
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 248μs -> 56.4μs (341% faster)

    # Verify no aggregation suffixes remain in the column names
    for col in result.columns:
        pass


def test_large_scale_performance_many_operations():
    """Test that function performs efficiently with large dataset."""
    # Create a DataFrame with 800 columns and 500 rows
    data = {}
    for i in range(400):
        data[f"col_{i}_mean"] = range(500)
        data[f"col_{i}_count"] = range(500)

    df = pd.DataFrame(data)
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 1.01ms -> 93.5μs (980% faster)


def test_large_scale_dataframe_with_duplicates():
    """Test renaming on DataFrame with many duplicate aggregated suffixes."""
    # Create many columns that all use _mean suffix
    data = {f"measurement_{i}_mean": [i * 10.0] * 300 for i in range(200)}
    df = pd.DataFrame(data)
    codeflash_output = _rename_aggregated_columns(df)
    result = codeflash_output  # 366μs -> 71.1μs (415% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_rename_aggregated_columns-mks41qqa and push.

Codeflash Static Badge

The optimized code achieves a **298% speedup** by avoiding pandas' heavyweight `DataFrame.rename()` machinery when possible. Here's why it's faster:

## Key Optimization

**Early exit on non-matching DataFrames**: The optimized version checks if any rename_map keys exist in the DataFrame columns *before* performing any renaming operation. In the common case where none of the special aggregation suffixes (`_mean`, `_stdev`, `_pstdev`, `_count`) are present in the columns, it immediately returns a shallow copy without invoking pandas' complex rename logic.

## Performance Benefits

1. **Avoided overhead**: `df.rename(columns=...)` internally performs extensive validation, index alignment, and creates multiple intermediate data structures even when no columns need renaming. The optimized version bypasses this entirely for non-matching cases.

2. **Selective column construction**: When a match *is* found, it builds a new column list using a simple list comprehension and directly assigns it to `df2.columns`. This is significantly faster than pandas' rename machinery.

3. **Test results validate the approach**:
   - **Empty DataFrames**: 1022% speedup (295μs → 26.3μs) - dramatic improvement from avoiding rename's overhead
   - **No matching columns**: 333-338% speedup across multiple tests - the early exit path is highly effective
   - **Large DataFrames without matches**: 605% speedup for numeric columns, 335% for many non-aggregated columns
   - **DataFrames with aggregation columns**: Still 114-116% speedup even when renaming is required

## Impact on Production Workload

Based on the `function_references`, this function is called within **`get_mean_grouping()`**, a metrics aggregation pipeline that processes grouped DataFrames. The optimization particularly benefits scenarios where:

- **GroupBy operations** produce DataFrames without the exact mapping keys (e.g., columns like `"value_mean"` instead of `"_mean"`)
- **Multiple aggregations** are performed in loops (the function is called once per `agg_field`)
- The evaluation pipeline processes many small to medium DataFrames repeatedly

The 3-10x speedup for non-matching cases means the metrics pipeline will run substantially faster when processing diverse column naming patterns, with minimal impact on the matching case performance.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 09:31
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant