Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 31, 2026

📄 8,871% (88.71x) speedup for find_last_node in src/algorithms/graph.py

⏱️ Runtime : 11.3 milliseconds 126 microseconds (best of 809 runs)

📝 Explanation and details

The optimized code achieves an 88x speedup by eliminating redundant work through a simple algorithmic improvement:

Original approach (O(N×M) complexity):

  • For each node, iterates through all edges to check if that node appears as a source
  • With 500 nodes and 499 edges, this performs ~250,000 comparisons (500 × 499)
  • Uses nested iteration inside a generator expression: all(e["source"] != n["id"] for e in edges) runs for every node candidate

Optimized approach (O(N+M) complexity):

  • Pre-builds a set of all source IDs once with {e["source"] for e in edges}
  • Then performs fast O(1) hash lookups (n["id"] not in sources) for each node
  • With 500 nodes and 499 edges, this performs only ~999 operations (499 + 500)

Why this matters:
The test results show dramatic improvements for larger inputs:

  • test_large_scale_many_nodes_and_edges (500 nodes): 17,540% faster (4.56ms → 25.8μs)
  • test_large_linear_chain (500 nodes): 17,456% faster (4.53ms → 25.8μs)
  • test_large_multiple_endpoint_graph (100 nodes, 100 edges): 3,151% faster (203μs → 6.25μs)

Smaller graphs (2-10 nodes) still see 100-137% speedups due to eliminating the nested loop overhead.

Behavioral preservation:

  • Returns the first non-source node in iteration order (identical to original)
  • Raises KeyError for malformed edges missing "source" key (same as original)
  • Handles all edge cases identically: empty inputs, falsy IDs, type-sensitive matching, duplicates

The optimization is particularly valuable for graph analysis workflows where this function might be called repeatedly on moderate-to-large graphs, as the performance gain scales quadratically with input size.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node


def test_basic_single_last_node():
    # Basic scenario: a simple two-node flow where 'a' points to 'b'.
    nodes = [{"id": "a"}, {"id": "b"}]  # two nodes in order
    edges = [{"source": "a", "target": "b"}]  # one edge from 'a' to 'b'
    # Expect the node with id 'b' because it has no outgoing edges.
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.54μs -> 708ns (118% faster)


def test_no_edges_returns_first_node():
    # Edge case: when there are no edges, every node has no outgoing edges.
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]  # multiple nodes
    edges = []  # empty list of edges
    # The implementation uses next over nodes, so the first node should be returned.
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.04μs -> 500ns (108% faster)


def test_no_nodes_returns_none():
    # Edge case: empty nodes list should yield None (nothing to choose from).
    nodes = []  # no nodes
    edges = [{"source": "x"}]  # edges do not matter when nodes is empty
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 666ns -> 541ns (23.1% faster)


def test_multiple_candidates_returns_first_candidate():
    # Order sensitivity: if multiple nodes have no outgoing edges, the first such node is returned.
    nodes = [{"id": 10}, {"id": 20}, {"id": 30}]
    edges = [{"source": 10, "target": 20}]  # only node 10 has an outgoing edge
    # Nodes 20 and 30 have no outgoing edges; the function should pick the first of those (20).
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.58μs -> 750ns (111% faster)


def test_type_sensitive_matching():
    # Types must match exactly: string '1' should not be considered equal to integer 1.
    nodes = [{"id": 1}, {"id": "1"}]
    edges = [{"source": "1"}]  # only matches the string id, not the integer id
    # The integer-id node (1) has no outgoing edges because '1' != 1, so it should be returned.
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.29μs -> 625ns (107% faster)


def test_falsy_ids_are_handled_correctly():
    # Node ids that are falsy (0, empty string) must still be compared correctly.
    nodes = [{"id": 0}, {"id": ""}, {"id": None}]
    # An edge from source 0 means node with id 0 has outgoing edge; the next falsy id '' should be returned.
    edges = [{"source": 0}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.67μs -> 833ns (100% faster)


def test_edge_missing_source_key_raises_keyerror():
    # If an edge dict does not contain the 'source' key, accessing e["source"] should raise a KeyError.
    nodes = [{"id": "a"}]
    edges = [{"target": "b"}]  # malformed edge lacking 'source'
    with pytest.raises(KeyError):
        # The function's implementation uses e["source"] directly, so KeyError is expected.
        find_last_node(nodes, edges) # 1.46μs -> 833ns (75.0% faster)


def test_duplicate_edge_sources_do_not_affect_result():
    # Duplicate edges from the same source should not change the outcome.
    nodes = [{"id": "alpha"}, {"id": "omega"}]
    edges = [{"source": "alpha"}, {"source": "alpha"}, {"source": "alpha"}]  # repeated duplicates
    # 'omega' has no outgoing edges, so it should be returned regardless of duplicate sources.
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.71μs -> 792ns (116% faster)


def test_large_scale_many_nodes_and_edges():
    # Large-scale test within limits (<1000 elements): create 500 nodes.
    size = 500  # keep well under 1000 as requested
    # Create nodes with ids 0..499 as separate dicts to preserve identity and ordering.
    nodes = [{"id": i} for i in range(size)]
    # Create edges such that every node except the last (id=size-1) has an outgoing edge.
    edges = [{"source": i, "target": i + 1} for i in range(size - 1)]
    # The node with id size-1 should be the only node without an outgoing edge.
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 4.56ms -> 25.8μs (17540% faster)


def test_nodes_with_non_hashable_like_ids_are_supported():
    # Use tuple ids (hashable) to ensure id types other than int/str work.
    nodes = [{"id": (1, 2)}, {"id": (3, 4)}]
    edges = [{"source": (1, 2)}]  # edge from (1,2) to somewhere else
    # The tuple (3,4) should be returned as it has no outgoing edges.
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 792ns (105% faster)


def test_returns_first_when_all_nodes_have_no_outgoing_edges():
    # When no edges refer to any node's id, the first node should be returned.
    nodes = [{"id": "x"}, {"id": "y"}, {"id": "z"}]
    # edges reference unrelated ids
    edges = [{"source": "other1"}, {"source": "other2"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.46μs -> 750ns (94.4% faster)


def test_edge_list_with_unrelated_keys_still_requires_source_key():
    # Even if edges contain many other keys, 'source' must exist for each edge otherwise KeyError.
    nodes = [{"id": "n"}]
    edges_good = [{"source": "x", "meta": 1}]  # well-formed edge
    # Should not raise and should return the first node because no edges have source == 'n'
    codeflash_output = find_last_node(nodes, edges_good); result = codeflash_output # 1.33μs -> 625ns (113% faster)
    # Now an edge missing 'source' among other well-formed edges should cause KeyError.
    edges_mixed = [{"source": "x"}, {"meta": "no source here"}]
    with pytest.raises(KeyError):
        find_last_node(nodes, edges_mixed) # 1.25μs -> 750ns (66.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from src.algorithms.graph import find_last_node


def test_single_node_no_edges():
    """Test finding the last node in a graph with a single node and no edges."""
    nodes = [{"id": 1, "name": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.04μs -> 500ns (108% faster)


def test_linear_chain_three_nodes():
    """Test finding the last node in a linear chain: A -> B -> C."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"},
        {"id": 3, "name": "C"}
    ]
    edges = [
        {"source": 1, "target": 2},
        {"source": 2, "target": 3}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.88μs -> 833ns (125% faster)


def test_linear_chain_two_nodes():
    """Test finding the last node in a chain of two nodes: A -> B."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"}
    ]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.50μs -> 667ns (125% faster)


def test_multiple_branches_single_endpoint():
    """Test finding the last node when multiple nodes converge to a single endpoint."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"},
        {"id": 3, "name": "C"}
    ]
    edges = [
        {"source": 1, "target": 3},
        {"source": 2, "target": 3}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.83μs -> 792ns (131% faster)


def test_single_node_with_self_loop():
    """Test behavior when a node has a self-referencing edge."""
    nodes = [{"id": 1, "name": "A"}]
    edges = [{"source": 1, "target": 1}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.17μs -> 625ns (86.7% faster)


def test_diamond_graph():
    """Test finding the last node in a diamond-shaped graph."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"},
        {"id": 3, "name": "C"},
        {"id": 4, "name": "D"}
    ]
    edges = [
        {"source": 1, "target": 2},
        {"source": 1, "target": 3},
        {"source": 2, "target": 4},
        {"source": 3, "target": 4}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.29μs -> 1.00μs (129% faster)


def test_nodes_with_extra_attributes():
    """Test that find_last_node works with nodes containing various attributes."""
    nodes = [
        {"id": 1, "name": "A", "type": "start", "value": 100},
        {"id": 2, "name": "B", "type": "middle", "value": 200},
        {"id": 3, "name": "C", "type": "end", "value": 300}
    ]
    edges = [
        {"source": 1, "target": 2},
        {"source": 2, "target": 3}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.88μs -> 792ns (137% faster)


def test_empty_nodes_list():
    """Test behavior when the nodes list is empty."""
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 666ns -> 417ns (59.7% faster)


def test_empty_edges_list_multiple_nodes():
    """Test that the first node is returned when there are no edges."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"},
        {"id": 3, "name": "C"}
    ]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.04μs -> 542ns (92.1% faster)


def test_all_nodes_are_sources():
    """Test when all nodes have outgoing edges (no node is a sink)."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"}
    ]
    edges = [
        {"source": 1, "target": 2},
        {"source": 2, "target": 1}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.58μs -> 708ns (124% faster)


def test_cyclic_graph():
    """Test finding the last node in a cyclic graph (A -> B -> C -> A)."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"},
        {"id": 3, "name": "C"}
    ]
    edges = [
        {"source": 1, "target": 2},
        {"source": 2, "target": 3},
        {"source": 3, "target": 1}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.88μs -> 792ns (137% faster)


def test_node_not_in_edges():
    """Test when a node exists but is not referenced in any edges."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"},
        {"id": 3, "name": "C"}
    ]
    edges = [
        {"source": 1, "target": 2}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.50μs -> 709ns (112% faster)


def test_edge_referencing_nonexistent_node():
    """Test that edges can reference node IDs that may not exist in the nodes list."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"}
    ]
    edges = [
        {"source": 1, "target": 99}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.50μs -> 750ns (100% faster)


def test_node_with_id_zero():
    """Test that node with id 0 is handled correctly."""
    nodes = [
        {"id": 0, "name": "Start"},
        {"id": 1, "name": "End"}
    ]
    edges = [
        {"source": 0, "target": 1}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.50μs -> 750ns (100% faster)


def test_node_with_negative_id():
    """Test that nodes with negative IDs are handled correctly."""
    nodes = [
        {"id": -1, "name": "A"},
        {"id": -2, "name": "B"}
    ]
    edges = [
        {"source": -1, "target": -2}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.46μs -> 875ns (66.7% faster)


def test_node_with_string_id():
    """Test that nodes with string IDs work correctly."""
    nodes = [
        {"id": "start", "name": "A"},
        {"id": "end", "name": "B"}
    ]
    edges = [
        {"source": "start", "target": "end"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.54μs -> 750ns (106% faster)


def test_duplicate_edges():
    """Test that duplicate edges don't affect the result."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"}
    ]
    edges = [
        {"source": 1, "target": 2},
        {"source": 1, "target": 2}  # Duplicate edge
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.54μs -> 792ns (94.7% faster)


def test_edge_with_extra_attributes():
    """Test that edges with extra attributes are handled correctly."""
    nodes = [
        {"id": 1, "name": "A"},
        {"id": 2, "name": "B"}
    ]
    edges = [
        {"source": 1, "target": 2, "weight": 10, "label": "connection"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.46μs -> 708ns (106% faster)


def test_large_linear_chain():
    """Test finding the last node in a large linear chain."""
    # Create a chain of 500 nodes: 1 -> 2 -> 3 -> ... -> 500
    node_count = 500
    nodes = [{"id": i, "name": f"Node_{i}"} for i in range(1, node_count + 1)]
    edges = [{"source": i, "target": i + 1} for i in range(1, node_count)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 4.53ms -> 25.8μs (17456% faster)


def test_large_tree_structure():
    """Test finding the last node in a large tree structure (binary tree)."""
    # Create a binary tree with multiple leaves
    nodes = [{"id": i, "name": f"Node_{i}"} for i in range(1, 100)]
    edges = []
    for i in range(1, 50):  # Internal nodes
        edges.append({"source": i, "target": 2 * i})  # Left child
        edges.append({"source": i, "target": 2 * i + 1})  # Right child
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 100μs -> 4.42μs (2182% faster)


def test_large_multiple_endpoint_graph():
    """Test finding the last node in a graph with many sources converging to one endpoint."""
    # Create 100 source nodes all connecting to node 101
    nodes = [{"id": i, "name": f"Node_{i}"} for i in range(1, 102)]
    edges = [{"source": i, "target": 101} for i in range(1, 101)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 203μs -> 6.25μs (3151% faster)


def test_large_graph_with_disconnected_components():
    """Test finding the last node when graph has multiple disconnected components."""
    # Create two disconnected chains: (1->2->3) and (4->5->6)
    nodes = [
        {"id": 1, "name": "A1"},
        {"id": 2, "name": "A2"},
        {"id": 3, "name": "A3"},
        {"id": 4, "name": "B1"},
        {"id": 5, "name": "B2"},
        {"id": 6, "name": "B3"}
    ]
    edges = [
        {"source": 1, "target": 2},
        {"source": 2, "target": 3},
        {"source": 4, "target": 5},
        {"source": 5, "target": 6}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.92μs -> 875ns (119% faster)


def test_large_wide_graph():
    """Test finding the last node in a wide graph with many nodes at same level."""
    # Create a graph where node 1 connects to 200 nodes, and none of those connect further
    nodes = [{"id": i, "name": f"Node_{i}"} for i in range(1, 202)]
    edges = [{"source": 1, "target": i} for i in range(2, 202)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 8.92μs -> 4.21μs (112% faster)


def test_large_complex_dag():
    """Test finding the last node in a large directed acyclic graph."""
    # Create a DAG with multiple paths converging to a single sink
    nodes = [{"id": i, "name": f"Node_{i}"} for i in range(1, 51)]
    edges = []
    # Connect nodes 1-24 to node 25
    for i in range(1, 25):
        edges.append({"source": i, "target": 25})
    # Connect node 25 to nodes 26-49
    for i in range(26, 50):
        edges.append({"source": 25, "target": i})
    # Connect all nodes 26-49 to node 50
    for i in range(26, 50):
        edges.append({"source": i, "target": 50})
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 76.4μs -> 3.83μs (1893% faster)


def test_performance_large_edges_small_nodes():
    """Test performance with many edges but relatively few nodes."""
    # Create 10 nodes with many edges between them
    nodes = [{"id": i, "name": f"Node_{i}"} for i in range(1, 11)]
    edges = []
    # Create many edges (but maintain acyclic property)
    for i in range(1, 10):
        for j in range(i + 1, 11):
            edges.append({"source": i, "target": j})
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 13.7μs -> 2.17μs (533% faster)


def test_nodes_with_numeric_string_ids():
    """Test with a large number of nodes using numeric string IDs."""
    node_count = 300
    nodes = [{"id": str(i), "name": f"Node_{i}"} for i in range(1, node_count + 1)]
    edges = [{"source": str(i), "target": str(i + 1)} for i in range(1, node_count)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.73ms -> 30.7μs (5540% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-find_last_node-ml2evagh and push.

Codeflash

The optimized code achieves an **88x speedup** by eliminating redundant work through a simple algorithmic improvement:

**Original approach (O(N×M) complexity):**
- For each node, iterates through **all edges** to check if that node appears as a source
- With 500 nodes and 499 edges, this performs ~250,000 comparisons (500 × 499)
- Uses nested iteration inside a generator expression: `all(e["source"] != n["id"] for e in edges)` runs for every node candidate

**Optimized approach (O(N+M) complexity):**
- Pre-builds a **set of all source IDs** once with `{e["source"] for e in edges}` 
- Then performs fast O(1) hash lookups (`n["id"] not in sources`) for each node
- With 500 nodes and 499 edges, this performs only ~999 operations (499 + 500)

**Why this matters:**
The test results show dramatic improvements for larger inputs:
- `test_large_scale_many_nodes_and_edges` (500 nodes): **17,540% faster** (4.56ms → 25.8μs)
- `test_large_linear_chain` (500 nodes): **17,456% faster** (4.53ms → 25.8μs)
- `test_large_multiple_endpoint_graph` (100 nodes, 100 edges): **3,151% faster** (203μs → 6.25μs)

Smaller graphs (2-10 nodes) still see 100-137% speedups due to eliminating the nested loop overhead.

**Behavioral preservation:**
- Returns the first non-source node in iteration order (identical to original)
- Raises `KeyError` for malformed edges missing "source" key (same as original)
- Handles all edge cases identically: empty inputs, falsy IDs, type-sensitive matching, duplicates

The optimization is particularly valuable for graph analysis workflows where this function might be called repeatedly on moderate-to-large graphs, as the performance gain scales quadratically with input size.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 January 31, 2026 14:32
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jan 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants