Skip to content

Conversation

@dayesouza
Copy link
Contributor

Description

This PR implements Phase 1 of the storage migration plan to support streaming row-by-row operations for larger datasets with reduced memory pressure. It introduces the TableProvider abstraction layer that sits between GraphRAG workflows and the underlying storage mechanism, enabling future database backends and streaming operations.

All indexing workflows have been migrated from direct Storage calls to use TableProvider.read_dataframe() and write_dataframe() methods, while maintaining backward compatibility through wrapper functions.

Proposed Changes

Core Infrastructure

  • New: TableProvider abstract base class in graphrag-storage/tables/

    • read_dataframe(table_name) - read entire table as DataFrame
    • write_dataframe(table_name, df) - write entire table as DataFrame
    • has_dataframe(table_name) - check if table exists
    • find_tables() - list all available tables
  • New: ParquetTableProvider implementation

    • Wraps underlying Storage instance (file/blob/cosmos)
    • Converts between DataFrames and Parquet format
    • Handles BytesIO streaming for in-memory operations

Context Changes

  • Updated PipelineRunContext to include:

    • input_table_provider: ParquetTableProvider
    • output_table_provider: ParquetTableProvider
    • previous_table_provider: ParquetTableProvider | None (for update mode)
  • Renamed previous_storageprevious_table_provider for consistency

  • Note: output_storage remains in context for JSON files (stats.json, context.json) and GraphML snapshots

Pattern change:

# Before
documents = await load_table_from_storage("documents", context.output_storage)
await write_table_to_storage(output, "text_units", context.output_storage)

# After
documents = await context.output_table_provider.read_dataframe("documents")
await context.output_table_provider.write_dataframe("text_units", output)

@dayesouza dayesouza requested a review from a team as a code owner January 26, 2026 22:50
Copy link
Collaborator

@natoverse natoverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be sure to also check the docs/examples_notebooks as well

@dayesouza dayesouza requested a review from natoverse January 27, 2026 14:38
Base automatically changed from v3/main to main January 27, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants