add TableProvider to enable future row-by-row streaming #2189

dayesouza · 2026-01-26T22:50:43Z

Description

This PR implements Phase 1 of the storage migration plan to support streaming row-by-row operations for larger datasets with reduced memory pressure. It introduces the TableProvider abstraction layer that sits between GraphRAG workflows and the underlying storage mechanism, enabling future database backends and streaming operations.

All indexing workflows have been migrated from direct Storage calls to use TableProvider.read_dataframe() and write_dataframe() methods, while maintaining backward compatibility through wrapper functions.

Proposed Changes

Core Infrastructure

New: TableProvider abstract base class in graphrag-storage/tables/
- read_dataframe(table_name) - read entire table as DataFrame
- write_dataframe(table_name, df) - write entire table as DataFrame
- has_dataframe(table_name) - check if table exists
- find_tables() - list all available tables
New: ParquetTableProvider implementation
- Wraps underlying Storage instance (file/blob/cosmos)
- Converts between DataFrames and Parquet format
- Handles BytesIO streaming for in-memory operations

Context Changes

Updated PipelineRunContext to include:
- input_table_provider: ParquetTableProvider
- output_table_provider: ParquetTableProvider
- previous_table_provider: ParquetTableProvider | None (for update mode)
Renamed previous_storage → previous_table_provider for consistency
Note: output_storage remains in context for JSON files (stats.json, context.json) and GraphML snapshots

Pattern change:

# Before
documents = await load_table_from_storage("documents", context.output_storage)
await write_table_to_storage(output, "text_units", context.output_storage)

# After
documents = await context.output_table_provider.read_dataframe("documents")
await context.output_table_provider.write_dataframe("text_units", output)

natoverse

Be sure to also check the docs/examples_notebooks as well

.semversioner/next-release/minor-20260126224712110537.json

packages/graphrag-storage/graphrag_storage/tables/__init__.py

packages/graphrag-storage/graphrag_storage/__init__.py

packages/graphrag/graphrag/utils/storage.py

dayesouza requested a review from a team as a code owner January 26, 2026 22:50

natoverse reviewed Jan 26, 2026

View reviewed changes

dayesouza requested a review from natoverse January 27, 2026 14:38

Base automatically changed from v3/main to main January 27, 2026 18:23

dayesouza added 7 commits January 27, 2026 21:00

write dataframe

befab25

changed some workflows

5a0e5ea

1a

277977e

add fixed files

6c2e04f

add versioning

0ccc8d6

add patch and remove utility

5018871

pr changes

914fb63

dayesouza force-pushed the tablesp branch from 7cfc2e6 to 914fb63 Compare January 27, 2026 21:57

Merge branch 'main' into tablesp

fc106a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add TableProvider to enable future row-by-row streaming #2189

add TableProvider to enable future row-by-row streaming #2189

Uh oh!

dayesouza commented Jan 26, 2026

Uh oh!

natoverse left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add TableProvider to enable future row-by-row streaming #2189

Are you sure you want to change the base?

add TableProvider to enable future row-by-row streaming #2189

Uh oh!

Conversation

dayesouza commented Jan 26, 2026

Description

Proposed Changes

Core Infrastructure

Context Changes

Uh oh!

natoverse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants