Skip to content

[bug] Fix batch INSERT_OVERWRITE replacecommits dropping adds in HudiDataFileExtractor#816

Open
vinishjail97 wants to merge 1 commit intoapache:mainfrom
vinishjail97:fix-insert-overwrite-batch-replacecommits
Open

[bug] Fix batch INSERT_OVERWRITE replacecommits dropping adds in HudiDataFileExtractor#816
vinishjail97 wants to merge 1 commit intoapache:mainfrom
vinishjail97:fix-insert-overwrite-batch-replacecommits

Conversation

@vinishjail97
Copy link
Contributor

Problem

When XTable processes multiple INSERT_OVERWRITE replacecommit instants in a single batch (e.g., commit A replaces the initial file groups, then commit B replaces A's file groups on the same partition), getUpdatesToPartitionForReplaceCommit uses getReplacedFileGroupsBeforeOrOn(A.timestamp) to find replaced file groups. This excludes A's own file groups because they were replaced by B (and B's timestamp > A's), and getAllFileGroups also excludes them (globally marked replaced). The result: replacecommit A emits 0 adds, causing downstream format sync failures.

Fix

Change getReplacedFileGroupsBeforeOrOngetAllReplacedFileGroups in the replace commit handler. This is consistent with how getUpdatesToPartition already handles regular commits (where all file groups across the timeline are considered). The newFileIds and replacedFileIds sets derived from each commit's write stats already ensure only the correct files are emitted as adds or removes for that specific commit.

Tests

  • Added insertOverwrite() helper to TestSparkHudiTable
  • Added testMultipleInsertOverwriteOnSamePartitions integration test in ITHudiConversionSource that reproduces the bug and validates the fix
  • New test fails without the fix (replacecommit A has expected: <[...parquet]> but was: <[]>) and passes with it

…leExtractor

When XTable processes multiple INSERT_OVERWRITE replacecommit instants in a single
batch (e.g., A replaces initial, B replaces A on the same partition),
getUpdatesToPartitionForReplaceCommit uses getReplacedFileGroupsBeforeOrOn(A.timestamp)
which excludes A's file groups (replaced by B, and B > A). Combined with
getAllFileGroups also excluding them (globally marked replaced), replacecommit A
emits 0 adds — causing downstream integrity failures.

Fix: change getReplacedFileGroupsBeforeOrOn -> getAllReplacedFileGroups in the
replace commit handler, consistent with how getUpdatesToPartition already handles
regular commits. The newFileIds/replacedFileIds sets from the commit's write stats
already ensure only the correct files are added or removed.

Added insertOverwrite() test helper to TestSparkHudiTable and an integration test
testMultipleInsertOverwriteOnSamePartitions that reproduces and validates the fix.
@vinishjail97 vinishjail97 changed the title Fix batch INSERT_OVERWRITE replacecommits dropping adds in HudiDataFileExtractor [bug] Fix batch INSERT_OVERWRITE replacecommits dropping adds in HudiDataFileExtractor Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant