Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Shekharrajak · 2025-12-28T09:57:25Z

Handle Spark's full serialization format (12-byte header + bits) in merge_filter() to support Spark partial / Comet final execution. The fix automatically detects the format and extracts bits data accordingly.

Fixes #2889

Rationale for this change

Spark's serialize() returns full format: 12-byte header (version + numHashFunctions + numWords) + bits data
Comet's state_as_bytes() returns bits data only
When Spark partial sends full format, Comet's merge_filter() expects bits-only, causing mismatch

Ref https://github.com/apache/spark/blob/master/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java#L99

Ref https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L219

Spark format: BloomFilterImpl.writeTo() (4+4 bytes) + BitArray.writeTo() (4 bytes + bits)

What changes are included in this PR?

Detects Spark format (buffer size = 12 + expected_bits_size)
Extracts bits data by skipping 12-byte header if Spark format
Returns bits as-is if Comet format

How are these changes tested?

Spark SQL test

Handle Spark's full serialization format (12-byte header + bits) in merge_filter() to support Spark partial / Comet final execution. The fix automatically detects the format and extracts bits data accordingly. Fixes apache#2889

andygrove · 2026-01-02T15:06:47Z

Thanks @Shekharrajak. Looks like you need to run make format to fix build failures

Shekharrajak · 2026-01-03T14:31:51Z

Thanks @Shekharrajak. Looks like you need to run make format to fix build failures

Done. Please help in triggering the workflow. Thanks!

andygrove · 2026-01-05T21:00:28Z

@Shekharrajak there are compilation errors

Shekharrajak · 2026-01-06T12:21:08Z

@andygrove , now it is looking fine locally. Do we have a way to run all the workflow checks to run locally so that we will be make sure everything is fine , before running the workflow in GitHub ?

codecov-commenter · 2026-01-06T15:35:17Z

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.57%. Comparing base (f09f8af) to head (6018e4a).
⚠️ Report is 845 commits behind head on main.

Files with missing lines	Patch %	Lines
...n/scala/org/apache/spark/sql/comet/operators.scala	60.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3003      +/-   ##
============================================
- Coverage     56.12%   54.57%   -1.56%     
- Complexity      976     1261     +285     
============================================
  Files           119      167      +48     
  Lines         11743    15556    +3813     
  Branches       2251     2584     +333     
============================================
+ Hits           6591     8490    +1899     
- Misses         4012     5844    +1832     
- Partials       1140     1222      +82

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mbutrovich · 2026-01-06T15:45:51Z

@andygrove , now it is looking fine locally. Do we have a way to run all the workflow checks to run locally so that we will be make sure everything is fine , before running the workflow in GitHub ?

I think in the compilation error case that should be pretty reproducible locally. I definitely recommend running cargo clippy in native first, since that'll catch native compilation and linting errors.

mbutrovich · 2026-01-06T17:13:38Z

In the absence of any new tests, it feels like we should be relaxing a fallback constraint in operators.scala or modifying existing tests to exercise this behavior. Otherwise I suspect we're still falling back. @andygrove do you recall where we might want to make changes to test this behavior?

Shekharrajak · 2026-01-06T18:27:20Z

I think this is the condition: https://github.com/apache/datafusion-comet/blob/main/spark/src/main/scala/org/apache/spark/sql/comet/operators.scala#L1074

mbutrovich

Thanks @Shekharrajak! Can you remove any relevant fallbacks and/or modify tests so we know that we're exercising this behavior?

…terAggregate merge

manuzhang · 2026-01-14T09:41:53Z

I definitely recommend running cargo clippy in native first, since that'll catch native compilation and linting errors.

Can we consider adding precommit hook?

Shekharrajak · 2026-01-14T10:39:55Z

Thanks @Shekharrajak! Can you remove any relevant fallbacks and/or modify tests so we know that we're exercising this behavior?

Done, please trigger workflow .

…ly exclusive

…bility

…ith better error messages

parthchandra · 2026-01-14T18:23:52Z

native/spark-expr/src/bloom_filter/spark_bloom_filter.rs

+                // Check if the incoming bloom filter has compatible size
+                let incoming_bits_size = bits_end - bits_start;
+                if incoming_bits_size != expected_bits_size {
+                    panic!(


Can we use CometError::Internal(String) instead of panic!? (You'll need to return a Result)

parthchandra · 2026-01-14T20:47:40Z

native/spark-expr/src/bloom_filter/spark_bloom_filter.rs

+        let expected_bits_size = self.bits.byte_size();
+        const SPARK_HEADER_SIZE: usize = 12; // version (4) + num_hash_functions (4) + num_words (4)
+
+        let bits_data = if other.len() >= SPARK_HEADER_SIZE {


Should this be strictly greater than SPARK_HEADER_SIZE?

parthchandra · 2026-01-14T20:49:52Z

native/spark-expr/src/bloom_filter/spark_bloom_filter.rs

+            let version = i32::from_be_bytes([
+                other[0], other[1], other[2], other[3],
+            ]);
+            if version == SPARK_BLOOM_FILTER_VERSION_1 {


Is this sufficient to ensure that this is a spark bloom filter? Isn't there a chance the starting 4 bytes of the Comet bloom filter might match the pattern?

minor change

5994c3f

mbutrovich self-requested a review January 5, 2026 17:06

Fix Rust lifetime and borrow checker errors in merge_filter

030c67b

mbutrovich requested changes Jan 7, 2026

View reviewed changes

Remove fallback and add test for Spark partial / Comet final BloomFil…

49169a6

…terAggregate merge

Fix missing imports in CometExecSuite: add SparkPlan, Partial, and Final

0b34826

Shekharrajak and others added 3 commits January 14, 2026 16:11

Fix Rust compilation errors: make allocators and HDFS features mutual…

559caec

…ly exclusive

Merge branch 'main' into fix/issue-2889-bloom-filter-buffer-incompati…

4a1f590

…bility

Improve bloom filter merge to handle Spark partial aggregate format w…

6018e4a

…ith better error messages

parthchandra reviewed Jan 14, 2026

View reviewed changes

Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Are you sure you want to change the base?

Fix BloomFilter buffer incompatibility between Spark and Comet #3003

Conversation

Shekharrajak commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Jan 2, 2026

Uh oh!

Shekharrajak commented Jan 3, 2026

Uh oh!

andygrove commented Jan 5, 2026

Uh oh!

Shekharrajak commented Jan 6, 2026

Uh oh!

codecov-commenter commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mbutrovich commented Jan 6, 2026

Uh oh!

mbutrovich commented Jan 6, 2026

Uh oh!

Shekharrajak commented Jan 6, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

manuzhang commented Jan 14, 2026

Uh oh!

Shekharrajak commented Jan 14, 2026

Uh oh!

parthchandra Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

parthchandra Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

parthchandra Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Shekharrajak commented Dec 28, 2025 •

edited

Loading

codecov-commenter commented Jan 6, 2026 •

edited

Loading