[VL] Support multiple segments per partition in columnar shuffle by guowangy · Pull Request #11722 · apache/gluten

guowangy · 2026-03-09T02:54:20Z

What changes are proposed in this pull request?

Introduces multi-segment-per-partition support in the Velox backend columnar shuffle writer, enabling incremental flushing of partition data to the final data file during processing — reducing peak memory usage without requiring full in-memory buffering or temporary spill files. The implementation can reduce total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory capacity in 2-socket Xeon 6960P system.

New index file format (`ColumnarIndexShuffleBlockResolver`)

Extends IndexShuffleBlockResolver with a new index format supporting multiple (offset, length) segments per partition:

[Partition Index: (N+1) × 8-byte big-endian offsets]
[Segment Data: per-partition list of (data_offset, length) pairs, each 8 bytes]
[1-byte end marker]  ← distinguishes from legacy format (size always multiple of 8)

ColumnarShuffleManager now uses this resolver. Multi-segment mode activates only when external shuffle service, push-based shuffle, and dictionary encoding are all disabled (dictionary encoding requires all-batches-complete before writing).

New I/O abstractions

FileSegmentsInputStream — InputStream over non-contiguous (offset, size) file segments; supports zero-copy native reads via read(destAddress, maxSize)
FileSegmentsManagedBuffer — ManagedBuffer backed by discontiguous segments; supports nioByteBuffer(), createInputStream(), convertToNetty()
DiscontiguousFileRegion — Netty FileRegion mapping a logical range to multiple physical segments for zero-copy network transfer
LowCopyFileSegmentsJniByteInputStream — zero-copy JNI wrapper over FileSegmentsInputStream; wired into JniByteInputStreams.create()

C++ `LocalPartitionWriter` changes

usePartitionMultipleSegments_ flag + partitionSegments_ vector tracking (start, length) per partition
flushCachedPayloads() — incremental flush after each hashEvict
writeMemoryPayload() — direct write to final data file during sortEvict
writeIndexFile() — serializes the new index at stop time
PayloadCache::writeIncremental() — flushes completed (non-active) partitions without touching the in-use partition

JNI/JVM wiring

LocalPartitionWriterJniWrapper and JniWrapper.cc accept a new optional indexFile parameter; ColumnarShuffleWriter passes the temp index file path when multi-segment mode is active.

How was this patch tested?

New unit test suites:

ColumnarIndexShuffleBlockResolverSuite — index format read/write, format detection, multi-segment block lookup
FileSegmentsInputStreamSuite — sequential reads, multi-segment traversal, skip, zero-copy native reads
FileSegmentsManagedBufferSuite — nioByteBuffer, createInputStream, convertToNetty, EOF and mmap edge cases
DiscontiguousFileRegionSuite — Netty transfer across discontiguous segments, lazy open
LowCopyFileSegmentsJniByteInputStreamTest — JNI wrapper correctness for ByteInputStream

Was this patch authored or co-authored using generative AI tooling?

… segments support

github-actions · 2026-03-09T02:54:49Z

Run Gluten Clickhouse CI on x86

marin-ma

@guowangy Thanks for contributing this feature. Please check my comments below.

marin-ma · 2026-03-10T16:58:45Z

cpp/core/shuffle/LocalPartitionWriter.cc

+#endif
+}
+
+arrow::Status LocalPartitionWriter::writeIndexFile() {


Can you add some c++ unit tests for the multi-segment partition write?

cpp/core/shuffle/LocalPartitionWriter.cc

backends-velox/src/main/scala/org/apache/spark/shuffle/ColumnarShuffleWriter.scala

marin-ma · 2026-03-10T17:09:42Z

cpp/core/shuffle/LocalPartitionWriter.cc


+  if (usePartitionMultipleSegments_) {
+    // If multiple segments per partition is enabled, write directly to the final data file.
+    RETURN_NOT_OK(writeMemoryPayload(partitionId, std::move(inMemoryPayload)));


Can you explain a bit more on how this can reduce the memory usage? Looks like the memory is still only get reclaimed by OOM and spilling.

Reduce memory usage does not apply to sortEvict since it is usually triggered at spill.
But it is applicable for hashEvict because payloads don't need cache to memory.

marin-ma · 2026-03-10T17:15:13Z

cpp/core/shuffle/LocalPartitionWriter.cc

      RETURN_NOT_OK(payloadCache_->cache(partitionId, std::move(payload)));
    }
+    if (usePartitionMultipleSegments_) {
+      RETURN_NOT_OK(flushCachedPayloads());


The hashEvict is not only called for spilling. When the evictType is kCache, then it try to cache as much payload in memory as possible to reduce spilling.

And when the evitType is kSpill, the data will be written to a spilled data file. Two evict types can exist in the same job. Is evictType == kSpill being properly handled for multi-segments write?

Yes, evictType == kSpill is properly handled.
Because when usePartitionMultipleSegments_ enabled, the cache mechanism of payloadCache_ is not used. Payloads will directly flush into disk; thus, we don't need to distinguish between kCache and kSpill.

marin-ma · 2026-03-10T21:57:14Z

The implementation can reduce total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory capacity in 2-socket Xeon 6960P system.

Can you explain where this improvement mainly comes from?

Currently we follow the same file layout as vanilla spark to have each partition output contiguous. I think one major benefit for this design is to reduce small random disk IO from the shuffle reader side. If memory is tight then the spill will be triggered more frequently, and it will be more likely to produce small output blocks for each partition. In this case this design will not be IO friendly.

guowangy · 2026-03-11T02:42:52Z

The implementation can reduce total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory capacity in 2-socket Xeon 6960P system.

Can you explain where this improvement mainly comes from?

Currently we follow the same file layout as vanilla spark to have each partition output contiguous. I think one major benefit for this design is to reduce small random disk IO from the shuffle reader side. If memory is tight then the spill will be triggered more frequently, and it will be more likely to produce small output blocks for each partition. In this case this design will not be IO friendly.

For existing design, if spill happens, interim blocks of partition data will be saved to the disk; at the end of the shuffle write, these interim data will be reloaded from disk and saved as final partition data. This optimization is trying to avoid such interim write/reload operations, when spill happens, data will be directly saved as final partition blocks.

As for IO friendly, for existing design, if spill is triggered frequently, there will be also many small interim blocks saved to disk, it's not IO friendly when reload them from disk and packed as final partition data. With this PR, it does not make IO friendly worse, just move such situation from shuffle writer to reader.

github-actions · 2026-03-11T09:14:22Z

Run Gluten Clickhouse CI on x86

marin-ma · 2026-03-11T09:18:00Z

@guowangy I draw a graph based on my understanding. Please correct me if I'm wrong: The existing design only has one random I/O access to one mapper output per reducer, but the new design has more random I/O accesses when reading all the segments from one mapper per reducer.

guowangy · 2026-03-11T09:21:34Z

@guowangy I draw a graph based on my understanding. Please correct me if I'm wrong: The existing design only has one random I/O access to one mapper output per reducer, but the new design has more random I/O accesses when reading all the segments from one mapper per reducer.

Yes, for shuffle reader, that's true.

marin-ma · 2026-03-11T09:50:04Z

@guowangy In general, random I/O is considered a bottleneck in shuffle, and that's why there are so many remote shuffle service projects and solutions like celeborn, uniffle are aimed at. The remote shuffle service usually coalesce the shuffle outputs from mapper side to reduce the random IO access. However, the design in this PR seems to go in the opposite direction, since it may introduce more random I/O during reads.

Directly writing the segments to the data file would make the partition writer logic simpler, but we intentionally didn't choose that approach based on the above consideration. I'm not sure if your test is based on single node or on a cluster. If it's on single node and disk IO is not bottleneck, then the solution may not be practical in real use case.

Besides, based on our experience, external shuffle service is usually enabled in real production environments because it provides better stability when executor process is down, and it's more like a must-have feature that the shuffle framework should support.

marin-ma · 2026-03-11T09:52:31Z

cc: @FelixYBW

guowangy added 20 commits March 9, 2026 10:24

Add ColumnarIndexShuffleBlockResolver

9b67bbb

Add DiscontiguousFileRegion

0be5dae

Add FileSegmentsBuffer#createInputStream

74e9692

Impl FileSegmentsBuffer#nioByteBuffer

15f2226

Add mmap opt for single large file

54d9dd1

DiscontiguousFileRegion support lazy open

c072f0e

Impl FileSegmentsManagedBuffer#convertToNetty

d534477

Add ColumnarIndexShuffleBlockResolver#getSegmentsFromIndex

f96ca3e

Impl ColumnarIndexShuffleBlockResolver#getBlockData

4d1a015

ColumnarIndexShuffleBlockResolver: add limitation of using new format

d6159c1

Add an optional indexFile params to LocalPartitionWriter for multiple…

6286505

… segments support

LocalPartitionWriter: support multiple segments of partition

f24c45a

Add LowCopyFileSegmentsJniByteInputStream to support native read

5a2ea3e

Add FileSegmentsInputStream to handle segments read

9ceb7fd

LowCopyFileSegmentsJniByteInputStream use FileSegmentsInputStream

66d3742

Avoid frequently calling fp.Tell()

e8f1b54

LocalPartitionWriter: fix output format for sortEvict

9cac127

Various fixes

93b3f53

FileSegmentsManagedBuffer: empty segments should work

5857915

Fixes

ecde73f

github-actions bot added CORE works for Gluten Core VELOX labels Mar 9, 2026

zhouyuan requested a review from marin-ma March 10, 2026 15:43

marin-ma reviewed Mar 10, 2026

View reviewed changes

guowangy added 3 commits March 11, 2026 13:55

htonll fix

cd2e782

Add config: spark.gluten.sql.columnar.shuffle.multiSegments.enabled

47cc8f0

Fix spark33 build error

e380ebd

github-actions bot added the DOCS label Mar 11, 2026

Conversation

guowangy commented Mar 9, 2026

What changes are proposed in this pull request?

New index file format (ColumnarIndexShuffleBlockResolver)

New I/O abstractions

C++ LocalPartitionWriter changes

JNI/JVM wiring

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

marin-ma left a comment

Choose a reason for hiding this comment

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

guowangy Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

marin-ma Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

guowangy Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

marin-ma commented Mar 10, 2026

Uh oh!

guowangy commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

marin-ma commented Mar 11, 2026

Uh oh!

guowangy commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marin-ma commented Mar 11, 2026

Uh oh!

marin-ma commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New index file format (`ColumnarIndexShuffleBlockResolver`)

C++ `LocalPartitionWriter` changes

guowangy commented Mar 11, 2026 •

edited

Loading