Skip to content

disable wrong parallel decompression for LOAD DATA#23620

Merged
mergify[bot] merged 5 commits intomatrixorigin:mainfrom
robll-v1:fix/load
Jan 29, 2026
Merged

disable wrong parallel decompression for LOAD DATA#23620
mergify[bot] merged 5 commits intomatrixorigin:mainfrom
robll-v1:fix/load

Conversation

@robll-v1
Copy link
Collaborator

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #23618

What this PR does / why we need it:

Summary

Parallel LOAD DATA on compressed files is unsafe because the current implementation splits input by byte offsets and then tries to locate line boundaries on a compressed stream. This can cut rows in the middle, cause column misalignment, and finally write corrupted values (e.g., invalid DATE/DECIMAL). The issue manifests later as panics during query output (e.g., Date.ToBytes out-of-range).

Root Cause

ReadFileOffset computes offsets using Offset = previous + batchSize and then getTailSize on the compressed reader, which is not a valid line-boundary strategy for compressed streams. This leads to row corruption.

Fix (medium-cost path)

Keep parallel write, but disable parallel splitting for compressed files.
For compressed data, load is handled by a single reader (proper decompression + parsing), and the insert stage still runs in parallel.
Avoid the “string vector + cast projection” path for compressed files to prevent misalignment.
Tests

Added a unit test that asserts ReadFileOffset rejects compressed input.
This locks the unsafe behavior and prevents regression.
Note: If we later refactor to truly support parallel splitting for compressed files (e.g., with a splittable format), this test must be updated or removed.
Behavior After Fix

Compressed LOAD DATA no longer panics.
Parallelism is preserved on the write side; read-side splitting is disabled for compressed inputs.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@mergify mergify bot merged commit 1ac8002 into matrixorigin:main Jan 29, 2026
23 of 24 checks passed
@mergify mergify bot removed the queued label Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants