Skip to content

Reservoir sampling for high-depth pileup#588

Open
stein-andrew wants to merge 1 commit intonanoporetech:masterfrom
stein-andrew:reservoir-sampling
Open

Reservoir sampling for high-depth pileup#588
stein-andrew wants to merge 1 commit intonanoporetech:masterfrom
stein-andrew:reservoir-sampling

Conversation

@stein-andrew
Copy link

Problem:
When --high-depth is used, positions that reach max_depth stop accepting new reads. Because reads are iterated in BAM coordinate sort order (leftmost mapping position first), this biases each genomic position toward reads whose alignments begin further upstream, systematically excluding reads that map further downstream even though they cover the same position. This issue was first noticed when working with a high-depth synthetic oligonucleotide dataset.

Solution:
Replace first-come-first-served depth limiting with reservoir sampling. Once a position reaches max_depth, each subsequent read is accepted with probability max_depth/n_seen, giving every read covering a position an equal chance of being included regardless of BAM iteration order.

Changes (all in modkit-core/src/pileup/pileup_processor.rs):

  • Added ReservoirState struct with reservoir sampling logic
  • DnaPileupWorker: reservoir check replaces reached_max_depth in pileup loop when CHECK_DEPTH=true
  • GenericPileupWorker/Tally2: reservoir sampling in add_call() when total_cov >= max_coverage
  • Behavior unchanged when depth is below max_depth

Problem:
When --high-depth is used, positions that reach max_depth stop
accepting new reads. Because reads are iterated in BAM coordinate
sort order (leftmost mapping position first), this biases each
genomic position toward reads whose alignments begin further
upstream, systematically excluding reads that map further
downstream even though they cover the same position. This issue was
first noticed when working with a high-depth synthetic oligonucleotide
dataset.

Solution:
Replace first-come-first-served depth limiting with reservoir
sampling. Once a position reaches max_depth, each subsequent read
is accepted with probability max_depth/n_seen, giving every read
covering a position an equal chance of being included regardless
of BAM iteration order.

Changes (all in modkit-core/src/pileup/pileup_processor.rs):
- Added ReservoirState struct with reservoir sampling logic
- DnaPileupWorker: reservoir check replaces reached_max_depth
  in pileup loop when CHECK_DEPTH=true
- GenericPileupWorker/Tally2: reservoir sampling in add_call()
  when total_cov >= max_coverage
- Behavior unchanged when depth is below max_depth
@ArtRand
Copy link
Contributor

ArtRand commented Mar 2, 2026

Hello @stein-andrew,

Thank you for this PR! Let me give it a few minutes to review and test and I'll be back with some feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants