Reservoir sampling for high-depth pileup by stein-andrew · Pull Request #588 · nanoporetech/modkit

stein-andrew · 2026-02-28T16:16:24Z

Problem:
When --high-depth is used, positions that reach max_depth stop accepting new reads. Because reads are iterated in BAM coordinate sort order (leftmost mapping position first), this biases each genomic position toward reads whose alignments begin further upstream, systematically excluding reads that map further downstream even though they cover the same position. This issue was first noticed when working with a high-depth synthetic oligonucleotide dataset.

Solution:
Replace first-come-first-served depth limiting with reservoir sampling. Once a position reaches max_depth, each subsequent read is accepted with probability max_depth/n_seen, giving every read covering a position an equal chance of being included regardless of BAM iteration order.

Changes (all in modkit-core/src/pileup/pileup_processor.rs):

Added ReservoirState struct with reservoir sampling logic
DnaPileupWorker: reservoir check replaces reached_max_depth in pileup loop when CHECK_DEPTH=true
GenericPileupWorker/Tally2: reservoir sampling in add_call() when total_cov >= max_coverage
Behavior unchanged when depth is below max_depth

Problem: When --high-depth is used, positions that reach max_depth stop accepting new reads. Because reads are iterated in BAM coordinate sort order (leftmost mapping position first), this biases each genomic position toward reads whose alignments begin further upstream, systematically excluding reads that map further downstream even though they cover the same position. This issue was first noticed when working with a high-depth synthetic oligonucleotide dataset. Solution: Replace first-come-first-served depth limiting with reservoir sampling. Once a position reaches max_depth, each subsequent read is accepted with probability max_depth/n_seen, giving every read covering a position an equal chance of being included regardless of BAM iteration order. Changes (all in modkit-core/src/pileup/pileup_processor.rs): - Added ReservoirState struct with reservoir sampling logic - DnaPileupWorker: reservoir check replaces reached_max_depth in pileup loop when CHECK_DEPTH=true - GenericPileupWorker/Tally2: reservoir sampling in add_call() when total_cov >= max_coverage - Behavior unchanged when depth is below max_depth

ArtRand · 2026-03-02T23:01:45Z

Hello @stein-andrew,

Thank you for this PR! Let me give it a few minutes to review and test and I'll be back with some feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reservoir sampling for high-depth pileup#588

Reservoir sampling for high-depth pileup#588
stein-andrew wants to merge 1 commit intonanoporetech:masterfrom
stein-andrew:reservoir-sampling

stein-andrew commented Feb 28, 2026

Uh oh!

ArtRand commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stein-andrew commented Feb 28, 2026

Uh oh!

ArtRand commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants