perf: cut train-loop sync/retrace overheads in OneVision Encoder by Luodian · Pull Request #93 · EvolvingLMMs-Lab/OneVision-Encoder

Luodian · 2026-02-10T07:45:36Z

summary

fix grad-accum step boundary + scheduler stepping alignment
cut per-step sync points in training/train.py (remove hot-path .item() / host sync patterns)
add safer compile control: --compile_backend {auto,none,inductor,aot_eager,eager}
in auto, disable compile on mixed dali_type to avoid retrace storms
cache RoPE frequency grids and skip dense-path identity gather in encoder forward
reduce dataloader overhead in data_v2* wrappers (drop redundant .cuda(), raise prefetch/thread defaults)
reduce per-step frame-sampling allocation pressure in residual branch

This PR is performance-oriented and does not guarantee byte-exact identity against historical runs under default settings.
For closest parity with previous behavior, run with:
- --compile_backend none (disable torch.compile wrapper)
If you need strict checkpoint-comparison parity too, keep all non-performance knobs unchanged (dataset list/order, random seeds, worker layout, and DALI environment/config).
Note: if your environment previously used old data_v2.py / data_v2_ocr.py defaults for prefetch_queue_depth=1, this PR changes them to 3; this can change data ordering timing while keeping sample values unchanged.

python3 -m py_compile training/train.py onevision_encoder/modeling_onevision_encoder.py dataloader/data_v2.py dataloader/data_v2_ocr.py dataloader/data_v2_multi_res.py

Luodian added 5 commits February 10, 2026 15:45

fix: align accumulation step boundaries for optimizer scheduling

ba7d288

perf: reduce train-loop sync points and optimize data/model hot paths

7a5817d

refactor: make compile backend configurable via string options

fc1d18f

perf: reduce decord frame-sampling allocations

5718418

fix: align callback warmup with prior loss window behavior

b09331d