feat: add precision checker with hook system and command-line control #102

chen2021673 · 2026-01-13T10:17:53Z

This PR introduces a comprehensive precision checking system for debugging numerical accuracy issues in distributed training:

Core Features:

Two-level precision checking (module-level and function-level)
Command-line flags: --precision_check, --precision_check_all_ranks
Extensible hook system for Functions, Modules, and Tensors
Automatic FP32 reference computation for validation

Hook System:

Forward/backward pre/post hooks for Functions and Modules
Tensor gradient hooks for inspection
Unified hook type definitions to reduce code duplication

Implementation:

PrecisionChecker utility with configurable check levels
Integration with autograd Function and nn::Module
Support for distributed training (per-rank checking)
Detailed logging to precision_check_rank_[N].log files

Documentation:

docs/hook_mechanism.md - Hook system architecture
docs/precision_checker_guide.md - Usage guide

Testing:

test/hook/test_hook.cc - Hook functionality tests
test/hook/test_precision_check.cc - Precision checker tests

This PR introduces a comprehensive precision checking system for debugging numerical accuracy issues in distributed training: **Core Features:** - Two-level precision checking (module-level and function-level) - Command-line flags: --precision_check, --precision_check_all_ranks - Extensible hook system for Functions, Modules, and Tensors - Automatic FP32 reference computation for validation **Hook System:** - Forward/backward pre/post hooks for Functions and Modules - Tensor gradient hooks for inspection - Unified hook type definitions to reduce code duplication **Implementation:** - PrecisionChecker utility with configurable check levels - Integration with autograd Function and nn::Module - Support for distributed training (per-rank checking) - Detailed logging to precision_check_rank_[N].log files **Documentation:** - docs/hook_mechanism.md - Hook system architecture - docs/precision_checker_guide.md - Usage guide **Testing:** - test/hook/test_hook.cc - Hook functionality tests - test/hook/test_precision_check.cc - Precision checker tests Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

…omprehensive docs - Add PrecisionCheckConfig and PrecisionCheckContext for better state management - Refactor precision checker to use context-based architecture - Add comprehensive documentation (hook_mechanism.md, precision_checker_guide.md) - Add test cases for hook system and precision checking - Update CMakeLists.txt to include new test targets - Improve command-line flag handling in examples Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

- Unify Function and Module hook infrastructure into common/hook.h - Remove duplicated HookHandle and HookHandleImpl classes - Update precision_checker_guide.md and hook_mechanism.md

chen2021673 · 2026-01-15T10:16:15Z

example/gpt2/main.cc

    int pp_rank = 0;

+    // Set thread-local global rank
+    nn::parallel::global::thread_global_rank = rank.GlobalRank();


这个全局变量后续可以看看有没有什么更优雅的替代方法

This commit fixes the issue where only rank 0 generated precision check log files when running with tensor parallelism. The root cause was that GetLogStream() used process-global static variables, causing all threads in a single process to share the same log file handle. Changes: - Add thread_global_rank thread-local variable to track per-thread rank - Convert GetLogStream() and TableHeaderPrinted() to use thread_local storage - Set thread_global_rank in Train() function for each thread - Move baseline output (key|md5 format) into table format branch to avoid duplicate output in simple format - Add directory creation and error handling for log file opening With these changes, each thread now creates its own log file based on its global rank (process_rank * nthread_per_process + thread_rank). Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Add tools/compare_loss.py to automate end-to-end loss comparison between two log directories, eliminating manual verification overhead as test cases scale up. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

JYMiracle305 · 2026-01-19T09:58:14Z

infini_train/include/autograd/function_hook.h

@@ -1,6 +1,8 @@
 #pragma once

+#include <functional>


这个文件没看到实质修改，新加的前置声明和头文件有用到吗？

kilinchange · 2026-01-19T15:01:37Z

infini_train/include/nn/parallel/global.h

    void Init(int threads_per_process, int tensor_parallel_size, bool sequence_parallel_enabled,
-              int pipeline_parallel_size, int virtual_pipeline_parallel_size);
+              int pipeline_parallel_size, int virtual_pipeline_parallel_size,
+              const utils::PrecisionCheckConfig &precision_config = utils::PrecisionCheckConfig());


函数签名尽可能不要填默认值，否则后续新增参数都要有默认值。

kilinchange · 2026-01-19T15:04:52Z

infini_train/include/nn/parallel/global.h


    Layout layout_;
+    PrecisionCheckLevel precision_check_level_ = PrecisionCheckLevel::NONE;
+    utils::PrecisionCheckConfig precision_check_config_;


这里可以补上 const

kilinchange · 2026-01-19T15:05:48Z

infini_train/include/nn/parallel/global.h

 inline void InitAllEnv(int nthread_per_process, int tensor_parallel_size, bool sequence_parallel_enabled,
-                       int pipeline_parallel_size, int virtual_pipeline_parallel) {
+                       int pipeline_parallel_size, int virtual_pipeline_parallel,
+                       const utils::PrecisionCheckConfig &precision_config = utils::PrecisionCheckConfig()) {


不要填默认值

kilinchange · 2026-01-19T15:09:26Z

infini_train/include/utils/precision_check_config.h

+    std::string baseline_path = ""; // baseline file path for comparison
+
+    // Parse from "key=value,key=value" string
+    static PrecisionCheckConfig Parse(const std::string &config_str) {


实现放 .cc 里吧

kilinchange · 2026-01-19T15:09:46Z

infini_train/include/utils/precision_check_context.h

实现放 .cc 里

kilinchange · 2026-01-19T15:35:16Z

infini_train/src/utils/precision_checker.cc

+namespace {
+
+// Simple MD5 implementation
+class MD5 {


直接比 tensor 的 md5，而不是看 abs/rel diff 且留一个阈值范围，是不是很难完全一致，而且无法看出差距多大？

kilinchange · 2026-01-19T15:42:39Z

infini_train/include/nn/parallel/global.h

precision_checker 相关的配置不应该放在 parallel 下，建议挪到其他地方，例如 utils 下。（可以等下次 pr 再改，跟全局 module hook 一起改）

kilinchange · 2026-01-19T15:44:21Z

infini_train/src/nn/modules/module.cc

+            utils::PrecisionChecker::RegisterForModule(this);
+            precision_check_registered_ = true;
+        }
+    }


https://docs.pytorch.org/docs/stable/generated/torch.nn.modules.module.register_module_forward_hook.html#torch-nn-modules-module-register-module-forward-hook

我们应当有一个能够注册全局 module hook 的机制，目前 precision_checker 本质上是注册了一个全局的 module hook，应当是 precision_checker 直接调用注册全局 module hook 的接口（例如在 InitAllEnv 里根据传入的 precision 参数决定是否注册全局 precision_check hook）（可以等下次 pr 再改）

kilinchange · 2026-01-19T15:50:00Z

infini_train/src/nn/modules/module.cc

+    }
+
+    // Register backward hooks on output tensors' grad_fn
+    if (!backward_pre_hooks_.empty() || !backward_post_hooks_.empty()) {


既然已经这么写了，就给这个条件包一个 UNLIKELY 吧

kilinchange · 2026-01-19T16:01:01Z

tools/compare_loss.py

+    sys.exit(1 if total_mismatches > 0 else 0)
+
+if __name__ == '__main__':
+    main()


文件末尾加一个空行

chen2021673 and others added 4 commits January 13, 2026 10:10

style: apply clang-format to precision checker code

70397b9

refactor: unify hook infrastructure and enhance documentation

95ae35d

- Unify Function and Module hook infrastructure into common/hook.h - Remove duplicated HookHandle and HookHandleImpl classes - Update precision_checker_guide.md and hook_mechanism.md

chen2021673 commented Jan 15, 2026

View reviewed changes

chen2021673 force-pushed the precision_checker branch from d35e92a to a7806d9 Compare January 16, 2026 01:43

fix Code Format Check

2e62b67

kilinchange requested review from Chamberlain0w0, JYMiracle305 and kilinchange January 16, 2026 01:47

feat: add automated loss comparison tool for log validation

4b1b6cd

Add tools/compare_loss.py to automate end-to-end loss comparison between two log directories, eliminating manual verification overhead as test cases scale up. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

JYMiracle305 reviewed Jan 19, 2026

View reviewed changes

kilinchange requested changes Jan 19, 2026

View reviewed changes

feat: add precision checker with hook system and command-line control #102

Are you sure you want to change the base?

feat: add precision checker with hook system and command-line control #102

Uh oh!

Conversation

chen2021673 commented Jan 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilinchange Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kilinchange Jan 19, 2026 •

edited

Loading