【训练营】学习率调度器实现 by littleotherut · Pull Request #113 · InfiniTensor/InfiniTrain

littleotherut · 2026-03-11T10:13:21Z

No description provided.

…r accessors

…ingRate

…LR helper

…base class - Change Step() to virtual with default implementation - Add pure virtual ComputeLR() for subclasses to implement. - Adapt test helpers (IdentityScheduler, LinearDecayScheduler) to implement ComputeLR() instead of Step(). - All existing tests pass without behavioral changes. BREAKING CHANGE: Subclasses must implement ComputeLR() instead of Step().

…t and update all tests to use Create<T>() factory method.

…entialLR - enhance LRScheduler with chained and closed form learning rate methods - adapt methods(Step, InitialStep, GetClosedFormLR, GetChainedFormLR) to match PyTorch‘s design - add tests for consistency - refactor LinearLR: add end_factor, and rename this class - add SequentialLR InitialStep and UndoChildInitialSteps BREAKING CHANGE: Subclasses must implement GetClosedFormLR instead of ComputeLR(). Should use LinearLR instead of LinearwarmupLR.

- Add LRSchedulerConfig struct with parameters for all basic schedulers(constant, linear, step) - Add CreateLRScheduler() factory function - Support automatic warmup wrapping via SequentialLR when warmup_steps > 0 - Adapt test files

…tial, Chained, and Lambda)

…ommon total_iters

- Add gflags: --lr_scheduler, --warmup_steps, --step_size, --gamma, --start_factor, --end_factor, --lr_total_iters, --total_steps - Replace nullptr scheduler with factory-created scheduler - Move scheduler.Step() after optimizer.Step() in both DP and PP paths - Replace hardcoded FLAGS_learning_rate in log with scheduler->GetLR()

Chamberlain0w0 · 2026-03-16T05:14:58Z

example/gpt2/main.cc

            size_t used_mb = 0, reserved_mb = 0;
            std::tie(used_mb, reserved_mb) = impl->GetMemPoolPeakMB(device);
-
+            const float current_lr = scheduler ? scheduler->GetLR() : static_cast<float>(FLAGS_learning_rate);


scheduler 在前面已经 Step 过了，所以这里 GetLR() 语义上是”下一步要用到的 lr“；而我们这里想打印的是每一步实际用到的 lr，所以这里的逻辑需要修改下。llama3 部分的 main.cc 里同理。

Chamberlain0w0 · 2026-03-16T05:15:18Z

example/llama3/main.cc

            size_t used_mb = 0, reserved_mb = 0;
            std::tie(used_mb, reserved_mb) = impl->GetMemPoolPeakMB(device);
-
+            const float current_lr = scheduler ? scheduler->GetLR() : static_cast<float>(FLAGS_learning_rate);


Chamberlain0w0 · 2026-03-16T05:37:36Z

infini_train/include/optimizer.h

    std::vector<std::shared_ptr<Tensor>> params_;
+    float learning_rate_ = 0.0f;
+    float initial_learning_rate_ = 0.0f;
+    bool initial_lr_set_ = false;


这部分比较冗余。optimizer 里面可以只存有代表当前学习率的 learning_rate_，不需要额外存 initial lr 的状态；语义上初始学习率可以仅存在 lr scheduler 里（你是实际上已经这样做了，存在 lr scheduler 的 base_lr）。

Chamberlain0w0 · 2026-03-16T06:10:16Z

infini_train/include/lr_scheduler.h

+
+    std::shared_ptr<Optimizer> optimizer_;
+    int64_t last_step_;
+    float current_lr_;


current_lr_ 似乎也有点冗余，语义上 current_lr_ 和 optimizer_->GetLearningRate() 的值在任何时候应等价，现在在你的设计里看到这二者存在各自分开存且混用的状态（读完发现目前的 current_lr_ 像是 optimizer_->GetLearningRate() 的一个副本）；目前的数值正确性上你处理的没问题，但是这种设计交给后人来扩展的时候很可能带来歧义。

建议针对“当前学习率”只保留唯一真状态来源，要么就全程由 optimizer_->GetLearningRate() 跟踪，lr scheduler 里面就不存 current lr 了；要么就由 lr scheduler 跟踪，每次计算完再 set 回 optimizer。个人认为前者较合适。

Chamberlain0w0 · 2026-03-16T06:14:27Z

infini_train/src/lr_scheduler.cc

+
+void LRScheduler::ApplyLR(float lr) {
+    current_lr_ = lr;
+    optimizer_->SetLearningRate(current_lr_);


承接上面所说的，在你的设计中一方面看到有 optimizer_->SetLearningRate(current_lr_); 这种调用，另一方面又有 current_lr_ = optimizer_->GetLearningRate();，二者可能会存在谁因谁果的混淆，所以建议保持设计上语义的一致性。

Chamberlain0w0 · 2026-03-16T06:14:40Z

infini_train/src/lr_scheduler.cc

+        scheduler->Step();
+    }
+
+    current_lr_ = optimizer_->GetLearningRate();


承接上面所说的，在你的设计中一方面看到有 optimizer_->SetLearningRate(current_lr_); 这种调用，另一方面又有 current_lr_ = optimizer_->GetLearningRate();，二者可能会存在谁因谁果的混淆，所以建议保持设计上语义的一致性。

Chamberlain0w0 · 2026-03-16T06:38:30Z

infini_train/src/lr_scheduler.cc

+    } else if (last_step_ < total_iters_) {
+        return lr;
+    } else if (last_step_ == total_iters_) {
+        return lr / factor_;


个别超参的值由于是由 cli 用户传入，所以需要加一下非法检查。以此处为例，factor 应该是 (0, 1) 范围内的，不然可能会存在除零的非法值。torch 实现中也在构造函数中做了检查，参考：https://github.com/pytorch/pytorch/blob/08840d08a02eead8edf22406a53e5691c9a89c9a/torch/optim/lr_scheduler.py#L813

另外，以我看到的，还有 StepLR 没检查 step_size > 0，LinearLR 没检查两个 factor 以及 total_iters 等。建议通篇 check 一下。

Chamberlain0w0 · 2026-03-16T07:23:42Z

infini_train/include/lr_scheduler.h

+    void LoadState(const StateDict &state) override;
+
+protected:
+    float GetClosedFormLR() const override { return current_lr_; }


这块语义上不太对，我仔细看了下 torch 里面的实现，GetClosedFormLR 对标 torch 里提供的 get_closed_form_lr 的接口的话，实际是想实现一个“给定 base_lr、last_step 以及其他超参，然后可以通过公式算出当前 lr 的 function”。这个虽然数值上确实等于你现在提供的 current_lr，但是逻辑上的代码不应该直接返回缓存的 current_lr_ 就完事，而是应该给一个计算公式。

另外，torch 里提供的 _get_closed_form_lr 的接口，最终实际上是用于 step(int epoch) 这个 function 的，如果对应的 LRScheduler 派生类实现了这个 _get_closed_form_lr，就代表其支持 closed form 的跳步语义，然后 step(epoch) 会直接由提供的 function 计算出 current lr。而 torch 里面的 SequentialLR 派生类没有实现这个 function。

考虑到你这边的 GetClosedFormLR 定义为虚函数，要求所有派生类必须实现，我建议是在这里加上一个 // FIXME 的注释说明一下这一点，目前暂时先以返回一个 current lr 来 hack 实现，而不是提供了 closed-form 计算方法。

Chamberlain0w0 · 2026-03-16T07:35:56Z

infini_train/include/lr_scheduler.h

+};
+
+} // namespace lr_schedulers
+} // namespace infini_train


format 规范上，end of file 需要有一个 newline，后续也有几个文件存在这个问题

kinorw and others added 19 commits March 3, 2026 14:42

refactor(optimizer): hoist learning_rate_ to Optimizer base and add l…

7a16589

…r accessors

refactor(distributed_optimizer): passthrough SetLearningRate/GetLearn…

0514862

…ingRate

feat(lr_scheduler): add LRScheduler abstract base class with StateDict

81295e8

refactor(examples): add scheduler placeholder and use runtime lr in logs

8e7cda0

feat: add ConstantLR, StepLR and LinearWarmupLR

1e65881

refactor(lr_scheduler): replace ComputeLR with virtual Step and Apply…

d924d3d

…LR helper

feat(lr_schedulers): add LambdaLR strategy

baca2ef

refactor(optimizer): add initial_learning_rate and it's accessors

7df75d7

feat(lr_schedulers): add SequentialLR composite strategy

d0ac538

feat(lr_scheduler): add factory method Create<T>() with two-phase ini…

5b4ef6d

…t and update all tests to use Create<T>() factory method.

feat(lr_schedulers): add ChainedScheduler composite strategy

6823244

feat(lr_scheduler): add scheduler factory for CLI integration (Sequen…

7a29a61

…tial, Chained, and Lambda)

feat(lr_scheduler): add warmup start_factor and end_factor , remove c…

b64566e

…ommon total_iters

Merge branch 'InfiniTensor:master' into lr_scheduler

f7b3fcb

style: apply clang-format to all legacy files

1f95e29

kilinchange requested a review from Chamberlain0w0 March 12, 2026 08:10

kilinchange assigned Chamberlain0w0 Mar 17, 2026

Chamberlain0w0 requested changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【训练营】学习率调度器实现#113

【训练营】学习率调度器实现#113
littleotherut wants to merge 19 commits intoInfiniTensor:masterfrom
littleotherut:lr_scheduler

littleotherut commented Mar 11, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Chamberlain0w0 Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

littleotherut commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants