- BASIC 【鱼书】深度学习入门-强化学习 课件及笔记
- DRL 【王树森】深度强化学习 课件及笔记
- Hands-on-RL 【愈勇等】动手学强化学习++
- OPEN AI 强化学习手册 官网地址
- 李宏毅-强化学习-PPO 【视频地址 】
- 李宏毅-强化学习-2025【视频地址】
- 人人都能看懂的PPO原理与源码解读
- easy-rl 在线地址
- Mathematical-RL [【赵世钰】强化学习的数学原理](https://www.bilibili.com/video/BV1sd4y167NS/?)
- RLHF-huggingface
- cleanrl 原地址
- joyrl 入门强化学习的代码生态
- easy-rl 在线地址
- notes-on-reinforcement-learning 在线阅读地址
- 强化学习算法实现 DRL-code-pytorch
- 复现deepseek-r1
- deepspeed-chat deepspeed-chat

- k3估计器:Approximating KL Divergence
- [推荐]为何在线强化学习能有效缓解灾难性遗忘?Why Online Reinforcement Learning Forgets Less
- [for LLM]RL当你的 KL 散度正则化在“裸奔”On a few pitfalls in KL divergence gradient estimation for
- 分析了两种主流 KL 估计器(K1 和 K3)在两种放置位置(Reward 和 Loss)下的梯度特性A Comedy of Estimators On KL Regularization in RL Training of LLMs
-
[熵]1-关注的是宏观的、全局的“策略熵”The Entropy Mechanism of Reinforcement Learning for Reasoning Language Model
-
TODO探索不应是盲目和全局的,而应是有选择性的Rethinking Entropy Regularization in Large Reasoning model
- 用行为校准 RL 抑制模型幻觉:Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
- TODO为什么大模型出现幻觉?Why Language Models Hallucinate
- 一阶近似Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
- 【对整个 response 进行裁剪】GSPO-组序列策略优化
- 强化学习是否真的在Llms中激发了超出基础模型的推理能力? Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- 预训练(Pre-Training)、中期训练(Mid-Training)和基于强化学习的后训练(RL Post-Training)On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- [推荐]为何在线强化学习能有效缓解灾难性遗忘?Why Online Reinforcement Learning Forgets Less
- RLVR微调的本质是“非主成分学习”!SFT微调的是“主成分”:The Path Not Taken: RLVR Provably Learns Off the Principals
- TODO主权重:LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning
- RL 的 scaling 到底有没有规律可循?:The Art of Scaling Reinforcement Learning Compute for LLMs
- 工具综合推理:ToolRL: Reward is All Tool Learning Needs
- http://udlbook.github.io/udlbook 深度学习中的算法背后的原理
- https://github.com/changyeyu/LLM-RL-Visualized 图解大模型算法
- https://www.rethink.fun 大模型核心技术和应用
- Ray-利用Ray进行大模型的数据处理、训练、推理和部署 Ray rllib github地址
- 图解OpenRLHF中基于Ray的分布式训练流程