2025年6月修改的论文速读记录，有哪些疑问点？

目录Contrastive learning as goal-conditioned reinforcement learning (Contrastive RL NCE)C-Learning: Learning to Achieve Goals via Recursive Classification01 一些形式化定义02 C-learning 作者声称 HER 等方法存在的问题03 简单的 C-learning04 基于递归的 offline C-learning05 实际算法Rewriting History with Inverse RL: Hindsight Inference for Policy ImprovementCan a MISL Fly? Analysis and Ingredients for Mutual Information Skill LearningA Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching CapabilitiesSkill Expansion and Composition in Parameter Space Contrastive learning as goal-conditioned reinforcement learning (Contrastive RL NCE) arxiv：https://arxiv.org/abs/2206.07568 website：https://ben-eysenbach.github.io/contrastive_rl/ talk：https://www.youtube.com/watch?v=5_eGcprfw60 来源：偶然看到的文章，NeurIPS 2022，citation 173。发现一作是 Benjamin Eysenbach，也是 DIAYN 的一作，没仔细看但感觉他的文章质量很高。主要内容： main idea 似乎非常简单。对于 goal-conditioned RL，即我们希望让 agent 达到 target state \(s_g\)，直接用对比学习的方法，维护一个 \(f(s, a, s_g)\) 的相似度函数，\((s,a)\) 越有希望达到 \(s_g\) 状态，这个函数值越大。具体的，\(f(s, a, s_g)\) 通过 embedding \(\phi(s,a)\) 和 \(\psi(s_g)\) 的内积来计算，loss function 是 \(L = \log\sigma(f(s, a, s_f^+)) + \log(1 - \sigma(f(s, a, s_f^=)))\)，其中 \(s_f^+\) 和 \(s_f^-\) 分别是从 state occupancy 中采样的 state \(s_f^+\sim \int p^\pi(s_+ | s,a)ds\; da\)、和随机采样的 state。理论可以保证，在某个 policy \(\pi\) 下，优化上述 NCE loss 得到的 \(f(s, a, s_g)\)，跟最优 Q 函数之间满足：\(\exp(f^*(s, a, s_g)) = [1/p(s_f)]Q^\pi_{s_f}(s,a)\)。这意味着，我们只需要一直 1. 优化 \(f\)，2. 选 \(f\) 值更高的 action，就可以得到最优策略。（这很像 temporal distance）一些形式化定义：在 Goal-Conditioned RL（GCRL）里，reward 被定义为下一时刻达到目标的概率：\(r_g(s,a) = (1-\gamma)P(s'=s_g | s,a)\)。这样，Q function 直接变成了 state occupancy：\(p(s_+ = s_g) = \mathbb E_{\pi} (1-\gamma)\sum_{t=0}^\infty \gamma^t P(s'=s_g | s_t,a_t)\)。理论在 Appendix B，感觉什么都没说。

2025年6月修改的论文速读记录，有哪些疑问点？

相关推荐