2025年6月修改的论文速读记录,有哪些疑问点?
摘要:目录Contrastive learning as goal-conditioned reinforcement learning (Contrastive RL NCE)C-Learning: Learning to Achieve Go
目录Contrastive learning as goal-conditioned reinforcement learning (Contrastive RL NCE)C-Learning: Learning to Achieve Goals via Recursive Classification01 一些形式化定义02 C-learning 作者声称 HER 等方法存在的问题03 简单的 C-learning04 基于递归的 offline C-learning05 实际算法Rewriting History with Inverse RL: Hindsight Inference for Policy ImprovementCan a MISL Fly? Analysis and Ingredients for Mutual Information Skill LearningA Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching CapabilitiesSkill Expansion and Composition in Parameter Space
Contrastive learning as goal-conditioned reinforcement learning (Contrastive RL NCE)
arxiv:https://arxiv.org/abs/2206.07568
website:https://ben-eysenbach.github.io/contrastive_rl/
talk:https://www.youtube.com/watch?v=5_eGcprfw60
来源:偶然看到的文章,NeurIPS 2022,citation 173。发现一作是 Benjamin Eysenbach,也是 DIAYN 的一作,没仔细看 但感觉他的文章质量很高。
主要内容:
main idea 似乎非常简单。对于 goal-conditioned RL,即我们希望让 agent 达到 target state \(s_g\),直接用对比学习的方法,维护一个 \(f(s, a, s_g)\) 的相似度函数,\((s,a)\) 越有希望达到 \(s_g\) 状态,这个函数值越大。
具体的,\(f(s, a, s_g)\) 通过 embedding \(\phi(s,a)\) 和 \(\psi(s_g)\) 的内积来计算,loss function 是 \(L = \log\sigma(f(s, a, s_f^+)) + \log(1 - \sigma(f(s, a, s_f^=)))\),其中 \(s_f^+\) 和 \(s_f^-\) 分别是从 state occupancy 中采样的 state \(s_f^+\sim \int p^\pi(s_+ | s,a)ds\; da\)、和随机采样的 state。
理论可以保证,在某个 policy \(\pi\) 下,优化上述 NCE loss 得到的 \(f(s, a, s_g)\),跟最优 Q 函数之间满足:\(\exp(f^*(s, a, s_g)) = [1/p(s_f)]Q^\pi_{s_f}(s,a)\)。这意味着,我们只需要一直 1. 优化 \(f\),2. 选 \(f\) 值更高的 action,就可以得到最优策略。
(这很像 temporal distance)
一些形式化定义:
在 Goal-Conditioned RL(GCRL)里,reward 被定义为下一时刻达到目标的概率:\(r_g(s,a) = (1-\gamma)P(s'=s_g | s,a)\)。
这样,Q function 直接变成了 state occupancy:\(p(s_+ = s_g) = \mathbb E_{\pi} (1-\gamma)\sum_{t=0}^\infty \gamma^t P(s'=s_g | s_t,a_t)\)。
理论在 Appendix B,感觉什么都没说。
