2025年12月第一篇论文速读记录是关于什么主题的？

目录Offline Multitask Representation Learning for Reinforcement Learning1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching CapabilitiesArtificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-FreeWhy Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in TrainingDoes Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?Lipschitz-constrained Unsupervised Skill Discovery (LSD)Controllability-Aware Unsupervised Skill Discovery (CSD) Offline Multitask Representation Learning for Reinforcement Learning arxiv：https://arxiv.org/abs/2403.11574 来源：NeurIPS 2024。是纯理论文章，没有代码。主要内容：对于低秩 MDP（transition \(P(s,a,s') = \phi(s,a)^T \psi(s')\)）中的 offline meta RL，利用给定的 N 个具有相同的的 task 的 offline 数据集，去学习一个统一的 \(\phi\)，然后对于一个给定的新 task（比如只有 transition 没有 reward 或者其他情况），进行适应。纯理论工作，看不太懂，（看起来跟目前工作也没什么关系）。 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities arxiv：https://arxiv.org/abs/2503.14858 来源：NeurIPS 2025 best paper，Benjamin Eysenbach 组的工作。之前读过：论文速读记录 | 2025.06 这篇文章主要在讲 RL scaling law 故事。 kimi 的一句话概括：在自监督的 goal-conditioned RL 中，把网络从常见的 4 层暴增到 64-1024 层，配合残差连接等技术，能让性能提升 2-50 倍，并自动学会新颖的复杂行为（比如人形机器人爬行 → 直立行走 → 翻越墙壁）。 Benjamin Eysenbach 的自监督 GCRL：学习一个 goal-conditioned policy \(\pi(a|s,g)\)，输入当前状态 s 和目标 g（通常是未来状态），输出动作 a。使用 Contrastive RL，把 RL 转成对比学习问题。 Critic 去学习一个度量：如果状态-动作 (s,a) 能到达目标 g，那它们的 embedding 距离 \(f(s,a,g) = \|\phi(s,a) - \psi(g)\|^2\) 应该很近，否则很远。基于这个思想，用 InfoNCE loss 训练：从同一条轨迹采样正样本 \((s_i,a_i,g_i)\)，从不同轨迹采样负样本 \(g_j\)，loss = \(-\log[ \exp(f(s_i,a_i,g_i)) / \sum_j \exp(f(s_i,a_i,g_j)) ]\)。 Actor \(\pi_\theta(a|s,g)\) 直接最大化 critic 的输出 \(f(s,a,g)\)，即去选一个让 \(\phi(s,a)\) 和 \(\psi(g)\) 的 embedding 距离更小的 action。相关工作：Benjamin Eysenbach 的 Contrastive RL（速读），感觉思路非常像的 Quasimetric RL（速读），以及相关的 temporal distance 工作（ETD）。

2025年12月第一篇论文速读记录是关于什么主题的？

相关推荐