2025年12月第二周的论文速读记录中，有哪些？

目录Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning一些 labeled data / expert demo + unlabeled data 的 offline RL 工作Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning(U2O RL) Unsupervised-to-Online Reinforcement Learning(HILP) Foundation policies with hilbert representationsActionable Models: Unsupervised Offline Reinforcement Learning of Robotic SkillsOPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement LearningParrot: Data-driven behavioral priors for reinforcement learning(ICVF) Reinforcement learning from passive data via latent intentionsBecome a proficient player with limited data through watching pure videos(PEX) Policy expansion for bridging offline-to-online reinforcement learningCup: Critic-guided policy reuse Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning arxiv：https://arxiv.org/abs/2302.08738 来源：无意中看到的，AAAI 2023。主要内容：为 PbRL 提出两种无监督 / 自监督技术，来 online 地利用 unlabelled data。1. 认为所有 unlabelled segment 都是人类喜欢的，并将 [R1 R2 ... RH] 作为奖励向量，通过神秘的 triplet loss 进行对比学习；2. 鼓励 reward model 中 state 的 embedding（没有细看这是什么）之间的距离满足 temporal distance，使用 MSE loss 来做。没有细读。一些 labeled data / expert demo + unlabeled data 的 offline RL 工作除了 CDS UDS 之外，还有： The Provable Benefits of Unsupervised Data Sharing for Offline Reinforcement Learning，https://arxiv.org/abs/2302.13493 ，ICLR 2023，师兄的工作。好像很理论，没有看。 CLUE: Calibrated Latent Guidance for Offline Reinforcement Learning，https://arxiv.org/abs/2104.07749 ，CoRL 2023：校准潜在空间（Calibrated Latent Guidance）：用 CVAE 学习 state-action 的潜在表示，但通过关键正则化强制所有专家数据嵌入坍缩到原点（均值 / 方差 ≈ 0）。这样，专家行为在潜在空间被"绑"成单点，任意样本与它的距离天然构成任务导向的内在奖励 —— 越像专家，奖励越高。无需对抗、无需时序建模，距离即奖励。 🥑 这篇文章也希望在 latent space 里面，用 latent space 里的距离来标 reward。看起来没有理论，感觉是纯启发式的。

2025年12月第二周的论文速读记录中，有哪些？

相关推荐