原文依据: 「We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints.」(出自:摘要,第1页)
原文依据: 「We introduce a new benchmark, IFBENCH, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints.」(出自:摘要,第1页)
原文依据: 「Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following.」(出自:摘要,第1页)
解析: 摘要中提到,研究人员设计了约束验证模块,并证明了使用可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, 简称RLVR)能够显著提升模型的指令遵循能力。
原文依据: 「The task of precise instruction following (IF) evaluates a language model’s ability to perform a task t, such as summarization or creative writing, while adhering to one or more output constraints c, which can be automatically verified.」(出自:第2节,第2页)
原文依据: 「2. 29 new training constraints and verification functions, IFTRAIN, to enable simple data creation that improves instruction following performance.」(出自:第2节,第2页)
原文依据: 「The prompts for verifiable IF training are created by combining an instruction from a public SFT dataset with a constraint from either the IFEval taxonomy (under Apache 2.0 license) or our new unseen training constraint taxonomy (which is separate from the constraints in IFBENCH).」(出自:第3节 Data 部分,第4页)
原文依据: 「We use GRPO [18] to optimize the following objective: Specifically, we train a policy with GRPO and outcome supervision…」(出自:第3节 Training 部分,第4页)
解析: 论文明确指出,他们使用GRPO(Group Region Policy Optimization)算法来训练策略,并通过结果监督(outcome supervision)的方式进行优化,即根据输出是否正确满足约束来评分。
D. 提升了模型在已知约束(in-domain)和未知约束(out-of-domain)上的双重表现✅
正确答案: D
原文依据: 「We find that training on a combination of constraints improves both in-domain and out-of-domain performance. …training on up to 5 or 6 constraints still leads to better generalization on these benchmarks.」(出自:第4.1节,第5页)
B. 在比测试集更宽的变量范围(WIDER RANGE)上训练,其性能与在相同范围上训练相当,甚至更好✅
C. 在与测试集完全不相交的变量范围(DIFFERENT RANGE)上训练,性能最高✅
D. 变量范围对模型性能没有显著影响✅
正确答案: B
原文依据: 「Interestingly, though, training on a wider variable range, performs comparably and often even better than training on the same range. This suggests that training on a diverse set of constraint variables improves generalization for in-domain constraints performance.」(出自:第4.3节,第6页)
原文依据: 「The results in Tab. 5 show that despite training on the same prompts and starting from the same model, GRPO training with IF verifiable rewards consistently outperforms the model trained with DPO on IFEval and IFBENCH.」(出自:第4.6节 Experiments and Results,第8页)
C. 从基础模型开始训练,使用特定的推理聊天模板,可以达到与从指令微调模型开始训练几乎相同的IFEval性能✅
D. 从基础模型训练会导致模型在所有基准上性能下降✅
正确答案: C
原文依据: 「In Table 6, we find that IF-RLVR training a base model leads to nearly the same IFEval performance as when using an instruct policy. IF-RLVR training from base, with a reasoning chat template, results in better generalization on the out-of-domain IFBENCH.」(出自:第4.7节,第8页)
原文依据: 「Beyond improving precise IF, we also see that RLVR trained models exhibit different instruction following behaviors compared to their non reinforcement trained counterparts. … IF-RLVR trained models tend to prioritize the constraint – future work will explore how to blend these behaviors with refined training recipes.」(出自:第2节,第2页)
原文依据: 「1. A new, unseen and challenging precise instruction following benchmark, IFBENCH³, with 58 new constraints and corresponding verification functions.」(出自:第2节,第2页)
原文依据: 「Removing the constraints from the LENGTH CONSTRAINT and the KEYWORDS categories harms IFEval performance the most, while removing constraints from the CHANGE CASES and DETECTABLE FORMAT categories barely affect performance…」(出自:第4.4节,第6页)
原文依据: 「While GRPO training with verifiable rewards for precise IF is great at teaching LLMs to follow output constraints, it can sometimes result in models that over-prioritize the constraint over the full instruction. An example of such an output is given in Figure 8. This could also be called over-optimization.」(出自:第5节 Mitigating Reward Hacking,第10页)
原文依据: 「We propose adding a general reward model (RM) signal to the verifiable reward. The intuition is that while the verifiable reward checks for the adherence to the output constraint, the general reward model provides signal for whether the response answers the prompt.」(出自:第5节 Mitigating Reward Hacking,第10页)
C. 在一个提示中,不同部分的指令(如系统提示 vs 用户提示,任务 vs 约束)具有不同的优先级,模型需要学会权衡✅
D. 一种将复杂指令分解为简单子任务的框架✅
正确答案: C
原文依据: 「The notion of an “instruction hierarchy” can be used to prioritize either the relative ranking of following system versus user prompts in a query along with how to prioritize different pieces of a request relative to eachother [25].」(出自:第5节,第9页)
原文依据: 「We define a taxonomy of constraint templates, which we split into training and test constraints to prevent contamination. …IFTRAIN consists of 29 new, unseen, verifiable constraints… IFBENCH consists of 58 new verifiable constraints…」(出自:第3节和第3节,第3-4页)
学习目标
通过精心设计的选择题和原文对照,帮助学习者掌握《Generalizing Verifiable Instruction Following》这篇论文的核心知识点,理解其提出的问题、解决方案及实验结论。
使用说明
请仔细阅读每个问题,选择您认为正确的答案。然后对照“原文依据”和“解析”部分,深入理解该知识点。这种主动回忆和即时反馈结合的方式,能有效提升记忆和理解效果。
题目与解析