原文依据: 「We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints.」(出自:Abstract,第1页)
原文依据: 「We introduce a new benchmark, IFBENCH, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints.」(出自:Abstract,第1页)
C. 通过收集真实用户反馈和手动编写核心技能,并结合来自WildChat的用户提示(prompts)构成。✅
D. 从GSM8K等数学推理数据集中自动提取。✅
正确答案: C
原文依据: 「The new constraints we introduce were created manually – sourced by collecting feedback from LM users… To create the final test prompts, we add instantiated constraints to unseen, i.e. held out from release, prompts from WildChat [29].」(出自:Section 2,第3页)
原文依据: 「Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following.」(出自:Abstract,第1页)
原文依据: 「Our results indicate that RLVR with Group Region Policy Optimization (GRPO) [18] and data augmentation leads to significant performance increases on old IF benchmarks and IFBENCH.」(出自:Section 1,第2页)
原文依据: 「As displayed in Figure 2, training on a bigger combination of constraints leads to better performance… Also on the out-of-domain benchmark IFBENCH, the best performance is achieved when training on more than one constraint per instance (Table 1).」(出自:Section 4.1,第5页)
原文依据: 「Following (verifiable) output constraints can stand in conflict with following the main task mentioned in the instruction… This indicates that the base policy models are better at following general instructions, while IF-RLVR trained models are better at following the constraints.」(出自:Section 5,第9页)
原文依据: 「We propose adding a general reward model (RM) signal to the verifiable reward. The intuition is that while the verifiable reward checks for the adherence to the output constraint, the general reward model provides signal for whether the response answers the prompt.」(出自:Section 5,第10页)
原文依据: 「We also see that targeted RLVR training for IF slightly harms other downstream evaluations, such as AlpacaEval 2, while staying comparable on others, such as GSM8K, MMLU and BBH.」(出自:Section 4.5,第7页)
原文依据: 「The results in Tab. 5 show that despite training on the same prompts and starting from the same model, GRPO training with IF verifiable rewards consistently outperforms the model trained with DPO on IFEval and IFBENCH.」(出自:Section 4.6,第8页)
原文依据: 「IFTRAIN consists of 29 new, unseen, verifiable constraints, with their corresponding verification functions. This more than doubles the current set of train constraint types.」(出自:Section 3,第4页)
C. 在一个更宽的、包含了测试范围的(WIDER RANGE)上训练,其性能与在相同范围上训练相当,甚至更好。✅
D. 变量范围对模型性能没有影响。✅
正确答案: C
原文依据: 「Interestingly, though, training on a wider variable range, performs comparably and often even better than training on the same range. This suggests that training on a diverse set of constraint variables improves generalization for in-domain constraints performance.」(出自:Section 4.3,第6页)
原文依据: 「In Table 6, we find that IF-RLVR training a base model leads to nearly the same IFEval performance as when using an instruct policy. IF-RLVR training from base… results in better generalization on the out-of-domain IFBENCH.」(出自:Section 4.7,第8-9页)
原文依据: 「The notion of an “instruction hierarchy” can be used to prioritize either the relative ranking of following system versus user prompts in a query along with how to prioritize different pieces of a request relative to eachother [25].」(出自:Section 5,第9页)
解析: “指令层级”这个概念被引用来说明,模型在面对一个复杂的提示时(例如“写一首关于夏天的诗,并且每个单词都以元音字母开头”),内部存在一个如何对指令的不同部分(“写诗”这个主任务 vs “单词以元音开头”这个约束)进行优先级排序的问题。IF-RLVR训练后的模型倾向于将约束的优先级提得很高,这正是导致“奖励黑客”现象的原因。
C. 用户先提出一个任务,模型回答后,用户在第三轮提出一个约束,要求模型重写之前的答案以满足新约束。✅
D. 模型需要一次性生成一个包含多轮对话的完整脚本。✅
正确答案: C
原文依据: 「Multi-turn: c is isolated from t in three turns. The first “user” prompt consist of a general prompt with task t and the second turn is an “assistant”‘s response to t, r1. The third turn (“user”) asks to rewrite r1 to comply with a constraint c.」(出自:Section 2,第3-4页)
原文依据: 「We exclusively focus on verifiable constraints, which is limiting, as many constraints used by users in the wild are constraints that do not have an easily verifiable ground truth.」(出自:Section 7,第11页)
学习目标
通过精心设计的选择题和原文对照,帮助学习者掌握《Generalizing Verifiable Instruction Following》这篇论文的核心知识点、研究发现和关键术语。
使用说明
请仔细阅读每个问题,选择您认为最合适的答案,然后对照原文依据和解析来加深理解。
题目与解析