原文依据: 「We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints.」(出自:Abstract,第1页)
原文依据: 「We introduce a new benchmark, IFBENCH, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints.」(出自:Abstract,第1页)
A. Reinforcement Learning with Validated Responses (基于已验证回复的强化学习)✅
B. Reward Learning with Verified Rules (基于已验证规则的奖励学习)✅
C. Reinforcement Learning with Verifiable Rewards (基于可验证奖励的强化学习)✅
D. Reasoning Learning via Verified Rewards (通过已验证奖励的推理学习)✅
正确答案: C
原文依据: 「Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following.」(出自:Abstract,第1页)
解析: RLVR是“Reinforcement Learning with Verifiable Rewards”的缩写。其核心思想是,由于指令约束是可被程序自动验证的(例如,检查输出是否包含5个段落),因此可以为模型的输出提供一个确切的、可验证的奖励信号(遵循了约束则奖励为1,否则为0),然后用这个信号进行强化学习。
知识点: IFTRAIN的作用 题目: 论文中发布的IFTRAIN数据集扮演了什么角色? 选项:
A. 它是一个包含58个约束的评估基准,用于测试模型的最终性能。✅
B. 它是一个专门用于训练的数据集,包含29个新的、手工标注的训练约束和验证函数。✅
C. 它是IFBENCH的旧版本,现已弃用。✅
D. 它是一个用于比较DPO和GRPO两种训练方法的专用测试集。✅
正确答案: B
原文依据: 「In addition to IFBENCH, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.」以及「IFTRAIN consists of 29 new, unseen, verifiable constraints, with their corresponding verification functions.」(出自:Abstract,第1页 & Section 3,第4页)
原文依据: 「Following (verifiable) output constraints can stand in conflict with following the main task mentioned in the instruction and a model has to trade-off between completing the task while also adhering to the constraint.」(出自:Section 5,第9页)
原文依据: 「We propose adding a general reward model (RM) signal to the verifiable reward. The intuition is that while the verifiable reward checks for the adherence to the output constraint, the general reward model provides signal for whether the response answers the prompt.」(出自:Section 5,第10页)
C. 训练时在每个样本中增加约束数量(例如,最多到5或6个)可以提升模型在域内(IFEval)和域外(IFBENCH)的性能。✅
D. 增加约束数量只对IFBENCH有好处,对IFEval的性能有负面影响。✅
正确答案: C
原文依据: 「We find that training on a combination of constraints improves both in-domain and out-of-domain performance. …training on up to 5 or 6 constraints still leads to better generalization on these benchmarks.」(出自:Section 4.1,第5页)
原文依据: 「The results in Tab. 5 show that despite training on the same prompts and starting from the same model, GRPO training with IF verifiable rewards consistently outperforms the model trained with DPO on IFEval and IFBENCH.」(出自:Section 4.6,第8页)
C. 在基础模型上进行IF-RLVR训练,其最终在IFEval上的性能几乎与在指令微调模型上训练相当。✅
D. 基础模型训练后在IFBENCH上表现更好,但在IFEval上表现更差。✅
正确答案: C
原文依据: 「In Table 6, we find that IF-RLVR training a base model leads to nearly the same IFEval performance as when using an instruct policy.」(出自:Section 4.7,第8页)
C. 他们将未见过的约束(unseen constraints)与来自WildChat的未见过提示(unseen prompts)相结合来创建测试实例。✅
D. 他们对所有测试输出都进行人工审核,以剔除可能的污染结果。✅
正确答案: C
原文依据: 「By combining unseen prompts with unseen constraints, we prevent accidental train-test contamination and can appropriately evaluate language models’ abilities to generalize…」(出自:Section 2,第3页)
C. 在一个更宽泛、包含了测试范围的范围(WIDER RANGE)内训练,其效果与在相同范围训练相当,甚至有时更好。✅
D. 变量范围对模型的泛化能力没有显著影响。✅
正确答案: C
原文依据: 「Interestingly, though, training on a wider variable range, performs comparably and often even better than training on the same range. This suggests that training on a diverse set of constraint variables improves generalization…」(出自:Section 4.3,第6页)
C. 在某些任务(如GSM8K, MMLU)上性能保持相当,但在另一些任务(如AlpacaEval 2)上性能轻微受损。✅
D. 对下游任务没有产生任何影响。✅
正确答案: C
原文依据: 「We also see that targeted RLVR training for IF slightly harms other downstream evaluations, such as AlpacaEval 2, while staying comparable on others, such as GSM8K, MMLU and BBH.」(出自:Section 4.5,第7页)
C. “strict”直接验证原始输出,“loose”在验证前会先清理输出(如移除开头/结尾的客套话或格式修饰符)。✅
D. “strict”由人工评估,“loose”由模型自动评估。✅
正确答案: C
原文依据: 「…we compute both strict and loose accuracy, where the strict accuracy verifies if the constraint is followed correctly, and the loose accuracy additionally cleans the model’s output by removing first/last lines and certain font modifiers.」(出自:Section 2,第3页)
B. 将一个公开SFT数据集(TÜLU-3-SFT)中的指令与来自IFEval或IFTRAIN的约束相结合。✅
C. 直接使用IFBENCH的测试集进行训练,以达到最佳性能。✅
D. 从真实用户与聊天机器人的对话日志中提取。✅
正确答案: B
原文依据: 「The prompts for verifiable IF training are created by combining an instruction from a public SFT dataset with a constraint from either the IFEval taxonomy … or our new unseen training constraint taxonomy (IFTRAIN). We randomly sample prompts from TÜLU-3-SFT…」(出自:Section 3, Data,第4页)
原文依据: 「We also find that most LLMs get the same easy instances right and the same hard instances wrong, which makes the creation of preference pairs more difficult.」(出自:Section 4.6,第8页)
知识点: IFTRAIN的设计目的 题目: IFTRAIN被设计用来教授约束的“基本构建模块”(basic building blocks)。根据论文,这具体指什么? 选项:
A. 教授模型基本的编程语言语法。✅
B. 教授模型不同版本的复制任务,如复制特定片段或复制并编辑。✅
C. 教授模型基本的数学公理和定理。✅
D. 教授模型不同语言之间的翻译规则。✅
正确答案: B
原文依据: 「The constraints were created to capture the basic building blocks of classic constraints. For example, to teach the model to copy better from the input, we create different versions of copying tasks, such as copying spans or copying and editing of the input.」(出自:Section 3,第4页)
原文依据: 「Each word in your response must start with the next letter of the alphabet, looping back to ‘A’ after ‘Z’.」(出自:Appendix A, Table 15, instruction ‘alphabet’)
C. “多轮”(Multi-turn)设置,即模型先对一个通用任务做出回应,然后用户在下一轮要求其根据新约束重写回答。✅
D. “并行”设置,即同时给模型多个独立的任务和约束。✅
正确答案: C
原文依据: 「Multi-turn: c is isolated from t in three turns. The first “user” prompt consist of a general prompt with task t and the second turn is an “assistant”‘s response to t, r1. The third turn (“user”) asks to rewrite r₁ to comply with a constraint c.」(出自:Section 2,第3-4页)
B. 大多数模型在简单的指令遵循基准上存在过拟合,泛化能力是关键挑战,而IFBENCH和RLVR为此提供了评估和改进的路径。✅
C. DPO是提升模型指令遵循能力的最佳方法。✅
D. 只有通过增加模型参数规模,才能从根本上提升指令遵循能力。✅
正确答案: B
原文依据: 「We create IFBENCH, a challenging and unseen benchmark to evaluate precise, verifiable instruction following. We show that most models overfit on a small set of constraints and that generalization is difficult. … We conclude with recommendations for improved constraint following abilities…」(出自:Section 7 Conclusion and Limitations,第11页)
学习目标
通过精心设计的选择题和原文对照,帮助学习者掌握关于评估和提升大型语言模型精确指令遵循能力的核心知识点。
使用说明
请仔细阅读每个问题,对照原文理解解析。这篇材料基于论文《Generalizing Verifiable Instruction Following》,旨在深入理解其提出的问题、解决方案及实验结论。
题目与解析