原文依据: 「In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form.」(出自:Abstract,第1页)
原文依据: 「…where humour arises from the interpretation of the punning word (i.e., “dyed”) with a phonetically similar word with different meaning (i.e., “died”) in the case of heterographic puns…」(出自:Introduction,第1页)
原文依据: 「On the other hand, much of the humour encountered online… is based on contemporary topical knowledge of evolving pop-culture phenomena and news events, where a full appreciation of the humour relies on potentially esoteric knowledge, rather than common-sense…」(出自:Introduction,第1页)
原文依据: 「To assess the quality of joke explanations generated by the models, we propose a scoring rubric with two core criteria, accuracy and completeness, each evaluated on a 6-point scale (0-5).」(出自:4.3 Evaluation Criteria,第5页)
原文依据: 「When comparing model performance, GPT-4o consistently outperforms all others, demonstrating the highest accuracy and completeness scores across joke types.」(出自:5 Human Evaluation,第5页)
A. 在准确性(Accuracy)和完整性(Completeness)两个标准上得分都达到4分或以上。✅
B. 解释的长度超过100个单词。✅
C. 至少在一个评估标准上获得满分5分。✅
D. 由LLM-Judge判定为「通过」。✅
正确答案: A
原文依据: 「…we present a binary categorisation of explanations, treating scores of 4 or above on both criteria as being “good” quality explanations…」(出自:5.1 Explanation Success Rate,第6页)
原文依据: 「H2. Heterographic puns will be more difficult for models to explain than homographic puns due to the former’s reliance on phonetic similarity, which is not explicitly encoded in orthographic text.」(出自:4.1 Hypotheses,第4页)
原文依据: 「The challenge likely stems from the additional complexity introduced by needing to recognise the phonetic similarity of the pun word whilst being trained on orthographic tokens.」(出自:5.1 Explanation Success Rate,第6页)
原文依据: 「Furthermore, we also observe that no existing state-of-the-art models can consistently explain jokes of any type correctly.」(出自:9 Conclusion,第8页)
知识点: 「汰渍洗衣球挑战」案例分析 题目: 在关于「汰渍洗衣球挑战」(Tide Pod Challenge)的案例研究中,大型模型(如GPT-4o)与小型模型在解释上的主要区别是什么? 选项:
A. 小型模型提供了更长、更详细的解释。✅
B. 大型模型成功推断出笑话与「汰渍洗衣球挑战」这一网络现象的关联,而小型模型则没有。✅
C. 所有模型都未能理解这个笑话的背景。✅
D. 小型模型认为这个笑话不好笑,而大型模型认为它很幽默。✅
正确答案: B
原文依据: 「Interestingly, all of the full-size models inferred the reference to this challenge in their explanations. However, the smaller counterparts consistently omitted this information or presented misinterpretations…」(出自:7 Case Study,第7页)
原文依据: 「H4. Larger model variants will perform better than smaller variants, particularly for topical humour, due to being able to store larger amounts of information in their parameters…」(出自:4.1 Hypotheses,第4页)
原文依据: 「Regarding joke difficulty, the results align with the predicted ordering. Homographic jokes are the easiest to explain (β = 0.583, p < 0.001), while non-topical (β = -0.511, p < 0.001) and topical jokes (β = -0.574, p < 0.001) are significantly harder.」(出自:第7页,左栏底部)
知识点: LLM-Judge模型 题目: 为了验证人类评估者的可靠性,研究人员额外使用了一个「LLM作为评委」(LLM as a Judge)的范式。他们最终选用了哪个模型作为评委? 选项:
A. GPT-4o✅
B. Llama 3.1-70B-Instruct✅
C. Qwen2.5-72B-Instruct✅
D. Gemini 1.5 Pro✅
正确答案: C
原文依据: 「Across the whole dataset, we additionally leverage the LLM as a Judge paradigm, using Qwen2.5-72B-Instruct, finding agreement for accuracy…」(出自:4.3 Evaluation Criteria,第5页)
原文依据: 「H3. Topical humour will be more difficult for models to explain than non-topical Reddit humour, due to the former’s reliance on subtle references to contemporary pop culture and events, rather than common sense reasoning and general knowledge.」(出自:4.1 Hypotheses,第4页)
原文依据: 「GPT-4o Mini, for example, explains the joke by suggesting that “teenagers are known for being messy eaters”, which caused increased sales.」(出自:第8页,左栏顶部)
原文依据: 「We present a novel, balanced dataset of 600 jokes, categorised into traditional puns…, non-topical Reddit jokes…, and topical Reddit jokes… Each joke is paired with a high-quality, succinct, human-authored reference explanation…」(出自:第2页,左栏)
原文依据: 「Homographic jokes, where a word has multiple meanings but identical spelling, consistently yield the highest proportion of successful explanations across nearly all models.」(出自:5.1 Explanation Success Rate,第6页)
Loakman, T. , Thorne, W., & Lin, C. (2025). ✅Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes. arXiv:2507.13335v1 [cs.CL].