原文依据: 「Natural language processing (NLP) relies heavily on autoregressive (AR) models, especially when it comes to sequence generation tasks. These models successfully capture the conditional dependencies inherent in language by generating text by predicting each token based on the order of preceding tokens.」(出自:Discrete Diffusion Models for Language Generation.pdf,第21页)
原文依据: 「The primary objective during training is to minimize the loss function that quantifies the difference between the original data and the model’s reconstruction at each denoising step. Through this optimization process, the model progressively improves its generative capability by learning a highfidility reverse diffusion path.」(出自:Discrete Diffusion Models for Language Generation.pdf,第22页)
知识点: Forward diffusion process 题目: 前向扩散过程在扩散模型中的作用是什么? 选项:
A. 直接生成新序列✅
B. 优化模型参数✅
C. 训练去噪网络✅
D. 逐步向数据添加噪声✅
正确答案: D
原文依据: 「In the forward process, noise is incrementally added to the input data across a series of timesteps, gradually transforming the data into a noiselike distribution.」(出自:Discrete Diffusion Models for Language Generation.pdf,第24页)
解析: 前向过程将原始数据逐步转化为噪声分布,这是一个可逆的Markov过程,为后向去噪提供基础。
知识点: Backward diffusion process 题目: 后向扩散过程的主要目标是什么? 选项:
A. 添加更多噪声✅
B. 从噪声中恢复原始数据✅
C. 计算嵌入向量✅
D. 顺序生成标记✅
正确答案: B
原文依据: 「We can demonstrate, from a statistical standpoint, that the training objective to diffusion models involves minimizing a loss function derived from the Variational Lower Bound (VLB) [10]. This objective enables the model to learn the denoising process in a step-by-step manner by approximating the reverse transitions from…」(出自:Discrete Diffusion Models for Language Generation.pdf,第23页)
解析: 后向过程学习逆转前向噪声添加,通过逐步去噪从噪声分布生成数据样本。
知识点: Bits Per Token (BPT) 题目: Bits Per Token (BPT)指标用于衡量什么? 选项:
A. 模型的压缩效率✅
B. 生成速度✅
C. 噪声添加水平✅
D. 序列长度✅
正确答案: A
原文依据: 「In the evaluation of language models, Bits Per Token (BPT) is a fundamental metric that quantifies the average number of bits required to encode each token[ $mathrm{}$ is given sequence.」(出自:Discrete Diffusion Models for Language Generation.pdf,第18页)
原文依据: 「Perplexity (PPL) Section 2.5, an indicator of the model’s ability to predict unseen data.」(出自:Discrete Diffusion Models for Language Generation.pdf,第17页)
原文依据: 「Negative Log-Likelihood (NLL) (Section 2.6), a loss-based evaluation metric derived from the log-likelihood of the predicted distribution.」(出自:Discrete Diffusion Models for Language Generation.pdf,第17页)
原文依据: 「Composed of full Wikipedia articles, WikiText-103 is particularly suitable for evaluating models capable of capturing long-range dependencies in text 5.」(出自:Discrete Diffusion Models for Language Generation.pdf,第32页)
原文依据: 「Subword Tokenization: The sentence is divided into smaller units (subwords) rather than whole words, which helps handle rare words and improves model generalization.」(出自:Discrete Diffusion Models for Language Generation.pdf,第34页)
解析: 分词将文本转换为模型可处理的标记,帮助处理稀有词并提升泛化。
知识点: Embedding vectors 题目: 嵌入向量在模型输入中的功能是什么? 选项:
A. 噪声添加✅
B. 将标记ID转换为多维向量表示✅
C. 计算困惑度✅
D. 并行去噪✅
正确答案: B
原文依据: 「Embedding Layer: This is typically a lookup table (e.g., a matrix of embeddings vector for a specific token ID.」(出自:Discrete Diffusion Models for Language Generation.pdf,第35页)
原文依据: 「This section introduces the transformer architecture [15], illustrated in Figure 7.1, which serves as the foundational structure for all three autoregressive models discussed later in this work.」(出自:Discrete Diffusion Models for Language Generation.pdf,第59页)
解析: Transformer使用自注意力机制,克服RNN的梯度消失问题,支持并行计算。
知识点: Exposure bias in AR models 题目: 自回归模型中的暴露偏差(exposure bias)是什么? 选项:
A. 噪声过多✅
B. 训练时使用真实前缀,推理时使用生成前缀导致不匹配✅
C. 嵌入维度过高✅
D. 去噪步骤不足✅
正确答案: D
原文依据: 「While autoregressive models have set the state-of-the-art in natural language generation, however, they suffer from well-known limitations, including exposure bias and restricted support for parallel computation during inference.」(出自:Discrete Diffusion Models for Language Generation.pdf,第14页)
解析: 暴露偏差指训练和推理上下文不一致,可能导致错误累积。
知识点: Parallel generation in diffusion models 题目: 扩散模型在生成中的并行优势是什么? 选项:
A. 顺序预测每个标记✅
B. 仅用于训练阶段✅
C. 噪声添加顺序✅
D. 同时生成所有标记✅
正确答案: B
原文依据: 「Unlike AR models, diffusion models generate entire sequences in parallel during inference, offering potential advantages in sampling flexibility, controllability, and robustness against exposure bias.」(出自:Discrete Diffusion Models for Language Generation.pdf,第13页)
解析: 扩散模型允许并行去噪,提高推理效率,克服AR的顺序限制。
知识点: Masking in D3PM 题目: D3PM中的掩码(masking)机制用于什么? 选项:
A. 模拟噪声,通过掩码状态转换✅
B. 顺序生成✅
C. 计算嵌入✅
D. 评估困惑度✅
正确答案: A
原文依据: 「As an illustrative example, assume $f_1=0.3$. The corresponding $Q$ matrix becomes: … the token ‘Saman’ (One-hot vec: $[1,0,0,0,0]$ ) and Vec $Q_1$ remains itself with a probability of 0.7, while it transi- tions to the mask state with a probability of 0.3 .」(出自:Discrete Diffusion Models for Language Generation.pdf,第25页)
原文依据: 「The transition matrix at each step, of size $(m+1) times(m+1)$, is given by: $Q=(1-f) I…$」(出自:Discrete Diffusion Models for Language Generation.pdf,第25页)
原文依据: 「We can demonstrate, from a statistical standpoint, that the training objective to diffusion models involves minimizing a loss function derived from the Variational Lower Bound (VLB) [10].」(出自:Discrete Diffusion Models for Language Generation.pdf,第23页)
解析: VLB提供一个可优化的下界,用于近似最大似然估计,帮助训练去噪模型。
知识点: GPT-2 as baseline 题目: GPT-2在实验中作为基准的作用是什么? 选项:
A. 扩散模型变体✅
B. 验证自回归模型性能✅
C. 数据集预处理✅
D. 计算噪声矩阵✅
正确答案: D
原文依据: 「To enhance the reliability and confidence of our experimental results, we implement an additional validation step by training a GPT-2 model under the same dataset and training conditions.」(出自:Discrete Diffusion Models for Language Generation.pdf,第35页)
解析: GPT-2作为强基线,用于交叉验证AR模型的表现,确保结果鲁棒性。
知识点: Training loss 题目: 训练损失在模型评估中的意义是什么? 选项:
A. 仅用于生成速度✅
B. 反映模型收敛和学习效果✅
C. 添加噪声水平✅
D. 嵌入维度选择✅
正确答案: B
原文依据: 「The training loss graphs… The global step metric represents the cumulative number of optimization steps taken during training…」(出自:Discrete Diffusion Models for Language Generation.pdf,第39页)
原文依据: 「In terms of inference speed, both AR and D3PM models demonstr… D3PM slightly outperforms the AR model in batch generation rate, which implies its potential advantage in parallel generation settings.」(出自:Discrete Diffusion Models for Language Generation.pdf,第46页)
原文依据: 「For the successful execution of model training and evaluation, highperformance computing (HCPC) resources provided by the university were utilized. Due to the computational demands of large-scale language models, my personal workstation was insufficient…」(出自:Discrete Diffusion Models for Language Generation.pdf,第60页)