Encoder 和 Decoder 中的 Attention 有什么区别？｜高频面试题解析

60 秒回答模板

Encoder 里的 self-attention 通常能看到输入序列的全部 token，用来做双向上下文建模。Decoder 里的 self-attention 要加 causal mask，当前位置只能看已经生成的历史 token，不能看未来。Encoder-Decoder 模型中 Decoder 还会有 cross-attention，Q 来自 Decoder 当前状态，K/V 来自 Encoder 输出，用来在生成时对输入内容对齐和取信息。三者的 QKV 来源、mask 和目标都不同。

考点 可见范围

难度 真实面经题

回答目标 讲清方法、取舍和追问

深入解析

Encoder 关注完整输入

Encoder 的 self-attention 面向理解任务，通常可以让每个位置看到整段输入。这样它能同时利用左侧和右侧上下文，形成更完整的输入表示。

Decoder 关注历史输出

Decoder 自回归生成时不能提前看到未来 token，所以 self-attention 会使用 causal mask。当前位置只能聚合前面已经生成的内容，保持训练和推理一致。

Cross-Attention 连接输入输出

在翻译、摘要等 Encoder-Decoder 结构里，Decoder 的 cross-attention 使用 Decoder hidden state 做 Q，Encoder 输出做 K/V，让生成过程对输入片段进行对齐。

QKV 来源不同

Encoder self-attention 的 Q/K/V 都来自输入表示；Decoder masked self-attention 的 Q/K/V 来自目标侧历史；cross-attention 则是 Q 来自目标侧，K/V 来自源侧。

任务目标不同

Encoder 偏理解和表征，Decoder 偏逐步生成，cross-attention 偏条件生成时的信息读取。区分这三类，比只说是否有 mask 更完整。

易错点

不要只说 Encoder 和 Decoder 都是 Attention，要区分 mask 和信息来源。
不要把 Decoder self-attention 和 cross-attention 混成一个模块。
不要忽略训练和推理一致性，causal mask 的目的就是防止未来信息泄漏。

面试官追问

为什么 Decoder 不能看到未来 token？

生成时未来 token 尚不存在，训练阶段也要避免信息泄漏，否则推理时行为会不一致。

只用 Decoder 的大模型还有 cross-attention 吗？

普通 decoder-only 语言模型通常没有 encoder-decoder cross-attention，但多模态或条件生成结构可能额外接入外部表示。

Cross-Attention 中 Q/K/V 分别来自哪里？

Q 来自 Decoder 当前状态，K 和 V 来自 Encoder 输出，这样目标侧可以按需读取源侧信息。