Luong Attention2022-05-07

Attention 架构

Luong Attention分为Local和Global两种，本文主要分析Global Attention。下图为Global Attention的架构图：

global_attention

符号解释：

$\bar{h}_s$ ：encoder_output
$h_t$ ：decoder_output
$a_t(s)$ ：attn_weights
$c_t$ ：Context vector
$\tilde{h}_t$ ：attentional hidden state

以中英文翻译场景为例，根据该架构图，分为如下计算步骤：

attn_weights计算
$a_t$
Context vector计算
$a_t$ $c_t$
attentional hidden state计算
$c_t$ $\operatorname{cat}$ $\operatorname{tanh}$ $\operatorname{linear}$ $\tilde{h}_t$
预测
$\tilde{h}_t$ 进行预测，得到最终预测结果。

Attention计算流程

Global Attention计算流程如下图所示：

Luong_Attention

计算步骤如下：

Encoder
- $\operatorname{RNN（如LSTM）}$ 处理，得到encoder_output：
  $\bar{h}_s \operatorname{: [batch\_size, input\_len, enc\_hidden\_size]}$
- $\operatorname{bmm}$ $\operatorname{linear}$ 处理
  $\bar{h}_s \operatorname{: [batch\_size, input\_len, dec\_hidden\_size]}$
- step3：transpose(1,2)
  $\bar{h}_s \operatorname{: [batch\_size, dec\_hidden\_size, input\_len]}$
Decoder
- RNN
  - $\operatorname{RNN（如LSTM）}$ 处理，得到decoder_output：
    $h_t \operatorname{: [batch\_size, output\_len, dec\_hidden\_size]}$
- Attention
  - $h_t$ $\bar{h}_s$ ，进行打分计算：
    $\operatorname{score}(h_t, \bar{h}_s)=\operatorname{bmm}(h_t, \bar{h}_s): \operatorname{[batch\_size, output\_len, input\_len]}$
  - $\operatorname{softmax}$ 计算，得到attn_weights
    $a_t(s) = \operatorname{align(h_t, \bar{h}_s)}=\operatorname{softmax \left(score(h_t, \bar{h}_s)\right)} : \operatorname{[batch\_size, output\_len, input\_len]}$
  - $a_t(s)$ ，对encoder_output求加权平均
    $c_t=a_t \bar{h}_s: \operatorname{[batch\_size, output\_len, enc\_hidden\_size]}$
- Attentional hidden state
  - $\operatorname{cat}$ 操作”融入”到原始的decoder_output
    $[c_t;h_t]：\operatorname{[batch\_size, output\_len, enc\_hidden\_size+dec\_hidden\_size]}$
    $\operatorname{linear}$ 变换，需要对其shape进行调整，结果如下：
    $[c_t;h_t]：\operatorname{[batch\_size \times output\_len, enc\_hidden\_size+dec\_hidden\_size]}$
  - $[c_t;h_t]$ $\operatorname{linear}$ 变换
    $\operatorname{linear}([c_t;h_t])： \operatorname{ [batch\_size \times output\_len, dec\_hidden\_size]}$
  - $\operatorname{tanh}$ 变换
    $\operatorname{tanh}(\operatorname{linear}([c_t;h_t]))： \operatorname{ [batch\_size \times output\_len, dec\_hidden\_size]}$
  - step11：将二维展开到三维，得到最终的attentional hidden state
    $\tilde{h}_t：\operatorname{ [batch\_size ,output\_len, dec\_hidden\_size]}$
- Predict
  - $\tilde{h}_t$ $\operatorname{linear}$ 变换
    $\operatorname{linear}(\tilde{h}_t)：\operatorname{ [batch\_size ,output\_len, vocab\_size]}$
  - $\operatorname{softmax}$ 运算，得到最终预测概率结果
    $\operatorname{softmax}(\operatorname{linear}(\tilde{h}_t))：\operatorname{ [batch\_size ,output\_len, vocab\_size]}$

参考资料

第七课 Seq2Seq与Attention (julyedu.com)