masks
For attention masking, pytorch nn.MultiHeadAttention
accepts either float or boolean masks.
There was a bug with float masks, causing Nan
values to get generated sometimes: - regression - nn.MultiheadAttention does not respect adding of floating point mask to attention for the fast path · Issue #107084 · pytorch/pytorch - TransformerEncoderLayer fast path predicts NaN when provided attention bias · Issue #118628 · pytorch/pytorch - Disable nn.MHA fastpath for floating point masks by mikaylagawarecki · Pull Request #107641 · pytorch/pytorch
So I’m using boolean masks instead. Note: pytorch converts 1 == True
, 0 == False
.
From the docs for MultiheadAttention.forward(): - key_padding_mask
– If specified, a mask of shape (N,S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). For unbatched query, shape should be (S). Binary and float masks are supported. For a binary mask, a True
value indicates that the corresponding key value will be ignored for the purpose of attention. For a float mask, it will be directly added to the corresponding key value. - attn_mask
– If specified, a 2D or 3D mask preventing attention to certain positions. Must be of shape (L,S) or (N⋅num_heads,L,S), where N is the batch size, L is the target sequence length, and S is the source sequence length. A 2D mask will be broadcasted across the batch while a 3D mask allows for a different mask for each entry in the batch. Binary and float masks are supported. For a binary mask, a True value indicates that the corresponding position is not allowed to attend. For a float mask, the mask values will be added to the attention weight. If both attn_mask and key_padding_mask are supplied, their types should match.
make_dummy_input
make_dummy_input (total_seq_len, nattn, batch_size)
create_masks
create_masks (input_seq, target_seq, device='cuda')
create_lookahead_mask
create_lookahead_mask (seq_len)
Create an attention mask, with rows representing target position and columns representing source position.
For row=i, column=j, mask[i][j] is ‘True’ if the decoder must ignore position j when processing position i.
An upper diagonal matrix (without the diagonal) will have ‘True’ for any j > i.
:param seq_len: sequence length :return: (seq_len, seq_len)
create_padding_mask
create_padding_mask (seq)
In seq, the 5th entry in the last dimension is the padding column, which will be 1 if the row is padding.
Convert to a boolean tensor, indicating ‘True’ for entries that are padding and should be ignored.
:param seq: (batch_size, seq_len, 5) :return: (batch_size, seq_len)