masks

Fill in a module description here

For attention masking, pytorch nn.MultiHeadAttention accepts either float or boolean masks.

There was a bug with float masks, causing Nan values to get generated sometimes: - regression - nn.MultiheadAttention does not respect adding of floating point mask to attention for the fast path · Issue #107084 · pytorch/pytorch - TransformerEncoderLayer fast path predicts NaN when provided attention bias · Issue #118628 · pytorch/pytorch - Disable nn.MHA fastpath for floating point masks by mikaylagawarecki · Pull Request #107641 · pytorch/pytorch

So I’m using boolean masks instead. Note: pytorch converts 1 == True, 0 == False.

From the docs for MultiheadAttention.forward(): - key_padding_mask – If specified, a mask of shape (N,S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). For unbatched query, shape should be (S). Binary and float masks are supported. For a binary mask, a True value indicates that the corresponding key value will be ignored for the purpose of attention. For a float mask, it will be directly added to the corresponding key value. - attn_mask – If specified, a 2D or 3D mask preventing attention to certain positions. Must be of shape (L,S) or (N⋅num_heads,L,S), where N is the batch size, L is the target sequence length, and S is the source sequence length. A 2D mask will be broadcasted across the batch while a 3D mask allows for a different mask for each entry in the batch. Binary and float masks are supported. For a binary mask, a True value indicates that the corresponding position is not allowed to attend. For a float mask, the mask values will be added to the attention weight. If both attn_mask and key_padding_mask are supplied, their types should match.

source

make_dummy_input

 make_dummy_input (total_seq_len, nattn, batch_size)

source

create_masks

 create_masks (input_seq, target_seq, device='cuda')

source

create_lookahead_mask

 create_lookahead_mask (seq_len)

Create an attention mask, with rows representing target position and columns representing source position.

For row=i, column=j, mask[i][j] is ‘True’ if the decoder must ignore position j when processing position i.

An upper diagonal matrix (without the diagonal) will have ‘True’ for any j > i.

:param seq_len: sequence length :return: (seq_len, seq_len)

source

create_padding_mask

 create_padding_mask (seq)

In seq, the 5th entry in the last dimension is the padding column, which will be 1 if the row is padding.

Convert to a boolean tensor, indicating ‘True’ for entries that are padding and should be ignored.

:param seq: (batch_size, seq_len, 5) :return: (batch_size, seq_len)