What is the primary purpose of the self-attention mechanism in a Transformer model?
Trap 1: To generate token embeddings in parallel
Parallel generation is a property of the Transformer architecture, but not specific to self-attention's purpose.
Trap 2: To reduce the dimensionality of token embeddings
Dimensionality reduction is not the role of self-attention.
Trap 3: To encode positional information of tokens
Positional encoding, not self-attention, provides positional information.
- A
To generate token embeddings in parallel
Why wrong: Parallel generation is a property of the Transformer architecture, but not specific to self-attention's purpose.
- B
To reduce the dimensionality of token embeddings
Why wrong: Dimensionality reduction is not the role of self-attention.
- C
To encode positional information of tokens
Why wrong: Positional encoding, not self-attention, provides positional information.
- D
To compute a weighted sum of all token representations based on pairwise relevance
Self-attention computes attention scores between all pairs and aggregates information.