Update on "Refactor attention v2"

lucylq · lucylq · commit 01bff6b612b7 · 2025-05-05T10:34:15.000-07:00
Pull attention creation out of Transformer/TransformerBlock. Instead, pass the layers into Transformer. The motivation is to customize linear layers in attention for LoRA (eg. make wq into a LoraLinear instead of a regular linear). In the next diff (D73517350), we pull wq,wk,wv,wo out of the attention and pass those in as well. This allows us to customize attention parameters without passing in ModelArgs and doing the customization deep inside attention.py. I think this modularizes our attention/transformer components, though also means that users have to do some more work to construct the attention layers and pass it to transformer. It follows the torchtune structure more closely, eg. https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama3_2/_component_builders.py#L221 Differential Revision: [D73538697](https://our.internmc.facebook.com/intern/diff/D73538697/) [ghstack-poisoned]
diff --git a/examples/models/llama/llama_transformer.py b/examples/models/llama/llama_transformer.py
@@ -85,6 +85,14 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 
 class TransformerBlock(nn.Module):
     def __init__(self, args: ModelArgs, attention: Attention):
+        """
+        Transformer block with support for pre-norm and post-norm.
+        Args:
+            args (ModelArgs): model configuration parameters.
+            attention (Attention): attention object to use in the transformer
+                block. See `attention.py` for types of attention. Make sure
+                the attention type is registered in the ATTENTION_REGISTRY.
+        """
         super().__init__()
         self.use_kv_cache = args.use_kv_cache
         self.n_heads = args.n_heads
@@ -100,6 +108,13 @@ def __init__(self, args: ModelArgs, attention: Attention):
 
     @classmethod
     def from_type(cls, layer_id, args, rope) -> "TransformerBlock":
+        """
+        Create a TransformerBlock with the legacy constructor.
+        Args:
+            layer_id (int): the index of the layer.
+            args (ModelArgs): model configuration parameters.
+            rope (Rope): the rope object to use for rotary embeddings.
+        """
         if args.attention_type not in ATTENTION_REGISTRY:
             raise ValueError(
                 f"Unknown attention type: {args.attention_type}. "
@@ -124,6 +139,14 @@ def forward(self, x, freqs_cos, freqs_sin, attn_options: ForwardOptions):  # x:
 
 class Transformer(nn.Module):
     def __init__(self, params: ModelArgs, layers: nn.ModuleList, rope: Rope):
+        """
+        Transformer model.
+        Args:
+            params (ModelArgs): model configuration parameters.
+            layers (nn.ModuleList): list of transformer blocks - see the
+                `TransformerBlock` type above.
+            rope (Rope): the rope object to use for rotary embeddings.
+        """
         super().__init__()
         self.params = params
         self.vocab_size = params.vocab_size