diff --git a/.opencode/skills/code-generation/SKILL.md b/.opencode/skills/code-generation/SKILL.md
new file mode 100644
index 0000000..d267373
--- /dev/null
+++ b/.opencode/skills/code-generation/SKILL.md
@@ -0,0 +1,201 @@
+---
+name: code-generation
+description: Use when generating PyTorch code from paper analysis to ensure correct mapping from paper to code
+---
+
+# Code Generation from Papers
+
+## Overview
+
+Guidelines for translating paper descriptions into working PyTorch code.
+
+**Announce at start:** "I'm using the code-generation skill to ensure accurate paper-to-code translation."
+
+## Core Principles
+
+1. **Traceability**: Every code block should reference paper section/equation
+2. **Testability**: Write code that can be unit tested
+3. **Readability**: Prefer clarity over cleverness
+4. **Modularity**: One component per file
+
+## Paper-to-Code Mapping
+
+### Architecture Diagrams → nn.Module
+
+| Diagram Element | PyTorch Equivalent |
+|-----------------|-------------------|
+| Box/Block | nn.Module subclass |
+| Arrow | forward() call chain |
+| Split | Multiple outputs / tuple |
+| Merge | torch.cat / torch.add |
+| Skip connection | Residual addition |
+
+### Equations → Tensor Operations
+
+| Notation | PyTorch |
+|----------|---------|
+| $Wx + b$ | `nn.Linear(in, out)` |
+| $\sigma(x)$ | `torch.sigmoid(x)` or `nn.Sigmoid()` |
+| $\text{softmax}(x)$ | `F.softmax(x, dim=-1)` |
+| $\|x\|_2$ | `torch.norm(x, p=2)` |
+| $x \odot y$ | `x * y` (element-wise) |
+| $x^T y$ | `torch.matmul(x.T, y)` or `x.T @ y` |
+| $\sum_i$ | `torch.sum(x, dim=i)` |
+| $\mathbb{E}[x]$ | `torch.mean(x)` |
+
+### Loss Functions
+
+| Paper Description | PyTorch |
+|-------------------|---------|
+| Cross-entropy | `nn.CrossEntropyLoss()` |
+| MSE / L2 | `nn.MSELoss()` |
+| L1 | `nn.L1Loss()` |
+| BCE | `nn.BCEWithLogitsLoss()` |
+| KL divergence | `nn.KLDivLoss()` |
+| Custom | Subclass or functional |
+
+## Code Structure Template
+
+```python
+"""
+{component_name}.py
+
+Implements {what} from "{paper_title}" ({year})
+
+Paper Reference:
+- Section: {section_number}
+- Equation: ({equation_number})
+- Figure: {figure_number}
+
+Author: Auto-generated for paper replication
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple, List
+
+
+class {ComponentName}(nn.Module):
+    """
+    {One-line description}
+    
+    From paper: "{exact quote or paraphrase}"
+    
+    Args:
+        {param1}: {description} (paper: {where specified})
+        {param2}: {description}
+    
+    Shape:
+        - Input: {shape description}
+        - Output: {shape description}
+    
+    Example:
+        >>> layer = {ComponentName}(dim=512)
+        >>> x = torch.randn(32, 100, 512)
+        >>> out = layer(x)
+        >>> out.shape
+        torch.Size([32, 100, 512])
+    """
+    
+    def __init__(
+        self,
+        {param1}: {type},
+        {param2}: {type} = {default},
+    ):
+        super().__init__()
+        
+        # Paper Section X.Y: "{description}"
+        self.layer1 = nn.Linear(...)
+        
+        # Equation (N): ...
+        self.layer2 = nn.LayerNorm(...)
+        
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Forward pass implementing Equation (N).
+        
+        Args:
+            x: Input tensor of shape (batch, seq, dim)
+            
+        Returns:
+            Output tensor of shape (batch, seq, dim)
+        """
+        # Step 1: ... (Eq. N, first term)
+        h = self.layer1(x)
+        
+        # Step 2: ... (Eq. N, second term)
+        out = self.layer2(h)
+        
+        return out
+```
+
+## Common Patterns
+
+### Residual Connection
+
+```python
+# Paper: "We add a residual connection"
+out = self.sublayer(x) + x
+```
+
+### Layer Normalization
+
+```python
+# Paper: "Pre-LN Transformer"
+x = self.norm(x)
+x = self.attention(x)
+
+# Paper: "Post-LN Transformer"
+x = x + self.attention(x)
+x = self.norm(x)
+```
+
+### Multi-Head Attention
+
+```python
+# Paper: "Standard multi-head attention with h heads"
+self.attention = nn.MultiheadAttention(
+    embed_dim=d_model,
+    num_heads=h,
+    dropout=dropout,
+    batch_first=True,
+)
+```
+
+### Custom Activation
+
+```python
+# Paper: "We use GELU activation"
+x = F.gelu(x)
+
+# Paper: "We use Swish/SiLU activation"
+x = F.silu(x)
+```
+
+## Handling Ambiguity
+
+When paper is unclear:
+
+1. **Check code repository** if available
+2. **Follow common practice** for the architecture type
+3. **Document assumption** in code comment
+4. **Add TODO** for verification
+
+```python
+# TODO: Paper unclear on initialization. Using PyTorch default.
+# See: https://github.com/paper/repo for reference implementation
+self.linear = nn.Linear(in_dim, out_dim)
+```
+
+## Verification Checklist
+
+Before completing a module:
+
+- [ ] All equations implemented
+- [ ] Shapes documented and verified
+- [ ] Paper references in comments
+- [ ] Type hints complete
+- [ ] Example in docstring works
+- [ ] No hardcoded dimensions (use params)
+- [ ] Gradient flow verified (no in-place ops breaking autograd)