hc 5d5aee1f83 refactor: improve verification workflow with visual comparison

Major changes:
- paper-image-extractor: Generate reference_plots.py for visual verification
- paper-director: Add image understanding checkpoint with side-by-side comparison
- paper-analyzer: Add data source labeling with reliability levels
- code-writer: Change from TDD to VDD (Verification-Driven Development)
- test-runner: Generate comparison reports with images and explanations
- verification skill: Add difference classification system
- code-generation skill: Emphasize result independence

Key principles:
- Code results are authoritative, paper values are references
- Differences are expected and documented, not bugs to fix
- Visual comparison prioritized over exact numerical match
- Tests verify sanity (shape, gradient, range), not exact values

2026-03-31 19:55:36 +08:00

5.8 KiB

Raw Blame History

name	description
code-generation	Use when generating PyTorch code from paper analysis to ensure correct mapping from paper to code

Code Generation from Papers

Overview

Guidelines for translating paper descriptions into working PyTorch code.

Announce at start: "I'm using the code-generation skill to ensure accurate paper-to-code translation."

Core Principles

Traceability: Every code block should reference paper section/equation
Testability: Write code that can be unit tested
Readability: Prefer clarity over cleverness
Modularity: One component per file
Independence: Code logic based on paper methodology, NOT reverse-engineered from expected outputs

Critical: Result Independence

The code must implement the paper's described method, not be reverse-engineered to match reference values.

DO NOT:

# WRONG: Using values from reference_plots.py as targets
expected_accuracy = 0.952  # Copied from paper figure
assert abs(accuracy - expected_accuracy) < 0.01  # This defeats the purpose

DO:

# CORRECT: Implement the method, let results be what they are
# Paper Section 4.1: "We use Adam with lr=1e-4"
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Run training, record actual results
accuracy = evaluate(model, test_loader)
# This accuracy is authoritative - compare with paper in report

Reference Values Are For Comparison Only

Values from image_understanding.md and reference_plots.py should:

Be used in the final report for comparison
NOT be used as assertion targets in tests
NOT influence implementation decisions

Paper-to-Code Mapping

Architecture Diagrams → nn.Module

Diagram Element	PyTorch Equivalent
Box/Block	nn.Module subclass
Arrow	forward() call chain
Split	Multiple outputs / tuple
Merge	torch.cat / torch.add
Skip connection	Residual addition

Equations → Tensor Operations

Notation	PyTorch
`Wx + b`	`nn.Linear(in, out)`
`\sigma(x)`	`torch.sigmoid(x)` or `nn.Sigmoid()`
`\text{softmax}(x)`	`F.softmax(x, dim=-1)`
`\\|x\\|_2`	`torch.norm(x, p=2)`
`x \odot y`	`x * y` (element-wise)
`x^T y`	`torch.matmul(x.T, y)` or `x.T @ y`
`\sum_i`	`torch.sum(x, dim=i)`
`\mathbb{E}[x]`	`torch.mean(x)`

Loss Functions

Paper Description	PyTorch
Cross-entropy	`nn.CrossEntropyLoss()`
MSE / L2	`nn.MSELoss()`
L1	`nn.L1Loss()`
BCE	`nn.BCEWithLogitsLoss()`
KL divergence	`nn.KLDivLoss()`
Custom	Subclass or functional

Code Structure Template

"""
{component_name}.py

Implements {what} from "{paper_title}" ({year})

Paper Reference:
- Section: {section_number}
- Equation: ({equation_number})
- Figure: {figure_number}

Author: Auto-generated for paper replication
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple, List


class {ComponentName}(nn.Module):
    """
    {One-line description}
    
    From paper: "{exact quote or paraphrase}"
    
    Args:
        {param1}: {description} (paper: {where specified})
        {param2}: {description}
    
    Shape:
        - Input: {shape description}
        - Output: {shape description}
    
    Example:
        >>> layer = {ComponentName}(dim=512)
        >>> x = torch.randn(32, 100, 512)
        >>> out = layer(x)
        >>> out.shape
        torch.Size([32, 100, 512])
    """
    
    def __init__(
        self,
        {param1}: {type},
        {param2}: {type} = {default},
    ):
        super().__init__()
        
        # Paper Section X.Y: "{description}"
        self.layer1 = nn.Linear(...)
        
        # Equation (N): ...
        self.layer2 = nn.LayerNorm(...)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass implementing Equation (N).
        
        Args:
            x: Input tensor of shape (batch, seq, dim)
            
        Returns:
            Output tensor of shape (batch, seq, dim)
        """
        # Step 1: ... (Eq. N, first term)
        h = self.layer1(x)
        
        # Step 2: ... (Eq. N, second term)
        out = self.layer2(h)
        
        return out

Common Patterns

Residual Connection

# Paper: "We add a residual connection"
out = self.sublayer(x) + x

Layer Normalization

# Paper: "Pre-LN Transformer"
x = self.norm(x)
x = self.attention(x)

# Paper: "Post-LN Transformer"
x = x + self.attention(x)
x = self.norm(x)

Multi-Head Attention

# Paper: "Standard multi-head attention with h heads"
self.attention = nn.MultiheadAttention(
    embed_dim=d_model,
    num_heads=h,
    dropout=dropout,
    batch_first=True,
)

Custom Activation

# Paper: "We use GELU activation"
x = F.gelu(x)

# Paper: "We use Swish/SiLU activation"
x = F.silu(x)

Handling Ambiguity

When paper is unclear:

Check code repository if available
Follow common practice for the architecture type
Document assumption in code comment
Add TODO for verification

# TODO: Paper unclear on initialization. Using PyTorch default.
# See: https://github.com/paper/repo for reference implementation
self.linear = nn.Linear(in_dim, out_dim)

Verification Checklist

Before completing a module:

All equations implemented
Shapes documented and verified
Paper references in comments
Type hints complete
Example in docstring works
No hardcoded dimensions (use params)
Gradient flow verified (no in-place ops breaking autograd)
No reference values hardcoded as expected outputs
Implementation based on paper method, not reverse-engineered from results

5.8 KiB Raw Blame History