feat(skills): add verification skill

Replication result verification methodology.
2026-03-31 17:43:41 +08:00 · 2026-03-31 17:43:41 +08:00 · 849cfe5409
commit 849cfe5409
parent 06282c7314
1 changed files with 190 additions and 0 deletions
--- a/.opencode/skills/verification/SKILL.md
+++ b/.opencode/skills/verification/SKILL.md
@ -0,0 +1,190 @@
+---
+name: verification
+description: Use when verifying replication results against paper's reported values
+---
+
+# Replication Verification
+
+## Overview
+
+Systematic approach to verifying that replicated code produces results matching the original paper.
+
+**Announce at start:** "I'm using the verification skill to validate replication accuracy."
+
+## Verification Levels
+
+### Level 1: Code Correctness
+- Unit tests pass
+- No runtime errors
+- Gradient flow works
+
+### Level 2: Behavioral Match
+- Output shapes correct
+- Value ranges reasonable
+- Edge cases handled
+
+### Level 3: Numerical Match
+- Results within tolerance of paper
+- Trends match (even if absolute values differ)
+- Statistical significance considered
+
+## Test Design for Replication
+
+### Shape Tests
+
+```python
+def test_model_output_shape():
+    """Verify model produces correct output shape per paper."""
+    model = MyModel(config)
+    x = torch.randn(batch_size, seq_len, input_dim)
+    out = model(x)
+    
+    # Paper Section 3.2: "Output dimension is 512"
+    assert out.shape == (batch_size, seq_len, 512)
+```
+
+### Value Range Tests
+
+```python
+def test_attention_weights_sum():
+    """Attention weights should sum to 1 (paper Eq. 3)."""
+    model = AttentionLayer(config)
+    x = torch.randn(batch_size, seq_len, dim)
+    _, attn_weights = model(x, return_attention=True)
+    
+    # Softmax output sums to 1
+    assert torch.allclose(attn_weights.sum(dim=-1), torch.ones(batch_size, seq_len))
+```
+
+### Gradient Tests
+
+```python
+def test_gradient_flow():
+    """Verify gradients flow through all parameters."""
+    model = MyModel(config)
+    x = torch.randn(batch_size, input_dim, requires_grad=True)
+    out = model(x)
+    loss = out.sum()
+    loss.backward()
+    
+    for name, param in model.named_parameters():
+        assert param.grad is not None, f"No gradient for {name}"
+        assert not torch.isnan(param.grad).any(), f"NaN gradient for {name}"
+```
+
+### Numerical Match Tests
+
+```python
+def test_loss_value_reasonable():
+    """Loss should be in expected range per paper Figure 2."""
+    model = MyModel(config)
+    # ... setup ...
+    
+    loss = compute_loss(model, data)
+    
+    # Paper reports initial loss ~2.3 (cross-entropy on 10 classes)
+    assert 2.0 < loss.item() < 3.0, f"Initial loss {loss.item()} outside expected range"
+```
+
+## Comparison Methodology
+
+### Absolute Comparison
+
+```python
+def compare_absolute(paper_value: float, our_value: float, tolerance: float = 0.01):
+    """Compare with absolute tolerance."""
+    diff = abs(paper_value - our_value)
+    return diff <= tolerance, diff
+```
+
+### Relative Comparison
+
+```python
+def compare_relative(paper_value: float, our_value: float, tolerance: float = 0.05):
+    """Compare with relative tolerance (5% default)."""
+    if paper_value == 0:
+        return our_value == 0, abs(our_value)
+    relative_diff = abs(paper_value - our_value) / abs(paper_value)
+    return relative_diff <= tolerance, relative_diff
+```
+
+### Statistical Comparison
+
+```python
+def compare_with_variance(
+    paper_mean: float,
+    paper_std: float,
+    our_values: List[float],
+    confidence: float = 0.95,
+):
+    """Compare considering paper's reported variance."""
+    our_mean = np.mean(our_values)
+    our_std = np.std(our_values)
+    
+    # Check if means are within 2 standard deviations
+    combined_std = np.sqrt(paper_std**2 + our_std**2)
+    z_score = abs(paper_mean - our_mean) / combined_std
+    
+    return z_score < 2.0, z_score
+```
+
+## Common Difference Sources
+
+### Acceptable Differences
+
+| Source | Typical Impact | Mitigation |
+|--------|---------------|------------|
+| Random seed | 1-2% | Run multiple seeds |
+| Floating point | < 0.1% | Use float64 for verification |
+| Framework differences | 1-3% | Document and accept |
+| Hardware differences | 0.5-1% | Note in report |
+
+### Concerning Differences
+
+| Source | Typical Impact | Action |
+|--------|---------------|--------|
+| Wrong architecture | > 10% | Review code vs paper |
+| Wrong hyperparameters | 5-20% | Verify all settings |
+| Data preprocessing | Variable | Match paper exactly |
+| Evaluation protocol | Variable | Check train/val/test split |
+
+## Verification Checklist
+
+### Before Comparison
+
+- [ ] Seeds set for reproducibility
+- [ ] Same evaluation data as paper
+- [ ] Same preprocessing pipeline
+- [ ] Same evaluation metrics
+
+### During Comparison
+
+- [ ] Run multiple times with different seeds
+- [ ] Record mean and standard deviation
+- [ ] Compare trends, not just final values
+- [ ] Check intermediate checkpoints if available
+
+### After Comparison
+
+- [ ] Document all differences
+- [ ] Explain likely causes
+- [ ] Determine if differences are acceptable
+- [ ] Suggest improvements if needed
+
+## Report Template
+
+```markdown
+## Verification Result: {Metric Name}
+
+**Paper Value**: {value} ± {std}
+**Our Value**: {value} ± {std}
+**Difference**: {absolute} ({relative}%)
+
+**Status**: MATCH | ACCEPTABLE | INVESTIGATE | MISMATCH
+
+**Analysis**:
+{explanation of difference}
+
+**Confidence**: {HIGH | MEDIUM | LOW}
+{reasoning for confidence level}
+```