PaperTool/.opencode/skills/verification/SKILL.md
hc 5d5aee1f83 refactor: improve verification workflow with visual comparison
Major changes:
- paper-image-extractor: Generate reference_plots.py for visual verification
- paper-director: Add image understanding checkpoint with side-by-side comparison
- paper-analyzer: Add data source labeling with reliability levels
- code-writer: Change from TDD to VDD (Verification-Driven Development)
- test-runner: Generate comparison reports with images and explanations
- verification skill: Add difference classification system
- code-generation skill: Emphasize result independence

Key principles:
- Code results are authoritative, paper values are references
- Differences are expected and documented, not bugs to fix
- Visual comparison prioritized over exact numerical match
- Tests verify sanity (shape, gradient, range), not exact values
2026-03-31 19:55:36 +08:00

271 lines
7.8 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
name: verification
description: Use when verifying replication results against paper's reported values
---
# Replication Verification
## Overview
Systematic approach to verifying that replicated code produces results comparable to the original paper. **Note**: Exact matches are rare; the goal is verifiable, explainable results.
**Announce at start:** "I'm using the verification skill to validate replication accuracy."
## Core Philosophy
1. **Code results are authoritative** - Our implementation's output is ground truth
2. **Paper values are references** - Used for comparison, not as test assertions
3. **Differences require explanations** - Not fixes (unless clearly buggy)
4. **Visual comparison over numerical** - Trends matter more than exact values
## Difference Classification System
| Status | Symbol | Criteria | Action |
|--------|--------|----------|--------|
| MATCH | ✅ | < 2% difference | Document, no action needed |
| ACCEPTABLE | | 2-10% difference | Document with brief explanation |
| EXPLAINABLE | 📝 | > 10%, cause identified | Document cause thoroughly |
| INVESTIGATE | 🔍 | > 10%, cause unknown | Review implementation |
| PAPER_ISSUE | 📄 | Our results more reasonable | Document evidence |
## Verification Levels
### Level 1: Code Correctness
- Unit tests pass
- No runtime errors
- Gradient flow works
### Level 2: Behavioral Match
- Output shapes correct
- Value ranges reasonable
- Edge cases handled
### Level 3: Numerical Match
- Results within tolerance of paper
- Trends match (even if absolute values differ)
- Statistical significance considered
## Test Design for Replication
### Shape Tests
```python
def test_model_output_shape():
"""Verify model produces correct output shape per paper."""
model = MyModel(config)
x = torch.randn(batch_size, seq_len, input_dim)
out = model(x)
# Paper Section 3.2: "Output dimension is 512"
assert out.shape == (batch_size, seq_len, 512)
```
### Value Range Tests
```python
def test_attention_weights_sum():
"""Attention weights should sum to 1 (paper Eq. 3)."""
model = AttentionLayer(config)
x = torch.randn(batch_size, seq_len, dim)
_, attn_weights = model(x, return_attention=True)
# Softmax output sums to 1
assert torch.allclose(attn_weights.sum(dim=-1), torch.ones(batch_size, seq_len))
```
### Gradient Tests
```python
def test_gradient_flow():
"""Verify gradients flow through all parameters."""
model = MyModel(config)
x = torch.randn(batch_size, input_dim, requires_grad=True)
out = model(x)
loss = out.sum()
loss.backward()
for name, param in model.named_parameters():
assert param.grad is not None, f"No gradient for {name}"
assert not torch.isnan(param.grad).any(), f"NaN gradient for {name}"
```
### Numerical Match Tests
```python
def test_loss_value_reasonable():
"""Loss should be in expected range per paper Figure 2."""
model = MyModel(config)
# ... setup ...
loss = compute_loss(model, data)
# Paper reports initial loss ~2.3 (cross-entropy on 10 classes)
assert 2.0 < loss.item() < 3.0, f"Initial loss {loss.item()} outside expected range"
```
## Comparison Methodology
### Absolute Comparison
```python
def compare_absolute(paper_value: float, our_value: float, tolerance: float = 0.01):
"""Compare with absolute tolerance."""
diff = abs(paper_value - our_value)
return diff <= tolerance, diff
```
### Relative Comparison
```python
def compare_relative(paper_value: float, our_value: float, tolerance: float = 0.05):
"""Compare with relative tolerance (5% default)."""
if paper_value == 0:
return our_value == 0, abs(our_value)
relative_diff = abs(paper_value - our_value) / abs(paper_value)
return relative_diff <= tolerance, relative_diff
```
### Statistical Comparison
```python
def compare_with_variance(
paper_mean: float,
paper_std: float,
our_values: List[float],
confidence: float = 0.95,
):
"""Compare considering paper's reported variance."""
our_mean = np.mean(our_values)
our_std = np.std(our_values)
# Check if means are within 2 standard deviations
combined_std = np.sqrt(paper_std**2 + our_std**2)
z_score = abs(paper_mean - our_mean) / combined_std
return z_score < 2.0, z_score
```
## Common Difference Sources
### Acceptable Differences
| Source | Typical Impact | Mitigation |
|--------|---------------|------------|
| Random seed | 1-2% | Run multiple seeds |
| Floating point | < 0.1% | Use float64 for verification |
| Framework differences | 1-3% | Document and accept |
| Hardware differences | 0.5-1% | Note in report |
### Concerning Differences
| Source | Typical Impact | Action |
|--------|---------------|--------|
| Wrong architecture | > 10% | Review code vs paper |
| Wrong hyperparameters | 5-20% | Verify all settings |
| Data preprocessing | Variable | Match paper exactly |
| Evaluation protocol | Variable | Check train/val/test split |
## Verification Checklist
### Before Comparison
- [ ] Seeds set for reproducibility
- [ ] Same evaluation data as paper
- [ ] Same preprocessing pipeline
- [ ] Same evaluation metrics
### During Comparison
- [ ] Run multiple times with different seeds
- [ ] Record mean and standard deviation
- [ ] Compare trends, not just final values
- [ ] Check intermediate checkpoints if available
### After Comparison
- [ ] Document all differences
- [ ] Explain likely causes
- [ ] Determine if differences are acceptable
- [ ] Suggest improvements if needed
## Report Template
```markdown
## Verification Result: {Metric Name}
**Paper Value**: {value} ± {std} (Source: {figure/table/text})
**Our Value**: {value} ± {std}
**Difference**: {absolute} ({relative}%)
**Status**: MATCH | ACCEPTABLE | EXPLAINABLE | INVESTIGATE | PAPER_ISSUE
**Analysis**:
{explanation of difference - required for all non-MATCH statuses}
**Confidence**: {HIGH | MEDIUM | LOW}
{reasoning for confidence level}
```
## Visual Comparison Guidelines
### Side-by-Side Figure Comparison
Always present figures in side-by-side format:
```markdown
| Paper Reference | Our Replication |
|-----------------|-----------------|
| ![](ref_fig.png) | ![](our_fig.png) |
```
### What to Compare
1. **Trends**: Does the curve go up/down at the same places?
2. **Shape**: Is the overall shape similar?
3. **Key points**: Do peaks/valleys occur at similar locations?
4. **Scale**: Are values in the same order of magnitude?
### Acceptable vs Unacceptable Differences
**Acceptable** (document and move on):
- Curve shifted slightly up/down (offset)
- Slightly faster/slower convergence
- Small noise differences
**Unacceptable** (investigate):
- Opposite trends (going up vs down)
- Completely different shapes
- Order of magnitude differences
- Missing features (e.g., expected oscillation absent)
## Common Difference Sources
### Expected Differences (ACCEPTABLE)
| Source | Typical Impact | Mitigation |
|--------|---------------|------------|
| Random seed | 1-3% | Run multiple seeds, report mean±std |
| Floating point | < 0.1% | Use float64 for verification |
| Framework differences | 1-5% | Document framework version |
| Hardware differences | 0.5-2% | Note in report |
| Batch size changes | 2-10% | Adjust LR proportionally |
### Concerning Differences (INVESTIGATE)
| Source | Typical Impact | Action |
|--------|---------------|--------|
| Wrong architecture | > 10% | Review code vs paper |
| Wrong hyperparameters | 5-20% | Verify all settings |
| Data preprocessing | Variable | Match paper exactly |
| Bug in implementation | Variable | Debug systematically |
### Paper Issues (PAPER_ISSUE)
Sometimes the paper contains errors. Signs include:
- Results that violate mathematical constraints
- Impossible performance claims
- Inconsistencies between text and figures
- Known errata
Document evidence thoroughly if claiming paper issue.