PaperTool/.opencode/skills/verification/SKILL.md

---
name: verification
description: Use when verifying replication results against paper's reported values
---

# Replication Verification

## Overview

Systematic approach to verifying that replicated code produces results comparable to the original paper. **Note**: Exact matches are rare; the goal is verifiable, explainable results.

**Announce at start:** "I'm using the verification skill to validate replication accuracy."

## Core Philosophy

1. **Code results are authoritative** - Our implementation's output is ground truth
2. **Paper values are references** - Used for comparison, not as test assertions
3. **Differences require explanations** - Not fixes (unless clearly buggy)
4. **Visual comparison over numerical** - Trends matter more than exact values

## Difference Classification System

| Status | Symbol | Criteria | Action |
|--------|--------|----------|--------|
| MATCH | ✅ | < 2% difference | Document, no action needed |
| ACCEPTABLE | ⚠️ | 2-10% difference | Document with brief explanation |
| EXPLAINABLE | 📝 | > 10%, cause identified | Document cause thoroughly |
| INVESTIGATE | 🔍 | > 10%, cause unknown | Review implementation |
| PAPER_ISSUE | 📄 | Our results more reasonable | Document evidence |

## Verification Levels

### Level 1: Code Correctness
- Unit tests pass
- No runtime errors
- Gradient flow works

### Level 2: Behavioral Match
- Output shapes correct
- Value ranges reasonable
- Edge cases handled

### Level 3: Numerical Match
- Results within tolerance of paper
- Trends match (even if absolute values differ)
- Statistical significance considered

## Test Design for Replication

### Shape Tests

```python
def test_model_output_shape():
    """Verify model produces correct output shape per paper."""
    model = MyModel(config)
    x = torch.randn(batch_size, seq_len, input_dim)
    out = model(x)

    # Paper Section 3.2: "Output dimension is 512"
    assert out.shape == (batch_size, seq_len, 512)
```

### Value Range Tests

```python
def test_attention_weights_sum():
    """Attention weights should sum to 1 (paper Eq. 3)."""
    model = AttentionLayer(config)
    x = torch.randn(batch_size, seq_len, dim)
    _, attn_weights = model(x, return_attention=True)

    # Softmax output sums to 1
    assert torch.allclose(attn_weights.sum(dim=-1), torch.ones(batch_size, seq_len))
```

### Gradient Tests

```python
def test_gradient_flow():
    """Verify gradients flow through all parameters."""
    model = MyModel(config)
    x = torch.randn(batch_size, input_dim, requires_grad=True)
    out = model(x)
    loss = out.sum()
    loss.backward()

    for name, param in model.named_parameters():
        assert param.grad is not None, f"No gradient for {name}"
        assert not torch.isnan(param.grad).any(), f"NaN gradient for {name}"
```

### Numerical Match Tests

```python
def test_loss_value_reasonable():
    """Loss should be in expected range per paper Figure 2."""
    model = MyModel(config)
    # ... setup ...

    loss = compute_loss(model, data)

    # Paper reports initial loss ~2.3 (cross-entropy on 10 classes)
    assert 2.0 < loss.item() < 3.0, f"Initial loss {loss.item()} outside expected range"
```

## Comparison Methodology

### Absolute Comparison

```python
def compare_absolute(paper_value: float, our_value: float, tolerance: float = 0.01):
    """Compare with absolute tolerance."""
    diff = abs(paper_value - our_value)
    return diff <= tolerance, diff
```

### Relative Comparison

```python
def compare_relative(paper_value: float, our_value: float, tolerance: float = 0.05):
    """Compare with relative tolerance (5% default)."""
    if paper_value == 0:
        return our_value == 0, abs(our_value)
    relative_diff = abs(paper_value - our_value) / abs(paper_value)
    return relative_diff <= tolerance, relative_diff
```

### Statistical Comparison

```python
def compare_with_variance(
    paper_mean: float,
    paper_std: float,
    our_values: List[float],
    confidence: float = 0.95,
):
    """Compare considering paper's reported variance."""
    our_mean = np.mean(our_values)
    our_std = np.std(our_values)

    # Check if means are within 2 standard deviations
    combined_std = np.sqrt(paper_std**2 + our_std**2)
    z_score = abs(paper_mean - our_mean) / combined_std

    return z_score < 2.0, z_score
```

## Common Difference Sources

### Acceptable Differences

| Source | Typical Impact | Mitigation |
|--------|---------------|------------|
| Random seed | 1-2% | Run multiple seeds |
| Floating point | < 0.1% | Use float64 for verification |
| Framework differences | 1-3% | Document and accept |
| Hardware differences | 0.5-1% | Note in report |

### Concerning Differences

| Source | Typical Impact | Action |
|--------|---------------|--------|
| Wrong architecture | > 10% | Review code vs paper |
| Wrong hyperparameters | 5-20% | Verify all settings |
| Data preprocessing | Variable | Match paper exactly |
| Evaluation protocol | Variable | Check train/val/test split |

## Verification Checklist

### Before Comparison

- [ ] Seeds set for reproducibility
- [ ] Same evaluation data as paper
- [ ] Same preprocessing pipeline
- [ ] Same evaluation metrics

### During Comparison

- [ ] Run multiple times with different seeds
- [ ] Record mean and standard deviation
- [ ] Compare trends, not just final values
- [ ] Check intermediate checkpoints if available

### After Comparison

- [ ] Document all differences
- [ ] Explain likely causes
- [ ] Determine if differences are acceptable
- [ ] Suggest improvements if needed

## Report Template

```markdown
## Verification Result: {Metric Name}

**Paper Value**: {value} ± {std} (Source: {figure/table/text})
**Our Value**: {value} ± {std}
**Difference**: {absolute} ({relative}%)

**Status**: MATCH | ACCEPTABLE | EXPLAINABLE | INVESTIGATE | PAPER_ISSUE

**Analysis**:
{explanation of difference - required for all non-MATCH statuses}

**Confidence**: {HIGH | MEDIUM | LOW}
{reasoning for confidence level}
```

## Visual Comparison Guidelines

### Side-by-Side Figure Comparison

Always present figures in side-by-side format:

```markdown
| Paper Reference | Our Replication |
|-----------------|-----------------|
| ![](ref_fig.png) | ![](our_fig.png) |
```

### What to Compare

1. **Trends**: Does the curve go up/down at the same places?
2. **Shape**: Is the overall shape similar?
3. **Key points**: Do peaks/valleys occur at similar locations?
4. **Scale**: Are values in the same order of magnitude?

### Acceptable vs Unacceptable Differences

**Acceptable** (document and move on):
- Curve shifted slightly up/down (offset)
- Slightly faster/slower convergence
- Small noise differences

**Unacceptable** (investigate):
- Opposite trends (going up vs down)
- Completely different shapes
- Order of magnitude differences
- Missing features (e.g., expected oscillation absent)

## Common Difference Sources

### Expected Differences (ACCEPTABLE)

| Source | Typical Impact | Mitigation |
|--------|---------------|------------|
| Random seed | 1-3% | Run multiple seeds, report mean±std |
| Floating point | < 0.1% | Use float64 for verification |
| Framework differences | 1-5% | Document framework version |
| Hardware differences | 0.5-2% | Note in report |
| Batch size changes | 2-10% | Adjust LR proportionally |

### Concerning Differences (INVESTIGATE)

| Source | Typical Impact | Action |
|--------|---------------|--------|
| Wrong architecture | > 10% | Review code vs paper |
| Wrong hyperparameters | 5-20% | Verify all settings |
| Data preprocessing | Variable | Match paper exactly |
| Bug in implementation | Variable | Debug systematically |

### Paper Issues (PAPER_ISSUE)

Sometimes the paper contains errors. Signs include:
- Results that violate mathematical constraints
- Impossible performance claims
- Inconsistencies between text and figures
- Known errata

Document evidence thoroughly if claiming paper issue.