PaperTool/.opencode/skills/verification/SKILL.md
hc 5d5aee1f83 refactor: improve verification workflow with visual comparison
Major changes:
- paper-image-extractor: Generate reference_plots.py for visual verification
- paper-director: Add image understanding checkpoint with side-by-side comparison
- paper-analyzer: Add data source labeling with reliability levels
- code-writer: Change from TDD to VDD (Verification-Driven Development)
- test-runner: Generate comparison reports with images and explanations
- verification skill: Add difference classification system
- code-generation skill: Emphasize result independence

Key principles:
- Code results are authoritative, paper values are references
- Differences are expected and documented, not bugs to fix
- Visual comparison prioritized over exact numerical match
- Tests verify sanity (shape, gradient, range), not exact values
2026-03-31 19:55:36 +08:00

7.8 KiB

name description
verification Use when verifying replication results against paper's reported values

Replication Verification

Overview

Systematic approach to verifying that replicated code produces results comparable to the original paper. Note: Exact matches are rare; the goal is verifiable, explainable results.

Announce at start: "I'm using the verification skill to validate replication accuracy."

Core Philosophy

  1. Code results are authoritative - Our implementation's output is ground truth
  2. Paper values are references - Used for comparison, not as test assertions
  3. Differences require explanations - Not fixes (unless clearly buggy)
  4. Visual comparison over numerical - Trends matter more than exact values

Difference Classification System

Status Symbol Criteria Action
MATCH < 2% difference Document, no action needed
ACCEPTABLE ⚠️ 2-10% difference Document with brief explanation
EXPLAINABLE 📝 > 10%, cause identified Document cause thoroughly
INVESTIGATE 🔍 > 10%, cause unknown Review implementation
PAPER_ISSUE 📄 Our results more reasonable Document evidence

Verification Levels

Level 1: Code Correctness

  • Unit tests pass
  • No runtime errors
  • Gradient flow works

Level 2: Behavioral Match

  • Output shapes correct
  • Value ranges reasonable
  • Edge cases handled

Level 3: Numerical Match

  • Results within tolerance of paper
  • Trends match (even if absolute values differ)
  • Statistical significance considered

Test Design for Replication

Shape Tests

def test_model_output_shape():
    """Verify model produces correct output shape per paper."""
    model = MyModel(config)
    x = torch.randn(batch_size, seq_len, input_dim)
    out = model(x)
    
    # Paper Section 3.2: "Output dimension is 512"
    assert out.shape == (batch_size, seq_len, 512)

Value Range Tests

def test_attention_weights_sum():
    """Attention weights should sum to 1 (paper Eq. 3)."""
    model = AttentionLayer(config)
    x = torch.randn(batch_size, seq_len, dim)
    _, attn_weights = model(x, return_attention=True)
    
    # Softmax output sums to 1
    assert torch.allclose(attn_weights.sum(dim=-1), torch.ones(batch_size, seq_len))

Gradient Tests

def test_gradient_flow():
    """Verify gradients flow through all parameters."""
    model = MyModel(config)
    x = torch.randn(batch_size, input_dim, requires_grad=True)
    out = model(x)
    loss = out.sum()
    loss.backward()
    
    for name, param in model.named_parameters():
        assert param.grad is not None, f"No gradient for {name}"
        assert not torch.isnan(param.grad).any(), f"NaN gradient for {name}"

Numerical Match Tests

def test_loss_value_reasonable():
    """Loss should be in expected range per paper Figure 2."""
    model = MyModel(config)
    # ... setup ...
    
    loss = compute_loss(model, data)
    
    # Paper reports initial loss ~2.3 (cross-entropy on 10 classes)
    assert 2.0 < loss.item() < 3.0, f"Initial loss {loss.item()} outside expected range"

Comparison Methodology

Absolute Comparison

def compare_absolute(paper_value: float, our_value: float, tolerance: float = 0.01):
    """Compare with absolute tolerance."""
    diff = abs(paper_value - our_value)
    return diff <= tolerance, diff

Relative Comparison

def compare_relative(paper_value: float, our_value: float, tolerance: float = 0.05):
    """Compare with relative tolerance (5% default)."""
    if paper_value == 0:
        return our_value == 0, abs(our_value)
    relative_diff = abs(paper_value - our_value) / abs(paper_value)
    return relative_diff <= tolerance, relative_diff

Statistical Comparison

def compare_with_variance(
    paper_mean: float,
    paper_std: float,
    our_values: List[float],
    confidence: float = 0.95,
):
    """Compare considering paper's reported variance."""
    our_mean = np.mean(our_values)
    our_std = np.std(our_values)
    
    # Check if means are within 2 standard deviations
    combined_std = np.sqrt(paper_std**2 + our_std**2)
    z_score = abs(paper_mean - our_mean) / combined_std
    
    return z_score < 2.0, z_score

Common Difference Sources

Acceptable Differences

Source Typical Impact Mitigation
Random seed 1-2% Run multiple seeds
Floating point < 0.1% Use float64 for verification
Framework differences 1-3% Document and accept
Hardware differences 0.5-1% Note in report

Concerning Differences

Source Typical Impact Action
Wrong architecture > 10% Review code vs paper
Wrong hyperparameters 5-20% Verify all settings
Data preprocessing Variable Match paper exactly
Evaluation protocol Variable Check train/val/test split

Verification Checklist

Before Comparison

  • Seeds set for reproducibility
  • Same evaluation data as paper
  • Same preprocessing pipeline
  • Same evaluation metrics

During Comparison

  • Run multiple times with different seeds
  • Record mean and standard deviation
  • Compare trends, not just final values
  • Check intermediate checkpoints if available

After Comparison

  • Document all differences
  • Explain likely causes
  • Determine if differences are acceptable
  • Suggest improvements if needed

Report Template

## Verification Result: {Metric Name}

**Paper Value**: {value} ± {std} (Source: {figure/table/text})
**Our Value**: {value} ± {std}
**Difference**: {absolute} ({relative}%)

**Status**: MATCH | ACCEPTABLE | EXPLAINABLE | INVESTIGATE | PAPER_ISSUE

**Analysis**:
{explanation of difference - required for all non-MATCH statuses}

**Confidence**: {HIGH | MEDIUM | LOW}
{reasoning for confidence level}

Visual Comparison Guidelines

Side-by-Side Figure Comparison

Always present figures in side-by-side format:

| Paper Reference | Our Replication |
|-----------------|-----------------|
| ![](ref_fig.png) | ![](our_fig.png) |

What to Compare

  1. Trends: Does the curve go up/down at the same places?
  2. Shape: Is the overall shape similar?
  3. Key points: Do peaks/valleys occur at similar locations?
  4. Scale: Are values in the same order of magnitude?

Acceptable vs Unacceptable Differences

Acceptable (document and move on):

  • Curve shifted slightly up/down (offset)
  • Slightly faster/slower convergence
  • Small noise differences

Unacceptable (investigate):

  • Opposite trends (going up vs down)
  • Completely different shapes
  • Order of magnitude differences
  • Missing features (e.g., expected oscillation absent)

Common Difference Sources

Expected Differences (ACCEPTABLE)

Source Typical Impact Mitigation
Random seed 1-3% Run multiple seeds, report mean±std
Floating point < 0.1% Use float64 for verification
Framework differences 1-5% Document framework version
Hardware differences 0.5-2% Note in report
Batch size changes 2-10% Adjust LR proportionally

Concerning Differences (INVESTIGATE)

Source Typical Impact Action
Wrong architecture > 10% Review code vs paper
Wrong hyperparameters 5-20% Verify all settings
Data preprocessing Variable Match paper exactly
Bug in implementation Variable Debug systematically

Paper Issues (PAPER_ISSUE)

Sometimes the paper contains errors. Signs include:

  • Results that violate mathematical constraints
  • Impossible performance claims
  • Inconsistencies between text and figures
  • Known errata

Document evidence thoroughly if claiming paper issue.