PaperTool/.opencode/skills/verification/SKILL.md
hc 849cfe5409 feat(skills): add verification skill
Replication result verification methodology.
2026-03-31 17:43:41 +08:00

4.9 KiB

name description
verification Use when verifying replication results against paper's reported values

Replication Verification

Overview

Systematic approach to verifying that replicated code produces results matching the original paper.

Announce at start: "I'm using the verification skill to validate replication accuracy."

Verification Levels

Level 1: Code Correctness

  • Unit tests pass
  • No runtime errors
  • Gradient flow works

Level 2: Behavioral Match

  • Output shapes correct
  • Value ranges reasonable
  • Edge cases handled

Level 3: Numerical Match

  • Results within tolerance of paper
  • Trends match (even if absolute values differ)
  • Statistical significance considered

Test Design for Replication

Shape Tests

def test_model_output_shape():
    """Verify model produces correct output shape per paper."""
    model = MyModel(config)
    x = torch.randn(batch_size, seq_len, input_dim)
    out = model(x)
    
    # Paper Section 3.2: "Output dimension is 512"
    assert out.shape == (batch_size, seq_len, 512)

Value Range Tests

def test_attention_weights_sum():
    """Attention weights should sum to 1 (paper Eq. 3)."""
    model = AttentionLayer(config)
    x = torch.randn(batch_size, seq_len, dim)
    _, attn_weights = model(x, return_attention=True)
    
    # Softmax output sums to 1
    assert torch.allclose(attn_weights.sum(dim=-1), torch.ones(batch_size, seq_len))

Gradient Tests

def test_gradient_flow():
    """Verify gradients flow through all parameters."""
    model = MyModel(config)
    x = torch.randn(batch_size, input_dim, requires_grad=True)
    out = model(x)
    loss = out.sum()
    loss.backward()
    
    for name, param in model.named_parameters():
        assert param.grad is not None, f"No gradient for {name}"
        assert not torch.isnan(param.grad).any(), f"NaN gradient for {name}"

Numerical Match Tests

def test_loss_value_reasonable():
    """Loss should be in expected range per paper Figure 2."""
    model = MyModel(config)
    # ... setup ...
    
    loss = compute_loss(model, data)
    
    # Paper reports initial loss ~2.3 (cross-entropy on 10 classes)
    assert 2.0 < loss.item() < 3.0, f"Initial loss {loss.item()} outside expected range"

Comparison Methodology

Absolute Comparison

def compare_absolute(paper_value: float, our_value: float, tolerance: float = 0.01):
    """Compare with absolute tolerance."""
    diff = abs(paper_value - our_value)
    return diff <= tolerance, diff

Relative Comparison

def compare_relative(paper_value: float, our_value: float, tolerance: float = 0.05):
    """Compare with relative tolerance (5% default)."""
    if paper_value == 0:
        return our_value == 0, abs(our_value)
    relative_diff = abs(paper_value - our_value) / abs(paper_value)
    return relative_diff <= tolerance, relative_diff

Statistical Comparison

def compare_with_variance(
    paper_mean: float,
    paper_std: float,
    our_values: List[float],
    confidence: float = 0.95,
):
    """Compare considering paper's reported variance."""
    our_mean = np.mean(our_values)
    our_std = np.std(our_values)
    
    # Check if means are within 2 standard deviations
    combined_std = np.sqrt(paper_std**2 + our_std**2)
    z_score = abs(paper_mean - our_mean) / combined_std
    
    return z_score < 2.0, z_score

Common Difference Sources

Acceptable Differences

Source Typical Impact Mitigation
Random seed 1-2% Run multiple seeds
Floating point < 0.1% Use float64 for verification
Framework differences 1-3% Document and accept
Hardware differences 0.5-1% Note in report

Concerning Differences

Source Typical Impact Action
Wrong architecture > 10% Review code vs paper
Wrong hyperparameters 5-20% Verify all settings
Data preprocessing Variable Match paper exactly
Evaluation protocol Variable Check train/val/test split

Verification Checklist

Before Comparison

  • Seeds set for reproducibility
  • Same evaluation data as paper
  • Same preprocessing pipeline
  • Same evaluation metrics

During Comparison

  • Run multiple times with different seeds
  • Record mean and standard deviation
  • Compare trends, not just final values
  • Check intermediate checkpoints if available

After Comparison

  • Document all differences
  • Explain likely causes
  • Determine if differences are acceptable
  • Suggest improvements if needed

Report Template

## Verification Result: {Metric Name}

**Paper Value**: {value} ± {std}
**Our Value**: {value} ± {std}
**Difference**: {absolute} ({relative}%)

**Status**: MATCH | ACCEPTABLE | INVESTIGATE | MISMATCH

**Analysis**:
{explanation of difference}

**Confidence**: {HIGH | MEDIUM | LOW}
{reasoning for confidence level}