hc 5d5aee1f83 refactor: improve verification workflow with visual comparison

Major changes:
- paper-image-extractor: Generate reference_plots.py for visual verification
- paper-director: Add image understanding checkpoint with side-by-side comparison
- paper-analyzer: Add data source labeling with reliability levels
- code-writer: Change from TDD to VDD (Verification-Driven Development)
- test-runner: Generate comparison reports with images and explanations
- verification skill: Add difference classification system
- code-generation skill: Emphasize result independence

Key principles:
- Code results are authoritative, paper values are references
- Differences are expected and documented, not bugs to fix
- Visual comparison prioritized over exact numerical match
- Tests verify sanity (shape, gradient, range), not exact values

2026-03-31 19:55:36 +08:00

7.8 KiB

Raw Blame History

name	description
verification	Use when verifying replication results against paper's reported values

Replication Verification

Overview

Systematic approach to verifying that replicated code produces results comparable to the original paper. Note: Exact matches are rare; the goal is verifiable, explainable results.

Announce at start: "I'm using the verification skill to validate replication accuracy."

Core Philosophy

Code results are authoritative - Our implementation's output is ground truth
Paper values are references - Used for comparison, not as test assertions
Differences require explanations - Not fixes (unless clearly buggy)
Visual comparison over numerical - Trends matter more than exact values

Difference Classification System

Status	Symbol	Criteria	Action
MATCH	✅	< 2% difference	Document, no action needed
ACCEPTABLE	⚠️	2-10% difference	Document with brief explanation
EXPLAINABLE	📝	> 10%, cause identified	Document cause thoroughly
INVESTIGATE	🔍	> 10%, cause unknown	Review implementation
PAPER_ISSUE	📄	Our results more reasonable	Document evidence

Verification Levels

Level 1: Code Correctness

Unit tests pass
No runtime errors
Gradient flow works

Level 2: Behavioral Match

Output shapes correct
Value ranges reasonable
Edge cases handled

Level 3: Numerical Match

Results within tolerance of paper
Trends match (even if absolute values differ)
Statistical significance considered

Test Design for Replication

Shape Tests

def test_model_output_shape():
    """Verify model produces correct output shape per paper."""
    model = MyModel(config)
    x = torch.randn(batch_size, seq_len, input_dim)
    out = model(x)
    
    # Paper Section 3.2: "Output dimension is 512"
    assert out.shape == (batch_size, seq_len, 512)

Value Range Tests

def test_attention_weights_sum():
    """Attention weights should sum to 1 (paper Eq. 3)."""
    model = AttentionLayer(config)
    x = torch.randn(batch_size, seq_len, dim)
    _, attn_weights = model(x, return_attention=True)
    
    # Softmax output sums to 1
    assert torch.allclose(attn_weights.sum(dim=-1), torch.ones(batch_size, seq_len))

Gradient Tests

def test_gradient_flow():
    """Verify gradients flow through all parameters."""
    model = MyModel(config)
    x = torch.randn(batch_size, input_dim, requires_grad=True)
    out = model(x)
    loss = out.sum()
    loss.backward()
    
    for name, param in model.named_parameters():
        assert param.grad is not None, f"No gradient for {name}"
        assert not torch.isnan(param.grad).any(), f"NaN gradient for {name}"

Numerical Match Tests

def test_loss_value_reasonable():
    """Loss should be in expected range per paper Figure 2."""
    model = MyModel(config)
    # ... setup ...
    
    loss = compute_loss(model, data)
    
    # Paper reports initial loss ~2.3 (cross-entropy on 10 classes)
    assert 2.0 < loss.item() < 3.0, f"Initial loss {loss.item()} outside expected range"

Comparison Methodology

Absolute Comparison

def compare_absolute(paper_value: float, our_value: float, tolerance: float = 0.01):
    """Compare with absolute tolerance."""
    diff = abs(paper_value - our_value)
    return diff <= tolerance, diff

Relative Comparison

def compare_relative(paper_value: float, our_value: float, tolerance: float = 0.05):
    """Compare with relative tolerance (5% default)."""
    if paper_value == 0:
        return our_value == 0, abs(our_value)
    relative_diff = abs(paper_value - our_value) / abs(paper_value)
    return relative_diff <= tolerance, relative_diff

Statistical Comparison

def compare_with_variance(
    paper_mean: float,
    paper_std: float,
    our_values: List[float],
    confidence: float = 0.95,
):
    """Compare considering paper's reported variance."""
    our_mean = np.mean(our_values)
    our_std = np.std(our_values)
    
    # Check if means are within 2 standard deviations
    combined_std = np.sqrt(paper_std**2 + our_std**2)
    z_score = abs(paper_mean - our_mean) / combined_std
    
    return z_score < 2.0, z_score

Common Difference Sources

Acceptable Differences

Source	Typical Impact	Mitigation
Random seed	1-2%	Run multiple seeds
Floating point	< 0.1%	Use float64 for verification
Framework differences	1-3%	Document and accept
Hardware differences	0.5-1%	Note in report

Concerning Differences

Source	Typical Impact	Action
Wrong architecture	> 10%	Review code vs paper
Wrong hyperparameters	5-20%	Verify all settings
Data preprocessing	Variable	Match paper exactly
Evaluation protocol	Variable	Check train/val/test split

Verification Checklist

Before Comparison

Seeds set for reproducibility
Same evaluation data as paper
Same preprocessing pipeline
Same evaluation metrics

During Comparison

Run multiple times with different seeds
Record mean and standard deviation
Compare trends, not just final values
Check intermediate checkpoints if available

After Comparison

Document all differences
Explain likely causes
Determine if differences are acceptable
Suggest improvements if needed

Report Template

## Verification Result: {Metric Name}

**Paper Value**: {value} ± {std} (Source: {figure/table/text})
**Our Value**: {value} ± {std}
**Difference**: {absolute} ({relative}%)

**Status**: MATCH | ACCEPTABLE | EXPLAINABLE | INVESTIGATE | PAPER_ISSUE

**Analysis**:
{explanation of difference - required for all non-MATCH statuses}

**Confidence**: {HIGH | MEDIUM | LOW}
{reasoning for confidence level}

Visual Comparison Guidelines

Side-by-Side Figure Comparison

Always present figures in side-by-side format:

| Paper Reference | Our Replication |
|-----------------|-----------------|
| ![](ref_fig.png) | ![](our_fig.png) |

What to Compare

Trends: Does the curve go up/down at the same places?
Shape: Is the overall shape similar?
Key points: Do peaks/valleys occur at similar locations?
Scale: Are values in the same order of magnitude?

Acceptable vs Unacceptable Differences

Acceptable (document and move on):

Curve shifted slightly up/down (offset)
Slightly faster/slower convergence
Small noise differences

Unacceptable (investigate):

Opposite trends (going up vs down)
Completely different shapes
Order of magnitude differences
Missing features (e.g., expected oscillation absent)

Common Difference Sources

Expected Differences (ACCEPTABLE)

Source	Typical Impact	Mitigation
Random seed	1-3%	Run multiple seeds, report mean±std
Floating point	< 0.1%	Use float64 for verification
Framework differences	1-5%	Document framework version
Hardware differences	0.5-2%	Note in report
Batch size changes	2-10%	Adjust LR proportionally

Concerning Differences (INVESTIGATE)

Source	Typical Impact	Action
Wrong architecture	> 10%	Review code vs paper
Wrong hyperparameters	5-20%	Verify all settings
Data preprocessing	Variable	Match paper exactly
Bug in implementation	Variable	Debug systematically

Paper Issues (PAPER_ISSUE)

Sometimes the paper contains errors. Signs include:

Results that violate mathematical constraints
Impossible performance claims
Inconsistencies between text and figures
Known errata

Document evidence thoroughly if claiming paper issue.

7.8 KiB Raw Blame History

Replication Verification

Overview

Core Philosophy

Difference Classification System

Verification Levels

Level 1: Code Correctness

Level 2: Behavioral Match

Level 3: Numerical Match

Test Design for Replication

Shape Tests

Value Range Tests

Gradient Tests

Numerical Match Tests

Comparison Methodology

Absolute Comparison

Relative Comparison

Statistical Comparison

Common Difference Sources

Acceptable Differences

Concerning Differences

Verification Checklist

Before Comparison

During Comparison

After Comparison

Report Template

Visual Comparison Guidelines

Side-by-Side Figure Comparison

What to Compare

Acceptable vs Unacceptable Differences

Common Difference Sources

Expected Differences (ACCEPTABLE)

Concerning Differences (INVESTIGATE)

Paper Issues (PAPER_ISSUE)

7.8 KiB

Raw Blame History