- Use English for structural headers (Role, Workflow, Constraints)
- Use Chinese for business logic and detailed explanations
- Consistent formatting across all 6 agents:
- paper-director.md
- paper-analyzer.md
- paper-image-extractor.md
- code-writer.md
- test-runner.md
- result-verifier.md
Root cause: test-runner was giving overly optimistic results due to:
1. Context bias - knew the implementation, tended to defend it
2. No actual visual comparison - just wrote 'ACCEPTABLE' without looking
3. No structural validation - accepted 35x scale differences as 'acceptable'
Solution:
- New result-verifier agent that performs blind visual comparison
- Strict pass/fail criteria for structural validation
- Updated test-runner to use result-verifier for each figure
- Clear guidelines: structural mismatches = FAIL, not ACCEPTABLE
Test result: verifier correctly identified Fig3 as FAIL with 7 specific issues:
- Wrong X-axis variable (channels vs power)
- Wrong Y-axis scale (5x difference)
- Wrong curve count (5 vs 4)
- etc.