Root cause: test-runner was giving overly optimistic results due to: 1. Context bias - knew the implementation, tended to defend it 2. No actual visual comparison - just wrote 'ACCEPTABLE' without looking 3. No structural validation - accepted 35x scale differences as 'acceptable' Solution: - New result-verifier agent that performs blind visual comparison - Strict pass/fail criteria for structural validation - Updated test-runner to use result-verifier for each figure - Clear guidelines: structural mismatches = FAIL, not ACCEPTABLE Test result: verifier correctly identified Fig3 as FAIL with 7 specific issues: - Wrong X-axis variable (channels vs power) - Wrong Y-axis scale (5x difference) - Wrong curve count (5 vs 4) - etc.
28 lines
511 B
TOML
28 lines
511 B
TOML
[project]
|
|
name = "resource-allocation"
|
|
version = "0.1.0"
|
|
description = "Replication of semantic-aware resource allocation"
|
|
requires-python = ">=3.10"
|
|
dependencies = [
|
|
"torch>=2.0.0",
|
|
"numpy>=1.23.0",
|
|
"matplotlib>=3.6.0",
|
|
"scipy>=1.9.0",
|
|
"tqdm>=4.65.0"
|
|
]
|
|
|
|
[project.optional-dependencies]
|
|
dev = [
|
|
"pytest>=7.0.0"
|
|
]
|
|
|
|
[build-system]
|
|
requires = ["hatchling"]
|
|
build-backend = "hatchling.build"
|
|
|
|
[tool.hatch.build.targets.wheel]
|
|
packages = ["src"]
|
|
|
|
[tool.pytest.ini_options]
|
|
pythonpath = ["."]
|