Initial commit: add project materials and code

2026-02-28 16:17:42 +08:00 · 2026-02-28 16:17:42 +08:00 · 5efb877df7
commit 5efb877df7
198 changed files with 17541 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,3 @@
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.pdf filter=lfs diff=lfs merge=lfs -text
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,24 @@
 # Python
 __pycache__/
 *.py[cod]
 *.pyd
 .pytest_cache/
 .mypy_cache/
 .ruff_cache/
 # Virtual environments
 .venv/
 venv/
 ENV/
 env/
 # Editors/IDEs
 .vscode/
 .idea/
 # OS files
 .DS_Store
 Thumbs.db
 # Logs
 *.log
--- a/Communication.pdf
+++ b/Communication.pdf
--- a/Communications.pdf
+++ b/Communications.pdf
--- a/Survey.pdf
+++ b/Survey.pdf
--- a/code/API.md
+++ b/code/API.md
@ -0,0 +1,599 @@
 # API 接口文档 / API Reference
 本文档详细描述了 Co-MADDPG 项目中所有公开类和函数的接口。
 ---
 ## 目录 / Table of Contents
 1. [环境模块 envs/](#1-环境模块-envs)
   - [ChannelModel](#channelmodel)
   - [SemanticModule](#semanticmodule)
   - [WirelessEnv](#wirelessenv)
 2. [算法模块 agents/](#2-算法模块-agents)
   - [Actor](#actor)
   - [Critic](#critic)
   - [OUNoise](#ounoise)
   - [ReplayBuffer](#replaybuffer)
   - [CoMADDPG](#comaddpg)
 3. [基线模块 baselines/](#3-基线模块-baselines)
   - [通用接口](#通用接口--common-interface)
   - [各基线差异](#各基线差异--baseline-differences)
 4. [工具模块 utils/](#4-工具模块-utils)
   - [metrics.py](#metricspy)
   - [visualization.py](#visualizationpy)
 5. [入口脚本](#5-入口脚本--entry-scripts)
   - [train.py](#trainpy)
   - [evaluate.py](#evaluatepy)
 ---
 ## 1. 环境模块 envs/
 ### ChannelModel
 **文件**: `envs/channel_model.py`
 3GPP Urban Micro NLOS 信道模型，负责路径损耗计算、复信道增益生成和 SNR 计算。
 ```python
 class ChannelModel:
    def __init__(self, config: dict) -> None
 ```
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `config` | dict | 完整配置字典，需包含 `config["env"]["carrier_freq"]`, `config["env"]["noise_psd"]`, `config["env"]["subcarrier_spacing"]` |
 #### 方法
 **`path_loss(distance) -> float`**
 计算 3GPP UMi NLOS 路径损耗。
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `distance` | float / np.ndarray | 收发机距离 (米) |
 | **返回** | float / np.ndarray | 路径损耗 (dB) |
 公式: `PL(d) = 36.7·log₁₀(d) + 22.7 + 26·log₁₀(fc)`
 ---
 **`generate_channel(distances, num_subcarriers) -> np.ndarray`**
 生成复信道增益矩阵。
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `distances` | np.ndarray (K,) | 每个用户的距离 |
 | `num_subcarriers` | int | 子载波数 N |
 | **返回** | np.ndarray (K, N) | 复信道增益 `h_{k,n} ~ CN(0, 10^{-PL/10})` |
 ---
 **`compute_snr(channel_gains, power_alloc, noise_power) -> np.ndarray`**
 计算每用户每子载波的 SNR。
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `channel_gains` | np.ndarray (K, N) | 复信道增益 |
 | `power_alloc` | np.ndarray (K, N) | 功率分配矩阵 (W) |
 | `noise_power` | float | 每子载波噪声功率 σ² (W) |
 | **返回** | np.ndarray (K, N) | SNR (线性尺度) |
 公式: `γ_{k,n} = p_{k,n} · |h_{k,n}|² / σ²`
 ---
 **`noise_power` (property) -> float**
 每子载波热噪声功率 (W)。
 公式: `σ² = 10^{(N₀_dBm - 30)/10} · Δf`
 ---
 ### SemanticModule
 **文件**: `envs/semantic_module.py`
 语义通信质量模块，计算 SSim 和语义 QoE。
 ```python
 class SemanticModule:
    def __init__(self, config: dict) -> None
 ```
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `config` | dict | 需包含 `config["env"]["rho_max"]`, `rho_min`, `w1`, `w2` |
 #### 方法
 **`compute_ssim(avg_snr, rho) -> float`**
 计算语义相似度指数。
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `avg_snr` | float / np.ndarray | 平均 SNR (线性尺度) |
 | `rho` | float | 压缩率 ρ ∈ [ρ_min, ρ_max] |
 | **返回** | float / np.ndarray | SSim ∈ [0, 1] |
 公式: `φ(γ̄, ρ) = 1 - exp(-a(ρ)·γ̄^{b(ρ)})`，其中 `a(ρ) = 0.8/(ρ+0.1)`, `b(ρ) = 0.6+0.2·ρ`
 ---
 **`compute_avg_snr(snr_per_subcarrier, allocation_mask) -> float`**
 计算已分配子载波的平均 SNR。
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `snr_per_subcarrier` | np.ndarray | 所有子载波的 SNR |
 | `allocation_mask` | np.ndarray | 二进制掩码 (1=已分配) |
 | **返回** | float | 平均 SNR (无分配时返回 0.0) |
 ---
 **`compute_semantic_qoe(ssim, rho, w1=None, w2=None, rho_max=None) -> float`**
 计算语义用户 QoE。
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `ssim` | float | 语义相似度 ∈ [0, 1] |
 | `rho` | float | 压缩率 |
 | `w1`, `w2` | float, optional | 权重 (默认使用配置值) |
 | `rho_max` | float, optional | 最大压缩率 (默认使用配置值) |
 | **返回** | float | QoE ∈ [0, 1] |
 公式: `QoE_s = w1·SSim + w2·(1 - ρ/ρ_max)`
 ---
 ### WirelessEnv
 **文件**: `envs/wireless_env.py`
 Gym 风格的无线资源分配环境，管理信道状态、执行动作、计算 QoE。
 ```python
 class WirelessEnv:
    def __init__(self, config: dict)
 ```
 | 属性 | 类型 | 说明 |
 |---|---|---|
 | `obs_dim` | int (property) | 观察维度 = N + 4 |
 | `act_dim` | int (property) | 动作维度 = 3 |
 | `N` | int | 子载波数量 |
 | `K_s`, `K_b`, `K` | int | 语义/传统/总用户数 |
 #### 方法
 **`reset() -> (obs_s, obs_b)`**
 重置环境。随机化用户距离、信道、辅助参数。
 | 返回 | 类型 | 说明 |
 |---|---|---|
 | `obs_s` | np.ndarray (obs_dim,) | 语义 agent 观察 (float32) |
 | `obs_b` | np.ndarray (obs_dim,) | 传统 agent 观察 (float32) |
 ---
 **`step(action_s, action_b) -> (obs_s, obs_b, reward_s, reward_b, done, info)`**
 执行一步。
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `action_s` | np.ndarray (3,) | 语义 agent 动作 [sub_frac, power_frac, rho] |
 | `action_b` | np.ndarray (3,) | 传统 agent 动作 [sub_frac, power_frac, _] |
 | 返回 | 类型 | 说明 |
 |---|---|---|
 | `obs_s`, `obs_b` | np.ndarray | 新观察 |
 | `reward_s`, `reward_b` | float | 各自平均 QoE（作为基础奖励） |
 | `done` | bool | 是否达到 max_steps |
 | `info` | dict | 详细信息（见下表） |
 **info 字典内容:**
 | Key | 类型 | 说明 |
 |---|---|---|
 | `qoe_semantic` | float | 语义组平均 QoE |
 | `qoe_traditional` | float | 传统组平均 QoE |
 | `qoe_sys` | float | 系统平均 QoE |
 | `qoe_list` | list[float] | 每个用户的 QoE |
 | `rates` | list[float] | 传统用户速率 (bps) |
 | `ssim_values` | list[float] | 语义用户 SSim 值 |
 | `rate_satisfaction` | float | 速率满足比例 ∈ [0, 1] |
 | `rho` | float | 实际使用的压缩率 |
 | `n_sub_s`, `n_sub_b` | int | 分配的子载波数量 |
 ---
 ## 2. 算法模块 agents/
 ### Actor
 **文件**: `agents/actor.py`
 确定性策略网络，输出 [0, 1] 范围的连续动作。
 ```python
 class Actor(nn.Module):
    def __init__(self, obs_dim: int, act_dim: int, hidden_sizes: list = [256, 256, 128])
 ```
 **`forward(obs) -> torch.Tensor`**
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `obs` | Tensor (batch, obs_dim) | 观察 |
 | **返回** | Tensor (batch, act_dim) | 动作 ∈ [0, 1]，通过 `(tanh(x) + 1) / 2` |
 ---
 ### Critic
 **文件**: `agents/critic.py`
 联合 Q 值网络 (CTDE)，输入所有 agent 的观察和动作。
 ```python
 class Critic(nn.Module):
    def __init__(self, obs_dim_total: int, act_dim_total: int, hidden_sizes: list = [512, 512, 256])
 ```
 - `obs_dim_total` = obs_dim × 2 = 136
 - `act_dim_total` = act_dim × 2 = 6
 - 总输入维度 = 142
 **`forward(obs, act) -> torch.Tensor`**
 | 参数 | 类型 | 说明 |
 |---|---|---|
 | `obs` | Tensor (batch, obs_dim_total) | 联合观察 concat(obs_s, obs_b) |
 | `act` | Tensor (batch, act_dim_total) | 联合动作 concat(act_s, act_b) |
 | **返回** | Tensor (batch, 1) | Q 值 |
 ---
 ### OUNoise
 **文件**: `agents/noise.py`
 Ornstein-Uhlenbeck 探索噪声，带线性 sigma 衰减。
 ```python
 class OUNoise:
    def __init__(self, size: int, mu: float = 0.0, theta: float = 0.15,
                 sigma_init: float = 0.2, sigma_min: float = 0.01, decay_period: int = 5000)
 ```
 | 参数 | 说明 |
 |---|---|
 | `size` | 噪声维度 (= act_dim = 3) |
 | `theta` | 回归速率 (默认 0.15) |
 | `sigma_init` | 初始标准差 (默认 0.2) |
 | `sigma_min` | 最小标准差 (默认 0.01) |
 | `decay_period` | 线性衰减周期 (默认 5000 episodes) |
 #### 方法
 | 方法 | 说明 |
 |---|---|
 | `reset()` | 重置噪声状态到 μ |
 | `sample() -> np.ndarray` | 采样一步 OU 噪声 |
 | `decay_sigma(episode)` | 线性衰减 sigma: `σ = max(σ_min, σ_init - (σ_init - σ_min) · episode / decay_period)` |
 ---
 ### ReplayBuffer
 **文件**: `agents/replay_buffer.py`
 9-field 经验回放缓冲区。
 ```python
 class ReplayBuffer:
    def __init__(self, capacity: int = 100000)
 ```
 #### 方法
 **`push(obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, done)`**
 存储一个 transition。所有参数为 numpy array 或 float。
 **`sample(batch_size) -> dict`**
 随机采样一批 transitions。
 | 返回 key | 类型 | Shape |
 |---|---|---|
 | `obs_s` | np.ndarray | (batch, obs_dim) |
 | `obs_b` | np.ndarray | (batch, obs_dim) |
 | `act_s` | np.ndarray | (batch, act_dim) |
 | `act_b` | np.ndarray | (batch, act_dim) |
 | `rew_s` | np.ndarray | (batch, 1) |
 | `rew_b` | np.ndarray | (batch, 1) |
 | `next_obs_s` | np.ndarray | (batch, obs_dim) |
 | `next_obs_b` | np.ndarray | (batch, obs_dim) |
 | `done` | np.ndarray | (batch, 1) |
 **`__len__() -> int`**: 当前存储的 transition 数量。
 ---
 ### CoMADDPG
 **文件**: `agents/co_maddpg.py`
 Co-MADDPG 主算法，实现 Stackelberg Leader-Follower 更新。
 ```python
 class CoMADDPG:
    def __init__(self, config: dict)
 ```
 | 关键属性 | 类型 | 说明 |
 |---|---|---|
 | `actor_s`, `actor_b` | Actor | 语义/传统 Actor 网络 |
 | `critic_s`, `critic_b` | Critic | 语义/传统 Critic 网络 |
 | `actor_s_target`, ... | Actor/Critic | Target 网络 (4个) |
 | `noise_s`, `noise_b` | OUNoise | 探索噪声 |
 | `buffer` | ReplayBuffer | 经验回放 |
 | `current_lambda` | float | 当前 λ(t) 值 |
 | `device` | torch.device | 计算设备 |
 #### 方法
 **`select_action(obs_s, obs_b, explore=True) -> (act_s, act_b)`**
 | 参数 | 说明 |
 |---|---|
 | `obs_s`, `obs_b` | np.ndarray (obs_dim,) — 各 agent 观察 |
 | `explore` | bool — 是否添加 OU 噪声 |
 | **返回** | tuple(np.ndarray, np.ndarray) — 动作 ∈ [0, 1]³ |
 ---
 **`compute_rewards(info) -> (rew_s, rew_b)`**
 根据 info 字典计算混合奖励。内部更新 `self.current_lambda`。
 | 参数 | 说明 |
 |---|---|
 | `info` | dict — 来自 env.step() |
 | **返回** | tuple(float, float) — 混合奖励 |
 奖励公式:
 ```
 r_coop_i = coop_self·qoe_i + coop_other·qoe_j + coop_sys·qoe_sys
 r_comp_i = comp_self·qoe_i + comp_sys·qoe_sys
 r_i = λ·r_coop_i + (1-λ)·r_comp_i
 ```
 ---
 **`update() -> dict`**
 执行 Stackelberg 更新。返回 loss 字典。
 | 返回 key | 说明 |
 |---|---|
 | `critic_loss_b` | Follower Critic 损失 |
 | `actor_loss_b` | Follower Actor 损失 |
 | `critic_loss_s` | Leader Critic 损失 |
 | `actor_loss_s` | Leader Actor 损失 |
 | `lambda` | 当前 λ(t) |
 ---
 **`save(path)` / `load(path)`**
 保存/加载所有网络参数到指定目录。
 | 文件 | 内容 |
 |---|---|
 | `model_s.pth` | Actor S + Critic S + 对应 Target 网络 |
 | `model_b.pth` | Actor B + Critic B + 对应 Target 网络 |
 ---
 ## 3. 基线模块 baselines/
 ### 通用接口 / Common Interface
 所有 7 个基线实现与 CoMADDPG 相同的接口：
 ```python
 def __init__(self, config: dict)
 def select_action(obs_s, obs_b, explore=True) -> (act_s, act_b)
 def compute_rewards(info) -> (rew_s, rew_b)
 def update() -> dict or None
 def save(path)
 def load(path)
 # 属性
 self.buffer: ReplayBuffer  # 或等效
 self.noise_s: OUNoise      # 部分基线有 (用于 train.py 的 hasattr 检查)
 self.noise_b: OUNoise
 ```
 ### 各基线差异 / Baseline Differences
 | 基线类 | 文件 | λ | 更新方式 | Critic | 特殊类 |
 |---|---|---|---|---|---|
 | `PureCooperative` | `pure_coop.py` | 1.0 固定 | Simultaneous | Joint | — |
 | `PureCompetitive` | `pure_comp.py` | 0.0 固定 | Simultaneous | Joint | — |
 | `FixedLambda` | `fixed_lambda.py` | 0.5 固定 | Stackelberg | Joint | — |
 | `IndependentDDPG` | `iddpg.py` | 0.0 | Simultaneous | Independent | `IndependentCritic` |
 | `SingleAgentDQN` | `single_dqn.py` | 0.5 | N/A (集中) | Centralized | `DQNNet`, `DQNReplayBuffer`, `EpsilonAdapter` |
 | `EqualAllocation` | `equal_alloc.py` | 0.5 | N/A (无学习) | None | `DummyBuffer` |
 | `SemanticOnly` | `semantic_only.py` | 1.0 | N/A (单策略) | Single | `SemanticCritic`, `SemanticBuffer` |
 #### 特殊说明
 **SingleAgentDQN**: 48 个离散动作 = 4 (sub_levels) × 4 (power_levels) × 3 (rho_levels)。使用 `EpsilonAdapter` 适配 `noise_s.decay_sigma()` 接口。
 **EqualAllocation**: 无学习，永远输出 `[0.5, 0.5, 0.5]`。`DummyBuffer` 有 `push()` 和 `__len__()` 但不存储数据。
 **IndependentDDPG**: `IndependentCritic` 输入为单个 agent 的 `(obs, act)` 而非联合输入，消融 CTDE 的效果。
 ---
 ## 4. 工具模块 utils/
 ### metrics.py
 **文件**: `utils/metrics.py`
 #### 函数
 **`jain_fairness(values) -> float`**
 Jain 公平性指数。`J = (Σx_i)² / (n·Σx_i²)`, 范围 [1/n, 1]。
 ---
 **`rate_satisfaction(rates, min_rate) -> float`**
 速率满足比例。满足 `R_k ≥ R_req` 的用户占比。
 ---
 **`compute_system_qoe(qoe_list) -> float`**
 系统级 QoE = 所有用户 QoE 的均值。
 ---
 **`compute_lambda(qoe_sys, beta=5.0, q_threshold=0.6) -> float`**
 动态协作权重。`λ = 1 / (1 + exp(-β·(QoE_sys - Q_th)))`
 ---
 **`compute_mixed_reward(qoe_s, qoe_b, qoe_sys, lam, reward_config) -> (float, float)`**
 计算混合奖励。`r_i = λ·r_coop_i + (1-λ)·r_comp_i`
 ---
 **`moving_average(data, window) -> np.ndarray`**
 滑动平均平滑。
 ---
 ### visualization.py
 **文件**: `utils/visualization.py`
 IEEE 风格绘图工具，对应论文 Section VII 的 12 张图。
 ```python
 class Plotter:
    def __init__(self, save_dir: str = "results/figures")
 ```
 #### ALGO_STYLES
 内置样式字典，为 8 个算法分配颜色、标记、线型：
 ```python
 ALGO_STYLES = {
    "Co-MADDPG":       {"color": "#E74C3C", "marker": "o", "linestyle": "-"},
    "PureCooperative":  {"color": "#3498DB", "marker": "s", "linestyle": "--"},
    "PureCompetitive":  {"color": "#2ECC71", "marker": "^", "linestyle": "--"},
    ...
 }
 ```
 #### 绘图方法
 | 方法 | 对应图表 | 参数 |
 |---|---|---|
 | `plot_convergence(data)` | Fig.2 | `{algo: [episode_rewards]}` |
 | `plot_qoe_vs_snr(data)` | Fig.3 | `{algo: {snr: qoe}}` |
 | `plot_fairness_vs_snr(data)` | Fig.4 | `{algo: {snr: fairness}}` |
 | `plot_qoe_vs_users(data)` | Fig.5 | `{algo: {n_users: qoe}}` |
 | `plot_rate_satisfaction(data)` | Fig.6 | `{algo: {n_users: rate_sat}}` |
 | `plot_lambda_trajectory(lambdas)` | Fig.7 | `[λ_1, λ_2, ...]` |
 | `plot_lambda_qoe_scatter(lambdas, qoes)` | Fig.8 | 两个等长列表 |
 | `plot_qoe_vs_semantic_ratio(data)` | Fig.9 | `{algo: {ratio: qoe}}` |
 | `plot_ablation(data)` | Fig.10 | `{algo: qoe_mean}` |
 | `plot_beta_sensitivity(data)` | Fig.11 | `{beta: qoe}` |
 | `plot_qth_sensitivity(data)` | Fig.12 | `{qth: qoe}` |
 所有方法自动保存 PNG (300 DPI) 到 `save_dir`。
 ---
 ## 5. 入口脚本 / Entry Scripts
 ### train.py
 训练入口，支持 CLI 参数。
 ```bash
 python train.py [--algo ALGO] [--config PATH] [--episodes N] [--steps N] [--seed N]
 ```
 | 参数 | 默认值 | 说明 |
 |---|---|---|
 | `--algo` | `co_maddpg` | 算法名 (`co_maddpg`, `pure_coop`, `all`, 等) |
 | `--config` | `configs/default.yaml` | 配置文件路径 |
 | `--episodes` | 从配置读取 (5000) | 训练轮数 |
 | `--steps` | 从配置读取 (200) | 每轮步数 |
 | `--seed` | 从配置读取 (42) | 随机种子 |
 **关键函数:**
 - `load_config(path)` — 加载 YAML
 - `get_algorithm(name, config)` — 工厂函数，返回算法实例
 - `train_single(algo_name, config)` — 训练单个算法
 - `train_all(config)` — 训练全部 8 个算法
 ---
 ### evaluate.py
 评估入口，运行 8 个场景生成 12+ 张图。
 ```bash
 python evaluate.py [--results_dir PATH] [--config PATH]
 ```
 **8 个评估场景:**
 | # | 函数 | 说明 |
 |---|---|---|
 | 1 | `scenario_convergence()` | 绘制训练收敛曲线 |
 | 2 | `scenario_qoe_vs_snr()` | 扫描 SNR (通过调节 noise_psd) |
 | 3 | `scenario_fairness_vs_snr()` | 不同 SNR 下的公平性 |
 | 4 | `scenario_qoe_vs_users()` | 扫描用户数量 |
 | 5 | `scenario_rate_satisfaction()` | 不同用户数下的速率满足度 |
 | 6 | `scenario_lambda_dynamics()` | λ(t) 时间演化 |
 | 7 | `scenario_ablation()` | 消融实验对比 |
 | 8 | `scenario_sensitivity()` | β 和 Q_th 敏感性 |
 ---
 ## 类型约定 / Type Conventions
 | 约定 | 说明 |
 |---|---|
 | 所有观察/动作 | numpy float32 |
 | 神经网络输入 | torch.FloatTensor (自动转换) |
 | 配置参数 | 从 YAML 加载，保持原始类型 |
 | 奖励/QoE | Python float |
 | 信道增益 | numpy complex128 |
 | 布尔 done | Python bool |
--- a/code/ARCHITECTURE.md
+++ b/code/ARCHITECTURE.md
@ -0,0 +1,342 @@
 # 架构设计文档 / Architecture Document
 ## 系统全局视图 / System Overview
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                        train.py (入口)                          │
 │  CLI 解析 → 加载配置 → 初始化环境+算法 → 训练循环 → 保存模型    │
 └──────────┬──────────────────────────────────┬───────────────────┘
           │                                  │
           ▼                                  ▼
 ┌─────────────────────┐          ┌──────────────────────────┐
 │   envs/ (环境层)     │          │  agents/ + baselines/    │
 │                     │          │     (算法层)              │
 │  WirelessEnv        │◄────────►│  CoMADDPG / 7 baselines  │
 │  ├─ ChannelModel    │  obs,    │  ├─ Actor (策略网络)      │
 │  ├─ SemanticModule  │  reward, │  ├─ Critic (价值网络)     │
 │  └─ step/reset      │  done    │  ├─ ReplayBuffer          │
 └─────────────────────┘          │  └─ OUNoise               │
                                 └──────────────────────────┘
                                            │
                                            ▼
                                 ┌──────────────────────────┐
                                 │    utils/ (工具层)        │
                                 │  metrics.py  (评估指标)   │
                                 │  visualization.py (绘图)  │
                                 └──────────────────────────┘
                                            │
                                            ▼
                                 ┌──────────────────────────┐
                                 │   evaluate.py (评估入口)  │
                                 │   8 场景 × 12+ 张图       │
                                 └──────────────────────────┘
 ```
 ---
 ## 模块依赖关系 / Module Dependencies
 ```
 configs/default.yaml
    │
    ├──► envs/channel_model.py      (读取 env.carrier_freq, env.noise_psd, env.subcarrier_spacing)
    ├──► envs/semantic_module.py    (读取 env.rho_max, env.rho_min, env.w1, env.w2)
    ├──► envs/wireless_env.py       (读取 env.* 和 training.max_steps)
    │        ├── uses ChannelModel
    │        └── uses SemanticModule
    │
    ├──► agents/co_maddpg.py        (读取 env.num_subcarriers, training.*, network.*, reward.*)
    │        ├── uses Actor          (agents/actor.py)
    │        ├── uses Critic         (agents/critic.py)
    │        ├── uses ReplayBuffer   (agents/replay_buffer.py)
    │        └── uses OUNoise        (agents/noise.py)
    │
    ├──► baselines/*.py             (各基线复用 Actor, Critic, ReplayBuffer, OUNoise)
    │
    └──► utils/metrics.py           (读取 reward.* 权重)
         utils/visualization.py     (独立，无配置依赖)
 ```
 ---
 ## 数据流 / Data Flow
 ### 训练循环（单个 Episode）
 ```
                     ┌─── Episode 开始 ───┐
                     │                     │
                     ▼                     │
              env.reset()                  │
              → (obs_s, obs_b)             │
                     │                     │
         ┌───► Step 循环 (200 步) ◄────┐  │
         │           │                  │  │
         │     agent.select_action      │  │
         │     (obs_s) → act_s          │  │
         │     (obs_b) → act_b          │  │
         │           │                  │  │
         │     env.step(act_s, act_b)   │  │
         │     → (obs_s', obs_b',       │  │
         │        rew_s, rew_b,         │  │
         │        done, info)           │  │
         │           │                  │  │
         │     agent.compute_rewards    │  │
         │     (info) → (r_s, r_b)      │  │
         │           │                  │  │
         │     buffer.push(             │  │
         │       obs_s, obs_b,          │  │
         │       act_s, act_b,          │  │
         │       r_s, r_b,             │  │
         │       obs_s', obs_b',        │  │
         │       done)                  │  │
         │           │                  │  │
         │     agent.update()           │  │
         │           │                  │  │
         │     not done ────────────────┘  │
         │                                 │
         └─── done ─── noise.decay() ──────┘
                          │
                    save model + log
 ```
 ### 环境 step() 内部流程（8 步）
 ```
 Action Decode        Subcarrier Alloc       Power Alloc
 (act_s, act_b)  →   Greedy by channel  →   Equal within group
     │                    │                      │
     ▼                    ▼                      ▼
  n_sub_s, n_sub_b   sem_subs, trad_subs   power_matrix (K×N)
                          │                      │
                          ▼                      ▼
                    ┌─── SNR = p·|h|²/σ² ───┐
                    │                        │
          ┌─────────┴──────┐      ┌──────────┴──────┐
          │ Traditional    │      │ Semantic         │
          │ Rate = Σ Δf·   │      │ avg_SNR →        │
          │   log₂(1+γ)   │      │ SSim(γ̄, ρ) →    │
          │ QoE_b = min    │      │ QoE_s = w1·SSim  │
          │  (R/R_req, 1)  │      │  + w2·(1-ρ/ρ_max)│
          └───────┬────────┘      └────────┬─────────┘
                  │                        │
                  └────────┬───────────────┘
                           ▼
                    QoE_sys = mean(all QoE)
                           │
                    Regenerate channel (block fading)
                           │
                    Build (obs_s', obs_b', rew_s, rew_b, done, info)
 ```
 ---
 ## Co-MADDPG Stackelberg 更新机制 / Stackelberg Update Mechanism
 这是本算法的核心创新。更新顺序体现了 Leader-Follower 博弈结构：
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                    Stackelberg Update                           │
 │                                                                 │
 │  PHASE 1: 更新 Follower (Agent B) / Update Follower First      │
 │  ┌───────────────────────────────────────────────────────┐     │
 │  │  1. Critic B: L_B = (Q_B(s,a) - y_B)²                │     │
 │  │     y_B = r_B + γ·Q_B'(s', a_s'_target, a_b'_target) │     │
 │  │  2. Actor B:  max Q_B(s, a_s_current, π_B(o_b))      │     │
 │  │  3. Soft update: θ_B_target ← τ·θ_B + (1-τ)·θ_B_tgt │     │
 │  └───────────────────────────────────────────────────────┘     │
 │                          │                                      │
 │                          ▼ (B's policy is now updated)          │
 │                                                                 │
 │  PHASE 2: 更新 Leader (Agent S) / Update Leader with B's BR    │
 │  ┌───────────────────────────────────────────────────────┐     │
 │  │  1. Get B's best response: a_b_br = π_B_updated(o_b)  │     │
 │  │     .detach() — 不反传梯度给 B                         │     │
 │  │  2. Critic S: L_S = (Q_S(s, a) - y_S)²               │     │
 │  │  3. Actor S:  max Q_S(s, π_S(o_s), a_b_br)           │     │
 │  │     Leader 优化时考虑了 Follower 的最优响应              │     │
 │  │  4. Soft update: θ_S_target ← τ·θ_S + (1-τ)·θ_S_tgt │     │
 │  └───────────────────────────────────────────────────────┘     │
 │                                                                 │
 │  PHASE 3: 更新动态 λ / Update Dynamic λ                        │
 │  ┌───────────────────────────────────────────────────────┐     │
 │  │  λ(t) = sigmoid(β · (QoE_sys - Q_th))                │     │
 │  │  β=5.0 控制切换陡度，Q_th=0.6 为切换阈值               │     │
 │  └───────────────────────────────────────────────────────┘     │
 └─────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## 奖励计算流程 / Reward Computation Flow
 ```
              env.step() 输出 info dict
                      │
            ┌─────────┴─────────┐
            │ qoe_semantic      │
            │ qoe_traditional   │
            │ qoe_sys           │
            └─────────┬─────────┘
                      │
              agent.compute_rewards(info)
                      │
         ┌────────────┴────────────┐
         ▼                         ▼
   r_coop_s =                r_coop_b =
   0.5·qoe_s +              0.5·qoe_b +
   0.3·qoe_b +              0.3·qoe_s +
   0.2·qoe_sys              0.2·qoe_sys
         │                         │
   r_comp_s =                r_comp_b =
   0.8·qoe_s +              0.8·qoe_b +
   0.2·qoe_sys              0.2·qoe_sys
         │                         │
         └────────────┬────────────┘
                      │
              λ = sigmoid(β·(qoe_sys - Q_th))
                      │
              r_s = λ·r_coop_s + (1-λ)·r_comp_s
              r_b = λ·r_coop_b + (1-λ)·r_comp_b
 ```
 ---
 ## 观察空间与动作空间 / Observation & Action Spaces
 ### 观察空间 (obs_dim = N + 4 = 68)
 | 维度 | 内容 (语义 Agent S) | 内容 (传统 Agent B) |
 |---|---|---|
 | [0 : N] | 语义用户平均信道功率 (归一化) | 传统用户平均信道功率 (归一化) |
 | [N] | qoe_avg_s (滚动平均 QoE) | qoe_avg_b (滚动平均 QoE) |
 | [N+1] | content_sensitivity | business_priority |
 | [N+2] | alloc_s (当前子载波分配比) | alloc_b (当前子载波分配比) |
 | [N+3] | load_s (流量负载) | load_b (流量负载) |
 ### 动作空间 (act_dim = 3, 连续 [0,1])
 | 维度 | 含义 (语义 Agent S) | 含义 (传统 Agent B) |
 |---|---|---|
 | [0] | 请求子载波比例 n_sub_frac | 请求子载波比例 n_sub_frac |
 | [1] | 功率分配比例 p_frac | 功率分配比例 p_frac |
 | [2] | 压缩率 ρ (映射到 [ρ_min, ρ_max]) | 冗余参数 (未使用) |
 ---
 ## 网络架构 / Network Architecture
 ### Actor Network
 ```
 Input: obs (68,)
  │
  ├─ Linear(68 → 256) + ReLU
  ├─ Linear(256 → 256) + ReLU
  ├─ Linear(256 → 128) + ReLU
  ├─ Linear(128 → 3)
  └─ (Tanh + 1) / 2  →  output ∈ [0, 1]³
 ```
 ### Critic Network (Joint, CTDE)
 ```
 Input: concat(obs_s, obs_b, act_s, act_b) = (142,)
  │
  ├─ Linear(142 → 512) + ReLU
  ├─ Linear(512 → 512) + ReLU
  ├─ Linear(512 → 256) + ReLU
  └─ Linear(256 → 1)  →  Q-value (scalar)
 ```
 ---
 ## 经验回放 / Replay Buffer
 9-field transitions:
 ```
 Transition = (obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, done)
              (68,)   (68,)  (3,)   (3,)   (1,)   (1,)    (68,)       (68,)      (1,)
 ```
 - 容量: 100,000 transitions
 - 采样: 均匀随机采样 batch_size=256
 - 存储: deque 结构，FIFO 淘汰
 ---
 ## Agent 接口契约 / Agent Interface Contract
 所有 8 个算法必须实现以下接口，以兼容 `train.py` 的训练循环：
 ```python
 class AgentInterface:
    # 必须有以下属性之一
    self.buffer: ReplayBuffer      # 优先检查
    self.replay_buffer: ReplayBuffer  # 备选
    # 选择动作
    def select_action(obs_s, obs_b, explore=True) -> (act_s, act_b)
    # 计算奖励
    def compute_rewards(info) -> (rew_s, rew_b)
    # 更新网络
    def update() -> dict or None
    # 保存/加载模型
    def save(path)
    def load(path)
    # 噪声衰减 (可选，train.py 通过 hasattr 检查)
    self.noise_s: OUNoise  # 需有 decay_sigma(episode) 方法
    self.noise_b: OUNoise
 ```
 ---
 ## 信道模型 / Channel Model
 基于 3GPP Urban Micro (UMi) NLOS 模型：
 ```
 距离: d ~ U(50, 500) 米
 路径损耗: PL(d) = 36.7·log₁₀(d) + 22.7 + 26·log₁₀(fc)   [dB]
 信道增益: h_{k,n} ~ CN(0, 10^{-PL/10})   (Rayleigh fading)
 噪声功率: σ² = 10^{(N₀_dBm - 30)/10} · Δf   [W]
 SNR:     γ_{k,n} = p_{k,n} · |h_{k,n}|² / σ²
 ```
 - 块衰落模型 (Block fading): 每个 step 重新生成信道
 - K_s=3 语义用户 + K_b=3 传统用户 = 6 users
 - N=64 个 OFDM 子载波
 - 子载波分配: 贪婪算法 (语义用户优先)
 - 功率分配: 组内均分
 ---
 ## 设计决策 / Design Decisions
 ### 1. 为什么用 Stackelberg 而不是 Nash？
 - Stackelberg 适合异构场景：语义用户 (Leader) 先决策，传统用户 (Follower) 最优响应
 - 保证了均衡存在性（定理 1-2 在论文中证明）
 ### 2. 为什么 λ(t) 用 sigmoid？
 - 连续可微，适合梯度训练
 - β 参数控制切换陡度，Q_th 控制切换点
 - 系统 QoE 高时偏合作 (λ→1)，低时偏竞争 (λ→0)
 ### 3. 为什么观察空间包含额外 4 维？
 - 仅信道信息不够：agent 需要知道当前 QoE 水平、流量负载、分配状况
 - 这些额外信息帮助 agent 做出更有环境感知的决策
 ### 4. 为什么 Critic 是联合的 (CTDE)？
 - 集中训练时可访问所有信息，解决非平稳性问题
 - 分散执行时只用各自的 Actor，降低通信开销
 ### 5. 为什么语义用户优先分配子载波？
 - 体现 Leader 的先动优势 (First-mover advantage)
 - 与 Stackelberg 博弈结构一致
--- a/code/README.md
+++ b/code/README.md
@ -0,0 +1,286 @@
 # Co-MADDPG: 面向语义与传统混合通信的合作竞争多智能体资源分配框架
 **Co-MADDPG: Cooperative-Competitive Multi-Agent Resource Allocation for Semantic-Traditional Hybrid Wireless Communication**
 ---
 ## 项目简介 / Project Overview
 本项目实现了 Co-MADDPG 算法——一种基于 Stackelberg 博弈和动态合作-竞争切换机制的多智能体深度强化学习框架，用于语义通信与传统比特级通信共存场景下的 OFDMA 无线资源分配。
 This project implements the Co-MADDPG algorithm — a multi-agent deep reinforcement learning framework based on Stackelberg game dynamics and dynamic cooperation-competition switching for OFDMA wireless resource allocation in semantic-traditional hybrid communication systems.
 ### 核心创新 / Key Innovations
 1. **合作竞争博弈建模 / Coopetition Game Modeling**: 将语义用户 (Leader) 与传统用户 (Follower) 之间的资源竞争建模为 Stackelberg 博弈
 2. **动态 λ(t) 切换 / Dynamic λ(t) Switching**: `λ(t) = sigmoid(β·(QoE_sys - Q_th))`，根据系统 QoE 在合作与竞争之间自适应切换
 3. **异构 QoE 指标 / Heterogeneous QoE**: 语义用户使用 SSim + 压缩率，传统用户使用速率满足度
 4. **CTDE 架构 / CTDE Architecture**: 集中训练分散执行，联合 Critic 网络
 ### 目标期刊 / Target Venue
 IEEE Transactions on Communications (TCOM)
 ---
 ## 环境要求 / Requirements
 ### Python 版本 / Python Version
 - Python 3.8+
 ### 依赖库 / Dependencies
 ```bash
 pip install numpy torch pyyaml matplotlib
 ```
 | 库 / Library | 版本 / Version | 用途 / Purpose |
 |---|---|---|
 | `numpy` | ≥1.20 | 数值计算、信道建模 / Numerical computation, channel modeling |
 | `torch` | ≥1.10 (CPU 或 GPU) | 神经网络训练 / Neural network training |
 | `pyyaml` | ≥5.0 | 配置文件加载 / Configuration file loading |
 | `matplotlib` | ≥3.4 | IEEE 风格绘图 / IEEE-style plotting |
 ### 硬件建议 / Hardware Recommendations
 | 场景 / Scenario | 配置 / Configuration |
 |---|---|
 | 功能验证 (Smoke Test) | CPU, 2-5 episodes, ~30 秒 |
 | 短期训练 (Short Training) | CPU/GPU, 100-500 episodes, ~10-60 分钟 |
 | 完整训练 (Full Training) | GPU (CUDA), 5000 episodes, ~2-8 小时 |
 ---
 ## 快速开始 / Quick Start
 ### 1. 克隆项目 / Clone
 ```bash
 git clone <repo-url>
 cd SemantiCommunication/code
 ```
 ### 2. 功能验证 (Smoke Test)
 ```bash
 # 训练 Co-MADDPG 2 个 episode（验证代码逻辑）
 python train.py --algo co_maddpg --episodes 2 --steps 10
 # 训练所有 8 个算法各 2 个 episode
 python train.py --algo all --episodes 2 --steps 10
 ```
 ### 3. 正式训练 / Full Training
 ```bash
 # 单算法训练（推荐先跑主算法）
 python train.py --algo co_maddpg --episodes 5000
 # 训练全部 8 个算法
 python train.py --algo all --episodes 5000
 # 指定配置文件
 python train.py --algo co_maddpg --config configs/default.yaml --episodes 5000
 ```
 ### 4. 评估与绘图 / Evaluation & Plotting
 ```bash
 # 运行全部 8 个评估场景，生成 12+ 张图
 python evaluate.py
 # 指定结果目录
 python evaluate.py --results_dir results/
 ```
 ---
 ## 支持的算法 / Supported Algorithms
 | # | 算法 / Algorithm | CLI 名称 | λ | 更新方式 / Update | Critic 类型 | 用途 / Purpose |
 |---|---|---|---|---|---|---|
 | 1 | **Co-MADDPG** | `co_maddpg` | 动态 dynamic | Stackelberg | Joint (CTDE) | 本文提出 / Proposed |
 | 2 | PureCooperative | `pure_coop` | 1.0 | Simultaneous | Joint | 消融：去除竞争 / Ablate competition |
 | 3 | PureCompetitive | `pure_comp` | 0.0 | Simultaneous | Joint | 消融：去除合作 / Ablate cooperation |
 | 4 | FixedLambda | `fixed_lambda` | 0.5 | Stackelberg | Joint | 消融：去除动态 λ / Ablate dynamic λ |
 | 5 | IDDPG | `iddpg` | 0.0 | Simultaneous | Independent | 消融：去除 CTDE / Ablate CTDE |
 | 6 | SingleAgentDQN | `single_dqn` | 0.5 | N/A | Centralized | 非 MARL 基线 / Non-MARL baseline |
 | 7 | EqualAllocation | `equal_alloc` | 0.5 | N/A | None | 性能下界 / Lower bound |
 | 8 | SemanticOnly | `semantic_only` | 1.0 | N/A | Single | 消融：去除异构性 / Ablate heterogeneity |
 ---
 ## 项目结构 / Project Structure
 ```
 SemantiCommunication/code/
 │
 ├── configs/                     # 配置文件 / Configuration
 │   ├── __init__.py
 │   └── default.yaml             # 主配置（超参数、环境参数）/ Main config
 │
 ├── envs/                        # 环境模块 / Environment modules
 │   ├── __init__.py
 │   ├── channel_model.py         # 3GPP 信道模型 / 3GPP channel model (Eq.5-8)
 │   ├── semantic_module.py       # 语义相似度 SSim / Semantic similarity (SSim)
 │   └── wireless_env.py          # Gym 风格无线环境 / Gym-like wireless env
 │
 ├── agents/                      # 核心算法 / Core algorithm
 │   ├── __init__.py
 │   ├── actor.py                 # Actor 网络 FC→Tanh→[0,1]
 │   ├── critic.py                # Critic 网络 (Joint Q-value)
 │   ├── noise.py                 # OU 探索噪声 / OU exploration noise
 │   ├── replay_buffer.py         # 9-field 经验回放 / 9-field replay buffer
 │   └── co_maddpg.py             # Co-MADDPG 主算法 / Main algorithm (★)
 │
 ├── baselines/                   # 7 个基线算法 / 7 baseline algorithms
 │   ├── __init__.py
 │   ├── pure_coop.py             # λ=1 纯协作
 │   ├── pure_comp.py             # λ=0 纯竞争
 │   ├── fixed_lambda.py          # λ=0.5 固定
 │   ├── iddpg.py                 # 独立 DDPG (无 CTDE)
 │   ├── single_dqn.py            # 集中式 DQN
 │   ├── equal_alloc.py           # 均分分配
 │   └── semantic_only.py         # 仅语义 DDPG
 │
 ├── utils/                       # 工具模块 / Utility modules
 │   ├── __init__.py
 │   ├── metrics.py               # 评估指标 (Jain fairness, λ, rewards)
 │   └── visualization.py         # IEEE 风格绘图 (12 种图)
 │
 ├── train.py                     # 训练入口 / Training entry point (★)
 ├── evaluate.py                  # 评估入口 / Evaluation entry point (★)
 ├── README.md                    # 本文件 / This file
 ├── ARCHITECTURE.md              # 架构设计文档 / Architecture document
 ├── API.md                       # API 接口文档 / API reference
 └── results/                     # 训练结果输出 / Training output directory
 ```
 ---
 ## 配置说明 / Configuration
 配置文件位于 `configs/default.yaml`，主要分为 4 个部分：
 ### env（环境参数）
 | 参数 | 默认值 | 说明 |
 |---|---|---|
 | `num_subcarriers` | 64 | OFDMA 子载波数 N |
 | `bandwidth` | 10.0e+6 | 系统带宽 (Hz) |
 | `subcarrier_spacing` | 156250.0 | 子载波间隔 Δf (Hz) |
 | `max_power` | 1.0 | 最大发射功率 (W) |
 | `noise_psd` | -174 | 噪声功率谱密度 (dBm/Hz) |
 | `carrier_freq` | 3.5 | 载波频率 (GHz) |
 | `num_semantic_users` | 3 | 语义用户数 K_s |
 | `num_traditional_users` | 3 | 传统用户数 K_b |
 | `min_rate_req` | 5.0e+5 | 传统用户最低速率需求 (bps) |
 | `rho_max` / `rho_min` | 1.0 / 0.05 | 压缩率范围 |
 | `w1` / `w2` | 0.7 / 0.3 | 语义 QoE 权重 |
 ### training（训练参数）
 | 参数 | 默认值 | 说明 |
 |---|---|---|
 | `max_episodes` | 5000 | 最大训练轮数 |
 | `max_steps` | 200 | 每轮最大步数 |
 | `batch_size` | 256 | 批量大小 |
 | `buffer_capacity` | 100000 | 经验回放容量 |
 | `actor_lr` / `critic_lr` | 1e-4 / 3e-4 | 学习率 |
 | `gamma` | 0.95 | 折扣因子 |
 | `tau` | 0.01 | 软更新系数 |
 | `beta` | 5.0 | λ(t) sigmoid 的陡度 |
 | `q_threshold` | 0.6 | λ(t) 切换阈值 Q_th |
 ### network（网络结构）
 | 参数 | 默认值 | 说明 |
 |---|---|---|
 | `actor_hidden` | [256, 256, 128] | Actor 隐藏层 |
 | `critic_hidden` | [512, 512, 256] | Critic 隐藏层 |
 ### reward（奖励权重）
 | 参数 | 默认值 | 说明 |
 |---|---|---|
 | `coop_self` / `coop_other` / `coop_sys` | 0.5 / 0.3 / 0.2 | 合作奖励权重 |
 | `comp_self` / `comp_sys` | 0.8 / 0.2 | 竞争奖励权重 |
 ---
 ## 关键公式 / Key Formulas
 | 公式 | 表达式 | 论文编号 |
 |---|---|---|
 | 路径损耗 / Path Loss | `PL(d) = 36.7·log₁₀(d) + 22.7 + 26·log₁₀(fc)` | Eq.(5) |
 | 信道增益 / Channel Gain | `h_{k,n} ~ CN(0, 10^{-PL/10})` | Eq.(6) |
 | 噪声功率 / Noise Power | `σ² = 10^{(N₀_dBm-30)/10} · Δf` | Eq.(7) |
 | 信噪比 / SNR | `γ_{k,n} = p_{k,n} · \|h_{k,n}\|² / σ²` | Eq.(8) |
 | 语义相似度 / SSim | `φ(γ̄,ρ) = 1 - exp(-a(ρ)·γ̄^{b(ρ)})` | — |
 | 语义 QoE | `QoE_s = 0.7·SSim + 0.3·(1-ρ/ρ_max)` | — |
 | 传统 QoE | `QoE_b = min(R_k/R_req, 1)` | — |
 | 动态 λ | `λ(t) = sigmoid(β·(QoE_sys - Q_th))` | — |
 | 混合奖励 | `r_i = λ·r_coop + (1-λ)·r_comp` | — |
 ---
 ## 评估场景 / Evaluation Scenarios
 `evaluate.py` 包含 8 个评估场景，对应论文 Section VII 的 12 张图：
 | # | 场景 | 对应图表 | 说明 |
 |---|---|---|---|
 | 1 | Convergence | Fig.2 | 收敛曲线对比 |
 | 2 | QoE vs SNR | Fig.3 | 不同 SNR 下的系统 QoE |
 | 3 | Fairness vs SNR | Fig.4 | 不同 SNR 下的 Jain 公平性 |
 | 4 | QoE vs Users | Fig.5 | 用户数量扩展性 |
 | 5 | Rate Satisfaction vs Users | Fig.6 | 速率满足度 |
 | 6 | Lambda Trajectory | Fig.7-8 | λ(t) 演化轨迹和散点图 |
 | 7 | Ablation Study | Fig.10 | 消融实验柱状图 |
 | 8 | Sensitivity | Fig.11-12 | β 和 Q_th 敏感性分析 |
 ---
 ## 输出文件 / Output Files
 训练和评估产生的文件保存在 `results/` 目录：
 ```
 results/
 ├── <algo_name>/
 │   ├── model_s.pth              # 语义智能体模型权重
 │   ├── model_b.pth              # 传统智能体模型权重
 │   ├── training_log.json        # 训练指标日志
 │   └── config_snapshot.yaml     # 训练时的配置快照
 ├── figures/
 │   ├── fig02_convergence.png
 │   ├── fig03_qoe_vs_snr.png
 │   ├── ...
 │   └── fig12_qth_sensitivity.png
 └── evaluation_results.json      # 评估汇总数据
 ```
 ---
 ## 已知问题与注意事项 / Known Issues & Notes
 1. **YAML 科学记数法**: 使用 `5.0e+5` 格式（非 `500.0e3`），否则 `yaml.safe_load()` 会将其解析为字符串
 2. **Smoke Test QoE 值**: 2 episode 的 smoke test 中所有算法的 QoE 值相近（~0.7-0.9），这是因为网络尚未充分训练。需完整训练（5000 episodes）才能看到显著差异
 3. **GPU 加速**: 默认自动检测 CUDA。CPU 训练较慢但功能完整
 4. **随机种子**: 默认 seed=42，可在配置中修改
 ---
 ## 论文引用 / Citation
 如引用本工作，请参考论文：
 > Co-MADDPG: 面向语义与传统混合通信的合作竞争多智能体资源分配框架
 论文文件位于 `../paper/` 目录。
 ---
 ## License
 MIT License
--- a/code/agents/init.py
+++ b/code/agents/init.py
@ -0,0 +1,6 @@
 """Agent modules for Co-MADDPG wireless resource allocation."""
 from .noise import OUNoise
 from .replay_buffer import ReplayBuffer
 __all__ = ["OUNoise", "ReplayBuffer"]
--- a/code/agents/actor.py
+++ b/code/agents/actor.py
@ -0,0 +1,61 @@
 """
 Actor Network for Wireless Resource Allocation / 无线资源分配中的 Actor 网络
 This file defines the Actor network architecture for the Co-MADDPG project.
 The Actor maps local observations to deterministic resource allocation actions.
 本文档定义了 Co-MADDPG 项目中的 Actor 网络架构。
 Actor 网络将局部观测值映射到确定性的资源分配动作。
 Network Architecture / 网络架构:
 FC(obs_dim \u2192 256 \u2192 256 \u2192 128 \u2192 act_dim)
 Output Mapping / 输出映射: (Tanh + 1) / 2 \u2208 [0, 1]
 Reference / 参考文献: Section 3.2.1 Actor-Critic Structure in the project paper.
 """
 import torch
 import torch.nn as nn
 class Actor(nn.Module):
    """
    Actor network for mapping observations to deterministic actions in [0, 1].
    Actor 网络，将观测值映射到 [0, 1] 范围内的确定性动作。
    Architecture / 架构: FC(obs_dim \u2192 256 \u2192 256 \u2192 128 \u2192 act_dim)
    Paper Ref / 论文参考: Section 3.2.1 - Policy Network implementation.
    Args / 参数:
        obs_dim (int): Dimension of the observation space. / 观测空间的维度。
        act_dim (int): Dimension of the action space. / 动作空间的维度。
        hidden_sizes (list): Sizes of the three hidden layers (default: [256, 256, 128]). / 三个隐藏层的维度（默认：[256, 256, 128]）。
    """
    def __init__(self, obs_dim, act_dim, hidden_sizes=[256, 256, 128]):
        super(Actor, self).__init__()
        # Ensure exactly 3 hidden layers as per model design / 确保按照模型设计包含恰好 3 个隐藏层
        assert len(hidden_sizes) == 3, "Actor requires exactly 3 hidden layer sizes"
        # Define the feedforward neural network / 定义前馈神经网络
        # FC(obs_dim \u2192 256 \u2192 256 \u2192 128 \u2192 act_dim)
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden_sizes[0]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[0], hidden_sizes[1]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[1], hidden_sizes[2]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[2], act_dim)
        )
    def forward(self, obs):
        """
        Forward pass for the Actor network. / Actor 网络的前向传播。
        Args / 参数:
            obs (torch.Tensor): Local observation tensor. / 局部观测张量。
        Returns / 返回:
            torch.Tensor: Actions mapped to the [0, 1] range. / 映射到 [0, 1] 范围的动作。
        """
        # Pass observations through the linear layers / 将观测值传入线性层
        out = self.net(obs)
        # Formula / 公式: Output (Tanh + 1) / 2 maps result to [0, 1] range / 将输出结果映射到 [0, 1] 范围
        return (torch.tanh(out) + 1.0) / 2.0
--- a/code/agents/co_maddpg.py
+++ b/code/agents/co_maddpg.py
@ -0,0 +1,376 @@
 """
 Co-MADDPG Algorithm for Wireless Resource Allocation / 无线资源分配中的 Co-MADDPG 算法
 This file implements the Cooperative Multi-Agent Deep Deterministic Policy Gradient (Co-MADDPG) algorithm.
 It features a Leader-Follower (Stackelberg) update structure for semantic and traditional agents.
 本文档实现了协作式多智能体深度确定性策略梯度 (Co-MADDPG) 算法。
 该算法针对语义智能体和传统智能体采用了领导者-跟随者（Stackelberg）更新结构。
 Key Components / 关键组件:
 - Actor-Critic Architecture / Actor-Critic 架构
 - Stackelberg Update / Stackelberg 更新 (Follower update first, then Leader uses Follower's best response)
 - Dynamic Cooperation Weight / 动态协作权重: \u03bb(t) = sigmoid(\u03b2*(QoE_sys - Q_th))
 - Mixed Reward / 混合奖励: r_i = \u03bb*r_coop + (1-\u03bb)*r_comp
 - Soft Update / 软更新: \u03b8_target \u2190 \u03c4*\u03b8 + (1-\u03c4)*\u03b8_target
 Reference / 参考文献: Section 3.2 Leader-Follower Game and Co-MADDPG in the project paper.
 """
 import os
 import torch
 import torch.nn as nn
 import torch.optim as optim
 import numpy as np
 from agents.actor import Actor
 from agents.critic import Critic
 from agents.replay_buffer import ReplayBuffer
 from agents.noise import OUNoise
 class CoMADDPG:
    """
    Co-MADDPG Algorithm featuring Leader-Follower updating structure.
    具有领导者-跟随者更新结构的 Co-MADDPG 算法。
    Agent S: Semantic Agent (Leader) / 语义智能体（领导者）
    Agent B: Traditional/Bit-stream Agent (Follower) / 传统/比特流智能体（跟随者）
    Paper Ref / 论文参考: Section 3.2 - Co-MADDPG Implementation details.
    """
    def __init__(self, config):
        self.config = config
        # Dimensions derived from config / 从配置中提取维度信息
        # Dimensions derived from config
        self.obs_dim = config['env']['num_subcarriers'] + 4
        self.act_dim = 3
        # The critic observes joint states and actions / Critic 观察联合状态和动作
        # The critic observes joint states and actions
        obs_dim_total = self.obs_dim * 2
        act_dim_total = self.act_dim * 2
        # Determine device implicitly / 自动检测设备 (CUDA 或 CPU)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Hyperparameters / 超参数设置
        # Hyperparameters
        train_cfg = config.get('training', {})
        self.gamma = train_cfg.get('gamma', 0.95)
        self.tau = train_cfg.get('tau', 0.01)
        self.beta = train_cfg.get('beta', 5.0)
        self.q_threshold = train_cfg.get('q_threshold', 0.6)
        self.batch_size = train_cfg.get('batch_size', 256)
        actor_lr = train_cfg.get('actor_lr', 1e-4)
        critic_lr = train_cfg.get('critic_lr', 3e-4)
        buffer_capacity = train_cfg.get('buffer_capacity', 100000)
        # Network configurations / 网络配置项
        # Network configurations
        net_cfg = config.get('network', {})
        actor_hidden = net_cfg.get('actor_hidden', [256, 256, 128])
        critic_hidden = net_cfg.get('critic_hidden', [512, 512, 256])
        # Create Actor Networks / 创建 Actor 网络
        self.actor_s = Actor(self.obs_dim, self.act_dim, actor_hidden).to(self.device)
        self.actor_b = Actor(self.obs_dim, self.act_dim, actor_hidden).to(self.device)
        # Create Actor Target Networks / 创建 Actor 目标网络
        self.actor_s_target = Actor(self.obs_dim, self.act_dim, actor_hidden).to(self.device)
        self.actor_b_target = Actor(self.obs_dim, self.act_dim, actor_hidden).to(self.device)
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        # Create Critic Networks / 创建 Critic 网络
        self.critic_s = Critic(obs_dim_total, act_dim_total, critic_hidden).to(self.device)
        self.critic_b = Critic(obs_dim_total, act_dim_total, critic_hidden).to(self.device)
        # Create Critic Target Networks / 创建 Critic 目标网络
        self.critic_s_target = Critic(obs_dim_total, act_dim_total, critic_hidden).to(self.device)
        self.critic_b_target = Critic(obs_dim_total, act_dim_total, critic_hidden).to(self.device)
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
        # Optimizers / 优化器设置
        self.actor_optimizer_s = optim.Adam(self.actor_s.parameters(), lr=actor_lr)
        self.actor_optimizer_b = optim.Adam(self.actor_b.parameters(), lr=actor_lr)
        self.critic_optimizer_s = optim.Adam(self.critic_s.parameters(), lr=critic_lr)
        self.critic_optimizer_b = optim.Adam(self.critic_b.parameters(), lr=critic_lr)
        # MSE Loss for critics / Critic 使用的均方误差损失函数
        self.critic_loss_fn = nn.MSELoss()
        # Replay Buffer / 经验回放池
        self.replay_buffer = ReplayBuffer(buffer_capacity)
        # Ornstein-Uhlenbeck noise / OU 探索噪声
        ou_sigma = train_cfg.get('ou_sigma_init', 0.2)
        ou_theta = train_cfg.get('ou_theta', 0.15)
        self.noise_s = OUNoise(self.act_dim, theta=ou_theta, sigma_init=ou_sigma)
        self.noise_b = OUNoise(self.act_dim, theta=ou_theta, sigma_init=ou_sigma)
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Determines the actions using the actor networks, with optional OU exploration noise.
        使用 Actor 网络确定动作，可选择性添加 OU 探索噪声。
        Args / 参数:
            obs_s, obs_b: Observations for agents S and B. / 智能体 S 和 B 的观测值。
            explore (bool): Whether to add noise for exploration. / 是否添加探索噪声。
        Returns / 返回:
            tuple: (act_s, act_b) actions for each agent. / 每个智能体的动作 (act_s, act_b)。
        """
        self.actor_s.eval()
        self.actor_b.eval()
        with torch.no_grad():
            obs_s_t = torch.FloatTensor(obs_s).unsqueeze(0).to(self.device)
            obs_b_t = torch.FloatTensor(obs_b).unsqueeze(0).to(self.device)
            act_s = self.actor_s(obs_s_t).cpu().numpy().squeeze(0)
            act_b = self.actor_b(obs_b_t).cpu().numpy().squeeze(0)
        self.actor_s.train()
        self.actor_b.train()
        # Apply OU noise if exploration is enabled / 如果开启探索，则添加 OU 噪声
        if explore:
            act_s += self.noise_s.sample()
            act_b += self.noise_b.sample()
        # Formula / 公式: act \u2208 [0, 1]
        # Clip mapping bounds as enforced by the (tanh + 1)/2 activation in Actor / 按照 Actor 中的激活函数限制动作范围到 [0, 1]
        act_s = np.clip(act_s, 0.0, 1.0)
        act_b = np.clip(act_b, 0.0, 1.0)
        return act_s, act_b
    def compute_lambda(self, qoe_sys):
        """
        Compute dynamic cooperation weight \u03bb(t). / 计算动态协作权重 \u03bb(t)。
        Formula / 公式: \u03bb(t) = sigmoid(\u03b2 * (QoE_sys - Q_th))
        Args / 参数:
            qoe_sys (float): Current system QoE. / 当前系统 QoE。
        Returns / 返回:
            float: Cooperation weight \u03bb(t) \u2208 [0, 1]. / 协作权重 \u03bb(t)。
        """
        return 1.0 / (1.0 + np.exp(-self.beta * (qoe_sys - self.q_threshold)))
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute joint dynamically weighted rewards based on \u03bb cooperation factor.
        基于 \u03bb 协作因子计算动态加权的联合奖励。
        Formula / 公式: r_i = \u03bb * r_coop_i + (1 - \u03bb) * r_comp_i
        Args / 参数:
            qoe_s, qoe_b, qoe_sys: QoE values for semantic, traditional, and system levels. / 语义层、传统层和系统层的 QoE 值。
        Returns / 返回:
            tuple: (r_s, r_b, lambda_val) final mixed rewards and the cooperation weight. / 最终混合奖励与协作权重。
        """
        lambda_val = self.compute_lambda(qoe_sys)
        rew_cfg = self.config.get('reward', {})
        coop_self = rew_cfg.get('coop_self', 0.5)
        coop_other = rew_cfg.get('coop_other', 0.3)
        coop_sys = rew_cfg.get('coop_sys', 0.2)
        comp_self = rew_cfg.get('comp_self', 0.8)
        comp_sys = rew_cfg.get('comp_sys', 0.2)
        # Cooperative logic (shared benefit mindset) / 协作逻辑（共同利益导向）
        # Formula / 公式: r_coop_i = 0.5*qoe_i + 0.3*qoe_j + 0.2*qoe_sys
        r_coop_s = coop_self * qoe_s + coop_other * qoe_b + coop_sys * qoe_sys
        r_coop_b = coop_self * qoe_b + coop_other * qoe_s + coop_sys * qoe_sys
        # Competitive logic (individual maximization mindset) / 竞争逻辑（个体利益导向）
        # Formula / 公式: r_comp_i = 0.8*qoe_i + 0.2*qoe_sys
        r_comp_s = comp_self * qoe_s + comp_sys * qoe_sys
        r_comp_b = comp_self * qoe_b + comp_sys * qoe_sys
        # Dynamically balanced reward (mix based on System QoE state vs threshold) / 动态平衡奖励（基于系统 QoE 状态与阈值的混合）
        r_s = lambda_val * r_coop_s + (1.0 - lambda_val) * r_comp_s
        r_b = lambda_val * r_coop_b + (1.0 - lambda_val) * r_comp_b
        return r_s, r_b, lambda_val
    def update(self):
        """
        Perform a gradient update loop containing Leader-Follower sequential methodology.
        执行包含领导者-跟随者顺序方法的梯度更新循环。
        Update Order / 更新顺序:
        1. Update Follower (Agent B) Critic & Actor / 更新跟随者（智能体 B）的 Critic 和 Actor
        2. Update Leader (Agent S) Critic & Actor / 更新领导者（智能体 S）的 Critic 和 Actor
        Returns / 返回:
            tuple: (critic_loss_s, critic_loss_b, actor_loss_s, actor_loss_b) or None if buffer not ready. / 各项损失值。
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        # Sample batch from replay buffer / 从回放池中采样批次数据
        batch = self.replay_buffer.sample(self.batch_size)
        # Destructure standardized tuple. Assumes order: 
        # (obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, dones)
        obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, dones = batch
        obs_s = torch.FloatTensor(obs_s).to(self.device)
        obs_b = torch.FloatTensor(obs_b).to(self.device)
        act_s = torch.FloatTensor(act_s).to(self.device)
        act_b = torch.FloatTensor(act_b).to(self.device)
        rew_s = torch.FloatTensor(rew_s).unsqueeze(1).to(self.device)
        rew_b = torch.FloatTensor(rew_b).unsqueeze(1).to(self.device)
        next_obs_s = torch.FloatTensor(next_obs_s).to(self.device)
        next_obs_b = torch.FloatTensor(next_obs_b).to(self.device)
        dones = torch.FloatTensor(dones).unsqueeze(1).to(self.device)
        # Construct joint states & actions for centralized critic / 构建用于集中式 Critic 的联合状态和动作空间
        obs_all = torch.cat([obs_s, obs_b], dim=1)
        next_obs_all = torch.cat([next_obs_s, next_obs_b], dim=1)
        act_all = torch.cat([act_s, act_b], dim=1)
        # Target actions for next state / 计算下一状态的目标动作值
        with torch.no_grad():
            next_act_s_target = self.actor_s_target(next_obs_s)
            next_act_b_target = self.actor_b_target(next_obs_b)
            next_act_all_target = torch.cat([next_act_s_target, next_act_b_target], dim=1)
        # =====================================================================
        # PHASE 1: Update Follower (Agent B) FIRST / 第一阶段：首先更新跟随者 (智能体 B)
        # Stackelberg methodology / Stackelberg 方法论: Follower responds to Leader's action / 跟随者响应领导者的动作
        # =====================================================================
        # PHASE 1: Update Follower (Agent B) FIRST
        # =====================================================================
        # Update Critic B / 更新智能体 B 的 Critic
        with torch.no_grad():
            target_q_b_next = self.critic_b_target(next_obs_all, next_act_all_target)
            target_q_b = rew_b + self.gamma * (1.0 - dones) * target_q_b_next
        current_q_b = self.critic_b(obs_all, act_all)
        critic_loss_b = self.critic_loss_fn(current_q_b, target_q_b)
        self.critic_optimizer_b.zero_grad()
        critic_loss_b.backward()
        self.critic_optimizer_b.step()
        # Update Actor B / 更新智能体 B 的 Actor
        # Loss: -mean(critic_b(obs_all, [act_s_from_buffer, actor_b(obs_b)]))
        # In Phase 1, the follower assumes leader's action from replay buffer / 在第一阶段，跟随者假定领导者的动作为回放池中的动作
        # Loss: -mean(critic_b(obs_all, [act_s_from_buffer, actor_b(obs_b)]))
        new_act_b = self.actor_b(obs_b)
        act_all_for_b = torch.cat([act_s, new_act_b], dim=1)
        actor_loss_b = -self.critic_b(obs_all, act_all_for_b).mean()
        self.actor_optimizer_b.zero_grad()
        actor_loss_b.backward()
        self.actor_optimizer_b.step()
        # =====================================================================
        # PHASE 2: Update Leader (Agent S) with UPDATED Follower / 第二阶段：基于更新后的跟随者更新领导者 (智能体 S)
        # Leader S uses Follower B's best response / 领导者 S 利用跟随者 B 的最佳响应函数
        # =====================================================================
        # PHASE 2: Update Leader (Agent S) with UPDATED Follower
        # =====================================================================
        # Update Critic S / 更新智能体 S 的 Critic
        with torch.no_grad():
            target_q_s_next = self.critic_s_target(next_obs_all, next_act_all_target)
            target_q_s = rew_s + self.gamma * (1.0 - dones) * target_q_s_next
        current_q_s = self.critic_s(obs_all, act_all)
        critic_loss_s = self.critic_loss_fn(current_q_s, target_q_s)
        self.critic_optimizer_s.zero_grad()
        critic_loss_s.backward()
        self.critic_optimizer_s.step()
        # Update Actor S / 更新智能体 S 的 Actor
        # KEY / 核心逻辑: Use newly updated actor_b(obs_b).detach() as follower's assumed action / 使用刚更新的 actor_b(obs_b).detach() 作为跟随者的预估动作
        # This represents the Leader's knowledge of the Follower's best response / 这代表了领导者对跟随者最佳响应的认知
        # KEY: Use newly updated actor_b(obs_b).detach() as follower's assumed action
        new_act_s = self.actor_s(obs_s)
        updated_act_b_detached = self.actor_b(obs_b).detach()
        act_all_for_s = torch.cat([new_act_s, updated_act_b_detached], dim=1)
        actor_loss_s = -self.critic_s(obs_all, act_all_for_s).mean()
        self.actor_optimizer_s.zero_grad()
        actor_loss_s.backward()
        self.actor_optimizer_s.step()
        # =====================================================================
        # Target Networks Soft Update / 目标网络软更新
        # Formula / 公式: \u03b8_target \u2190 \u03c4 * \u03b8 + (1 - \u03c4) * \u03b8_target
        # =====================================================================
        # Target Networks Soft Update
        # =====================================================================
        self.soft_update(self.actor_s_target, self.actor_s, self.tau)
        self.soft_update(self.actor_b_target, self.actor_b, self.tau)
        self.soft_update(self.critic_s_target, self.critic_s, self.tau)
        self.soft_update(self.critic_b_target, self.critic_b, self.tau)
        return critic_loss_s.item(), critic_loss_b.item(), actor_loss_s.item(), actor_loss_b.item()
    def soft_update(self, target, source, tau):
        """
        Polyak averaging for target network parameters. / 目标网络参数的 Polyak 平均（软更新）。
        Args / 参数:
            target: Target network. / 目标网络。
            source: Source network. / 源网络。
            tau (float): Soft update interpolation factor. / 软更新插值因子 \u03c4。
        """
        for target_param, source_param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(tau * source_param.data + (1.0 - tau) * target_param.data)
    def save(self, path):
        """
        Saves all 4 network state_dicts and optimizers. / 保存所有 4 个网络的权重和优化器状态。
        Args / 参数:
            path (str): File path to save the checkpoint. / 保存检查点的文件路径。
        """
        os.makedirs(os.path.dirname(path), exist_ok=True)
        torch.save({
            'actor_s': self.actor_s.state_dict(),
            'actor_b': self.actor_b.state_dict(),
            'critic_s': self.critic_s.state_dict(),
            'critic_b': self.critic_b.state_dict(),
            'actor_optimizer_s': self.actor_optimizer_s.state_dict(),
            'actor_optimizer_b': self.actor_optimizer_b.state_dict(),
            'critic_optimizer_s': self.critic_optimizer_s.state_dict(),
            'critic_optimizer_b': self.critic_optimizer_b.state_dict(),
        }, path)
    def load(self, path):
        """
        Loads all 4 networks and optimizer parameters from saved states. / 从保存的状态加载所有 4 个网络和优化器参数。
        Args / 参数:
            path (str): File path of the checkpoint to load. / 要加载的检查点文件路径。
        """
        checkpoint = torch.load(path, map_location=self.device)
        self.actor_s.load_state_dict(checkpoint['actor_s'])
        self.actor_b.load_state_dict(checkpoint['actor_b'])
        self.critic_s.load_state_dict(checkpoint['critic_s'])
        self.critic_b.load_state_dict(checkpoint['critic_b'])
        self.actor_optimizer_s.load_state_dict(checkpoint['actor_optimizer_s'])
        self.actor_optimizer_b.load_state_dict(checkpoint['actor_optimizer_b'])
        self.critic_optimizer_s.load_state_dict(checkpoint['critic_optimizer_s'])
        self.critic_optimizer_b.load_state_dict(checkpoint['critic_optimizer_b'])
        # Hard sync the target networks after loading
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
--- a/code/agents/critic.py
+++ b/code/agents/critic.py
@ -0,0 +1,63 @@
 """
 Critic Network for Wireless Resource Allocation / 无线资源分配中的 Critic 网络
 This file defines the Critic network architecture for the Co-MADDPG project.
 The Critic estimates the joint Q-value based on the global observations and actions of all agents.
 本文档定义了 Co-MADDPG 项目中的 Critic 网络架构。
 Critic 网络基于所有智能体的全局观测和动作来估算联合 Q 值。
 Network Architecture / 网络架构:
 FC(obs_dim_total + act_dim_total \u2192 512 \u2192 512 \u2192 256 \u2192 1)
 Input / 输入: Concatenated observations and actions / 拼接后的观测与动作
 Reference / 参考文献: Section 3.2.1 Actor-Critic Structure in the project paper.
 """
 import torch
 import torch.nn as nn
 class Critic(nn.Module):
    """
    Critic network for assessing the value of joint actions given joint observations.
    Critic 网络，用于在给定联合观测的情况下评估联合动作的价值。
    Architecture / 架构: FC(obs_dim_total + act_dim_total \u2192 512 \u2192 512 \u2192 256 \u2192 1)
    Paper Ref / 论文参考: Section 3.2.1 - Centralized Critic implementation.
    Args / 参数:
        obs_dim_total (int): Total dimension of concatenated observations. / 所有智能体拼接后的总观测维度。
        act_dim_total (int): Total dimension of concatenated actions. / 所有智能体拼接后的总动作维度。
        hidden_sizes (list): Sizes of the three hidden layers (default: [512, 512, 256]). / 三个隐藏层的维度（默认：[512, 512, 256]）。
    """
    def __init__(self, obs_dim_total, act_dim_total, hidden_sizes=[512, 512, 256]):
        super(Critic, self).__init__()
        # Ensure exactly 3 hidden layers as per model design / 确保按照模型设计包含恰好 3 个隐藏层
        assert len(hidden_sizes) == 3, "Critic requires exactly 3 hidden layer sizes"
        # Define the feedforward neural network / 定义前馈神经网络
        # FC(obs_dim_total + act_dim_total \u2192 512 \u2192 512 \u2192 256 \u2192 1)
        self.net = nn.Sequential(
            nn.Linear(obs_dim_total + act_dim_total, hidden_sizes[0]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[0], hidden_sizes[1]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[1], hidden_sizes[2]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[2], 1)
        )
    def forward(self, obs_all, act_all):
        """
        Forward pass for the Critic network. / Critic 网络的前向传播。
        Args / 参数:
            obs_all (torch.Tensor): The concatenated joint observation tensor. / 拼接后的联合观测张量。
            act_all (torch.Tensor): The concatenated joint action tensor. / 拼接后的联合动作张量。
        Returns / 返回:
            torch.Tensor: Scalar Q-value evaluation. / 标量 Q 值评估结果。
        """
        # Formula / 公式: x = [obs_total, act_total]
        # Concatenate joint states and actions together for input / 将联合状态和动作拼接作为输入
        x = torch.cat([obs_all, act_all], dim=1)
        # Pass the concatenated input through the network / 将拼接后的输入传入网络
        return self.net(x)
--- a/code/agents/noise.py
+++ b/code/agents/noise.py
@ -0,0 +1,74 @@
 """
 Ornstein-Uhlenbeck (OU) Exploration Noise / OU 探索噪声
 This file implements the OU noise process for continuous action exploration.
 The noise is temporally correlated and features linear sigma decay over training.
 本文档实现了用于连续动作探索的 OU 噪声过程。
 该噪声具有时间相关性，并在训练过程中具有线性标准差（sigma）衰减特性。
 Formula / 公式: dx = \u03b8(\u03bc - x)dt + \u03c3dW
 Decay / 衰减: Linear sigma decay over specified decay period. / 在指定的衰减周期内线性衰减 sigma。
 Reference / 参考文献: Section 3.2.2 Exploration Mechanism in the project paper.
 """
 import numpy as np
 class OUNoise:
    """Ornstein-Uhlenbeck process for temporally correlated exploration noise.
    用于生成具有时间相关性的探索噪声的 Ornstein-Uhlenbeck 过程。
    Formula / 公式: dx = \u03b8(\u03bc - x)dt + \u03c3dW
    Args / 参数:
        action_dim (int): Dimensionality of the action space. / 动作空间的维度。
        mu (float): Long-term mean of the process. / 过程的长期均值 \u03bc。
        theta (float): Mean-reversion rate. / 均值回归速率 \u03b8。
        sigma_init (float): Initial standard deviation. / 初始标准差 \u03c3。
        sigma_min (float): Minimum standard deviation after decay. / 衰减后的最小标准差。
        decay_period (int): Number of episodes over which sigma decays linearly. / sigma 线性衰减的总回合数。
    """
    def __init__(self, action_dim: int, mu: float = 0.0, theta: float = 0.15,
                 sigma_init: float = 0.2, sigma_min: float = 0.01,
                 decay_period: int = 5000):
        # Initialize noise parameters and state / 初始化噪声参数和状态
        self.action_dim = action_dim
        self.mu = mu
        self.theta = theta
        self.sigma_init = sigma_init
        self.sigma_min = sigma_min
        self.sigma = sigma_init
        self.decay_period = decay_period
        self.state = np.full(action_dim, mu, dtype=np.float64)
    def reset(self):
        """
        Reset the internal state to the mean. / 将内部状态重置为均值 \u03bc。
        """
        self.state = np.full(self.action_dim, self.mu, dtype=np.float64)
    def decay_sigma(self, episode: int):
        """Linearly decay sigma from sigma_init to sigma_min over decay_period.
        在衰减周期内，将 sigma 从初始值线性衰减到最小值。
        Args / 参数:
            episode (int): Current episode number. / 当前回合数。
        """
        # Calculate decay fraction / 计算衰减比例
        frac = min(1.0, episode / max(1, self.decay_period))
        # Linear decay formula / 线性衰减公式: \u03c3 = \u03c3_init + frac * (\u03c3_min - \u03c3_init)
        self.sigma = self.sigma_init + frac * (self.sigma_min - self.sigma_init)
    def sample(self) -> np.ndarray:
        """
        Generate a noise sample via the OU process. / 通过 OU 过程生成噪声样本。
        Returns / 返回:
            np.ndarray: Noise vector of shape (action_dim,). / 形状为 (action_dim,) 的噪声向量。
        """
        # OU Formula / OU 公式: dx = \u03b8 * (\u03bc - x) + \u03c3 * N(0,1)
        dx = (self.theta * (self.mu - self.state)
              + self.sigma * np.random.randn(self.action_dim))
        # Update state / 更新状态: x = x + dx
        self.state = self.state + dx
        return self.state.copy()
--- a/code/agents/replay_buffer.py
+++ b/code/agents/replay_buffer.py
@ -0,0 +1,92 @@
 """
 Experience Replay Buffer for Multi-Agent RL / 多智能体强化学习的经验回放池
 This file implements a fixed-size replay buffer to store and sample transitions.
 Each transition contains observations, actions, and rewards for both semantic and traditional agents.
 本文档实现了一个固定大小的回放池，用于存储和采样状态转移。
 每个状态转移包含语义智能体和传统智能体的观测、动作及奖励。
 Storage Format / 存储格式: 9-field transitions / 9 字段状态转移
 (obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, done)
 Reference / 参考文献: Section 3.2.3 Experience Replay in the project paper.
 """
 import random
 from collections import deque
 import numpy as np
 class ReplayBuffer:
    """Fixed-size experience replay buffer for two-agent transitions.
    用于双智能体状态转移的固定大小经验回放池。
    Stores transitions of the form / 存储如下形式的状态转移:
    (obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, done)
    Args / 参数:
        capacity (int): Maximum number of transitions to store. / 存储转换的最大数量。
    """
    def __init__(self, capacity: int):
        # Initialize the buffer as a double-ended queue with a maximum length / 将回放池初始化为具有最大长度的双端队列
        self.buffer = deque(maxlen=capacity)
    def push(self, obs_s, obs_b, act_s, act_b, rew_s, rew_b,
             next_obs_s, next_obs_b, done=False):
        """
        Store a single transition into the buffer. / 将单次状态转移存入回放池。
        Args / 参数:
            obs_s, obs_b: Observations for Semantic and Traditional agents. / 语义智能体与传统智能体的观测。
            act_s, act_b: Actions taken by each agent. / 各个智能体采取的动作。
            rew_s, rew_b: Rewards received by each agent. / 各个智能体获得的奖励。
            next_obs_s, next_obs_b: Next observations. / 下一个状态的观测。
            done (bool): Whether the episode ended. / 回合是否结束。
        """
        # Append the 9-field transition to the deque / 将 9 字段的状态转移添加到队列中
        self.buffer.append((
            np.asarray(obs_s, dtype=np.float32),
            np.asarray(obs_b, dtype=np.float32),
            np.asarray(act_s, dtype=np.float32),
            np.asarray(act_b, dtype=np.float32),
            float(rew_s),
            float(rew_b),
            np.asarray(next_obs_s, dtype=np.float32),
            np.asarray(next_obs_b, dtype=np.float32),
            float(done),
        ))
    def sample(self, batch_size: int):
        """
        Sample a random batch of transitions for training. / 随机采样一批状态转移用于训练。
        Args / 参数:
            batch_size (int): Number of transitions to sample. / 采样数量。
        Returns / 返回:
            tuple of np.ndarray: (obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, dones)
            Each array has shape (batch_size, ...). / 每个数组的形状均为 (batch_size, ...)。
        """
        # Randomly select 'batch_size' samples from the buffer / 从回放池中随机选择 batch_size 个样本
        batch = random.sample(self.buffer, batch_size)
        # Unzip the batch into separate components / 将采样到的批次拆解为独立的组件
        obs_s, obs_b, act_s, act_b, rew_s, rew_b, \
            next_obs_s, next_obs_b, dones = zip(*batch)
        # Convert each component to a numpy array / 将各组件转换为 numpy 数组
        return (
            np.array(obs_s),
            np.array(obs_b),
            np.array(act_s),
            np.array(act_b),
            np.array(rew_s),
            np.array(rew_b),
            np.array(next_obs_s),
            np.array(next_obs_b),
            np.array(dones),
        )
    def __len__(self) -> int:
        """
        Return the current size of the buffer. / 返回回放池的当前大小。
        """
        return len(self.buffer)
--- a/code/baselines/init.py
+++ b/code/baselines/init.py
@ -0,0 +1,12 @@
 from .pure_coop import PureCooperative
 from .pure_comp import PureCompetitive
 from .single_dqn import SingleAgentDQN
 from .iddpg import IndependentDDPG
 from .fixed_lambda import FixedLambda
 from .equal_alloc import EqualAllocation
 from .semantic_only import SemanticOnly
 __all__ = [
    "PureCooperative", "PureCompetitive", "SingleAgentDQN",
    "IndependentDDPG", "FixedLambda", "EqualAllocation", "SemanticOnly",
 ]
--- a/code/baselines/equal_alloc.py
+++ b/code/baselines/equal_alloc.py
@ -0,0 +1,101 @@
 import numpy as np
 """
 Baseline: EqualAllocation (等额分配基线)
 =====================================
 Purpose (lower bound): 
 - This baseline represents a simple heuristic approach with no learning involved.
 - It serves as a lower bound for performance comparison, showing the system behavior under a naive, fixed resource allocation strategy.
 - 目的（性能下限）：该基线代表了一种不涉及学习的简单启发式方法。它作为性能对比的下限，展示了在朴素的固定资源分配策略下系统的表现。
 Difference from Co-MADDPG:
 1. Learning: No learning vs Deep Reinforcement Learning.
 2. Action Selection: Always fixed at [0.5, 0.5, 0.5] for all resource parameters (subcarrier fraction, power, m_param).
 3. 与 Co-MADDPG 的区别：
   - 学习机制：无学习 vs 深度强化学习。
   - 动作选择：所有资源参数（子载波比例、功率、m 参数）始终固定为 [0.5, 0.5, 0.5]。
 Contribution:
 - Contributes to performance baseline tables as the "Random/Fixed" comparison point.
 - 贡献：作为“随机/固定”对比点，用于性能基准表。
 """
 class DummyBuffer:
    """
    Dummy replay buffer that satisfies train.py's push/len interface.
    满足 train.py 中 push/len 接口要求的虚拟重放池。
    """
    def push(self, *args):
        # Do nothing as no learning is performed
        # 不执行任何操作，因为没有学习过程
        pass
    def __len__(self):
        # Always return 0 to indicate no samples available
        # 始终返回 0，表示没有可用样本
        return 0
 class EqualAllocation:
    """
    EqualAllocation algorithm implementation.
    等额分配算法实现。
    """
    def __init__(self, config):
        # Initialize with configuration and a dummy buffer
        # 使用配置和虚拟重放池进行初始化
        self.config = config
        self.replay_buffer = DummyBuffer()
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Always return a fixed action [0.5, 0.5, 0.5].
        始终返回固定动作 [0.5, 0.5, 0.5]。
        """
        return np.array([0.5, 0.5, 0.5], dtype=np.float32), \
               np.array([0.5, 0.5, 0.5], dtype=np.float32)
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute rewards using a fixed λ=0.5 for consistency in monitoring.
        使用固定 λ=0.5 计算奖励，以保持监测的一致性。
        Formula: Balanced combination of coop and comp components.
        公式说明：协作项与竞争项的平衡组合。
        """
        lam = 0.5
        rew_cfg = self.config.get('reward', {})
        coop_self = rew_cfg.get('coop_self', 0.5)
        coop_other = rew_cfg.get('coop_other', 0.3)
        coop_sys = rew_cfg.get('coop_sys', 0.2)
        comp_self = rew_cfg.get('comp_self', 0.8)
        comp_sys = rew_cfg.get('comp_sys', 0.2)
        # Compute reward components for S
        # 计算 S 的奖励组成部分
        r_coop_s = coop_self * qoe_s + coop_other * qoe_b + coop_sys * qoe_sys
        r_comp_s = comp_self * qoe_s + comp_sys * qoe_sys
        r_s = lam * r_coop_s + (1 - lam) * r_comp_s
        # Compute reward components for B
        # 计算 B 的奖励组成部分
        r_coop_b = coop_self * qoe_b + coop_other * qoe_s + coop_sys * qoe_sys
        r_comp_b = comp_self * qoe_b + comp_sys * qoe_sys
        r_b = lam * r_coop_b + (1 - lam) * r_comp_b
        return r_s, r_b, lam
    def update(self):
        """
        No update performed in heuristic baseline.
        启发式基线中不执行更新。
        """
        return None
    def save(self, path):
        """No state to save."""
        pass
    def load(self, path):
        """No state to load."""
        pass
--- a/code/baselines/fixed_lambda.py
+++ b/code/baselines/fixed_lambda.py
@ -0,0 +1,280 @@
 import os
 import numpy as np
 import torch
 import torch.nn.functional as F
 from agents.actor import Actor
 from agents.critic import Critic
 from agents.replay_buffer import ReplayBuffer
 from agents.noise import OUNoise
 """
 Baseline: FixedLambda (固定 λ 基线)
 =====================================
 Purpose (ablation): 
 - This baseline is used to evaluate the benefit of the dynamic lambda switching mechanism in Co-MADDPG.
 - It fixes λ at a constant value (0.5), balancing cooperation and competition equally throughout the training.
 - 目的（消融实验）：该基线用于评估 Co-MADDPG 中动态 λ 切换机制的收益。它将 λ 固定为常数（0.5），在整个训练过程中平衡协作与竞争。
 Difference from Co-MADDPG:
 1. Lambda (λ): Fixed at 0.5, whereas Co-MADDPG dynamically adjusts λ based on system state.
 2. Update Order: Retains the Stackelberg update order (follower B first, then leader S), same as Co-MADDPG.
 3. 与 Co-MADDPG 的区别：
   - Lambda (λ): 固定为 0.5，而 Co-MADDPG 根据系统状态动态调整 λ。
   - 更新顺序：保留了 Stackelberg 博弈更新顺序（先更新从属者 B，再更新主导者 S），与 Co-MADDPG 一致。
 Contribution:
 - Contributes to performance sensitivity analysis regarding the choice of λ and shows why a fixed balance is suboptimal.
 - 贡献：用于关于 λ 选择的性能敏感性分析，展示为什么固定比例的平衡并非最优。
 """
 class FixedLambda:
    """
    FixedLambda algorithm implementation.
    固定 λ 算法实现。
    """
    def __init__(self, config):
        # Initialize configuration and device
        # 初始化配置和设备
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Hyperparameters: Gamma, Tau, Batch Size, and Fixed λ=0.5
        # 超参数：折扣因子、软更新系数、批量大小以及固定 λ=0.5
        self.gamma = config['training']['gamma']
        self.tau = config['training']['tau']
        self.batch_size = config['training']['batch_size']
        self.fixed_lambda = 0.5
        # Dimensions: State and Action
        # 维度信息：状态与动作
        self.obs_dim = config['env']['num_subcarriers'] + 4
        self.act_dim = 3
        # Actor networks and their target networks
        # Actor 网络及其目标网络
        hidden_a = config['network']['actor_hidden']
        hidden_c = config['network']['critic_hidden']
        self.actor_s = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_b = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_s_target = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_b_target = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        # Joint Critics for Centralized Training
        # 用于中心化训练的联合 Critic
        obs_total = self.obs_dim * 2
        act_total = self.act_dim * 2
        self.critic_s = Critic(obs_total, act_total, hidden_c).to(self.device)
        self.critic_b = Critic(obs_total, act_total, hidden_c).to(self.device)
        self.critic_s_target = Critic(obs_total, act_total, hidden_c).to(self.device)
        self.critic_b_target = Critic(obs_total, act_total, hidden_c).to(self.device)
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
        # Optimizers for actors and critics
        # Actor 与 Critic 的优化器
        self.actor_s_optimizer = torch.optim.Adam(self.actor_s.parameters(), lr=config['training']['actor_lr'])
        self.actor_b_optimizer = torch.optim.Adam(self.actor_b.parameters(), lr=config['training']['actor_lr'])
        self.critic_s_optimizer = torch.optim.Adam(self.critic_s.parameters(), lr=config['training']['critic_lr'])
        self.critic_b_optimizer = torch.optim.Adam(self.critic_b.parameters(), lr=config['training']['critic_lr'])
        # Experience Replay and OU Noise for exploration
        # 经验重放池与用于探索的 OU 噪声
        self.replay_buffer = ReplayBuffer(config['training']['buffer_capacity'])
        self.noise_s = OUNoise(self.act_dim, theta=config['training']['ou_theta'],
                               sigma_init=config['training']['ou_sigma_init'],
                               sigma_min=config['training']['ou_sigma_min'])
        self.noise_b = OUNoise(self.act_dim, theta=config['training']['ou_theta'],
                               sigma_init=config['training']['ou_sigma_init'],
                               sigma_min=config['training']['ou_sigma_min'])
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Select actions for both agents given observations.
        根据观察结果为两个智能体选择动作。
        """
        self.actor_s.eval()
        self.actor_b.eval()
        with torch.no_grad():
            obs_s_t = torch.FloatTensor(obs_s).unsqueeze(0).to(self.device)
            obs_b_t = torch.FloatTensor(obs_b).unsqueeze(0).to(self.device)
            act_s = self.actor_s(obs_s_t).cpu().numpy()[0]
            act_b = self.actor_b(obs_b_t).cpu().numpy()[0]
        self.actor_s.train()
        self.actor_b.train()
        if explore:
            # Add noise during training exploration
            # 训练探索期间增加噪声
            act_s = np.clip(act_s + self.noise_s.sample(), 0.0, 1.0)
            act_b = np.clip(act_b + self.noise_b.sample(), 0.0, 1.0)
        else:
            act_s = np.clip(act_s, 0.0, 1.0)
            act_b = np.clip(act_b, 0.0, 1.0)
        return act_s, act_b
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute rewards with fixed λ=0.5.
        使用固定 λ=0.5 计算奖励。
        Formula: r_i = 0.5 * r_coop + 0.5 * r_comp
        公式说明：奖励是协作项与竞争项的等权之和。
        """
        lam = self.fixed_lambda
        rew_cfg = self.config.get('reward', {})
        coop_self = rew_cfg.get('coop_self', 0.5)
        coop_other = rew_cfg.get('coop_other', 0.3)
        coop_sys = rew_cfg.get('coop_sys', 0.2)
        comp_self = rew_cfg.get('comp_self', 0.8)
        comp_sys = rew_cfg.get('comp_sys', 0.2)
        # Compute Cooperative and Competitive components for S
        # 计算 S 的协作与竞争组成部分
        r_coop_s = coop_self * qoe_s + coop_other * qoe_b + coop_sys * qoe_sys
        r_comp_s = comp_self * qoe_s + comp_sys * qoe_sys
        r_s = lam * r_coop_s + (1 - lam) * r_comp_s
        # Compute Cooperative and Competitive components for B
        # 计算 B 的协作与竞争组成部分
        r_coop_b = coop_self * qoe_b + coop_other * qoe_s + coop_sys * qoe_sys
        r_comp_b = comp_self * qoe_b + comp_sys * qoe_sys
        r_b = lam * r_coop_b + (1 - lam) * r_comp_b
        return r_s, r_b, lam
    def update(self):
        """
        Update networks using Stackelberg update order.
        使用 Stackelberg 博弈顺序更新网络。
        Order: Follower B updates first, then Leader S updates considering B's response.
        顺序：从属者 B 先更新，随后主导者 S 在考虑 B 的响应后进行更新。
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        # Sample from replay buffer
        # 从经验池采样
        obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, dones = \
            self.replay_buffer.sample(self.batch_size)
        # Convert to tensors
        # 转换为张量
        obs_s = torch.FloatTensor(obs_s).to(self.device)
        obs_b = torch.FloatTensor(obs_b).to(self.device)
        act_s = torch.FloatTensor(act_s).to(self.device)
        act_b = torch.FloatTensor(act_b).to(self.device)
        rew_s = torch.FloatTensor(rew_s).unsqueeze(1).to(self.device)
        rew_b = torch.FloatTensor(rew_b).unsqueeze(1).to(self.device)
        next_obs_s = torch.FloatTensor(next_obs_s).to(self.device)
        next_obs_b = torch.FloatTensor(next_obs_b).to(self.device)
        dones = torch.FloatTensor(dones).unsqueeze(1).to(self.device)
        # Centralized observation and next observation
        # 中心化观察与下一状态观察
        joint_obs = torch.cat([obs_s, obs_b], dim=1)
        joint_next_obs = torch.cat([next_obs_s, next_obs_b], dim=1)
        joint_act = torch.cat([act_s, act_b], dim=1)
        # Compute targets for critics
        # 计算 Critic 的目标值
        with torch.no_grad():
            next_act_s = self.actor_s_target(next_obs_s)
            next_act_b = self.actor_b_target(next_obs_b)
            joint_next_act = torch.cat([next_act_s, next_act_b], dim=1)
            target_q_s = rew_s + self.gamma * (1 - dones) * self.critic_s_target(joint_next_obs, joint_next_act)
            target_q_b = rew_b + self.gamma * (1 - dones) * self.critic_b_target(joint_next_obs, joint_next_act)
        # --- Stackelberg: update follower B first ---
        # --- Stackelberg 博弈：首先更新从属者 B ---
        # Update Critic B
        # 更新 Critic B
        current_q_b = self.critic_b(joint_obs, joint_act)
        critic_loss_b = F.mse_loss(current_q_b, target_q_b)
        self.critic_b_optimizer.zero_grad()
        critic_loss_b.backward()
        self.critic_b_optimizer.step()
        # Update Actor B (Follower)
        # 更新 Actor B (从属者)
        new_act_b = self.actor_b(obs_b)
        actor_loss_b = -self.critic_b(joint_obs, torch.cat([act_s, new_act_b], dim=1)).mean()
        self.actor_b_optimizer.zero_grad()
        actor_loss_b.backward()
        self.actor_b_optimizer.step()
        # --- Then update leader S ---
        # --- 然后更新主导者 S ---
        # Re-compute follower's best response for leader's critic update
        # 为主导者的 Critic 更新重新计算从属者的最佳响应
        with torch.no_grad():
            act_b_br = self.actor_b(obs_b)
        joint_act_leader = torch.cat([act_s, act_b_br], dim=1)
        # Update Critic S
        # 更新 Critic S
        current_q_s = self.critic_s(joint_obs, joint_act_leader)
        critic_loss_s = F.mse_loss(current_q_s, target_q_s)
        self.critic_s_optimizer.zero_grad()
        critic_loss_s.backward()
        self.critic_s_optimizer.step()
        # Update Actor S (Leader) considering Follower's best response
        # 考虑从属者的最佳响应，更新 Actor S (主导者)
        with torch.no_grad():
            act_b_br2 = self.actor_b(obs_b)
        new_act_s = self.actor_s(obs_s)
        actor_loss_s = -self.critic_s(joint_obs, torch.cat([new_act_s, act_b_br2], dim=1)).mean()
        self.actor_s_optimizer.zero_grad()
        actor_loss_s.backward()
        self.actor_s_optimizer.step()
        # Soft update target networks
        # 目标网络软更新
        for target, source in [
            (self.critic_s_target, self.critic_s),
            (self.critic_b_target, self.critic_b),
            (self.actor_s_target, self.actor_s),
            (self.actor_b_target, self.actor_b),
        ]:
            for tp, sp in zip(target.parameters(), source.parameters()):
                tp.data.copy_(self.tau * sp.data + (1.0 - self.tau) * tp.data)
        return {
            'actor_loss_s': actor_loss_s.item(),
            'actor_loss_b': actor_loss_b.item(),
            'critic_loss_s': critic_loss_s.item(),
            'critic_loss_b': critic_loss_b.item(),
        }
    def save(self, path):
        """
        Save models to disk.
        将模型保存至磁盘。
        """
        os.makedirs(path, exist_ok=True)
        torch.save(self.actor_s.state_dict(), os.path.join(path, "actor_s.pth"))
        torch.save(self.actor_b.state_dict(), os.path.join(path, "actor_b.pth"))
        torch.save(self.critic_s.state_dict(), os.path.join(path, "critic_s.pth"))
        torch.save(self.critic_b.state_dict(), os.path.join(path, "critic_b.pth"))
    def load(self, path):
        """
        Load models from disk.
        从磁盘加载模型。
        """
        self.actor_s.load_state_dict(torch.load(os.path.join(path, "actor_s.pth"), map_location=self.device))
        self.actor_b.load_state_dict(torch.load(os.path.join(path, "actor_b.pth"), map_location=self.device))
        self.critic_s.load_state_dict(torch.load(os.path.join(path, "critic_s.pth"), map_location=self.device))
        self.critic_b.load_state_dict(torch.load(os.path.join(path, "critic_b.pth"), map_location=self.device))
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
--- a/code/baselines/iddpg.py
+++ b/code/baselines/iddpg.py
@ -0,0 +1,266 @@
 import os
 import numpy as np
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from agents.actor import Actor
 from agents.replay_buffer import ReplayBuffer
 from agents.noise import OUNoise
 """
 Baseline: IndependentDDPG (独立 DDPG 基线)
 =====================================
 Purpose (ablation): 
 - This baseline removes the Centralized Training Decentralized Execution (CTDE) component.
 - It is used to demonstrate the necessity of joint critics that observe other agents' actions for stable training in MARL.
 - 目的（消融实验）：该基线移除了中心化训练分布式执行（CTDE）组件。用于证明在多智能体强化学习中，引入能观察其他智能体动作的联合 Critic 对维持训练稳定性的必要性。
 Difference from Co-MADDPG:
 1. Critic Type: IndependentCritics are used, which only take the local observation and local action (obs_i, act_i) as input.
 2. Update Order: Simultaneous independent updates for both agents.
 3. 与 Co-MADDPG 的区别：
   - Critic 类型：使用独立 Critic，其输入仅包含局部观察与局部动作 (obs_i, act_i)。
   - 更新顺序：两个智能体同时进行独立的更新。
 Contribution:
 - Contributes to ablation studies showing how centralized critics mitigate non-stationarity issues.
 - 贡献：用于消融实验，展示中心化 Critic 如何缓解非平稳性（Non-stationarity）问题。
 """
 class IndependentCritic(nn.Module):
    """
    IndependentCritic that takes only a single agent's observation and action.
    独立 Critic，仅接收单个智能体的观察与动作。
    """
    def __init__(self, obs_dim, act_dim, hidden_sizes=[512, 512, 256]):
        super().__init__()
        assert len(hidden_sizes) == 3
        self.net = nn.Sequential(
            nn.Linear(obs_dim + act_dim, hidden_sizes[0]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[0], hidden_sizes[1]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[1], hidden_sizes[2]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[2], 1),
        )
    def forward(self, obs, act):
        # Concatenate local observation and local action
        # 拼接局部观察与局部动作
        x = torch.cat([obs, act], dim=1)
        return self.net(x)
 class IndependentDDPG:
    """
    IndependentDDPG algorithm implementation.
    独立 DDPG 算法实现。
    """
    def __init__(self, config):
        # Initialize configuration and device
        # 初始化配置和设备
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Hyperparameters
        # 超参数
        self.gamma = config['training']['gamma']
        self.tau = config['training']['tau']
        self.batch_size = config['training']['batch_size']
        # Dimensions
        # 维度
        self.obs_dim = config['env']['num_subcarriers'] + 4
        self.act_dim = 3
        # Hidden layer configurations
        # 隐藏层配置
        hidden_a = config['network']['actor_hidden']
        hidden_c = config['network']['critic_hidden']
        # Agent S: Local Actor and Independent Critic
        # 智能体 S：局部 Actor 与独立 Critic
        self.actor_s = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_s_target = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.critic_s = IndependentCritic(self.obs_dim, self.act_dim, hidden_c).to(self.device)
        self.critic_s_target = IndependentCritic(self.obs_dim, self.act_dim, hidden_c).to(self.device)
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        # Agent B: Local Actor and Independent Critic
        # 智能体 B：局部 Actor 与独立 Critic
        self.actor_b = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_b_target = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        self.critic_b = IndependentCritic(self.obs_dim, self.act_dim, hidden_c).to(self.device)
        self.critic_b_target = IndependentCritic(self.obs_dim, self.act_dim, hidden_c).to(self.device)
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
        # Optimizers
        # 优化器
        self.actor_s_optimizer = torch.optim.Adam(self.actor_s.parameters(), lr=config['training']['actor_lr'])
        self.actor_b_optimizer = torch.optim.Adam(self.actor_b.parameters(), lr=config['training']['actor_lr'])
        self.critic_s_optimizer = torch.optim.Adam(self.critic_s.parameters(), lr=config['training']['critic_lr'])
        self.critic_b_optimizer = torch.optim.Adam(self.critic_b.parameters(), lr=config['training']['critic_lr'])
        # Shared replay buffer
        # 共享重放池
        self.replay_buffer = ReplayBuffer(config['training']['buffer_capacity'])
        # Noise for exploration
        # 探索噪声
        self.noise_s = OUNoise(self.act_dim, theta=config['training']['ou_theta'],
                               sigma_init=config['training']['ou_sigma_init'],
                               sigma_min=config['training']['ou_sigma_min'])
        self.noise_b = OUNoise(self.act_dim, theta=config['training']['ou_theta'],
                               sigma_init=config['training']['ou_sigma_init'],
                               sigma_min=config['training']['ou_sigma_min'])
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Select actions for both agents.
        为两个智能体选择动作。
        """
        self.actor_s.eval()
        self.actor_b.eval()
        with torch.no_grad():
            obs_s_t = torch.FloatTensor(obs_s).unsqueeze(0).to(self.device)
            obs_b_t = torch.FloatTensor(obs_b).unsqueeze(0).to(self.device)
            act_s = self.actor_s(obs_s_t).cpu().numpy()[0]
            act_b = self.actor_b(obs_b_t).cpu().numpy()[0]
        self.actor_s.train()
        self.actor_b.train()
        if explore:
            # Apply OU noise
            # 应用 OU 噪声
            act_s = np.clip(act_s + self.noise_s.sample(), 0.0, 1.0)
            act_b = np.clip(act_b + self.noise_b.sample(), 0.0, 1.0)
        else:
            act_s = np.clip(act_s, 0.0, 1.0)
            act_b = np.clip(act_b, 0.0, 1.0)
        return act_s, act_b
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute rewards based on independent competitive behavior (λ=0).
        基于独立的竞争行为计算奖励 (λ=0)。
        Formula: r_i = comp_self * qoe_i + comp_sys * qoe_sys
        公式说明：独立模式下默认为纯竞争，每个智能体仅优化自身效用及系统整体惩罚。
        """
        lam = 0.0
        r_s = self.config['reward']['comp_self'] * qoe_s + self.config['reward']['comp_sys'] * qoe_sys
        r_b = self.config['reward']['comp_self'] * qoe_b + self.config['reward']['comp_sys'] * qoe_sys
        return r_s, r_b, lam
    def update(self):
        """
        Update each agent independently and simultaneously.
        独立且同步地更新每个智能体。
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        # Sample batch
        # 采样批量数据
        obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, dones = \
            self.replay_buffer.sample(self.batch_size)
        # To tensors
        # 转换为张量
        obs_s = torch.FloatTensor(obs_s).to(self.device)
        obs_b = torch.FloatTensor(obs_b).to(self.device)
        act_s = torch.FloatTensor(act_s).to(self.device)
        act_b = torch.FloatTensor(act_b).to(self.device)
        rew_s = torch.FloatTensor(rew_s).unsqueeze(1).to(self.device)
        rew_b = torch.FloatTensor(rew_b).unsqueeze(1).to(self.device)
        next_obs_s = torch.FloatTensor(next_obs_s).to(self.device)
        next_obs_b = torch.FloatTensor(next_obs_b).to(self.device)
        dones = torch.FloatTensor(dones).unsqueeze(1).to(self.device)
        # --- Update Agent S (independent) ---
        # --- 独立更新智能体 S ---
        with torch.no_grad():
            # Critic target only uses local next observation and action
            # Critic 目标仅使用局部下一状态观察与动作
            next_act_s = self.actor_s_target(next_obs_s)
            target_q_s = rew_s + self.gamma * (1 - dones) * self.critic_s_target(next_obs_s, next_act_s)
        current_q_s = self.critic_s(obs_s, act_s)
        critic_loss_s = F.mse_loss(current_q_s, target_q_s)
        self.critic_s_optimizer.zero_grad()
        critic_loss_s.backward()
        self.critic_s_optimizer.step()
        new_act_s = self.actor_s(obs_s)
        actor_loss_s = -self.critic_s(obs_s, new_act_s).mean()
        self.actor_s_optimizer.zero_grad()
        actor_loss_s.backward()
        self.actor_s_optimizer.step()
        # --- Update Agent B (independent) ---
        # --- 独立更新智能体 B ---
        with torch.no_grad():
            # Critic target only uses local next observation and action
            # Critic 目标仅使用局部下一状态观察与动作
            next_act_b = self.actor_b_target(next_obs_b)
            target_q_b = rew_b + self.gamma * (1 - dones) * self.critic_b_target(next_obs_b, next_act_b)
        current_q_b = self.critic_b(obs_b, act_b)
        critic_loss_b = F.mse_loss(current_q_b, target_q_b)
        self.critic_b_optimizer.zero_grad()
        critic_loss_b.backward()
        self.critic_b_optimizer.step()
        new_act_b = self.actor_b(obs_b)
        actor_loss_b = -self.critic_b(obs_b, new_act_b).mean()
        self.actor_b_optimizer.zero_grad()
        actor_loss_b.backward()
        self.actor_b_optimizer.step()
        # Soft update targets for both agents
        # 软更新两个智能体的目标网络
        for target, source in [
            (self.critic_s_target, self.critic_s),
            (self.critic_b_target, self.critic_b),
            (self.actor_s_target, self.actor_s),
            (self.actor_b_target, self.actor_b),
        ]:
            for tp, sp in zip(target.parameters(), source.parameters()):
                tp.data.copy_(self.tau * sp.data + (1.0 - self.tau) * tp.data)
        return {
            'actor_loss_s': actor_loss_s.item(),
            'actor_loss_b': actor_loss_b.item(),
            'critic_loss_s': critic_loss_s.item(),
            'critic_loss_b': critic_loss_b.item(),
        }
    def save(self, path):
        """
        Save models.
        保存模型。
        """
        os.makedirs(path, exist_ok=True)
        torch.save(self.actor_s.state_dict(), os.path.join(path, "actor_s.pth"))
        torch.save(self.actor_b.state_dict(), os.path.join(path, "actor_b.pth"))
        torch.save(self.critic_s.state_dict(), os.path.join(path, "critic_s.pth"))
        torch.save(self.critic_b.state_dict(), os.path.join(path, "critic_b.pth"))
    def load(self, path):
        """
        Load models.
        加载模型。
        """
        self.actor_s.load_state_dict(torch.load(os.path.join(path, "actor_s.pth"), map_location=self.device))
        self.actor_b.load_state_dict(torch.load(os.path.join(path, "actor_b.pth"), map_location=self.device))
        self.critic_s.load_state_dict(torch.load(os.path.join(path, "critic_s.pth"), map_location=self.device))
        self.critic_b.load_state_dict(torch.load(os.path.join(path, "critic_b.pth"), map_location=self.device))
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
--- a/code/baselines/pure_comp.py
+++ b/code/baselines/pure_comp.py
@ -0,0 +1,245 @@
 import os
 import torch
 import torch.nn.functional as F
 import numpy as np
 from agents.actor import Actor
 from agents.critic import Critic
 from agents.replay_buffer import ReplayBuffer
 from agents.noise import OUNoise
 """
 Baseline: PureCompetitive (纯竞争基线)
 =====================================
 Purpose (ablation): 
 - This baseline removes the cooperative component from the MADDPG framework.
 - It serves as an ablation study to demonstrate that pure competition (λ=0) leads to resource wastage and suboptimal system-wide utility.
 - 目的（消融实验）：该基线移除了 MADDPG 框架中的协作成分。作为消融实验，用于证明纯竞争模式（λ=0）会导致资源浪费和系统级效用降低。
 Difference from Co-MADDPG:
 1. Lambda (λ): Fixed at 0.0 (pure competition), whereas Co-MADDPG uses dynamic λ.
 2. Update Order: Uses simultaneous updates for both actors, whereas Co-MADDPG uses Stackelberg update order.
 3. 与 Co-MADDPG 的区别：
   - Lambda (λ): 固定为 0.0（纯竞争），而 Co-MADDPG 使用动态 λ。
   - 更新顺序：两个参与者同时更新（Simultaneous Update），而 Co-MADDPG 使用 Stackelberg 博弈更新顺序。
 Contribution:
 - Contributes to comparison figures showing the "Price of Anarchy" in resource allocation.
 - 贡献：用于对比图表，展示资源分配中的“无政府代价”。
 """
 class PureCompetitive:
    """
    PureCompetitive algorithm implementation.
    纯竞争算法实现。
    """
    def __init__(self, config):
        # Initialize configuration and device
        # 初始化配置和设备
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Hyperparameters: Gamma (discount), Tau (soft update), Batch Size
        # 超参数：折扣因子、软更新系数、批量大小
        self.gamma = config['training']['gamma']
        self.tau = config['training']['tau']
        self.batch_size = config['training']['batch_size']
        # Dimensions: State (subcarriers + 4), Action (3)
        # 维度：状态（子载波 + 4）、动作（3）
        self.obs_dim = config['env']['num_subcarriers'] + 4
        self.act_dim = 3
        # Agents: Semantic (s) and Traditional (b) actors and target networks
        # 智能体：语义 (s) 与 传统 (b) 参与者的 Actor 及其目标网络
        self.actor_s = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_b = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_s_target = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_b_target = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        # Joint Critics: Uses Centralized Training (obs_dim*2, act_dim*2)
        # 联合 Critic：使用中心化训练（输入为两体观察与动作的并集）
        self.critic_s = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_b = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_s_target = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_b_target = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
        # Optimizers for all networks
        # 所有网络的优化器
        self.actor_s_optimizer = torch.optim.Adam(self.actor_s.parameters(), lr=config['training']['actor_lr'])
        self.actor_b_optimizer = torch.optim.Adam(self.actor_b.parameters(), lr=config['training']['actor_lr'])
        self.critic_s_optimizer = torch.optim.Adam(self.critic_s.parameters(), lr=config['training']['critic_lr'])
        self.critic_b_optimizer = torch.optim.Adam(self.critic_b.parameters(), lr=config['training']['critic_lr'])
        # Experience Replay and Noise for exploration
        # 经验重放池与用于探索的噪声
        self.replay_buffer = ReplayBuffer(config['training']['buffer_capacity'])
        self.noise_s = OUNoise(self.act_dim, theta=config['training']['ou_theta'], sigma_init=config['training']['ou_sigma_init'], sigma_min=config['training']['ou_sigma_min'])
        self.noise_b = OUNoise(self.act_dim, theta=config['training']['ou_theta'], sigma_init=config['training']['ou_sigma_init'], sigma_min=config['training']['ou_sigma_min'])
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Select actions for both agents given observations.
        根据观察结果为两个智能体选择动作。
        """
        obs_s = torch.FloatTensor(obs_s).unsqueeze(0).to(self.device)
        obs_b = torch.FloatTensor(obs_b).unsqueeze(0).to(self.device)
        self.actor_s.eval()
        self.actor_b.eval()
        with torch.no_grad():
            # Forward pass through actors
            # Actor 前向传播
            act_s = self.actor_s(obs_s).cpu().numpy()[0]
            act_b = self.actor_b(obs_b).cpu().numpy()[0]
        self.actor_s.train()
        self.actor_b.train()
        if explore:
            # Apply OU noise for exploration
            # 应用 OU 噪声进行探索
            act_s = np.clip(act_s + self.noise_s.sample(), 0.0, 1.0)
            act_b = np.clip(act_b + self.noise_b.sample(), 0.0, 1.0)
        return act_s, act_b
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute rewards based on pure competition (λ=0).
        基于纯竞争计算奖励 (λ=0)。
        Formula: r_i = comp_self * qoe_i + comp_sys * qoe_sys
        公式说明：由于 λ=0，奖励完全由竞争项组成，仅考虑自身 QoE 以及系统总 QoE 的惩罚项。
        """
        lam = 0.0
        r_s = self.config['reward']['comp_self'] * qoe_s + self.config['reward']['comp_sys'] * qoe_sys
        r_b = self.config['reward']['comp_self'] * qoe_b + self.config['reward']['comp_sys'] * qoe_sys
        return r_s, r_b, lam
    def update(self):
        """
        Update the networks using sampled experiences.
        使用采样的经验更新网络。
        Update order: Simultaneous updates (both actors update based on current policy of the other).
        更新顺序：同时更新（两个 Actor 基于对方当前的策略进行更新）。
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        # Sample batch from replay buffer
        # 从重放池采样批量数据
        obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, dones = self.replay_buffer.sample(self.batch_size)
        # Convert to tensors
        # 转换为张量
        obs_s = torch.FloatTensor(obs_s).to(self.device)
        obs_b = torch.FloatTensor(obs_b).to(self.device)
        act_s = torch.FloatTensor(act_s).to(self.device)
        act_b = torch.FloatTensor(act_b).to(self.device)
        rew_s = torch.FloatTensor(rew_s).unsqueeze(1).to(self.device)
        rew_b = torch.FloatTensor(rew_b).unsqueeze(1).to(self.device)
        next_obs_s = torch.FloatTensor(next_obs_s).to(self.device)
        next_obs_b = torch.FloatTensor(next_obs_b).to(self.device)
        dones = torch.FloatTensor(dones).unsqueeze(1).to(self.device)
        # Centralized observations and actions
        # 中心化观察与动作
        joint_obs = torch.cat([obs_s, obs_b], dim=1)
        joint_next_obs = torch.cat([next_obs_s, next_obs_b], dim=1)
        joint_act = torch.cat([act_s, act_b], dim=1)
        # 1. Critics Update (1. Critic 更新)
        with torch.no_grad():
            # Get target actions for next state
            # 获取下一状态的目标动作
            next_act_s = self.actor_s_target(next_obs_s)
            next_act_b = self.actor_b_target(next_obs_b)
            joint_next_act = torch.cat([next_act_s, next_act_b], dim=1)
            # Compute target Q values
            # 计算目标 Q 值
            target_q_s = rew_s + self.gamma * (1 - dones) * self.critic_s_target(joint_next_obs, joint_next_act)
            target_q_b = rew_b + self.gamma * (1 - dones) * self.critic_b_target(joint_next_obs, joint_next_act)
        # Compute current Q values and MSE loss
        # 计算当前 Q 值与均方误差损失
        current_q_s = self.critic_s(joint_obs, joint_act)
        current_q_b = self.critic_b(joint_obs, joint_act)
        critic_loss_s = F.mse_loss(current_q_s, target_q_s)
        critic_loss_b = F.mse_loss(current_q_b, target_q_b)
        # Backpropagation for critics
        # Critic 的反向传播
        self.critic_s_optimizer.zero_grad()
        critic_loss_s.backward()
        self.critic_s_optimizer.step()
        self.critic_b_optimizer.zero_grad()
        critic_loss_b.backward()
        self.critic_b_optimizer.step()
        # 2. Actors Update (Simultaneous) (2. Actor 更新 - 同时进行)
        new_act_s = self.actor_s(obs_s)
        new_act_b = self.actor_b(obs_b)
        # Calculate policy loss using joint critic
        # 使用联合 Critic 计算策略损失
        actor_loss_s = -self.critic_s(joint_obs, torch.cat([new_act_s, act_b], dim=1)).mean()
        actor_loss_b = -self.critic_b(joint_obs, torch.cat([act_s, new_act_b], dim=1)).mean()
        # Backpropagation for actors
        # Actor 的反向传播
        self.actor_s_optimizer.zero_grad()
        actor_loss_s.backward()
        self.actor_s_optimizer.step()
        self.actor_b_optimizer.zero_grad()
        actor_loss_b.backward()
        self.actor_b_optimizer.step()
        # 3. Soft Target Networks Update (3. 目标网络软更新)
        for target_param, param in zip(self.critic_s_target.parameters(), self.critic_s.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        for target_param, param in zip(self.critic_b_target.parameters(), self.critic_b.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        for target_param, param in zip(self.actor_s_target.parameters(), self.actor_s.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        for target_param, param in zip(self.actor_b_target.parameters(), self.actor_b.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        return {
            'actor_loss_s': actor_loss_s.item(),
            'actor_loss_b': actor_loss_b.item(),
            'critic_loss_s': critic_loss_s.item(),
            'critic_loss_b': critic_loss_b.item()
        }
    def save(self, path):
        """
        Save models to disk.
        将模型保存至磁盘。
        """
        os.makedirs(path, exist_ok=True)
        torch.save(self.actor_s.state_dict(), os.path.join(path, "actor_s.pth"))
        torch.save(self.actor_b.state_dict(), os.path.join(path, "actor_b.pth"))
        torch.save(self.critic_s.state_dict(), os.path.join(path, "critic_s.pth"))
        torch.save(self.critic_b.state_dict(), os.path.join(path, "critic_b.pth"))
    def load(self, path):
        """
        Load models from disk.
        从磁盘加载模型。
        """
        self.actor_s.load_state_dict(torch.load(os.path.join(path, "actor_s.pth"), map_location=self.device))
        self.actor_b.load_state_dict(torch.load(os.path.join(path, "actor_b.pth"), map_location=self.device))
        self.critic_s.load_state_dict(torch.load(os.path.join(path, "critic_s.pth"), map_location=self.device))
        self.critic_b.load_state_dict(torch.load(os.path.join(path, "critic_b.pth"), map_location=self.device))
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
--- a/code/baselines/pure_coop.py
+++ b/code/baselines/pure_coop.py
@ -0,0 +1,245 @@
 import os
 import torch
 import torch.nn.functional as F
 import numpy as np
 from agents.actor import Actor
 from agents.critic import Critic
 from agents.replay_buffer import ReplayBuffer
 from agents.noise import OUNoise
 """
 Baseline: PureCooperative (纯协作基线)
 =====================================
 Purpose (ablation): 
 - This baseline removes the competitive component from the MADDPG framework.
 - It serves as an ablation study to demonstrate the necessity of competitive modeling (λ < 1) for system performance.
 - 目的（消融实验）：该基线移除了 MADDPG 框架中的竞争成分。作为消融实验，用于证明在系统中引入竞争建模（λ < 1）对性能提升的必要性。
 Difference from Co-MADDPG:
 1. Lambda (λ): Fixed at 1.0 (pure cooperation), whereas Co-MADDPG uses dynamic λ.
 2. Update Order: Uses simultaneous updates for both actors, whereas Co-MADDPG uses Stackelberg update order.
 3. 与 Co-MADDPG 的区别：
   - Lambda (λ): 固定为 1.0（纯协作），而 Co-MADDPG 使用动态 λ。
   - 更新顺序：两个参与者同时更新（Simultaneous Update），而 Co-MADDPG 使用 Stackelberg 博弈更新顺序。
 Contribution:
 - Contributes to performance comparison figures and tables (e.g., convergence speed and final QoE) to show how pure cooperation handles resource conflicts.
 - 贡献：用于性能对比图表（如收敛速度和最终 QoE），展示纯协作模式在处理资源冲突时的表现。
 """
 class PureCooperative:
    """
    PureCooperative algorithm implementation.
    纯协作算法实现。
    """
    def __init__(self, config):
        # Initialize configuration and device
        # 初始化配置和设备
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Hyperparameters: Gamma (discount), Tau (soft update), Batch Size
        # 超参数：折扣因子、软更新系数、批量大小
        self.gamma = config['training']['gamma']
        self.tau = config['training']['tau']
        self.batch_size = config['training']['batch_size']
        # Dimensions: State (subcarriers + 4), Action (3)
        # 维度：状态（子载波 + 4）、动作（3）
        self.obs_dim = config['env']['num_subcarriers'] + 4
        self.act_dim = 3
        # Agents: Semantic (s) and Traditional (b) actors and target networks
        # 智能体：语义 (s) 与 传统 (b) 参与者的 Actor 及其目标网络
        self.actor_s = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_b = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_s_target = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_b_target = Actor(self.obs_dim, self.act_dim, config['network']['actor_hidden']).to(self.device)
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        # Joint Critics: Uses Centralized Training (obs_dim*2, act_dim*2)
        # 联合 Critic：使用中心化训练（输入为两体观察与动作的并集）
        self.critic_s = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_b = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_s_target = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_b_target = Critic(self.obs_dim*2, self.act_dim*2, config['network']['critic_hidden']).to(self.device)
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
        # Optimizers for all networks
        # 所有网络的优化器
        self.actor_s_optimizer = torch.optim.Adam(self.actor_s.parameters(), lr=config['training']['actor_lr'])
        self.actor_b_optimizer = torch.optim.Adam(self.actor_b.parameters(), lr=config['training']['actor_lr'])
        self.critic_s_optimizer = torch.optim.Adam(self.critic_s.parameters(), lr=config['training']['critic_lr'])
        self.critic_b_optimizer = torch.optim.Adam(self.critic_b.parameters(), lr=config['training']['critic_lr'])
        # Experience Replay and Noise for exploration
        # 经验重放池与用于探索的噪声
        self.replay_buffer = ReplayBuffer(config['training']['buffer_capacity'])
        self.noise_s = OUNoise(self.act_dim, theta=config['training']['ou_theta'], sigma_init=config['training']['ou_sigma_init'], sigma_min=config['training']['ou_sigma_min'])
        self.noise_b = OUNoise(self.act_dim, theta=config['training']['ou_theta'], sigma_init=config['training']['ou_sigma_init'], sigma_min=config['training']['ou_sigma_min'])
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Select actions for both agents given observations.
        根据观察结果为两个智能体选择动作。
        """
        obs_s = torch.FloatTensor(obs_s).unsqueeze(0).to(self.device)
        obs_b = torch.FloatTensor(obs_b).unsqueeze(0).to(self.device)
        self.actor_s.eval()
        self.actor_b.eval()
        with torch.no_grad():
            # Forward pass through actors
            # Actor 前向传播
            act_s = self.actor_s(obs_s).cpu().numpy()[0]
            act_b = self.actor_b(obs_b).cpu().numpy()[0]
        self.actor_s.train()
        self.actor_b.train()
        if explore:
            # Apply OU noise for exploration
            # 应用 OU 噪声进行探索
            act_s = np.clip(act_s + self.noise_s.sample(), 0.0, 1.0)
            act_b = np.clip(act_b + self.noise_b.sample(), 0.0, 1.0)
        return act_s, act_b
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute rewards based on pure cooperation (λ=1).
        基于纯协作计算奖励 (λ=1)。
        Formula: r_i = coop_self * qoe_i + coop_other * qoe_j + coop_sys * qoe_sys
        公式说明：由于 λ=1，奖励完全由协作项组成，考虑自身 QoE、对方 QoE 以及系统总 QoE。
        """
        lam = 1.0
        r_s = self.config['reward']['coop_self'] * qoe_s + self.config['reward']['coop_other'] * qoe_b + self.config['reward']['coop_sys'] * qoe_sys
        r_b = self.config['reward']['coop_self'] * qoe_b + self.config['reward']['coop_other'] * qoe_s + self.config['reward']['coop_sys'] * qoe_sys
        return r_s, r_b, lam
    def update(self):
        """
        Update the networks using sampled experiences.
        使用采样的经验更新网络。
        Update order: Simultaneous updates (both actors update based on current policy of the other).
        更新顺序：同时更新（两个 Actor 基于对方当前的策略进行更新）。
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        # Sample batch from replay buffer
        # 从重放池采样批量数据
        obs_s, obs_b, act_s, act_b, rew_s, rew_b, next_obs_s, next_obs_b, dones = self.replay_buffer.sample(self.batch_size)
        # Convert to tensors
        # 转换为张量
        obs_s = torch.FloatTensor(obs_s).to(self.device)
        obs_b = torch.FloatTensor(obs_b).to(self.device)
        act_s = torch.FloatTensor(act_s).to(self.device)
        act_b = torch.FloatTensor(act_b).to(self.device)
        rew_s = torch.FloatTensor(rew_s).unsqueeze(1).to(self.device)
        rew_b = torch.FloatTensor(rew_b).unsqueeze(1).to(self.device)
        next_obs_s = torch.FloatTensor(next_obs_s).to(self.device)
        next_obs_b = torch.FloatTensor(next_obs_b).to(self.device)
        dones = torch.FloatTensor(dones).unsqueeze(1).to(self.device)
        # Centralized observations and actions
        # 中心化观察与动作
        joint_obs = torch.cat([obs_s, obs_b], dim=1)
        joint_next_obs = torch.cat([next_obs_s, next_obs_b], dim=1)
        joint_act = torch.cat([act_s, act_b], dim=1)
        # 1. Critics Update (1. Critic 更新)
        with torch.no_grad():
            # Get target actions for next state
            # 获取下一状态的目标动作
            next_act_s = self.actor_s_target(next_obs_s)
            next_act_b = self.actor_b_target(next_obs_b)
            joint_next_act = torch.cat([next_act_s, next_act_b], dim=1)
            # Compute target Q values
            # 计算目标 Q 值
            target_q_s = rew_s + self.gamma * (1 - dones) * self.critic_s_target(joint_next_obs, joint_next_act)
            target_q_b = rew_b + self.gamma * (1 - dones) * self.critic_b_target(joint_next_obs, joint_next_act)
        # Compute current Q values and MSE loss
        # 计算当前 Q 值与均方误差损失
        current_q_s = self.critic_s(joint_obs, joint_act)
        current_q_b = self.critic_b(joint_obs, joint_act)
        critic_loss_s = F.mse_loss(current_q_s, target_q_s)
        critic_loss_b = F.mse_loss(current_q_b, target_q_b)
        # Backpropagation for critics
        # Critic 的反向传播
        self.critic_s_optimizer.zero_grad()
        critic_loss_s.backward()
        self.critic_s_optimizer.step()
        self.critic_b_optimizer.zero_grad()
        critic_loss_b.backward()
        self.critic_b_optimizer.step()
        # 2. Actors Update (Simultaneous) (2. Actor 更新 - 同时进行)
        new_act_s = self.actor_s(obs_s)
        new_act_b = self.actor_b(obs_b)
        # Calculate policy loss using joint critic
        # 使用联合 Critic 计算策略损失
        actor_loss_s = -self.critic_s(joint_obs, torch.cat([new_act_s, act_b], dim=1)).mean()
        actor_loss_b = -self.critic_b(joint_obs, torch.cat([act_s, new_act_b], dim=1)).mean()
        # Backpropagation for actors
        # Actor 的反向传播
        self.actor_s_optimizer.zero_grad()
        actor_loss_s.backward()
        self.actor_s_optimizer.step()
        self.actor_b_optimizer.zero_grad()
        actor_loss_b.backward()
        self.actor_b_optimizer.step()
        # 3. Soft Target Networks Update (3. 目标网络软更新)
        for target_param, param in zip(self.critic_s_target.parameters(), self.critic_s.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        for target_param, param in zip(self.critic_b_target.parameters(), self.critic_b.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        for target_param, param in zip(self.actor_s_target.parameters(), self.actor_s.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        for target_param, param in zip(self.actor_b_target.parameters(), self.actor_b.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)
        return {
            'actor_loss_s': actor_loss_s.item(),
            'actor_loss_b': actor_loss_b.item(),
            'critic_loss_s': critic_loss_s.item(),
            'critic_loss_b': critic_loss_b.item()
        }
    def save(self, path):
        """
        Save models to disk.
        将模型保存至磁盘。
        """
        os.makedirs(path, exist_ok=True)
        torch.save(self.actor_s.state_dict(), os.path.join(path, "actor_s.pth"))
        torch.save(self.actor_b.state_dict(), os.path.join(path, "actor_b.pth"))
        torch.save(self.critic_s.state_dict(), os.path.join(path, "critic_s.pth"))
        torch.save(self.critic_b.state_dict(), os.path.join(path, "critic_b.pth"))
    def load(self, path):
        """
        Load models from disk.
        从磁盘加载模型。
        """
        self.actor_s.load_state_dict(torch.load(os.path.join(path, "actor_s.pth"), map_location=self.device))
        self.actor_b.load_state_dict(torch.load(os.path.join(path, "actor_b.pth"), map_location=self.device))
        self.critic_s.load_state_dict(torch.load(os.path.join(path, "critic_s.pth"), map_location=self.device))
        self.critic_b.load_state_dict(torch.load(os.path.join(path, "critic_b.pth"), map_location=self.device))
        self.actor_s_target.load_state_dict(self.actor_s.state_dict())
        self.actor_b_target.load_state_dict(self.actor_b.state_dict())
        self.critic_s_target.load_state_dict(self.critic_s.state_dict())
        self.critic_b_target.load_state_dict(self.critic_b.state_dict())
--- a/code/baselines/semantic_only.py
+++ b/code/baselines/semantic_only.py
@ -0,0 +1,238 @@
 import os
 import random
 from collections import deque
 import numpy as np
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from agents.actor import Actor
 from agents.noise import OUNoise
 """
 Baseline: SemanticOnly (仅语义基线)
 =====================================
 Purpose (ablation): 
 - This baseline removes the heterogeneous treatment of different user groups.
 - It treats all users as semantic users and uses a single DDPG policy to control both groups.
 - It serves as an ablation study to demonstrate the benefit of having heterogeneous, specialized policies for semantic vs. traditional users.
 - 目的（消融实验）：该基线移除了对不同用户组的异构处理。它将所有用户视为语义用户，并使用单一的 DDPG 策略同时控制两个用户组。作为消融实验，用于证明为语义用户和传统用户分别设计专门的异构策略的收益。
 Difference from Co-MADDPG:
 1. Heterogeneity: Homogeneous policy (all semantic) vs Heterogeneous policies.
 2. Architecture: Single DDPG agent for both groups vs Multi-agent (Co-MADDPG).
 3. 与 Co-MADDPG 的区别：
   - 异构性：同构策略（全部视为语义用户） vs 异构策略。
   - 架构：单 DDPG 智能体控制两组 vs 多智能体 (Co-MADDPG)。
 Contribution:
 - Contributes to performance analysis regarding user heterogeneity and specialized resource allocation.
 - 贡献：用于关于用户异构性和专门化资源分配的性能分析。
 """
 class SemanticCritic(nn.Module):
    """
    Single-agent critic: observation + action → Q-value.
    单智能体 Critic：观察 + 动作 → Q 值。
    """
    def __init__(self, obs_dim, act_dim, hidden_sizes=[256, 256, 128]):
        super().__init__()
        assert len(hidden_sizes) == 3
        self.net = nn.Sequential(
            nn.Linear(obs_dim + act_dim, hidden_sizes[0]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[0], hidden_sizes[1]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[1], hidden_sizes[2]),
            nn.ReLU(),
            nn.Linear(hidden_sizes[2], 1),
        )
    def forward(self, obs, act):
        # Forward pass for single agent
        # 单智能体前向传播
        return self.net(torch.cat([obs, act], dim=1))
 class SemanticBuffer:
    """
    Replay buffer for SemanticOnly baseline.
    仅语义基线的重放池。
    Wrapper that accepts the 9-arg multi-agent push but stores single-agent transitions.
    接收多智能体 9 参数 push 请求，但内部存储单智能体转换数据。
    """
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    def push(self, obs_s, obs_b, act_s, act_b, rew_s, rew_b,
             next_obs_s, next_obs_b, done=False):
        """
        Store only semantic agent's observation/action and average reward.
        仅存储语义智能体的观察/动作以及平均奖励。
        """
        self.buffer.append((
            np.asarray(obs_s, dtype=np.float32),
            np.asarray(act_s, dtype=np.float32),
            float(0.5 * (rew_s + rew_b)),
            np.asarray(next_obs_s, dtype=np.float32),
            float(done),
        ))
    def sample(self, batch_size):
        """Sample batch."""
        batch = random.sample(self.buffer, batch_size)
        obs, act, rew, next_obs, dones = zip(*batch)
        return (np.array(obs), np.array(act), np.array(rew, dtype=np.float32),
                np.array(next_obs), np.array(dones, dtype=np.float32))
    def __len__(self):
        return len(self.buffer)
 class SemanticOnly:
    """
    SemanticOnly algorithm implementation.
    仅语义算法实现。
    """
    def __init__(self, config):
        # Initialize configuration and device
        # 初始化配置和设备
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Hyperparameters
        # 超参数
        self.gamma = config['training']['gamma']
        self.tau = config['training']['tau']
        self.batch_size = config['training']['batch_size']
        # Dimensions
        # 维度
        self.obs_dim = config['env']['num_subcarriers'] + 4
        self.act_dim = 3
        # Network configurations
        # 网络配置
        hidden_a = config['network']['actor_hidden']
        critic_hidden = [256, 256, 128] 
        # Single Actor and Critic policy
        # 单一 Actor 与 Critic 策略
        self.actor = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_target = Actor(self.obs_dim, self.act_dim, hidden_a).to(self.device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.critic = SemanticCritic(self.obs_dim, self.act_dim, critic_hidden).to(self.device)
        self.critic_target = SemanticCritic(self.obs_dim, self.act_dim, critic_hidden).to(self.device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        # Optimizers
        # 优化器
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=config['training']['actor_lr'])
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=config['training']['critic_lr'])
        # Buffer and Noise
        # 重放池与噪声
        self.replay_buffer = SemanticBuffer(config['training']['buffer_capacity'])
        self.noise_s = OUNoise(self.act_dim, theta=config['training']['ou_theta'],
                               sigma_init=config['training']['ou_sigma_init'],
                               sigma_min=config['training']['ou_sigma_min'])
        # Alias for compatibility with training loop
        # 与训练循环兼容的别名
        self.noise_b = self.noise_s
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Select actions for both groups using the same policy.
        使用相同策略为两组用户选择动作。
        """
        self.actor.eval()
        with torch.no_grad():
            obs_t = torch.FloatTensor(obs_s).unsqueeze(0).to(self.device)
            act = self.actor(obs_t).cpu().numpy()[0]
        self.actor.train()
        if explore:
            # Apply OU noise
            # 应用 OU 噪声
            act = np.clip(act + self.noise_s.sample(), 0.0, 1.0)
        else:
            act = np.clip(act, 0.0, 1.0)
        # Return the same action for both groups
        # 为两组用户返回相同的动作
        return act.copy(), act.copy()
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute rewards assuming full cooperation (λ=1).
        假设完全协作 (λ=1) 计算奖励。
        Formula: r = 0.5 * (qoe_s + qoe_b)
        公式说明：由于全部视为语义用户，目标是最大化整体 QoE。
        """
        lam = 1.0 
        r = 0.5 * (qoe_s + qoe_b)
        return r, r, lam
    def update(self):
        """
        Update the single DDPG agent.
        更新单个 DDPG 智能体。
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        # Sample from buffer
        # 从重放池采样
        obs, act, rew, next_obs, dones = self.replay_buffer.sample(self.batch_size)
        # To tensors
        # 转换为张量
        obs_t = torch.FloatTensor(obs).to(self.device)
        act_t = torch.FloatTensor(act).to(self.device)
        rew_t = torch.FloatTensor(rew).unsqueeze(1).to(self.device)
        next_obs_t = torch.FloatTensor(next_obs).to(self.device)
        dones_t = torch.FloatTensor(dones).unsqueeze(1).to(self.device)
        # 1. Critic update (1. Critic 更新)
        with torch.no_grad():
            next_act = self.actor_target(next_obs_t)
            target_q = rew_t + self.gamma * (1 - dones_t) * self.critic_target(next_obs_t, next_act)
        current_q = self.critic(obs_t, act_t)
        critic_loss = F.mse_loss(current_q, target_q)
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        # 2. Actor update (2. Actor 更新)
        new_act = self.actor(obs_t)
        actor_loss = -self.critic(obs_t, new_act).mean()
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()
        # 3. Soft update targets (3. 目标网络软更新)
        for target, source in [
            (self.critic_target, self.critic),
            (self.actor_target, self.actor),
        ]:
            for tp, sp in zip(target.parameters(), source.parameters()):
                tp.data.copy_(self.tau * sp.data + (1.0 - self.tau) * tp.data)
        return {'actor_loss': actor_loss.item(), 'critic_loss': critic_loss.item()}
    def save(self, path):
        """Save models."""
        os.makedirs(path, exist_ok=True)
        torch.save(self.actor.state_dict(), os.path.join(path, "actor.pth"))
        torch.save(self.critic.state_dict(), os.path.join(path, "critic.pth"))
    def load(self, path):
        """Load models."""
        self.actor.load_state_dict(torch.load(os.path.join(path, "actor.pth"), map_location=self.device))
        self.critic.load_state_dict(torch.load(os.path.join(path, "critic.pth"), map_location=self.device))
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.critic_target.load_state_dict(self.critic.state_dict())
--- a/code/baselines/single_dqn.py
+++ b/code/baselines/single_dqn.py
@ -0,0 +1,296 @@
 import os
 import random
 from collections import deque
 import numpy as np
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 """
 Baseline: SingleAgentDQN (单智能体 DQN 基线)
 =====================================
 Purpose (non-MARL baseline): 
 - This baseline represents a traditional single-agent approach to the resource allocation problem.
 - It uses a centralized DQN that controls both groups by discretizing the continuous action space.
 - 目的（非多智能体基线）：该基线代表了解决资源分配问题的传统单智能体方法。它使用中心化 DQN，通过对连续动作空间进行离散化，同时控制两个用户组。
 Difference from Co-MADDPG:
 1. Algorithm Class: Non-MARL (DQN) vs MARL (Co-MADDPG).
 2. Action Space: Discrete (48 actions) vs Continuous.
 3. Architecture: Centralized control vs Decentralized execution with CTDE.
 4. Exploration: Epsilon-greedy vs OU Noise.
 5. 与 Co-MADDPG 的区别：
   - 算法类别：非多智能体 (DQN) vs 多智能体 (Co-MADDPG)。
   - 动作空间：离散（48 种动作组合） vs 连续。
   - 架构：中心化控制 vs CTDE 架构下的分布式执行。
   - 探索机制：ε-greedy vs OU 噪声。
 Contribution:
 - Contributes to performance tables showing the limitations of discretization and centralized control in complex multi-user scenarios.
 - 贡献：用于性能表，展示在复杂多用户场景下，动作离散化和中心化控制的局限性。
 """
 # ---- Discrete action mapping (离散动作映射) ----
 # 4 levels for subcarrier fraction, 4 for power fraction, 3 for m_param
 # 子载波比例 4 级，功率比例 4 级，m 参数 3 级
 N_SUB_LEVELS = [0.25, 0.5, 0.75, 1.0]
 P_FRAC_LEVELS = [0.25, 0.5, 0.75, 1.0]
 M_PARAM_LEVELS = [0.33, 0.66, 1.0]
 NUM_ACTIONS = len(N_SUB_LEVELS) * len(P_FRAC_LEVELS) * len(M_PARAM_LEVELS)  # 48 combinations
 # Build lookup table: index -> (n_sub_frac, p_frac, m_param)
 # 构建查找表：索引 -> (子载波比例, 功率比例, m 参数)
 _ACTION_TABLE = []
 for n in N_SUB_LEVELS:
    for p in P_FRAC_LEVELS:
        for m in M_PARAM_LEVELS:
            _ACTION_TABLE.append(np.array([n, p, m], dtype=np.float32))
 class DQNNet(nn.Module):
    """
    Simple Fully Connected Q-network.
    简单的全连接 Q 网络。
    """
    def __init__(self, state_dim, num_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, num_actions),
        )
    def forward(self, x):
        """Map state to Q-values for each discrete action."""
        return self.net(x)
 class DQNReplayBuffer:
    """
    Wrapper buffer for SingleAgentDQN.
    单智能体 DQN 的封装重放池。
    Accepts the multi-agent 9-argument signature but stores transitions suitable for DQN.
    接收多智能体的 9 参数签名，但内部存储适合 DQN 的转换数据。
    """
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
        self._last_action_s_idx = 0
        self._last_action_b_idx = 0
    def set_last_actions(self, idx_s, idx_b):
        """Store the discrete action indices used."""
        self._last_action_s_idx = idx_s
        self._last_action_b_idx = idx_b
    def push(self, obs_s, obs_b, act_s, act_b, rew_s, rew_b,
             next_obs_s, next_obs_b, done=False):
        """
        Store multi-agent step as a single-agent transition.
        将多智能体步骤作为单智能体转换存储。
        """
        # Concatenate observations for centralized state
        # 拼接观察值以形成中心化状态
        state = np.concatenate([np.asarray(obs_s, dtype=np.float32),
                                np.asarray(obs_b, dtype=np.float32)])
        next_state = np.concatenate([np.asarray(next_obs_s, dtype=np.float32),
                                     np.asarray(next_obs_b, dtype=np.float32)])
        # Average rewards for single-agent scalar reward
        # 对奖励求平均以获得单智能体标量奖励
        reward = 0.5 * (float(rew_s) + float(rew_b))
        self.buffer.append((state, self._last_action_s_idx, self._last_action_b_idx,
                            reward, next_state, float(done)))
    def sample(self, batch_size):
        """Sample a batch of transitions."""
        batch = random.sample(self.buffer, batch_size)
        states, a_s, a_b, rewards, next_states, dones = zip(*batch)
        return (np.array(states), np.array(a_s), np.array(a_b),
                np.array(rewards, dtype=np.float32),
                np.array(next_states), np.array(dones, dtype=np.float32))
    def __len__(self):
        return len(self.buffer)
 class SingleAgentDQN:
    """
    SingleAgentDQN algorithm implementation.
    单智能体 DQN 算法实现。
    """
    def __init__(self, config):
        # Initialize configuration and device
        # 初始化配置和设备
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Hyperparameters
        # 超参数
        self.gamma = config['training']['gamma']
        self.batch_size = config['training']['batch_size']
        self.tau = config['training']['tau']
        # Dimensions: Concentated state
        # 维度：拼接后的状态
        self.obs_dim = config['env']['num_subcarriers'] + 4
        self.state_dim = self.obs_dim * 2 
        self.num_actions = NUM_ACTIONS
        # Two DQN heads: one for semantic (s) actions, one for traditional (b) actions
        # 两个 DQN 头：一个用于语义动作 (s)，一个用于传统动作 (b)
        self.q_net_s = DQNNet(self.state_dim, self.num_actions).to(self.device)
        self.q_net_b = DQNNet(self.state_dim, self.num_actions).to(self.device)
        self.q_target_s = DQNNet(self.state_dim, self.num_actions).to(self.device)
        self.q_target_b = DQNNet(self.state_dim, self.num_actions).to(self.device)
        self.q_target_s.load_state_dict(self.q_net_s.state_dict())
        self.q_target_b.load_state_dict(self.q_net_b.state_dict())
        # Optimizers
        # 优化器
        lr = config['training'].get('actor_lr', 1e-4)
        self.optimizer_s = torch.optim.Adam(self.q_net_s.parameters(), lr=lr)
        self.optimizer_b = torch.optim.Adam(self.q_net_b.parameters(), lr=lr)
        # Epsilon-greedy exploration parameters
        # ε-greedy 探索参数
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay_episodes = 3000
        # Specialized Replay Buffer
        # 专用的重放池
        self.replay_buffer = DQNReplayBuffer(config['training']['buffer_capacity'])
        # Discrete action index tracking
        # 离散动作索引追踪
        self._last_action_s_idx = 0
        self._last_action_b_idx = 0
        # EpsilonAdapter: Hack to allow epsilon decay via train.py's existing loop
        # EpsilonAdapter：用于通过 train.py 现有循环触发 ε 衰减的技巧
        self.noise_s = type('EpsilonAdapter', (), {
            'decay_sigma': lambda _, ep: self._decay_epsilon(ep)
        })()
    def select_action(self, obs_s, obs_b, explore=True):
        """
        Select discrete actions using epsilon-greedy policy.
        使用 ε-greedy 策略选择离散动作。
        """
        state = np.concatenate([obs_s, obs_b]).astype(np.float32)
        state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        if explore and random.random() < self.epsilon:
            # Random exploration
            # 随机探索
            idx_s = random.randrange(self.num_actions)
            idx_b = random.randrange(self.num_actions)
        else:
            # Exploit learned Q-values
            # 利用已学习的 Q 值
            self.q_net_s.eval()
            self.q_net_b.eval()
            with torch.no_grad():
                q_s = self.q_net_s(state_t)
                q_b = self.q_net_b(state_t)
            self.q_net_s.train()
            self.q_net_b.train()
            idx_s = q_s.argmax(dim=1).item()
            idx_b = q_b.argmax(dim=1).item()
        # Update last indices for the buffer push
        # 更新用于存入重放池的最后索引
        self._last_action_s_idx = idx_s
        self._last_action_b_idx = idx_b
        self.replay_buffer.set_last_actions(idx_s, idx_b)
        # Return continuous actions from lookup table
        # 从查找表中返回对应的连续动作
        return _ACTION_TABLE[idx_s].copy(), _ACTION_TABLE[idx_b].copy()
    def compute_rewards(self, qoe_s, qoe_b, qoe_sys):
        """
        Compute scalar reward for single agent.
        为单智能体计算标量奖励。
        Formula: r = 0.5 * (qoe_s + qoe_b)
        公式说明：由于是单智能体控制全局，奖励取两组用户 QoE 的均值。
        """
        lam = 0.5
        r = 0.5 * (qoe_s + qoe_b)
        return r, r, lam
    def update(self):
        """
        Update the Q-networks.
        更新 Q 网络。
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        # Sample batch
        # 采样批量数据
        states, a_s, a_b, rewards, next_states, dones = \
            self.replay_buffer.sample(self.batch_size)
        # To tensors
        # 转换为张量
        states_t = torch.FloatTensor(states).to(self.device)
        next_states_t = torch.FloatTensor(next_states).to(self.device)
        rewards_t = torch.FloatTensor(rewards).unsqueeze(1).to(self.device)
        dones_t = torch.FloatTensor(dones).unsqueeze(1).to(self.device)
        a_s_t = torch.LongTensor(a_s).unsqueeze(1).to(self.device)
        a_b_t = torch.LongTensor(a_b).unsqueeze(1).to(self.device)
        # 1. Update Semantic Head (1. 更新语义分支)
        q_values_s = self.q_net_s(states_t).gather(1, a_s_t)
        with torch.no_grad():
            next_q_s = self.q_target_s(next_states_t).max(1, keepdim=True)[0]
            target_s = rewards_t + self.gamma * (1 - dones_t) * next_q_s
        loss_s = F.mse_loss(q_values_s, target_s)
        self.optimizer_s.zero_grad()
        loss_s.backward()
        self.optimizer_s.step()
        # 2. Update Traditional Head (2. 更新传统分支)
        q_values_b = self.q_net_b(states_t).gather(1, a_b_t)
        with torch.no_grad():
            next_q_b = self.q_target_b(next_states_t).max(1, keepdim=True)[0]
            target_b = rewards_t + self.gamma * (1 - dones_t) * next_q_b
        loss_b = F.mse_loss(q_values_b, target_b)
        self.optimizer_b.zero_grad()
        loss_b.backward()
        self.optimizer_b.step()
        # 3. Soft update target networks (3. 目标网络软更新)
        for target, source in [
            (self.q_target_s, self.q_net_s),
            (self.q_target_b, self.q_net_b),
        ]:
            for tp, sp in zip(target.parameters(), source.parameters()):
                tp.data.copy_(self.tau * sp.data + (1.0 - self.tau) * tp.data)
        return {'loss_s': loss_s.item(), 'loss_b': loss_b.item()}
    def _decay_epsilon(self, episode):
        """
        Decay epsilon over episodes.
        随训练轮数衰减 ε。
        """
        frac = min(1.0, episode / max(1, self.epsilon_decay_episodes))
        self.epsilon = self.epsilon + frac * (self.epsilon_min - self.epsilon)
    def save(self, path):
        """Save Q-nets."""
        os.makedirs(path, exist_ok=True)
        torch.save(self.q_net_s.state_dict(), os.path.join(path, "q_net_s.pth"))
        torch.save(self.q_net_b.state_dict(), os.path.join(path, "q_net_b.pth"))
    def load(self, path):
        """Load Q-nets."""
        self.q_net_s.load_state_dict(torch.load(os.path.join(path, "q_net_s.pth"), map_location=self.device))
        self.q_net_b.load_state_dict(torch.load(os.path.join(path, "q_net_b.pth"), map_location=self.device))
        self.q_target_s.load_state_dict(self.q_net_s.state_dict())
        self.q_target_b.load_state_dict(self.q_net_b.state_dict())
--- a/code/configs/init.py
+++ b/code/configs/init.py
@ -0,0 +1 @@
 # configs/__init__.py
--- a/code/configs/default.yaml
+++ b/code/configs/default.yaml
@ -0,0 +1,81 @@
 # =============================================================================
 # Co-MADDPG Wireless Resource Allocation — Default Configuration
 # =============================================================================
 # All hyperparameters follow the paper's specifications for semantic-aware
 # cooperative multi-agent resource allocation in OFDMA systems.
 # =============================================================================
 env:
  # OFDMA system parameters
  num_subcarriers: 64              # N: total number of OFDM subcarriers
  bandwidth: 10.0e+6                # B: total system bandwidth (Hz)
  subcarrier_spacing: 156250.0     # Δf: subcarrier spacing (Hz), B/N
  max_power: 1.0                   # P_max: maximum transmit power per user (W)
  noise_psd: -174                  # N0: noise power spectral density (dBm/Hz)
  carrier_freq: 3.5                # f_c: carrier frequency (GHz)
  # Cell geometry
  min_distance: 50                 # d_min: minimum BS-user distance (m)
  max_distance: 500                # d_max: maximum BS-user distance (m)
  # User configuration
  num_semantic_users: 3            # K_s: number of semantic communication users
  num_traditional_users: 3         # K_b: number of traditional bit-rate users
  # QoS constraints
  min_rate_req: 5.0e+5            # R_min: minimum rate requirement for traditional users (bps)
  # Semantic compression ratio bounds
  rho_max: 1.0                     # ρ_max: maximum compression ratio (no compression)
  rho_min: 0.05                    # ρ_min: minimum compression ratio
  # QoE weighting factors
  w1: 0.7                          # w1: weight for semantic similarity (SSIM)
  w2: 0.3                          # w2: weight for compression efficiency
 training:
  # Episode configuration
  max_episodes: 5000               # total training episodes
  max_steps: 200                   # maximum steps per episode
  # Replay buffer and sampling
  batch_size: 256                  # mini-batch size for gradient updates
  buffer_capacity: 100000          # replay buffer capacity
  # Learning rates
  actor_lr: 1.0e-4                 # actor network learning rate
  critic_lr: 3.0e-4                # critic network learning rate
  # Discount and soft-update
  gamma: 0.95                      # discount factor γ
  tau: 0.01                        # soft target update rate τ
  # Ornstein-Uhlenbeck exploration noise
  ou_sigma_init: 0.2               # initial noise standard deviation
  ou_sigma_min: 0.01               # minimum noise standard deviation
  ou_theta: 0.15                   # OU mean-reversion rate θ
  # Cooperative mechanism parameters
  beta: 5.0                        # β: cooperation benefit scaling factor
  q_threshold: 0.6                 # Q-value threshold for cooperation mode switch
  update_interval: 5               # target network update interval (episodes)
  # Reproducibility
  seed: 42
 network:
  # Actor network hidden layer dimensions
  actor_hidden: [256, 256, 128]
  # Critic network hidden layer dimensions
  critic_hidden: [512, 512, 256]
 reward:
  # Cooperative mode reward weights
  coop_self: 0.5                   # α_self: weight on own reward (cooperative)
  coop_other: 0.3                  # α_other: weight on other agents' reward
  coop_sys: 0.2                    # α_sys: weight on system-level reward
  # Competitive mode reward weights
  comp_self: 0.8                   # α_self: weight on own reward (competitive)
  comp_sys: 0.2                    # α_sys: weight on system-level reward
--- a/code/envs/init.py
+++ b/code/envs/init.py
@ -0,0 +1,6 @@
 """Environment modules for Co-MADDPG wireless resource allocation."""
 from .channel_model import ChannelModel
 from .semantic_module import SemanticModule
 __all__ = ["ChannelModel", "SemanticModule"]
--- a/code/envs/channel_model.py
+++ b/code/envs/channel_model.py
@ -0,0 +1,197 @@
 """
 无线资源分配信道模型 / Channel model for OFDMA wireless resource allocation.
 该模块实现了 3GPP 风格的路径损耗模型和多用户 OFDMA 下行链路系统的复信道增益生成。
 所有公式遵循论文中的公式 (5)-(8)。
 This module implements the 3GPP-style path loss model and complex channel gain generation 
 for a multi-user OFDMA downlink system. All formulas follow the paper's equations (5)–(8).
 作者/Author: Sisyphus-Junior
 日期/Date: 2026-02-28
 论文引用/Paper Reference: Co-MADDPG based Resource Allocation for Semantic Communication
 依赖/Dependencies: numpy
 """
 import numpy as np
 class ChannelModel:
    """
    多用户 OFDMA 系统的频率选择性信道模型。
    Frequency-selective channel model for multi-user OFDMA systems.
    生成包含距离相关路径损耗和瑞利衰落的每子载波复信道增益，并计算每子载波的信噪比 (SNR)。
    Generates per-subcarrier complex channel gains incorporating distance-dependent 
    path loss with Rayleigh fading, and computes per-subcarrier SNR values.
    Parameters
    ----------
    config : dict
        完整的配置字典（必须包含 "env" 部分，且具有 carrier_freq, noise_psd 和 subcarrier_spacing 键）。
        Full configuration dictionary (must contain an "env" section with keys 
        "carrier_freq", "noise_psd", and "subcarrier_spacing").
    """
    def __init__(self, config: dict) -> None:
        # 初始化环境配置 / Initialize environment configurations
        self.config = config
        env = config["env"]
        # 载波频率 (GHz) / Carrier frequency in GHz
        self._carrier_freq_ghz: float = env["carrier_freq"]
        # 噪声功率谱密度 (dBm/Hz) / Noise power spectral density in dBm/Hz
        self._noise_psd_dbm: float = env["noise_psd"]
        # 子载波间隔 (Hz) / Subcarrier spacing in Hz
        self._subcarrier_spacing: float = env["subcarrier_spacing"]
    # ------------------------------------------------------------------
    # 路径损耗 / Path loss
    # ------------------------------------------------------------------
    def path_loss(self, distance: float) -> float:
        """
        计算与距离相关的路径损耗 (dB)。
        Compute distance-dependent path loss in dB.
        使用 3GPP Urban Micro (UMi) NLOS 模型 (公式 5):
        Uses the 3GPP Urban Micro (UMi) NLOS model (Eq. 5):
            PL(d) = 36.7 * log10(d) + 22.7 + 26 * log10(fc)
        其中 d 的单位为米，fc 的单位为 GHz。
        where *d* is in metres and *fc* is in GHz.
        Parameters
        ----------
        distance : float or np.ndarray
            收发机之间的距离，单位为米。
            Transmitter–receiver distance(s) in metres.
        Returns
        -------
        float or np.ndarray
            路径损耗值，单位为 dB。
            Path loss value(s) in dB.
        """
        fc = self._carrier_freq_ghz
        # 应用 3GPP UMi NLOS 公式 / Apply 3GPP UMi NLOS formula - Eq.(5)
        return 36.7 * np.log10(distance) + 22.7 + 26.0 * np.log10(fc)
    # ------------------------------------------------------------------
    # 信道生成 / Channel generation
    # ------------------------------------------------------------------
    def generate_channel(
        self, distances: np.ndarray, num_subcarriers: int
    ) -> np.ndarray:
        """
        生成所有用户和子载波的复信道增益。
        Generate complex channel gains for all users and subcarriers.
        每个元素 h_{k,n} 服从复高斯分布 CN(0, 10^{-PL/10}) (公式 6)。
        即：独立循环对称复高斯分布，其方差等于线性尺度的逆路径损耗。
        Each element h_{k,n} is drawn from CN(0, 10^{-PL/10}) (Eq. 6), i.e.
        independent circularly-symmetric complex Gaussian with variance
        equal to the linear-scale inverse path loss.
        Parameters
        ----------
        distances : array_like, shape (K,)
            每个用户距离基站的距离（米）。
            Distance of each user from the base station (metres).
        num_subcarriers : int
            OFDM 子载波数量 N。
            Number of OFDM subcarriers *N*.
        Returns
        -------
        np.ndarray, shape (K, N)
            复信道增益矩阵。
            Complex channel gain matrix.
        """
        distances = np.asarray(distances, dtype=np.float64)
        K = len(distances)
        N = num_subcarriers
        # 每用户路径损耗 -> 线性尺度信道方差 / Per-user path loss -> linear-scale channel variance
        pl_db = self.path_loss(distances)               # (K,)
        # 方差 = 10^(-PL/10) / Variance = 10^(-PL/10) - Eq.(6)
        variance = 10.0 ** (-pl_db / 10.0)              # (K,)
        variance = variance.reshape(K, 1)                # (K, 1) 用于广播 / for broadcasting
        # 复高斯：每个分量服从 N(0, var/2) / Complex Gaussian: each component ~ N(0, var/2)
        std = np.sqrt(variance / 2.0)
        # 生成实部和虚部 / Generate real and imaginary parts
        real_part = np.random.randn(K, N) * std
        imag_part = np.random.randn(K, N) * std
        # 返回复增益 / Return complex gains
        return real_part + 1j * imag_part
    # ------------------------------------------------------------------
    # SNR 计算 / SNR computation
    # ------------------------------------------------------------------
    def compute_snr(
        self,
        channel_gains: np.ndarray,
        power_alloc: np.ndarray,
        noise_power: float,
    ) -> np.ndarray:
        """
        计算每个用户的每子载波信噪比 (SNR)。
        Compute per-subcarrier SNR for every user.
        γ_{k,n} = p_{k,n} * |h_{k,n}|² / σ² (公式 8)
        γ_{k,n} = p_{k,n} · |h_{k,n}|² / σ² (Eq. 8)
        Parameters
        ----------
        channel_gains : np.ndarray, shape (K, N)
            复信道增益矩阵。
            Complex channel gain matrix.
        power_alloc : np.ndarray, shape (K, N)
            每个用户在每个子载波上分配的功率（瓦特）。
            Power allocated by each user on each subcarrier (Watts).
        noise_power : float
            每子载波的噪声功率 σ²（瓦特）。
            Noise power σ² per subcarrier (Watts).
        Returns
        -------
        np.ndarray, shape (K, N)
            SNR 值（线性尺度）。
            SNR values (linear scale).
        """
        # 计算 SNR = 功率 * 增益平方 / 噪声 / Compute SNR = Power * Gain^2 / Noise - Eq.(8)
        return power_alloc * (np.abs(channel_gains) ** 2) / noise_power
    # ------------------------------------------------------------------
    # 噪声功率属性 / Noise power property
    # ------------------------------------------------------------------
    @property
    def noise_power(self) -> float:
        """
        每子载波的热噪声功率 (瓦特)。
        Thermal noise power per subcarrier (Watts).
        σ² = N₀ * Δf (公式 7)
        σ² = N₀ · Δf (Eq. 7)
        其中 N₀ 是从 dBm/Hz 转换为线性 (W/Hz) 的噪声功率谱密度：
        where N₀ is the noise PSD converted from dBm/Hz to linear (W/Hz):
            N₀_linear = 10^((N₀_dBm - 30) / 10)
        Returns
        -------
        float
            噪声功率（瓦特）。
            Noise power in Watts.
        """
        n0_dbm = self._noise_psd_dbm
        delta_f = self._subcarrier_spacing
        # 转换为线性功率谱密度 / Convert to linear PSD - Eq.(7)
        n0_linear = 10.0 ** ((n0_dbm - 30.0) / 10.0)
        # 计算总噪声功率 / Compute total noise power
        return n0_linear * delta_f
--- a/code/envs/semantic_module.py
+++ b/code/envs/semantic_module.py
@ -0,0 +1,156 @@
 """
 语义通信模块 / Semantic communication module for Co-MADDPG.
 实现基于 DeepSC 经验曲线的语义相似度 (SSim) 计算，以及语义通信用户的 QoE 计算。
 Implements semantic similarity (SSim) computation based on empirical DeepSC curves,
 and QoE calculation for semantic communication users.
 作者/Author: Sisyphus-Junior
 日期/Date: 2026-02-28
 论文引用/Paper Reference: Co-MADDPG based Resource Allocation for Semantic Communication
 依赖/Dependencies: numpy
 """
 import numpy as np
 class SemanticModule:
    """
    语义通信质量模块。
    Semantic communication quality module.
    根据平均 SNR 和压缩率，使用受 DeepSC 文献启发的经验拟合曲线计算语义相似度指数 (SSim)。
    Computes semantic similarity index (SSim) from average SNR and compression
    ratio, using empirical fitting curves inspired by DeepSC literature.
    Parameters
    ----------
    config : dict
        完整的配置字典（必须包含 "env" 部分，且具有 rho_max, rho_min, w1, w2 键）。
        Full configuration dictionary (must contain an "env" section 
        with keys "rho_max", "rho_min", "w1", "w2").
    """
    def __init__(self, config: dict) -> None:
        # 初始化语义参数 / Initialize semantic parameters
        self.config = config
        env = config["env"]
        # 最大压缩率 / Maximum compression ratio ρ_max
        self.rho_max = env.get("rho_max", 1.0)
        # 最小压缩率 / Minimum compression ratio ρ_min
        self.rho_min = env.get("rho_min", 0.05)
        # QoE 权重 1 / QoE Weight w1 (SSim weight)
        self.w1 = env.get("w1", 0.7)
        # QoE 权重 2 / QoE Weight w2 (Resource efficiency weight)
        self.w2 = env.get("w2", 0.3)
    @staticmethod
    def _a(rho: float) -> float:
        """经验曲线参数 a(ρ) = 0.8 / (ρ + 0.1)。 / Empirical curve parameter a(ρ) = 0.8 / (ρ + 0.1)."""
        return 0.8 / (rho + 0.1)
    @staticmethod
    def _b(rho: float) -> float:
        """经验曲线参数 b(ρ) = 0.6 + 0.2 * ρ。 / Empirical curve parameter b(ρ) = 0.6 + 0.2 * ρ."""
        return 0.6 + 0.2 * rho
    def compute_ssim(self, avg_snr, rho: float):
        """
        计算语义相似度指数 (SSim)。
        Compute semantic similarity index (SSim).
        φ(γ̄, ρ) = 1 - exp(-a(ρ) * γ̄^{b(ρ)}) (公式参考 SSim 章节)
        φ(γ̄, ρ) = 1 - exp(-a(ρ) * γ̄^{b(ρ)}) (Refer to SSim section in paper)
        Parameters
        ----------
        avg_snr : float or np.ndarray
            线性尺度的平均 SNR（非 dB）。
            Average SNR in linear scale (not dB).
        rho : float
            压缩率 ρ ∈ [ρ_min, ρ_max]。
            Compression ratio ρ ∈ [ρ_min, ρ_max].
        Returns
        -------
        float or np.ndarray
            [0, 1] 范围内的语义相似度。
            Semantic similarity in [0, 1].
        """
        # 防止 SNR 过小导致数值错误 / Avoid numerical errors with small SNR
        avg_snr = np.maximum(avg_snr, 1e-10)
        # 获取经验参数 a 和 b / Get empirical parameters a and b
        a = self._a(rho)
        b = self._b(rho)
        # 计算 SSim 公式 / Compute SSim formula
        return 1.0 - np.exp(-a * np.power(avg_snr, b))
    def compute_avg_snr(self, snr_per_subcarrier: np.ndarray,
                        allocation_mask: np.ndarray) -> float:
        """
        计算已分配子载波上的平均 SNR。
        Compute average SNR over allocated subcarriers.
        Parameters
        ----------
        snr_per_subcarrier : np.ndarray
            所有子载波的 SNR 值（线性尺度）。
            SNR values for all subcarriers (linear scale).
        allocation_mask : np.ndarray
            指示已分配子载波的二进制掩码。
            Binary mask indicating allocated subcarriers.
        Returns
        -------
        float
            已分配子载波的平均 SNR；若未分配则返回 0。
            Mean SNR over allocated subcarriers; 0 if none allocated.
        """
        # 提取已分配子载波的 SNR / Extract SNR for allocated subcarriers
        allocated = snr_per_subcarrier[allocation_mask > 0]
        # 如果没有子载波被分配 / If no subcarriers are allocated
        if len(allocated) == 0:
            return 0.0
        # 返回平均值 / Return mean value
        return float(np.mean(allocated))
    def compute_semantic_qoe(self, ssim: float, rho: float,
                             w1: float = None, w2: float = None,
                             rho_max: float = None) -> float:
        """
        计算语义通信用户的 QoE。
        Compute QoE for a semantic communication user.
        QoE_s = w1 * SSim + w2 * (1 - ρ / ρ_max) (公式参考 QoE_s)
        QoE_s = w1 * SSim + w2 * (1 - ρ / ρ_max) (Refer to QoE_s formula)
        Parameters
        ----------
        ssim : float
            [0, 1] 范围内的语义相似度指数。
            Semantic similarity index in [0, 1].
        rho : float
            使用的压缩率。
            Compression ratio used.
        w1, w2 : float, optional
            权重（默认为配置中的实例值）。
            Weights (defaults to instance values from config).
        rho_max : float, optional
            最大压缩率（默认为配置中的值）。
            Maximum compression ratio (default from config).
        Returns
        -------
        float
            [0, 1] 范围内的 QoE 值。
            QoE value in [0, 1].
        """
        # 使用默认值或输入值 / Use default or input values
        if w1 is None:
            w1 = self.w1
        if w2 is None:
            w2 = self.w2
        if rho_max is None:
            rho_max = self.rho_max
        # 计算语义 QoE / Calculate semantic QoE
        return float(w1 * ssim + w2 * (1.0 - rho / rho_max))
--- a/code/envs/wireless_env.py
+++ b/code/envs/wireless_env.py
@ -0,0 +1,336 @@
 """
 无线资源分配环境 / Main Gym-like environment for wireless resource allocation.
 该模块实现了一个用于语义和传统用户共存系统的无线资源分配环境。
 它通过 Gym 风格的 reset/step 接口，处理子载波分配、功率控制和压缩率优化。
 This module implements a wireless resource allocation environment for systems 
 with coexisting semantic and traditional users. It handles subcarrier allocation, 
 power control, and compression ratio optimization via a Gym-like reset/step interface.
 作者/Author: Sisyphus-Junior
 日期/Date: 2026-02-28
 论文引用/Paper Reference: Co-MADDPG based Resource Allocation for Semantic Communication
 依赖/Dependencies: numpy, envs.channel_model, envs.semantic_module
 """
 import numpy as np
 from envs.channel_model import ChannelModel
 from envs.semantic_module import SemanticModule
 class WirelessEnv:
    """
    语义与传统通信共存环境。
    Wireless environment with semantic and traditional communication.
    管理信道状态、执行动作并计算系统范围内的 QoE。
    Manages channel states, executes actions, and computes system-wide QoE.
    Parameters
    ----------
    config : dict
        包含 'env' 和 'training' 部分的配置字典。
        Configuration dictionary containing 'env' and 'training' sections.
    """
    def __init__(self, config):
        # 提取环境和训练配置 / Extract environment and training configs
        env_config = config['env']
        train_config = config['training']
        # 核心系统参数 / Core system parameters
        self.N = env_config['num_subcarriers']  # 子载波数量 N / Number of subcarriers
        self.K_s = env_config['num_semantic_users']  # 语义用户数 / Number of semantic users
        self.K_b = env_config['num_traditional_users']  # 传统用户数 / Number of traditional users
        self.K = self.K_s + self.K_b  # 总用户数 / Total number of users
        # 物理层参数 / Physical layer parameters
        self.P_max = env_config['max_power']  # 最大总发射功率 / Maximum total transmit power
        self.R_req = env_config['min_rate_req']  # 传统用户最小速率需求 / Min rate requirement for traditional users
        self.delta_f = env_config['subcarrier_spacing']  # 子载波间隔 / Subcarrier spacing
        self.rho_min = env_config['rho_min']  # 最小压缩率 / Minimum compression ratio
        self.rho_max = env_config['rho_max']  # 最大压缩率 / Maximum compression ratio
        self.w1 = env_config['w1']  # 语义 QoE 权重 1 / Semantic QoE weight 1
        self.w2 = env_config['w2']  # 语义 QoE 权重 2 / Semantic QoE weight 2
        # 距离限制 / Distance limits
        self.min_d = env_config.get('min_distance', 50.0)
        self.max_d = env_config.get('max_distance', 500.0)
        # 训练步数控制 / Training step control
        self.max_steps = train_config['max_steps']
        self.step_count = 0
        # 初始化模型 / Initialize models
        self.channel_model = ChannelModel(config)
        self.semantic_module = SemanticModule(config)
        # 初始状态变量 / Initial state variables
        self.distances = np.zeros(self.K)  # 用户距离 / User distances
        self.channel_gains = np.zeros((self.K, self.N), dtype=complex)  # 复信道增益 / Complex channel gains
        self.content_sensitivity = 0.5  # 内容敏感度 / Content sensitivity
        self.business_priority = 0.5  # 业务优先级 / Business priority
        self.load_s = 0.5  # 语义流量负载 / Semantic traffic load
        self.load_b = 0.5  # 传统流量负载 / Traditional traffic load
        self.alloc_s = 0.0  # 语义子载波分配比例 / Semantic subcarrier allocation fraction
        self.alloc_b = 0.0  # 传统子载波分配比例 / Traditional subcarrier allocation fraction
        self.qoe_avg_s = 0.0  # 语义平均 QoE / Rolling average semantic QoE
        self.qoe_avg_b = 0.0  # 传统平均 QoE / Rolling average traditional QoE
    @property
    def obs_dim(self):
        """观察维度: 子载波 (N) + 4 个额外特征。 / Observation dimension: Subcarriers (N) + 4 extra features."""
        return self.N + 4
    @property
    def act_dim(self):
        """动作维度: 子载波比例, 功率比例, [语义: 压缩率]。 / Action dimension: Subcarrier fraction, Power fraction, [Semantic: Compression ratio]."""
        return 3
    def reset(self):
        """
        重置环境状态。
        Reset environment state.
        Returns
        -------
        tuple
            (语义智能体观察, 传统智能体观察)。
            (semantic_observation, traditional_observation).
        """
        # 在 [min_distance, max_distance] 内随机分配用户距离 / Random user distances in [min_distance, max_distance]
        self.distances = np.random.uniform(self.min_d, self.max_d, size=self.K)
        # 生成信道 (形状: K x N 复数) / Generate channel (shape: K x N complex) - Eq.(6)
        self.channel_gains = self.channel_model.generate_channel(self.distances, self.N)
        self.step_count = 0
        # 随机设置观察参数 / Random params for observation
        self.content_sensitivity = np.random.uniform(0.3, 0.8)
        self.business_priority = np.random.uniform(0.3, 0.8)
        self.load_s = np.random.uniform(0.2, 0.8)
        self.load_b = np.random.uniform(0.2, 0.8)
        # 重置分配比例和移动平均值 / Reset allocations and moving averages
        self.alloc_s = 0.0
        self.alloc_b = 0.0
        self.qoe_avg_s = 0.0
        self.qoe_avg_b = 0.0
        # 获取初始观察 / Get initial observations
        obs_s = self._get_observation('semantic')
        obs_b = self._get_observation('traditional')
        return obs_s, obs_b
    def _get_observation(self, agent_type):
        """
        构造智能体的观察向量。
        Construct observation vector for agents.
        Parameters
        ----------
        agent_type : str
            'semantic' 或 'traditional'。
            'semantic' or 'traditional'.
        Returns
        -------
        np.ndarray
            归一化后的观察向量。
            Normalized observation vector.
        """
        if agent_type == 'semantic':
            # 语义用户索引范围 / Semantic user indices range
            user_indices = range(self.K_b, self.K)
            if len(user_indices) > 0:
                # 计算平均信道增益平方 (功率) / Mean channel power
                channel_power = np.mean(np.abs(self.channel_gains[user_indices])**2, axis=0)
            else:
                channel_power = np.zeros(self.N)
            # 归一化信道功率 / Normalize channel power
            channel_norm = channel_power / (np.max(channel_power) + 1e-10)
            # 拼接额外特征 / Concatenate extra features
            obs = np.concatenate([channel_norm, 
                [self.qoe_avg_s, self.content_sensitivity, self.alloc_s, self.load_s]])
        else:  # 传统 / traditional
            # 传统用户索引范围 / Traditional user indices range
            user_indices = range(0, self.K_b)
            if len(user_indices) > 0:
                # 计算平均信道功率 / Mean channel power
                channel_power = np.mean(np.abs(self.channel_gains[user_indices])**2, axis=0)
            else:
                channel_power = np.zeros(self.N)
            # 归一化信道功率 / Normalize channel power
            channel_norm = channel_power / (np.max(channel_power) + 1e-10)
            # 拼接额外特征 / Concatenate extra features
            obs = np.concatenate([channel_norm,
                [self.qoe_avg_b, self.business_priority, self.alloc_b, self.load_b]])
        # 返回 32位浮点型观察 / Return float32 observation
        return obs.astype(np.float32)
    def step(self, action_s, action_b):
        """
        执行一个时间步。
        Execute a single environment step.
        Parameters
        ----------
        action_s : np.ndarray
            语义智能体动作 [子载波比例, 功率比例, 压缩率]。
            Semantic agent action [sub_fraction, power_fraction, compression_ratio].
        action_b : np.ndarray
            传统智能体动作 [子载波比例, 功率比例, 冗余参数]。
            Traditional agent action [sub_fraction, power_fraction, redundant_param].
        Returns
        -------
        tuple
            (obs_s, obs_b, reward_s, reward_b, done, info).
        """
        self.step_count += 1
        # 1. 解码动作 / Decode actions
        # 计算子载波分配数量 / Compute number of subcarriers
        n_sub_s = max(1, int(round(action_s[0] * self.N)))
        n_sub_b = max(1, int(round(action_b[0] * self.N)))
        # 限制总子载波数量 / Clip total subcarriers
        if n_sub_s + n_sub_b > self.N:
            total = n_sub_s + n_sub_b
            n_sub_s = int(round(n_sub_s * self.N / total))
            n_sub_b = self.N - n_sub_s
        # 计算功率分配 / Compute power allocation
        p_s = action_s[1] * self.P_max
        p_b = action_b[1] * self.P_max
        # 限制总功率 / Limit total power
        if p_s + p_b > self.P_max:
            total_p = p_s + p_b
            p_s = p_s * self.P_max / total_p
            p_b = p_b * self.P_max / total_p
        # 解码语义压缩率 / Decode semantic compression ratio
        rho = action_s[2] * (self.rho_max - self.rho_min) + self.rho_min
        # 2. 分配子载波 (基于信道质量的贪婪算法) / Allocate subcarriers (greedy by channel quality)
        # 计算两组用户的平均信道质量 / Mean channel quality for both groups
        sem_channel = np.mean(np.abs(self.channel_gains[self.K_b:])**2, axis=0) if self.K_s > 0 else np.zeros(self.N)
        trad_channel = np.mean(np.abs(self.channel_gains[:self.K_b])**2, axis=0) if self.K_b > 0 else np.zeros(self.N)
        # 语义用户优先挑选最好的子载波 / Semantic users pick best subcarriers first
        all_subs = np.arange(self.N)
        sem_sorted = np.argsort(-sem_channel)
        sem_subs = sem_sorted[:n_sub_s]
        # 剩余子载波给传统用户 / Remaining subcarriers for traditional users
        remaining = np.setdiff1d(all_subs, sem_subs)
        if len(remaining) >= n_sub_b:
            trad_quality = trad_channel[remaining]
            best_idx = np.argsort(-trad_quality)[:n_sub_b]
            trad_subs = remaining[best_idx]
        else:
            trad_subs = remaining
            n_sub_b = len(trad_subs)
        # 3. 功率分配 (组内均分) / Power allocation (equal within group)
        noise_power = self.channel_model.noise_power
        # 分配矩阵和功率矩阵 / Allocation and power matrices
        alloc_matrix = np.zeros((self.K, self.N))
        power_matrix = np.zeros((self.K, self.N))
        # 在 K_s 个用户中循环分配语义子载波 / Distribute semantic subcarriers among K_s users round-robin
        for i, k in enumerate(range(self.K_b, self.K)):
            user_subs = sem_subs[i::max(1, self.K_s)]
            if len(user_subs) > 0:
                alloc_matrix[k, user_subs] = 1
                power_matrix[k, user_subs] = p_s / max(n_sub_s, 1)
        # 在 K_b 个用户中循环分配传统子载波 / Distribute traditional subcarriers among K_b users round-robin
        for i, k in enumerate(range(0, self.K_b)):
            user_subs = trad_subs[i::max(1, self.K_b)]
            if len(user_subs) > 0:
                alloc_matrix[k, user_subs] = 1
                power_matrix[k, user_subs] = p_b / max(n_sub_b, 1)
        # 4. 计算 SNR / Compute SNR - Eq.(8)
        snr_matrix = self.channel_model.compute_snr(self.channel_gains, power_matrix, noise_power)
        # 5. 计算每个用户的 QoE / Compute QoE for each user
        qoe_list = []
        rates = []
        ssim_values = []
        # 传统用户 QoE 计算 / Traditional users QoE computation - Eq.(QoE_b)
        for k in range(self.K_b):
            user_subs = np.where(alloc_matrix[k] > 0)[0]
            if len(user_subs) == 0:
                rate_k = 0.0
            else:
                # R_k = Σ α * Δf * log2(1 + γ) / R_k = Σ α * Δf * log2(1 + γ)
                rate_k = np.sum(self.delta_f * np.log2(1 + snr_matrix[k, user_subs]))
            rates.append(rate_k)
            # QoE_b = min(R_k / R_req, 1) / QoE_b = min(R_k / R_req, 1)
            qoe_k = min(rate_k / self.R_req, 1.0)
            qoe_list.append(qoe_k)
        # 语义用户 QoE 计算 / Semantic users QoE computation - Eq.(QoE_s)
        for k in range(self.K_b, self.K):
            user_subs = np.where(alloc_matrix[k] > 0)[0]
            if len(user_subs) == 0:
                ssim_k = 0.0
            else:
                avg_snr = np.mean(snr_matrix[k, user_subs])
                # 计算语义相似度 / Compute SSim - Eq. (SSim)
                ssim_k = self.semantic_module.compute_ssim(avg_snr, rho)
            ssim_values.append(float(ssim_k))
            # 计算语义 QoE / Compute semantic QoE
            qoe_k = self.semantic_module.compute_semantic_qoe(ssim_k, rho, self.w1, self.w2, self.rho_max)
            qoe_list.append(qoe_k)
        # 6. 系统平均 QoE / System QoE
        qoe_sys = np.mean(qoe_list) if len(qoe_list) > 0 else 0.0
        qoe_s = np.mean(qoe_list[self.K_b:]) if self.K_s > 0 else 0.0
        qoe_b = np.mean(qoe_list[:self.K_b]) if self.K_b > 0 else 0.0
        # 更新滚动平均值 / Update rolling averages
        alpha_smooth = 0.1
        self.qoe_avg_s = alpha_smooth * qoe_s + (1 - alpha_smooth) * self.qoe_avg_s
        self.qoe_avg_b = alpha_smooth * qoe_b + (1 - alpha_smooth) * self.qoe_avg_b
        # 记录当前分配比例 / Record current allocation ratios
        self.alloc_s = n_sub_s / self.N
        self.alloc_b = n_sub_b / self.N
        # 7. 为下一步生成新信道 (块衰落) / Regenerate channel for next step (block fading)
        self.channel_gains = self.channel_model.generate_channel(self.distances, self.N)
        # 8. 构造输出数据 / Build output
        obs_s = self._get_observation('semantic')
        obs_b = self._get_observation('traditional')
        done = (self.step_count >= self.max_steps)
        # 计算速率满足度 / Compute rate satisfaction for traditional users
        if len(rates) > 0:
            rate_satisfaction = float(np.mean([1.0 if r >= self.R_req else 0.0 for r in rates]))
        else:
            rate_satisfaction = 1.0
        # 构造信息字典 / Construct info dictionary
        info = {
            'qoe_semantic': qoe_s,
            'qoe_traditional': qoe_b,
            'qoe_sys': qoe_sys,
            'qoe_list': qoe_list,
            'rates': rates,
            'ssim_values': ssim_values,
            'rate_satisfaction': rate_satisfaction,
            'rho': rho,
            'n_sub_s': n_sub_s,
            'n_sub_b': n_sub_b,
        }
        # 返回结果 (奖励值设为各自的平均 QoE) / Return results (rewards set to respective mean QoEs)
        return obs_s, obs_b, qoe_s, qoe_b, done, info
--- a/code/evaluate.py
+++ b/code/evaluate.py
@ -0,0 +1,577 @@
 #!/usr/bin/env python3
 """
 Co-MADDPG Evaluation & Figure Generation | Co-MADDPG 评估与图表生成
 This script evaluates trained models across various network scenarios and 
 generates the 12 primary figures for the research paper. It covers robustness 
 tests (SNR), scalability (User Load), and internal dynamics (Lambda).
 本脚本在各种网络场景下评估已训练的模型，并为研究论文生成 12 张主要图表。
 它涵盖了鲁棒性测试 (SNR)、可扩展性 (用户负载) 和内部动态 (Lambda)。
 Scenarios Documented:
 1. Convergence / 收敛性 (Fig 2)
 2. SNR Sensitivity / SNR 敏感性 (Fig 3, 4)
 3. User Load Scalability / 用户负载可扩展性 (Fig 5, 6)
 4. Dynamic Lambda Trajectory / 动态 Lambda 轨迹 (Fig 7, 8)
 5. Semantic-Traditional Ratio / 语义-传统比例 (Fig 9)
 6. Component Ablation / 组件消融实验 (Fig 10)
 7. Beta Parameter Sensitivity / Beta 参数敏感性 (Fig 11)
 8. Q_th Threshold Sensitivity / Q_th 阈值敏感性 (Fig 12)
 Reference:
 - Section VII: Experimental Results
 """
 import os
 import sys
 import argparse
 import json
 import yaml
 import numpy as np
 import torch
 from pathlib import Path
 from copy import deepcopy
 PROJECT_ROOT = Path(__file__).parent
 sys.path.insert(0, str(PROJECT_ROOT))
 from envs.wireless_env import WirelessEnv
 from agents.co_maddpg import CoMADDPG
 from baselines.pure_coop import PureCooperative
 from baselines.pure_comp import PureCompetitive
 from baselines.single_dqn import SingleAgentDQN
 from baselines.iddpg import IndependentDDPG
 from baselines.fixed_lambda import FixedLambda
 from baselines.equal_alloc import EqualAllocation
 from baselines.semantic_only import SemanticOnly
 from utils.metrics import jain_fairness, rate_satisfaction, compute_system_qoe, moving_average
 from utils.visualization import Plotter
 # Mapping internal keys to display names and classes
 # 将内部键映射到显示名称和类
 ALGO_MAP = {
    'co_maddpg': ('Co-MADDPG', CoMADDPG),
    'pure_coop': ('Pure Cooperative', PureCooperative),
    'pure_comp': ('Pure Competitive', PureCompetitive),
    'single_dqn': ('Single-Agent DQN', SingleAgentDQN),
    'iddpg': ('IDDPG', IndependentDDPG),
    'fixed_lambda': ('Fixed λ=0.5', FixedLambda),
    'equal_alloc': ('Equal Allocation', EqualAllocation),
    'semantic_only': ('Semantic-Only', SemanticOnly),
 }
 def load_config(config_path: str) -> dict:
    """Load YAML configuration file. | 加载 YAML 配置文件。"""
    with open(config_path, 'r', encoding='utf-8') as f:
        return yaml.safe_load(f)
 def evaluate_episode(env, agent, config, num_episodes=10):
    """
    Run evaluation episodes and return average metrics.
    执行评估回合并返回平均指标。
    Parameters
    ----------
    env : WirelessEnv
        The wireless environment instance. | 无线环境实例。
    agent : BaseAgent
        The trained agent model. | 已训练的智能体模型。
    config : dict
        Configuration parameters. | 配置参数。
    num_episodes : int
        Number of episodes to average over. | 用于计算平均值的回合数。
    """
    max_steps = config['training']['max_steps']
    all_qoe_sys = []
    all_qoe_s = []
    all_qoe_b = []
    all_fairness = []
    all_rate_sat = []
    all_lambda = []
    all_rates = []
    for _ in range(num_episodes):
        obs_s, obs_b = env.reset()
        ep_qoe_sys = []
        ep_lambda = []
        for step in range(max_steps):
            # Deterministic action selection (no exploration noise)
            # 确定性动作选择（无探索噪声）
            act_s, act_b = agent.select_action(obs_s, obs_b, explore=False)
            next_obs_s, next_obs_b, qoe_s, qoe_b, done, info = env.step(act_s, act_b)
            qoe_sys = info['qoe_sys']
            # Get lambda if applicable | 获取 lambda（如果适用）
            if hasattr(agent, 'compute_lambda'):
                lambda_val = agent.compute_lambda(qoe_sys)
            else:
                lambda_val = 0.5
            ep_qoe_sys.append(qoe_sys)
            ep_lambda.append(lambda_val)
            obs_s = next_obs_s
            obs_b = next_obs_b
            if done:
                break
        # Calculate episode means | 计算回合平均值
        all_qoe_sys.append(np.mean(ep_qoe_sys))
        all_qoe_s.append(info['qoe_semantic'])
        all_qoe_b.append(info['qoe_traditional'])
        all_fairness.append(jain_fairness(info['qoe_list']))
        all_rate_sat.append(info['rate_satisfaction'])
        all_lambda.append(np.mean(ep_lambda))
        all_rates.extend(info['rates'])
    return {
        'qoe_sys': np.mean(all_qoe_sys),
        'qoe_sys_std': np.std(all_qoe_sys),
        'qoe_semantic': np.mean(all_qoe_s),
        'qoe_traditional': np.mean(all_qoe_b),
        'fairness': np.mean(all_fairness),
        'rate_satisfaction': np.mean(all_rate_sat),
        'avg_lambda': np.mean(all_lambda),
        'lambda_trajectory': all_lambda,
    }
 # ============================================================
 # Scenario 1: Convergence (Fig 2)
 # ============================================================
 def scenario_convergence(results_dir: str, save_dir: str):
    """
    Generate convergence curves from training history.
    根据训练历史生成收敛曲线。
    Loads JSON history files for each algorithm and plots system QoE.
    加载每个算法的 JSON 历史文件并绘制系统 QoE。
    """
    print("\n[Scenario 1] Convergence curves (Fig 2)")
    plotter = Plotter()
    data_dict = {}
    for algo_key, (display_name, _) in ALGO_MAP.items():
        history_path = os.path.join(results_dir, f'{algo_key}_history.json')
        if os.path.exists(history_path):
            with open(history_path, 'r') as f:
                history = json.load(f)
            if 'episode_qoe_sys' in history:
                data_dict[display_name] = history['episode_qoe_sys']
    if data_dict:
        plotter.plot_convergence(data_dict, os.path.join(save_dir, 'fig2_convergence'))
        print(f"  Saved fig2_convergence")
    else:
        print("  No training history found. Run training first.")
 # ============================================================
 # Scenario 2: QoE vs SNR (Fig 3, 4)
 # ============================================================
 def scenario_snr(config: dict, results_dir: str, save_dir: str, num_eval=5):
    """
    Evaluate performance across different SNR levels.
    在不同 SNR 水平下评估性能。
    Simulation Method: Adjusts noise PSD to achieve target SNR (0 to 30 dB).
    仿真方法：调整噪声功率谱密度 (PSD) 以达到目标 SNR（0 到 30 dB）。
    """
    print("\n[Scenario 2] QoE vs SNR (Fig 3, 4)")
    plotter = Plotter()
    snr_levels_db = np.arange(0, 31, 5)  # 0, 5, 10, 15, 20, 25, 30
    qoe_data = {}
    fairness_data = {}
    for algo_key, (display_name, AlgoClass) in ALGO_MAP.items():
        qoe_vals = []
        fair_vals = []
        for snr_db in snr_levels_db:
            # Modify noise PSD to achieve target SNR | 修改噪声 PSD 以达到目标 SNR
            test_config = deepcopy(config)
            # SNR = Signal_Power - Noise_Power. Adjusting noise_psd shifts SNR.
            # SNR = 信号功率 - 噪声功率。调整 noise_psd 会改变 SNR。
            snr_offset = snr_db - 15  # 15 dB is roughly the baseline SNR | 15 dB 大约是基准 SNR
            test_config['env']['noise_psd'] = -174 - snr_offset
            env = WirelessEnv(test_config)
            agent = AlgoClass(test_config)
            # Load trained model weights | 加载已训练的模型权重
            model_path = os.path.join(results_dir, f'{algo_key}_best.pt')
            if os.path.exists(model_path) and hasattr(agent, 'load'):
                try:
                    agent.load(model_path)
                except Exception:
                    pass
            result = evaluate_episode(env, agent, test_config, num_episodes=num_eval)
            qoe_vals.append(result['qoe_sys'])
            fair_vals.append(result['fairness'])
        qoe_data[display_name] = qoe_vals
        fairness_data[display_name] = fair_vals
        print(f"  {display_name}: QoE range [{min(qoe_vals):.3f}, {max(qoe_vals):.3f}]")
    plotter.plot_qoe_vs_snr(qoe_data, os.path.join(save_dir, 'fig3_qoe_vs_snr'))
    plotter.plot_fairness_vs_snr(fairness_data, os.path.join(save_dir, 'fig4_fairness_vs_snr'))
    print(f"  Saved fig3, fig4")
    return {'snr_levels': snr_levels_db.tolist(), 'qoe': qoe_data, 'fairness': fairness_data}
 # ============================================================
 # Scenario 3: QoE vs User Load (Fig 5, 6)
 # ============================================================
 def scenario_user_load(config: dict, results_dir: str, save_dir: str, num_eval=5):
    """
    Evaluate performance with different user counts.
    评估不同用户数量下的性能。
    Simulation Method: Varies total user count K from 4 to 12, split between S and B.
    仿真方法：将总用户数 K 在 4 到 12 之间变化，在语义 (S) 和传统 (B) 用户之间分配。
    """
    print("\n[Scenario 3] QoE vs User Load (Fig 5, 6)")
    plotter = Plotter()
    user_counts = [4, 6, 8, 10, 12]  # Total K | 总用户数 K
    qoe_data = {}
    rate_sat_data = {}
    for algo_key, (display_name, AlgoClass) in ALGO_MAP.items():
        qoe_vals = []
        rate_vals = []
        for k_total in user_counts:
            test_config = deepcopy(config)
            # Distribute users equally between types | 在不同类型之间平均分配用户
            k_s = k_total // 2
            k_b = k_total - k_s
            test_config['env']['num_semantic_users'] = k_s
            test_config['env']['num_traditional_users'] = k_b
            env = WirelessEnv(test_config)
            agent = AlgoClass(test_config)
            model_path = os.path.join(results_dir, f'{algo_key}_best.pt')
            if os.path.exists(model_path) and hasattr(agent, 'load'):
                try:
                    agent.load(model_path)
                except Exception:
                    pass
            result = evaluate_episode(env, agent, test_config, num_episodes=num_eval)
            qoe_vals.append(result['qoe_sys'])
            rate_vals.append(result['rate_satisfaction'])
        qoe_data[display_name] = qoe_vals
        rate_sat_data[display_name] = rate_vals
        print(f"  {display_name}: QoE range [{min(qoe_vals):.3f}, {max(qoe_vals):.3f}]")
    plotter.plot_qoe_vs_users(qoe_data, os.path.join(save_dir, 'fig5_qoe_vs_users'))
    plotter.plot_rate_satisfaction_vs_users(rate_sat_data, os.path.join(save_dir, 'fig6_rate_sat_vs_users'))
    print(f"  Saved fig5, fig6")
 # ============================================================
 # Scenario 4: Lambda Dynamics (Fig 7, 8)
 # ============================================================
 def scenario_lambda_dynamics(config: dict, results_dir: str, save_dir: str):
    """
    Analyze dynamic λ switching behavior of Co-MADDPG.
    分析 Co-MADDPG 的动态 λ 切换行为。
    """
    print("\n[Scenario 4] Lambda Dynamics (Fig 7, 8)")
    plotter = Plotter()
    env = WirelessEnv(config)
    agent = CoMADDPG(config)
    model_path = os.path.join(results_dir, 'co_maddpg_best.pt')
    if os.path.exists(model_path):
        try:
            agent.load(model_path)
        except Exception:
            pass
    # Run one episode and collect λ trajectory | 执行一个回合并收集 λ 轨迹
    obs_s, obs_b = env.reset()
    lambda_vals = []
    qoe_vals = []
    for step in range(config['training']['max_steps']):
        act_s, act_b = agent.select_action(obs_s, obs_b, explore=False)
        next_obs_s, next_obs_b, qoe_s, qoe_b, done, info = env.step(act_s, act_b)
        qoe_sys = info['qoe_sys']
        lambda_val = agent.compute_lambda(qoe_sys)
        lambda_vals.append(float(lambda_val))
        qoe_vals.append(float(qoe_sys))
        obs_s, obs_b = next_obs_s, next_obs_b
        if done:
            break
    plotter.plot_lambda_trajectory(lambda_vals, os.path.join(save_dir, 'fig7_lambda_trajectory'))
    plotter.plot_lambda_qoe_scatter(lambda_vals, qoe_vals, os.path.join(save_dir, 'fig8_lambda_qoe_scatter'))
    print(f"  Saved fig7, fig8")
 # ============================================================
 # Scenario 5: Semantic/Traditional Ratio (Fig 9)
 # ============================================================
 def scenario_user_ratio(config: dict, results_dir: str, save_dir: str, num_eval=5):
    """
    Evaluate with different semantic/traditional user ratios.
    评估不同语义/传统用户比例下的性能。
    Studies the impact as semantic communication becomes more prevalent.
    研究语义通信变得更加普遍时的影响。
    """
    print("\n[Scenario 5] User Ratio Analysis (Fig 9)")
    plotter = Plotter()
    total_users = 6
    ratios = [0.0, 0.17, 0.33, 0.5, 0.67, 0.83, 1.0]  # semantic fraction | 语义用户占比
    qoe_data = {}
    for algo_key, (display_name, AlgoClass) in ALGO_MAP.items():
        qoe_vals = []
        for ratio in ratios:
            # Map ratio to discrete integer counts | 将比例映射为离散整数计数
            k_s = max(0, min(total_users, int(round(ratio * total_users))))
            k_b = total_users - k_s
            # Ensure at least one of each for hybrid env constraints if necessary
            # 如有必要，确保混合环境约束下每种类型至少有一个
            if k_s == 0: k_s = 1; k_b = total_users - 1
            if k_b == 0: k_b = 1; k_s = total_users - 1
            test_config = deepcopy(config)
            test_config['env']['num_semantic_users'] = k_s
            test_config['env']['num_traditional_users'] = k_b
            env = WirelessEnv(test_config)
            agent = AlgoClass(test_config)
            model_path = os.path.join(results_dir, f'{algo_key}_best.pt')
            if os.path.exists(model_path) and hasattr(agent, 'load'):
                try:
                    agent.load(model_path)
                except Exception:
                    pass
            result = evaluate_episode(env, agent, test_config, num_episodes=num_eval)
            qoe_vals.append(result['qoe_sys'])
        qoe_data[display_name] = qoe_vals
    plotter.plot_qoe_vs_ratio(qoe_data, ratios, os.path.join(save_dir, 'fig9_qoe_vs_ratio'))
    print(f"  Saved fig9")
 # ============================================================
 # Scenario 6: Ablation Study (Fig 10)
 # ============================================================
 def scenario_ablation(config: dict, results_dir: str, save_dir: str, num_eval=5):
    """
    Run ablation study comparing core components.
    运行消融实验比较核心组件。
    Ablation Mapping:
    - w/o Stackelberg: Pure Cooperative (simultaneous update) | 无 Stackelberg：纯协作（同步更新）
    - w/o Dynamic λ: Fixed Lambda (λ=0.5) | 无动态 λ：固定 Lambda (λ=0.5)
    - w/o Cooperation: Pure Competitive (λ=0) | 无协作：纯竞争 (λ=0)
    - w/o CTDE: IDDPG (Independent Critics) | 无 CTDE：IDDPG（独立评论家）
    """
    print("\n[Scenario 6] Ablation Study (Fig 10)")
    plotter = Plotter()
    ablation_keys = {
        'Co-MADDPG (Full)': 'co_maddpg',
        'w/o Stackelberg': 'pure_coop',
        'w/o Dynamic λ': 'fixed_lambda',
        'w/o Cooperation': 'pure_comp',
        'w/o CTDE': 'iddpg',
    }
    ablation_data = {}
    for label, algo_key in ablation_keys.items():
        history_path = os.path.join(results_dir, f'{algo_key}_history.json')
        if os.path.exists(history_path):
            with open(history_path, 'r') as f:
                history = json.load(f)
            # Average of last 500 episodes for stability | 为保证稳定性取最后 500 回合的平均值
            qoe_series = history.get('episode_qoe_sys', [])
            if len(qoe_series) >= 500:
                ablation_data[label] = np.mean(qoe_series[-500:])
            elif len(qoe_series) > 0:
                ablation_data[label] = np.mean(qoe_series[-len(qoe_series)//5:])
            else:
                ablation_data[label] = 0.0
        else:
            # Fallback to direct evaluation if history missing | 如果历史记录缺失，则回退到直接评估
            env = WirelessEnv(config)
            AlgoClass = ALGO_MAP[algo_key][1]
            agent = AlgoClass(config)
            model_path = os.path.join(results_dir, f'{algo_key}_best.pt')
            if os.path.exists(model_path) and hasattr(agent, 'load'):
                try:
                    agent.load(model_path)
                except Exception:
                    pass
            result = evaluate_episode(env, agent, config, num_episodes=num_eval)
            ablation_data[label] = result['qoe_sys']
    plotter.plot_ablation(ablation_data, os.path.join(save_dir, 'fig10_ablation'))
    print(f"  Saved fig10")
 # ============================================================
 # Scenario 7: β Sensitivity (Fig 11)
 # ============================================================
 def scenario_beta_sensitivity(config: dict, results_dir: str, save_dir: str, num_eval=5):
    """
    Evaluate sensitivity to the β parameter in the sigmoid function.
    评估 Sigmoid 函数中 β 参数的敏感性。
    β controls the steepness of switching between competition and cooperation.
    β 控制竞争与协作之间切换的陡峭程度。
    """
    print("\n[Scenario 7] β Sensitivity (Fig 11)")
    plotter = Plotter()
    betas = [1, 3, 5, 7, 10]
    qoe_data = {}
    for beta in betas:
        test_config = deepcopy(config)
        test_config['training']['beta'] = float(beta)
        env = WirelessEnv(test_config)
        agent = CoMADDPG(test_config)
        model_path = os.path.join(results_dir, 'co_maddpg_best.pt')
        if os.path.exists(model_path):
            try:
                agent.load(model_path)
            except Exception:
                pass
        result = evaluate_episode(env, agent, test_config, num_episodes=num_eval)
        qoe_data[f'β={beta}'] = result['qoe_sys']
        print(f"  β={beta}: QoE_sys={result['qoe_sys']:.4f}")
    plotter.plot_beta_sensitivity(qoe_data, betas, os.path.join(save_dir, 'fig11_beta_sensitivity'))
    print(f"  Saved fig11")
 # ============================================================
 # Scenario 8: Q_th Sensitivity (Fig 12)
 # ============================================================
 def scenario_qth_sensitivity(config: dict, results_dir: str, save_dir: str, num_eval=5):
    """
    Evaluate sensitivity to the Q_th threshold parameter.
    评估 Q_th 阈值参数的敏感性。
    Q_th is the target QoE level below which cooperation is triggered.
    Q_th 是触发协作的目标 QoE 水平。
    """
    print("\n[Scenario 8] Q_th Sensitivity (Fig 12)")
    plotter = Plotter()
    qths = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
    qoe_data = {}
    for qth in qths:
        test_config = deepcopy(config)
        test_config['training']['q_threshold'] = float(qth)
        env = WirelessEnv(test_config)
        agent = CoMADDPG(test_config)
        model_path = os.path.join(results_dir, 'co_maddpg_best.pt')
        if os.path.exists(model_path):
            try:
                agent.load(model_path)
            except Exception:
                pass
        result = evaluate_episode(env, agent, test_config, num_episodes=num_eval)
        qoe_data[f'Q_th={qth}'] = result['qoe_sys']
        print(f"  Q_th={qth}: QoE_sys={result['qoe_sys']:.4f}")
    plotter.plot_qth_sensitivity(qoe_data, qths, os.path.join(save_dir, 'fig12_qth_sensitivity'))
    print(f"  Saved fig12")
 # ============================================================
 # Run all scenarios | 执行所有场景
 # ============================================================
 def run_all_scenarios(config: dict, results_dir: str, save_dir: str):
    """Run all evaluation scenarios and generate all figures. | 执行所有评估场景并生成所有图表。"""
    os.makedirs(save_dir, exist_ok=True)
    scenario_convergence(results_dir, save_dir)
    scenario_snr(config, results_dir, save_dir)
    scenario_user_load(config, results_dir, save_dir)
    scenario_lambda_dynamics(config, results_dir, save_dir)
    scenario_user_ratio(config, results_dir, save_dir)
    scenario_ablation(config, results_dir, save_dir)
    scenario_beta_sensitivity(config, results_dir, save_dir)
    scenario_qth_sensitivity(config, results_dir, save_dir)
    print(f"\nAll figures saved to: {save_dir}")
 def main():
    """Main entry point for evaluation. | 评估主入口。"""
    parser = argparse.ArgumentParser(description='Co-MADDPG Evaluation')
    parser.add_argument('--config', type=str, default='configs/default.yaml',
                        help='Path to config YAML')
    parser.add_argument('--results_dir', type=str, required=True,
                        help='Directory with trained models and history')
    parser.add_argument('--save_dir', type=str, default=None,
                        help='Directory to save figures (default: results_dir/figures)')
    parser.add_argument('--scenario', type=str, default='all',
                        choices=['all', 'convergence', 'snr', 'user_load',
                                 'lambda', 'ratio', 'ablation', 'beta', 'qth'],
                        help='Evaluation scenario to run')
    parser.add_argument('--num_eval', type=int, default=10,
                        help='Number of evaluation episodes per setting')
    args = parser.parse_args()
    config_path = os.path.join(PROJECT_ROOT, args.config)
    config = load_config(config_path)
    save_dir = args.save_dir or os.path.join(args.results_dir, 'figures')
    os.makedirs(save_dir, exist_ok=True)
    # Dispatch to specific scenario | 分派到特定场景
    scenario_map = {
        'all': lambda: run_all_scenarios(config, args.results_dir, save_dir),
        'convergence': lambda: scenario_convergence(args.results_dir, save_dir),
        'snr': lambda: scenario_snr(config, args.results_dir, save_dir, args.num_eval),
        'user_load': lambda: scenario_user_load(config, args.results_dir, save_dir, args.num_eval),
        'lambda': lambda: scenario_lambda_dynamics(config, args.results_dir, save_dir),
        'ratio': lambda: scenario_user_ratio(config, args.results_dir, save_dir, args.num_eval),
        'ablation': lambda: scenario_ablation(config, args.results_dir, save_dir, args.num_eval),
        'beta': lambda: scenario_beta_sensitivity(config, args.results_dir, save_dir, args.num_eval),
        'qth': lambda: scenario_qth_sensitivity(config, args.results_dir, save_dir, args.num_eval),
    }
    scenario_map[args.scenario]()
    print("\nEvaluation complete!")
 if __name__ == '__main__':
    main()
--- a/code/results/run_20260228_153632/config.yaml
+++ b/code/results/run_20260228_153632/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10.0e6
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500.0e3
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153837/co_maddpg_best.pt
+++ b/code/results/run_20260228_153837/co_maddpg_best.pt
--- a/code/results/run_20260228_153837/co_maddpg_final.pt
+++ b/code/results/run_20260228_153837/co_maddpg_final.pt
--- a/code/results/run_20260228_153837/co_maddpg_history.json
+++ b/code/results/run_20260228_153837/co_maddpg_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.7113027844231694,
    0.6344297213112167,
    0.7739924098489253
  ],
  "episode_qoe_semantic": [
    0.4226055688463388,
    0.2688594426224332,
    0.5479848196978504
  ],
  "episode_qoe_traditional": [
    1.0,
    1.0,
    1.0
  ],
  "episode_lambda": [
    0.6207547547571596,
    0.5395092957443972,
    0.6895618004798005
  ],
  "episode_fairness": [
    0.7661071139341958,
    0.7078592991985423,
    0.8700568715234
  ],
  "episode_rate_satisfaction": [
    1.0,
    1.0,
    1.0
  ],
  "episode_reward_s": [
    5.776441792194522,
    4.562596418470474,
    6.7501691441542295
  ],
  "episode_reward_b": [
    8.449613896268866,
    8.125998007753859,
    8.729679052824276
  ],
  "training_time": 0.04235124588012695
 }
--- a/code/results/run_20260228_153837/config.yaml
+++ b/code/results/run_20260228_153837/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153858/config.yaml
+++ b/code/results/run_20260228_153858/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153858/pure_coop_best.pt/actor_b.pth
+++ b/code/results/run_20260228_153858/pure_coop_best.pt/actor_b.pth
--- a/code/results/run_20260228_153858/pure_coop_best.pt/actor_s.pth
+++ b/code/results/run_20260228_153858/pure_coop_best.pt/actor_s.pth
--- a/code/results/run_20260228_153858/pure_coop_best.pt/critic_b.pth
+++ b/code/results/run_20260228_153858/pure_coop_best.pt/critic_b.pth
--- a/code/results/run_20260228_153858/pure_coop_best.pt/critic_s.pth
+++ b/code/results/run_20260228_153858/pure_coop_best.pt/critic_s.pth
--- a/code/results/run_20260228_153858/pure_coop_final.pt/actor_b.pth
+++ b/code/results/run_20260228_153858/pure_coop_final.pt/actor_b.pth
--- a/code/results/run_20260228_153858/pure_coop_final.pt/actor_s.pth
+++ b/code/results/run_20260228_153858/pure_coop_final.pt/actor_s.pth
--- a/code/results/run_20260228_153858/pure_coop_final.pt/critic_b.pth
+++ b/code/results/run_20260228_153858/pure_coop_final.pt/critic_b.pth
--- a/code/results/run_20260228_153858/pure_coop_final.pt/critic_s.pth
+++ b/code/results/run_20260228_153858/pure_coop_final.pt/critic_s.pth
--- a/code/results/run_20260228_153858/pure_coop_history.json
+++ b/code/results/run_20260228_153858/pure_coop_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.7113027826418427,
    0.634429719402772,
    0.7739924051936342
  ],
  "episode_qoe_semantic": [
    0.42260556528368554,
    0.2688594388055441,
    0.5479848103872687
  ],
  "episode_qoe_traditional": [
    1.0,
    1.0,
    1.0
  ],
  "episode_lambda": [
    1.0,
    1.0,
    1.0
  ],
  "episode_fairness": [
    0.7661071135594882,
    0.7078592958131911,
    0.8700568677872864
  ],
  "episode_rate_satisfaction": [
    1.0,
    1.0,
    1.0
  ],
  "episode_reward_s": [
    6.535633391702112,
    5.613156632833266,
    7.2879088623236115
  ],
  "episode_reward_b": [
    7.690422261134742,
    7.075437755222176,
    8.191939241549074
  ],
  "training_time": 0.05865025520324707
 }
--- a/code/results/run_20260228_153859/config.yaml
+++ b/code/results/run_20260228_153859/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153859/pure_comp_best.pt/actor_b.pth
+++ b/code/results/run_20260228_153859/pure_comp_best.pt/actor_b.pth
--- a/code/results/run_20260228_153859/pure_comp_best.pt/actor_s.pth
+++ b/code/results/run_20260228_153859/pure_comp_best.pt/actor_s.pth
--- a/code/results/run_20260228_153859/pure_comp_best.pt/critic_b.pth
+++ b/code/results/run_20260228_153859/pure_comp_best.pt/critic_b.pth
--- a/code/results/run_20260228_153859/pure_comp_best.pt/critic_s.pth
+++ b/code/results/run_20260228_153859/pure_comp_best.pt/critic_s.pth
--- a/code/results/run_20260228_153859/pure_comp_final.pt/actor_b.pth
+++ b/code/results/run_20260228_153859/pure_comp_final.pt/actor_b.pth
--- a/code/results/run_20260228_153859/pure_comp_final.pt/actor_s.pth
+++ b/code/results/run_20260228_153859/pure_comp_final.pt/actor_s.pth
--- a/code/results/run_20260228_153859/pure_comp_final.pt/critic_b.pth
+++ b/code/results/run_20260228_153859/pure_comp_final.pt/critic_b.pth
--- a/code/results/run_20260228_153859/pure_comp_final.pt/critic_s.pth
+++ b/code/results/run_20260228_153859/pure_comp_final.pt/critic_s.pth
--- a/code/results/run_20260228_153859/pure_comp_history.json
+++ b/code/results/run_20260228_153859/pure_comp_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.7113027826418427,
    0.634429719402772,
    0.7739924051936342
  ],
  "episode_qoe_semantic": [
    0.42260556528368554,
    0.2688594388055441,
    0.5479848103872687
  ],
  "episode_qoe_traditional": [
    1.0,
    1.0,
    1.0
  ],
  "episode_lambda": [
    0.0,
    0.0,
    0.0
  ],
  "episode_fairness": [
    0.7661071135594882,
    0.7078592958131911,
    0.8700568677872864
  ],
  "episode_rate_satisfaction": [
    1.0,
    1.0,
    1.0
  ],
  "episode_reward_s": [
    4.80345008755317,
    3.4197349492498965,
    5.931863293485418
  ],
  "episode_reward_b": [
    9.422605565283686,
    9.268859438805546,
    9.54798481038727
  ],
  "training_time": 0.050787925720214844
 }
--- a/code/results/run_20260228_153900/config.yaml
+++ b/code/results/run_20260228_153900/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153900/fixed_lambda_best.pt/actor_b.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_best.pt/actor_b.pth
--- a/code/results/run_20260228_153900/fixed_lambda_best.pt/actor_s.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_best.pt/actor_s.pth
--- a/code/results/run_20260228_153900/fixed_lambda_best.pt/critic_b.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_best.pt/critic_b.pth
--- a/code/results/run_20260228_153900/fixed_lambda_best.pt/critic_s.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_best.pt/critic_s.pth
--- a/code/results/run_20260228_153900/fixed_lambda_final.pt/actor_b.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_final.pt/actor_b.pth
--- a/code/results/run_20260228_153900/fixed_lambda_final.pt/actor_s.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_final.pt/actor_s.pth
--- a/code/results/run_20260228_153900/fixed_lambda_final.pt/critic_b.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_final.pt/critic_b.pth
--- a/code/results/run_20260228_153900/fixed_lambda_final.pt/critic_s.pth
+++ b/code/results/run_20260228_153900/fixed_lambda_final.pt/critic_s.pth
--- a/code/results/run_20260228_153900/fixed_lambda_history.json
+++ b/code/results/run_20260228_153900/fixed_lambda_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.7113027826418427,
    0.634429719402772,
    0.7739924051936342
  ],
  "episode_qoe_semantic": [
    0.42260556528368554,
    0.2688594388055441,
    0.5479848103872687
  ],
  "episode_qoe_traditional": [
    1.0,
    1.0,
    1.0
  ],
  "episode_lambda": [
    0.5,
    0.5,
    0.5
  ],
  "episode_fairness": [
    0.7661071135594882,
    0.7078592958131911,
    0.8700568677872864
  ],
  "episode_rate_satisfaction": [
    1.0,
    1.0,
    1.0
  ],
  "episode_reward_s": [
    5.669541739627641,
    4.516445791041581,
    6.609886077904513
  ],
  "episode_reward_b": [
    8.556513913209214,
    8.17214859701386,
    8.86996202596817
  ],
  "training_time": 0.04902958869934082
 }
--- a/code/results/run_20260228_153901/config.yaml
+++ b/code/results/run_20260228_153901/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153901/iddpg_best.pt/actor_b.pth
+++ b/code/results/run_20260228_153901/iddpg_best.pt/actor_b.pth
--- a/code/results/run_20260228_153901/iddpg_best.pt/actor_s.pth
+++ b/code/results/run_20260228_153901/iddpg_best.pt/actor_s.pth
--- a/code/results/run_20260228_153901/iddpg_best.pt/critic_b.pth
+++ b/code/results/run_20260228_153901/iddpg_best.pt/critic_b.pth
--- a/code/results/run_20260228_153901/iddpg_best.pt/critic_s.pth
+++ b/code/results/run_20260228_153901/iddpg_best.pt/critic_s.pth
--- a/code/results/run_20260228_153901/iddpg_final.pt/actor_b.pth
+++ b/code/results/run_20260228_153901/iddpg_final.pt/actor_b.pth
--- a/code/results/run_20260228_153901/iddpg_final.pt/actor_s.pth
+++ b/code/results/run_20260228_153901/iddpg_final.pt/actor_s.pth
--- a/code/results/run_20260228_153901/iddpg_final.pt/critic_b.pth
+++ b/code/results/run_20260228_153901/iddpg_final.pt/critic_b.pth
--- a/code/results/run_20260228_153901/iddpg_final.pt/critic_s.pth
+++ b/code/results/run_20260228_153901/iddpg_final.pt/critic_s.pth
--- a/code/results/run_20260228_153901/iddpg_history.json
+++ b/code/results/run_20260228_153901/iddpg_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.7113507545159694,
    0.634429719402772,
    0.7743815650859412
  ],
  "episode_qoe_semantic": [
    0.4227015090319389,
    0.2688594388055441,
    0.5487631301718825
  ],
  "episode_qoe_traditional": [
    1.0,
    1.0,
    1.0
  ],
  "episode_lambda": [
    0.0,
    0.0,
    0.0
  ],
  "episode_fairness": [
    0.766128164912615,
    0.7078592958131911,
    0.8703367376079599
  ],
  "episode_rate_satisfaction": [
    1.0,
    1.0,
    1.0
  ],
  "episode_reward_s": [
    4.804313581287451,
    3.4197349492498965,
    5.938868171546943
  ],
  "episode_reward_b": [
    9.422701509031938,
    9.268859438805546,
    9.548763130171885
  ],
  "training_time": 0.04243969917297363
 }
--- a/code/results/run_20260228_153912/config.yaml
+++ b/code/results/run_20260228_153912/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153912/equal_alloc_history.json
+++ b/code/results/run_20260228_153912/equal_alloc_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.9155535371015582,
    0.9138688645096937,
    0.9056880368178429
  ],
  "episode_qoe_semantic": [
    0.8311070742031162,
    0.8277377290193872,
    0.8113760736356859
  ],
  "episode_qoe_traditional": [
    1.0,
    1.0,
    1.0
  ],
  "episode_lambda": [
    0.5,
    0.5,
    0.5
  ],
  "episode_fairness": [
    0.9913232844941204,
    0.9909702012153767,
    0.9878097137456644
  ],
  "episode_rate_satisfaction": [
    1.0,
    1.0,
    1.0
  ],
  "episode_reward_s": [
    8.733303056523374,
    8.708032967645405,
    8.585320552267643
  ],
  "episode_reward_b": [
    9.57776768550779,
    9.569344322548467,
    9.528440184089215
  ],
  "training_time": 0.008931398391723633
 }
--- a/code/results/run_20260228_153912/single_dqn_best.pt/q_net_b.pth
+++ b/code/results/run_20260228_153912/single_dqn_best.pt/q_net_b.pth
--- a/code/results/run_20260228_153912/single_dqn_best.pt/q_net_s.pth
+++ b/code/results/run_20260228_153912/single_dqn_best.pt/q_net_s.pth
--- a/code/results/run_20260228_153912/single_dqn_final.pt/q_net_b.pth
+++ b/code/results/run_20260228_153912/single_dqn_final.pt/q_net_b.pth
--- a/code/results/run_20260228_153912/single_dqn_final.pt/q_net_s.pth
+++ b/code/results/run_20260228_153912/single_dqn_final.pt/q_net_s.pth
--- a/code/results/run_20260228_153912/single_dqn_history.json
+++ b/code/results/run_20260228_153912/single_dqn_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.8968791602798086,
    0.862955282639947,
    0.8855898874949111
  ],
  "episode_qoe_semantic": [
    0.793758320559617,
    0.7259105652798941,
    0.7711797749898219
  ],
  "episode_qoe_traditional": [
    1.0,
    1.0,
    1.0
  ],
  "episode_lambda": [
    0.5,
    0.5,
    0.5
  ],
  "episode_fairness": [
    0.9823216140378461,
    0.9653094801712315,
    0.977852788570174
  ],
  "episode_rate_satisfaction": [
    1.0,
    1.0,
    1.0
  ],
  "episode_reward_s": [
    8.968791602798085,
    8.62955282639947,
    8.85589887494911
  ],
  "episode_reward_b": [
    8.968791602798085,
    8.62955282639947,
    8.85589887494911
  ],
  "training_time": 0.026282548904418945
 }
--- a/code/results/run_20260228_153913/config.yaml
+++ b/code/results/run_20260228_153913/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 3
  max_steps: 10
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_153913/semantic_only_best.pt/actor.pth
+++ b/code/results/run_20260228_153913/semantic_only_best.pt/actor.pth
--- a/code/results/run_20260228_153913/semantic_only_best.pt/critic.pth
+++ b/code/results/run_20260228_153913/semantic_only_best.pt/critic.pth
--- a/code/results/run_20260228_153913/semantic_only_final.pt/actor.pth
+++ b/code/results/run_20260228_153913/semantic_only_final.pt/actor.pth
--- a/code/results/run_20260228_153913/semantic_only_final.pt/critic.pth
+++ b/code/results/run_20260228_153913/semantic_only_final.pt/critic.pth
--- a/code/results/run_20260228_153913/semantic_only_history.json
+++ b/code/results/run_20260228_153913/semantic_only_history.json
@ -0,0 +1,43 @@
 {
  "episode_qoe_sys": [
    0.9476448587100288,
    0.9923984342804163,
    0.7789848763042758
  ],
  "episode_qoe_semantic": [
    0.896932899256876,
    0.9847968685608324,
    0.7648825601976259
  ],
  "episode_qoe_traditional": [
    0.9983568181631816,
    1.0,
    0.7930871924109255
  ],
  "episode_lambda": [
    1.0,
    1.0,
    1.0
  ],
  "episode_fairness": [
    0.9942327960366917,
    0.9999411716548007,
    0.8905328191108769
  ],
  "episode_rate_satisfaction": [
    0.9666666666666666,
    1.0,
    0.7666666666666666
  ],
  "episode_reward_s": [
    9.476448587100288,
    9.923984342804163,
    7.789848763042759
  ],
  "episode_reward_b": [
    9.476448587100288,
    9.923984342804163,
    7.789848763042759
  ],
  "training_time": 0.031549930572509766
 }
--- a/code/results/run_20260228_154150/co_maddpg_best.pt
+++ b/code/results/run_20260228_154150/co_maddpg_best.pt
--- a/code/results/run_20260228_154150/co_maddpg_final.pt
+++ b/code/results/run_20260228_154150/co_maddpg_final.pt
--- a/code/results/run_20260228_154150/co_maddpg_history.json
+++ b/code/results/run_20260228_154150/co_maddpg_history.json
@ -0,0 +1,819 @@
 {
  "episode_qoe_sys": [
    0.7482231308526839,
    0.8309082233561463,
    0.8084170820757554,
    0.7090661422500526,
    0.7194826664883084,
    0.6245194124588949,
    0.7412642678718226,
    0.7221448044513399,
    0.7616440464228234,
    0.7643558815553954,
    0.6572493473713608,
    0.721670593949549,
    0.8822724023489592,
    0.9435396352401472,
    0.9325469563439573,
    0.938130047220731,
    0.8636689001753414,
    0.95773008362579,
    0.875634562616548,
    0.9578399247781135,
    0.927812144330949,
    0.9698146492596956,
    0.9086419845169996,
    0.9600391162164482,
    0.8819534906972089,
    0.9705150137636998,
    0.9817752467613294,
    0.9244882556828192,
    0.9591994798601222,
    0.9567062474565199,
    0.9612473818655292,
    0.9489192169510843,
    0.9115611001535519,
    0.9696553416677196,
    0.9355053443615821,
    0.9413769693090847,
    0.9543642718155892,
    0.933331057191906,
    0.9336041737014251,
    0.9604106194733352,
    0.9493491198733224,
    0.9357083780290258,
    0.9027962081393558,
    0.9713731990207785,
    0.954643593061007,
    0.9662689614556443,
    0.9686350404402433,
    0.9676491602261227,
    0.9778232788592959,
    0.9087213895548492,
    0.969287759304687,
    0.981932590271949,
    0.9474101652260117,
    0.9623562990072777,
    0.9805665598136514,
    0.9331272921499283,
    0.8304356491064996,
    0.9793316972058003,
    0.9090033676712256,
    0.9747586605171652,
    0.9403938437409362,
    0.9667584914513386,
    0.9359455160996325,
    0.9557410564679502,
    0.9597904917192871,
    0.9486689840734478,
    0.9293529262170188,
    0.966203688479699,
    0.9669558504123625,
    0.9525303671383566,
    0.975497262291949,
    0.9876639371029221,
    0.9698650321922837,
    0.980560180983099,
    0.9134075369766181,
    0.9616650872368651,
    0.9570056880926423,
    0.969389305006777,
    0.9843389928732023,
    0.9182309941219532,
    0.9368243124765431,
    0.9343432886045736,
    0.9353967218354095,
    0.9535820909511235,
    0.9625949822273279,
    0.9164164469644897,
    0.8952548217642696,
    0.9246997270424372,
    0.9737879436258754,
    0.9575701007311774,
    0.9572732612193962,
    0.9662254002356427,
    0.8937236297098302,
    0.906247835587458,
    0.9012673014356085,
    0.9701107895336493,
    0.9200143620653359,
    0.8229309600483448,
    0.7912501938404811,
    0.9298436984090547
  ],
  "episode_qoe_semantic": [
    0.5386190737468054,
    0.6825896901941588,
    0.7444487178417911,
    0.5757792493885436,
    0.6787930315981019,
    0.5986785176522087,
    0.6479199531676869,
    0.5526496652477322,
    0.7266452879302231,
    0.6130892447787215,
    0.8144986947427216,
    0.5833411878990981,
    0.7649890809441174,
    0.8870792704802943,
    0.8678067756263826,
    0.8762600944414619,
    0.7940044670173492,
    0.9287935005849133,
    0.8924571062173975,
    0.9356798495562267,
    0.8556242886618981,
    0.9396292985193915,
    0.8172839690339992,
    0.9200782324328957,
    0.9045604093248163,
    0.9572517751128246,
    0.9635504935226588,
    0.8522064262188362,
    0.9183989597202443,
    0.9134124949130397,
    0.9291562484403696,
    0.9045051005688349,
    0.8231222003071039,
    0.9393106833354392,
    0.8710106887231642,
    0.8827539386181692,
    0.9088270208690225,
    0.8666621143838121,
    0.8672083474028498,
    0.9208212389466705,
    0.9178438008745937,
    0.8728069332495135,
    0.839933904057521,
    0.942746398041557,
    0.9092871861220141,
    0.9325379229112888,
    0.9684385906490054,
    0.9352983204522449,
    0.9559614240495207,
    0.8605450294166949,
    0.9554146758149128,
    0.9638651805438978,
    0.8968995496571779,
    0.9304760349063185,
    0.9611331196273027,
    0.8662545842998567,
    0.6608712982129993,
    0.9586633944116002,
    0.851340068675784,
    0.9591310877960931,
    0.9071086512942736,
    0.9584505903221894,
    0.9118910321992645,
    0.9114821129359002,
    0.9195809834385742,
    0.8973379681468953,
    0.8587058524340375,
    0.9337539267984504,
    0.9374069000696267,
    0.905060734276713,
    0.954279696047815,
    0.975327874205844,
    0.9397300643845672,
    0.9611203619661981,
    0.8268150739532364,
    0.9233301744737298,
    0.9140113761852845,
    0.9387786100135539,
    0.9686818971010805,
    0.8417440582243283,
    0.8736486249530862,
    0.8686865772091467,
    0.8974601103374859,
    0.9071641819022466,
    0.9251899644546556,
    0.8328328939289794,
    0.8771763101952059,
    0.888685826147213,
    0.9475758872517507,
    0.938096489004522,
    0.9278798557721253,
    0.9591174671379519,
    0.8836767640997718,
    0.8658290045082492,
    0.802534602871217,
    0.9402215790672989,
    0.9224410630049446,
    0.9262775668313314,
    0.8330185534925567,
    0.9059670530501951
  ],
  "episode_qoe_traditional": [
    0.9578271879585623,
    0.979226756518134,
    0.8723854463097195,
    0.8423530351115613,
    0.7601723013785147,
    0.6503603072655811,
    0.8346085825759586,
    0.8916399436549475,
    0.7966428049154236,
    0.915622518332069,
    0.5,
    0.86,
    0.9995557237538009,
    1.0,
    0.997287137061532,
    1.0,
    0.9333333333333332,
    0.9866666666666667,
    0.8588120190156985,
    0.98,
    1.0,
    1.0,
    1.0,
    1.0,
    0.8593465720696014,
    0.9837782524145748,
    1.0,
    0.996770085146802,
    1.0,
    1.0,
    0.9933385152906886,
    0.9933333333333333,
    1.0,
    1.0,
    1.0,
    1.0,
    0.9999015227621558,
    1.0,
    1.0,
    1.0,
    0.9808544388720511,
    0.9986098228085382,
    0.9656585122211907,
    1.0,
    1.0,
    1.0,
    0.9688314902314811,
    1.0,
    0.9996851336690711,
    0.9568977496930029,
    0.983160842794461,
    1.0,
    0.9979207807948456,
    0.994236563108237,
    1.0,
    1.0,
    1.0,
    1.0,
    0.9666666666666667,
    0.9903862332382372,
    0.9736790361875985,
    0.9750663925804878,
    0.96,
    1.0,
    1.0,
    1.0,
    1.0,
    0.9986534501609475,
    0.9965048007550985,
    1.0,
    0.9967148285360831,
    1.0,
    1.0,
    1.0,
    1.0,
    1.0,
    1.0,
    1.0,
    0.9999960886453242,
    0.9947179300195782,
    1.0,
    1.0,
    0.9733333333333333,
    1.0,
    1.0,
    1.0,
    0.9133333333333333,
    0.960713627937661,
    1.0,
    0.9770437124578328,
    0.9866666666666667,
    0.9733333333333333,
    0.9037704953198883,
    0.9466666666666668,
    1.0,
    1.0,
    0.9175876611257278,
    0.7195843532653585,
    0.7494818341884059,
    0.9537203437679147
  ],
  "episode_lambda": [
    0.6622670842692696,
    0.7500482804667972,
    0.712824052857097,
    0.6195217465028277,
    0.6246417035096472,
    0.5200353938173463,
    0.6540794380013576,
    0.636338735783167,
    0.6697885704126283,
    0.6673772741695139,
    0.550696898489669,
    0.6419142428124175,
    0.7970499706714043,
    0.8443467429477454,
    0.8360455337178382,
    0.8426261647409603,
    0.7756630288388194,
    0.8528887866697177,
    0.7712979007575007,
    0.8515188194536911,
    0.831982178309649,
    0.8630587804878813,
    0.8155616597992922,
    0.8523052738948976,
    0.777859159494738,
    0.8628422365463099,
    0.8703896401735133,
    0.832147474615274,
    0.8567125261812731,
    0.8540584100991033,
    0.8574796597896875,
    0.8486073293034732,
    0.8167214240717635,
    0.8631445596172717,
    0.8322489724890078,
    0.8427957411813948,
    0.8513288328971008,
    0.83610420114875,
    0.8346301407086982,
    0.8555261772562746,
    0.8488973898417297,
    0.8382545611006229,
    0.8106296421557165,
    0.8631865700905925,
    0.8514182712928211,
    0.8589408395624666,
    0.8583146940820728,
    0.8610052588591282,
    0.8673828567667745,
    0.8136839515280165,
    0.8595398670114222,
    0.8706188770848126,
    0.8466916541391454,
    0.8582676205072488,
    0.8695099040790814,
    0.8382045510895932,
    0.7500335685307956,
    0.8690725917312673,
    0.8090861429003792,
    0.8656024432527438,
    0.8351962073195872,
    0.8565593037036073,
    0.8331092982129398,
    0.8529304768203586,
    0.8558437350530216,
    0.8469989863348539,
    0.8338577639421785,
    0.8574191637814199,
    0.8613039309955337,
    0.8513387912574415,
    0.8664755306835339,
    0.8739801091298787,
    0.8631668101358624,
    0.8698988380111281,
    0.8215561666200218,
    0.8573165867294444,
    0.85076266547904,
    0.861400442043415,
    0.8717986455959276,
    0.824437184508494,
    0.8386643789964086,
    0.8397715181377856,
    0.8335028065217736,
    0.8505127668173627,
    0.857986440413312,
    0.8243548273329769,
    0.7932473560299084,
    0.8226255312242924,
    0.8657210409510292,
    0.8504455784938111,
    0.8516167341861332,
    0.8570833545359118,
    0.7920852203092764,
    0.8101063207989906,
    0.8130926271378134,
    0.8628189629073196,
    0.813934169535405,
    0.7239675536814041,
    0.697161418562565,
    0.8294543427076831
  ],
  "episode_fairness": [
    0.8365546847460921,
    0.9190493291539967,
    0.8767738415960451,
    0.8071548015889092,
    0.825637115285233,
    0.7091905519987111,
    0.8440418038129003,
    0.7974779147314435,
    0.851609602820255,
    0.8315070925609628,
    0.7393591232441391,
    0.8363205523817605,
    0.9651736168759403,
    0.991424731180848,
    0.9866162194641283,
    0.9930357228040573,
    0.9394680107957133,
    0.9895327325697711,
    0.9245676404350065,
    0.987910098266364,
    0.9856983743490316,
    0.9975037560685449,
    0.9760507650255529,
    0.9823000832688425,
    0.9310703470514532,
    0.9939744912478992,
    0.9989586243549191,
    0.9880979521952423,
    0.9969769597575652,
    0.9949971208059558,
    0.9961833484771023,
    0.990731251436096,
    0.9672001410911061,
    0.9981150029000626,
    0.9737714761736297,
    0.989633671768632,
    0.9913528156558986,
    0.98326220746304,
    0.9785599919206249,
    0.9931555003464905,
    0.9918503250812633,
    0.9859206817942338,
    0.9622335027623293,
    0.9959586040162285,
    0.9920632033512357,
    0.9927084111797256,
    0.9845087014139241,
    0.9961820894324529,
    0.997570322761397,
    0.9722129676965667,
    0.9914992500833033,
    0.9992520130255573,
    0.9894337847245858,
    0.9959853761029812,
    0.9983634058781189,
    0.990423946983358,
    0.9386654920729426,
    0.9988165984692309,
    0.953969784720989,
    0.9950127558283774,
    0.9664745267970803,
    0.9866207281794412,
    0.9735376760704572,
    0.9920040183202619,
    0.9945369124345247,
    0.9892864380977443,
    0.9846458819186641,
    0.9897527469459123,
    0.9970213836436224,
    0.9934854754616879,
    0.9973232263785293,
    0.9995320648236641,
    0.9975759579905328,
    0.9992302951767392,
    0.9766872225095262,
    0.994365998770187,
    0.9823785122070894,
    0.9938573115267891,
    0.9985216182723887,
    0.9782907554686366,
    0.986673543946389,
    0.9900992602114989,
    0.9787362624469867,
    0.9922257657716247,
    0.9956499851515308,
    0.9834955996009246,
    0.9357723855514355,
    0.9664917058444696,
    0.9979421290209091,
    0.9855222437903479,
    0.9870800073423273,
    0.9854481134757457,
    0.9397208713676178,
    0.9652651197884685,
    0.9733074839151556,
    0.995610929300921,
    0.9485385709284205,
    0.8559766762591738,
    0.8548966452833797,
    0.9686293218606085
  ],
  "episode_rate_satisfaction": [
    0.9466666666666668,
    0.9733333333333333,
    0.8666666666666667,
    0.8333333333333333,
    0.68,
    0.5933333333333334,
    0.82,
    0.8866666666666667,
    0.7933333333333334,
    0.9133333333333333,
    0.5,
    0.86,
    0.9933333333333333,
    1.0,
    0.98,
    1.0,
    0.9333333333333332,
    0.9866666666666667,
    0.8333333333333333,
    0.98,
    1.0,
    1.0,
    1.0,
    1.0,
    0.8133333333333332,
    0.96,
    1.0,
    0.9866666666666666,
    1.0,
    1.0,
    0.9733333333333334,
    0.9933333333333333,
    1.0,
    1.0,
    1.0,
    1.0,
    0.9933333333333333,
    1.0,
    1.0,
    1.0,
    0.9533333333333335,
    0.9866666666666666,
    0.96,
    1.0,
    1.0,
    1.0,
    0.9666666666666666,
    1.0,
    0.9933333333333333,
    0.9333333333333332,
    0.98,
    1.0,
    0.98,
    0.9866666666666666,
    1.0,
    1.0,
    1.0,
    1.0,
    0.9666666666666667,
    0.9733333333333333,
    0.96,
    0.9733333333333333,
    0.96,
    1.0,
    1.0,
    1.0,
    1.0,
    0.98,
    0.98,
    1.0,
    0.9933333333333333,
    1.0,
    1.0,
    1.0,
    1.0,
    1.0,
    1.0,
    1.0,
    0.9933333333333333,
    0.98,
    1.0,
    1.0,
    0.9733333333333333,
    1.0,
    1.0,
    1.0,
    0.9133333333333333,
    0.9333333333333332,
    1.0,
    0.9666666666666666,
    0.9866666666666667,
    0.9733333333333333,
    0.9,
    0.9466666666666668,
    1.0,
    1.0,
    0.9066666666666666,
    0.7133333333333334,
    0.7266666666666667,
    0.9533333333333333
  ],
  "episode_reward_s": [
    32.92383809440045,
    38.84142532446255,
    39.35637621048818,
    33.07791352656574,
    35.646210709806844,
    31.18422146051695,
    35.46536098394087,
    32.65580211798654,
    38.381744393426914,
    34.941406424115705,
    38.51242888182305,
    33.89074410673731,
    42.0506590087602,
    46.282565423167455,
    45.58336583191295,
    45.96564685961806,
    42.192963393599115,
    47.49666238964523,
    44.56877145020918,
    47.71015188821474,
    45.198230723548036,
    48.048246036239085,
    43.84526480983841,
    47.300356963558094,
    45.20638878358768,
    48.34769531041928,
    48.82699860713653,
    45.084045188208066,
    47.36021198696687,
    47.17476820456252,
    47.58827205532081,
    46.77030916407793,
    43.96637506070994,
    48.041371093237736,
    45.54614070290883,
    46.13934763862364,
    46.9976740018387,
    45.5722055506037,
    45.53325189744269,
    47.39979573269473,
    47.01031877519191,
    45.752946543812186,
    44.17447857435883,
    48.132430155850216,
    47.01165338649684,
    47.7725229962693,
    48.5468174990822,
    47.89221472556719,
    48.56183376636242,
    44.868427322906776,
    48.32321435725977,
    48.839759821494496,
    46.56077070131724,
    47.64514827434208,
    48.74531046499078,
    45.61160357468347,
    38.30599666214984,
    48.671281265623506,
    44.37724726011491,
    48.51998361495961,
    46.52964288990172,
    48.359191337333144,
    46.69730174811404,
    47.10170879335486,
    47.37218108827143,
    46.60505915926487,
    45.32149665650905,
    47.741738354794364,
    47.912722190171124,
    46.90016054953645,
    48.46717237096394,
    49.20999546691045,
    48.053197160200156,
    48.75259152456535,
    44.227435588996926,
    47.49857467520549,
    47.11214878132527,
    47.98339873037769,
    48.99083793601537,
    44.61620329488045,
    45.8075475819309,
    45.70824657400885,
    46.27662379954107,
    46.93775305680158,
    47.56567107661146,
    44.44637503058194,
    44.61310640779463,
    45.7268104779848,
    48.31129057048788,
    47.64982087078567,
    47.44713963216457,
    48.32337400294084,
    44.84925917166148,
    44.87998311830005,
    43.377047635109356,
    48.05872898172292,
    46.267529073103596,
    43.78131195170364,
    41.00391972731087,
    46.24856144321394
  ],
  "episode_reward_b": [
    41.89847499086794,
    44.249397011152084,
    41.485331997087336,
    37.828700698439526,
    36.302055939023994,
    31.267719785372527,
    38.66106580324139,
    39.55867832714744,
    37.782660248855414,
    41.494181731423836,
    27.212505855313015,
    38.276315288217596,
    46.176581226135724,
    48.07139810084726,
    47.67132980248276,
    47.84735786245503,
    44.173926623935024,
    48.27634597293376,
    42.99468481144563,
    48.07384058959662,
    47.582983709546866,
    48.93321888973049,
    47.018933641861565,
    48.70355465808671,
    42.9889602861332,
    48.703806065950694,
    49.35052606899642,
    47.36478038007383,
    48.55973599904535,
    48.49585654108949,
    48.5364661312321,
    48.12161253103048,
    47.18973495464524,
    48.92416307353423,
    48.004393733249366,
    47.99834929228482,
    48.438753179720216,
    47.76090016858691,
    47.8271654726998,
    48.6412662146388,
    47.92459321214032,
    47.817891259090416,
    46.105142239576736,
    49.00488974622761,
    48.45270591960388,
    48.854373149295135,
    48.31668654494212,
    48.872701297045055,
    49.22049411956717,
    46.0037116325781,
    48.60556157320892,
    49.3534992057004,
    48.18024582128393,
    48.59048162638569,
    49.31134551637437,
    47.70112564030935,
    44.73756824850012,
    49.26188845495652,
    46.52308950700766,
    48.95588243675687,
    47.50974148419189,
    48.31665780780074,
    46.8972498618492,
    48.472396853440145,
    48.60686808365729,
    48.26183924807991,
    47.61379596519284,
    48.87863049317556,
    48.78286285106514,
    48.3528761642992,
    49.08255385823096,
    49.55639824338176,
    48.933306059028205,
    49.30342657374457,
    47.11331810866491,
    48.667934048481015,
    48.588420027938966,
    48.9555317703,
    49.443061351304884,
    47.20689611731486,
    47.874883665723395,
    47.7260822864485,
    47.26304838399987,
    48.420456038310746,
    48.69382714612132,
    47.19526966586705,
    44.912375768632344,
    46.74316222625892,
    49.06750379209967,
    48.107189202332094,
    48.280186489775026,
    48.299166020623446,
    44.52310379932152,
    45.744800440445744,
    46.74968250845148,
    48.95234997164203,
    45.73390713343002,
    38.51178405313085,
    38.121099656737286,
    46.73580839769153
  ],
  "training_time": 37.96440362930298
 }
--- a/code/results/run_20260228_154150/config.yaml
+++ b/code/results/run_20260228_154150/config.yaml
@ -0,0 +1,47 @@
 env:
  bandwidth: 10000000.0
  carrier_freq: 3.5
  max_distance: 500
  max_power: 1.0
  min_distance: 50
  min_rate_req: 500000.0
  noise_psd: -174
  num_semantic_users: 3
  num_subcarriers: 64
  num_traditional_users: 3
  rho_max: 1.0
  rho_min: 0.05
  subcarrier_spacing: 156250.0
  w1: 0.7
  w2: 0.3
 network:
  actor_hidden:
  - 256
  - 256
  - 128
  critic_hidden:
  - 512
  - 512
  - 256
 reward:
  comp_self: 0.8
  comp_sys: 0.2
  coop_other: 0.3
  coop_self: 0.5
  coop_sys: 0.2
 training:
  actor_lr: 0.0001
  batch_size: 256
  beta: 5.0
  buffer_capacity: 100000
  critic_lr: 0.0003
  gamma: 0.95
  max_episodes: 100
  max_steps: 50
  ou_sigma_init: 0.2
  ou_sigma_min: 0.01
  ou_theta: 0.15
  q_threshold: 0.6
  seed: 42
  tau: 0.01
  update_interval: 5
--- a/code/results/run_20260228_154150/fixed_lambda_best.pt/actor_b.pth
+++ b/code/results/run_20260228_154150/fixed_lambda_best.pt/actor_b.pth
--- a/code/results/run_20260228_154150/fixed_lambda_best.pt/actor_s.pth
+++ b/code/results/run_20260228_154150/fixed_lambda_best.pt/actor_s.pth
--- a/code/results/run_20260228_154150/fixed_lambda_best.pt/critic_b.pth
+++ b/code/results/run_20260228_154150/fixed_lambda_best.pt/critic_b.pth
--- a/code/results/run_20260228_154150/fixed_lambda_best.pt/critic_s.pth
+++ b/code/results/run_20260228_154150/fixed_lambda_best.pt/critic_s.pth
--- a/code/results/run_20260228_154150/iddpg_best.pt/actor_b.pth
+++ b/code/results/run_20260228_154150/iddpg_best.pt/actor_b.pth
--- a/code/results/run_20260228_154150/iddpg_best.pt/actor_s.pth
+++ b/code/results/run_20260228_154150/iddpg_best.pt/actor_s.pth
--- a/code/results/run_20260228_154150/iddpg_best.pt/critic_b.pth
+++ b/code/results/run_20260228_154150/iddpg_best.pt/critic_b.pth
--- a/code/results/run_20260228_154150/iddpg_best.pt/critic_s.pth
+++ b/code/results/run_20260228_154150/iddpg_best.pt/critic_s.pth
--- a/code/results/run_20260228_154150/iddpg_final.pt/actor_b.pth
+++ b/code/results/run_20260228_154150/iddpg_final.pt/actor_b.pth
--- a/Show More
+++ b/Show More