Early-Settled Advantage Q-Learning for Finite-Horizon MDP Regret¶
- Task ID:
math.ucb_q_learning_regret - Domain:
math - Subdomain:
reinforcement_learning - Status:
test - Tags:
reinforcement_learning,q_learning,UCB,LCB,regret,finite_horizon_mdp,reference_advantage,early_settled_reference,model_free
Public Summary¶
This page is generated from task metadata and selected public-safe excerpts.
Example B1 Prompt Excerpt¶
# Task
You are given `data/problem_setup.json`, which describes several hidden finite-horizon tabular episodic MDPs only by their public dimensions: `S`, `A`, `H`, `K`, `initial_state`, and reward range. The transition kernels and reward tables are not provided. The evaluator will instantiate the hidden environments and interact with your code online for `K` episodes per case.
Write `analysis.py` implementing a memory-efficient model-free online learner following the main line of Q-EarlySettled-Advantage from the revised UCB paper:
- Maintain finite-horizon tables `Q[h, s, a]`, `V[h, s]`, and visit counts `N[h, s, a]`.
- Maintain at least one optimistic upper-confidence sequence and one pessimistic lower-confidence sequence, with corresponding `V_upper` and `V_lower` values.
- Maintain a reference value `V_ref` and update it early in learning; stop/lock/freeze the reference once the current optimistic value is close enough to the lower-confidence value.
- Use reference-advantage decomposition: estimate the reference contribution and the advantage contribution `V - V_ref` separately.
- Track running means/second moments or variance-like quantities for the reference and advantage terms, and use them in a confidence bonus.
Notes¶
- This page is a generated site artifact.
- Higher-level prompt details and internal benchmark specifics may remain intentionally undisclosed.