Home / Catalog / Early-Settled Advantage Q-Learning for Finite-Horizon MDP Regret

Early-Settled Advantage Q-Learning for Finite-Horizon MDP Regret¶

Task ID: math.ucb_q_learning_regret
Domain: math
Subdomain: reinforcement_learning
Status: test
Tags: reinforcement_learning, q_learning, UCB, LCB, regret, finite_horizon_mdp, reference_advantage, early_settled_reference, model_free

Public Summary¶

This page is generated from task metadata and selected public-safe excerpts.

Example B1 Prompt Excerpt¶

# Task
You are given `data/problem_setup.json`, which describes several hidden finite-horizon tabular episodic MDPs only by their public dimensions: `S`, `A`, `H`, `K`, `initial_state`, and reward range. The transition kernels and reward tables are not provided. The evaluator will instantiate the hidden environments and interact with your code online for `K` episodes per case.
Write `analysis.py` implementing a memory-efficient model-free online learner following the main line of Q-EarlySettled-Advantage from the revised UCB paper:
- Maintain finite-horizon tables `Q[h, s, a]`, `V[h, s]`, and visit counts `N[h, s, a]`.
- Maintain at least one optimistic upper-confidence sequence and one pessimistic lower-confidence sequence, with corresponding `V_upper` and `V_lower` values.
- Maintain a reference value `V_ref` and update it early in learning; stop/lock/freeze the reference once the current optimistic value is close enough to the lower-confidence value.
- Use reference-advantage decomposition: estimate the reference contribution and the advantage contribution `V - V_ref` separately.
- Track running means/second moments or variance-like quantities for the reference and advantage terms, and use them in a confidence bonus.

Notes¶

This page is a generated site artifact.
Higher-level prompt details and internal benchmark specifics may remain intentionally undisclosed.