Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: pythonAI/reinforcement-learning
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 2fe6984
Choose a base ref
...
head repository: rlcode/reinforcement-learning
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 3d421f6
Choose a head ref
  • 9 commits
  • 103 files changed
  • 3 contributors

Commits on May 17, 2026

  1. Modernize: PyTorch + gymnasium, 9 algorithms (rlcode#122)

    * Modernize toolchain: uv + PyTorch 2.11 + gymnasium 1.2
    
    Replace 2017-era Keras/TF1 requirements.txt with pyproject.toml managed
    by uv. Python pinned to 3.11. Adds .venv to gitignore.
    
    * Prune to 9 algorithms and renumber folders
    
    Keep: policy/value iteration, SARSA, Q-learning, deep-SARSA, REINFORCE
    (grid-world); DQN, A2C, PPO (cartpole). Drop Monte Carlo, DDQN, A3C,
    PER, dueling, mountaincar, pong, breakout variants. Also drop save_model
    and save_graph directories per revival design.
    
    * Convert deep RL algorithms to PyTorch + gymnasium
    
    - Deep SARSA, REINFORCE (grid-world): Keras MLP -> nn.Module
    - DQN, A2C (cartpole): Keras -> PyTorch, gym -> gymnasium 5-tuple API
    - PPO (cartpole): new CleanRL-style single-file implementation
    
    Smoke-tested cartpole agents: DQN and A2C reach 500, PPO learns to 200+.
    
    * Add paper citations and equations to deep RL algorithm comments
    
    Each algorithm file now has a top docstring with the paper reference and
    the core update equation, plus inline comments explaining the why behind
    key lines (target detach, advantage normalization, ratio clipping, etc.).
    
    Also downgrade Pillow to 10.x to avoid a tkinter binding issue on macOS
    with Pillow 12 (the grid-world envs still use tkinter for now).
    
    * Add shared pygame grid-world env (used by SARSA and Q-learning)
    
    Cross-platform replacement for tkinter — pygame ships SDL via wheels so
    the same env works on Mac/Windows/Linux with no system tcl-tk install.
    Shapes are drawn as pygame primitives instead of PIL-loaded PNGs.
    
    This commit only touches SARSA and Q-learning (Type B: static grid,
    tabular Q-values overlaid on cells).  Deep-SARSA, REINFORCE and
    policy/value iteration will follow in separate commits.
    
    * Inline gridworld import in agent files, drop pointless environment.py shim
    
    * Drop stale per-folder .python-version pins (3.5.0 from 2017)
    
    * Ignore .python-version (pyproject.toml owns Python version pin)
    
    * Port Deep SARSA and REINFORCE to pygame DynamicEnv
    
    Add DynamicEnv to gridworld.py: 5x5 grid with 3 obstacles moving
    horizontally every other step, 15-dim relative state encoding, goal-only
    termination, optional per-step penalty (REINFORCE uses -0.1, Deep SARSA
    uses 0).  Action mapping is (up, down, right, left) to match the
    original deep-grid-world code.
    
    render_mode=None disables display for headless training/tests.
    
    * Show episode/score HUD and obstacle-hit flash in DynamicEnv
    
    - Top bar: "Episode: N    Score: X.X"
    - Agent flashes red and a floating "-1" appears for 4 frames when it
      lands on a moving obstacle, so the penalty is visible during runs.
    
    * Port policy/value iteration to pygame GraphicDisplay
    
    Add PolicyEnv (pure MDP data) and GraphicDisplay (pygame button-driven
    viewer) to gridworld.py.  The display takes (label, handler) button
    slots so each algorithm's main script wires up its own actions:
      - Policy iter:  Evaluate / Improve / Move / Reset
      - Value  iter:  Calculate / Print Policy / Move / Clear
    
    Click handling: pygame event loop dispatches mouse clicks to the
    handler whose button rect contains the cursor.  show_values overlays
    V(s) text; show_arrows draws policy arrows; move_along_policy
    animates the agent along greedy actions.
    
    Removes the per-folder environment.py files — all grid-world envs now
    live in 1-grid-world/gridworld.py (~580 lines total vs ~1200 before).
    
    * Restore full names policy_iteration/value_iteration (drop pi/vi abbreviations)
    
    * Simplify policy/value iteration main blocks
    
    - Drop display_ref dict hack: handlers close over `display` directly now
      that buttons can be assigned after construction.
    - Drop manual eval/improve counters; on_move just runs whatever policy
      the agent currently has (random initially, sharpens after Improve).
    - on_reset re-runs the agent's __init__ instead of reaching into its
      fields to reset value_table / policy_table by hand.
    
    * Shrink gridworld.py from 643 to 515 lines via shared draw helpers
    
    - Lift pump_events, draw_grid, draw_square/circle/triangle, cell_center
      to module level so all three classes (Env, DynamicEnv, GraphicDisplay)
      share the same primitives instead of each inlining its own.
    - Move `import math` to top, drop unused _clock fields and the redundant
      _check_boundary helper, tighten docstrings, and inline a few one-shot
      intermediates that were just adding noise.
    - No behavior changes: Env / DynamicEnv / PolicyEnv sanity checks still
      pass and DQN smoke run still converges.
    
    * Compress gridworld.py further: 515 -> 344 lines
    
    Inline most per-class draw helpers, replace the print_value_all/
    show_values/show_arrows setter pattern with simple public attributes
    where it didn't sacrifice the API, and tighten the lazy display init
    into a tiny _open helper.  Behavior unchanged.
    
    * Flatten directory layout: one file per algorithm
    
    Move each algorithm's single .py up to its category folder and drop
    the sub-folder layer.  Layout is now:
    
      1-grid-world/
        1-policy_iteration.py    4-q_learning.py
        2-value_iteration.py     5-deep_sarsa.py
        3-sarsa.py               6-reinforce.py
        gridworld.py             (shared env module)
    
      2-cartpole/
        1-dqn.py    2-a2c.py    3-ppo.py
    
    Also delete 1-grid-world/img/ — the pygame-based env draws all shapes
    as primitives, so the PNG sprites are no longer used.
    
    With gridworld.py now next to the agents that import it, the
    sys.path.insert dance in each agent file goes away — they just do
    `from gridworld import ...` like a normal sibling module.
    
    * Fix policy-arrow head direction (was opening forward like a Y)
    
    * Stop move_along_policy at the goal cell
    
    Policy iteration's get_action returns a float (0.0) sentinel for the
    terminal state, which crashes enumerate().  The original tkinter loop
    guarded with a len() check on policy_table — emulate that by checking
    the reward grid before calling the picker.
    
    * Fill agent square; draw V text and arrows over it
    
    * Disable buttons until their prerequisites have been clicked
    
    GraphicDisplay now tracks per-label click counts (display.clicks dict,
    accessible via display.click_count(label)).  Button tuples accept an
    optional third element — a zero-arg predicate returning bool — and
    the dispatcher skips clicks on disabled buttons while the renderer
    greys them out.
    
    In policy iteration: Improve enabled only after Evaluate; Move only
    after Improve.  In value iteration: Print Policy after Calculate;
    Move after Print Policy.  Reset / Clear stay always enabled and wipe
    the click counter on the way out.
    
    * Document the intended button workflow in policy/value iteration mains
    
    * Add visual press feedback to GraphicDisplay buttons
    
    Briefly tint the clicked button blue and offset its text by a pixel
    for 6 frames after a click, so users get tactile confirmation that
    their click registered.
    
    * Add episode/score HUD to static Env (SARSA, Q-learning)
    
    Same pattern as DynamicEnv: 32px dark bar at the top showing the
    current episode number and cumulative score.  Episode count goes up
    on reset (after the first reset); score resets with each episode.
    
    * Static Env HUD: show Episode, Steps, Last (terminal reward)
    
    The cumulative score was always 0 in mid-episode (rewards only at
    terminals) and got wiped on reset before it could be read.  Replace
    with current-episode step count plus the last episode's terminal
    reward, which persists across resets.
    
    * Static Env HUD: round Steps to nearest 5, rename Last -> Last Score
    
    * Enlarge DynamicEnv grid to match static Env and add Last Score HUD
    
    DYN_UNIT 50 -> 100 so the dynamic env window is the same scale as the
    static one.  Add last_score (previous episode's terminal score) to the
    HUD so users can compare across episodes without watching the terminal.
    
    * Drop gridworld_changing.png — doesn't match current 5x5 / 3-obstacle env
    
    * Cartpole: add RENDER toggle and save trained model on exit
    
    Each cartpole script now has two constants at the top:
      RENDER     — True to open a pygame window during training (much slower)
      SAVE_PATH  — where to torch.save the trained network on exit
    
    Models are saved both on early-stop (recent mean reward > 490) and on
    hitting EPISODES, so a successful run always leaves a checkpoint behind.
    *.pt added to .gitignore so checkpoints don't get committed.
    
    * Cartpole: RENDER/TEST env vars instead of a file-level constant
    
    RENDER=1 toggles the pygame window during training.  TEST=1 loads the
    saved checkpoint and runs inference forever (no learning, no gradient
    steps); TEST implies RENDER since the point is to watch.
    
    Usage:
      uv run python 1-dqn.py           # headless train, save .pt on exit
      RENDER=1 uv run python 1-dqn.py  # train with the window open (slow)
      TEST=1   uv run python 1-dqn.py  # replay the saved trained policy
    
    * Cartpole: --render / --test CLI flags instead of env vars
    
    argparse is more discoverable (each script supports --help) and reads
    better than RENDER=1 sigils.  Usage:
      uv run python 1-dqn.py            # headless train
      uv run python 1-dqn.py --render   # train with the window open (slow)
      uv run python 1-dqn.py --test     # replay the saved trained policy
    
    * Extract shared cartpole CLI/test helpers into cartpole.py
    
    argparse, env construction, the pygame-QUIT poll and the test-mode
    episode loop were duplicated across the three algorithm files.  Lift
    them into 2-cartpole/cartpole.py with parse_args / make_env /
    quit_if_window_closed / run_test_loop, and have each algorithm import
    just what it needs.  Each algo still owns its own checkpoint format
    and the action picker it passes into run_test_loop.
    
    * Rename gridworld.py / cartpole.py -> env.py for consistency
    
    Both folders now have an env.py — the import line in every algorithm
    file becomes a uniform 'from env import ...' regardless of category.
    
    * Cartpole training: poll pygame QUIT each step too
    
    So the X button works during --render training runs, not only in
    --test mode.  The poll is a no-op when no display has been initialized
    (checked via pygame.display.get_init()), keeping headless runs fast.
    
    * Bump PPO rollout to 1024 steps and total updates to 1500
    
    256-step rollouts gave noisy GAE estimates and PPO oscillated around
    200 on CartPole.  Larger rollouts stabilize the advantage estimate at
    the cost of ~4x wall time, which is the right trade for a learning
    demo.
    
    * PPO stability: orthogonal init + correct truncation handling
    
    Two CleanRL-standard fixes that were missing:
    
    1. Orthogonal weight init with gain=sqrt(2) on the tanh trunk, 0.01 on
       the policy head (keeps the initial action distribution near uniform),
       1.0 on the value head.  This is the single biggest stability win for
       PPO on small MLPs.
    
    2. The GAE done mask now uses *terminated* only, not terminated|truncated.
       When CartPole-v1 truncates at 500 steps the world hasn't actually
       ended, so V(s') should be bootstrapped instead of zeroed — otherwise
       the agent learns to fear the time limit and value collapses near the
       end of every successful episode.
    
    * Cartpole: kr-v2 style reward shaping (+0.1 / -1) across all three algos
    
    Match the rlcode-kr-v2 reference: per-step reward is +0.1, the failure
    step pays -1, and successful 500-step episodes get +0.1 on every step.
    Well-scaled magnitudes (vs. our previous +1 / -100) help PPO especially,
    which has no reward normalization and is sensitive to large rewards.
    
    PPO: re-enabled `done = terminated or truncated` for the GAE mask now
    that the shaping itself distinguishes success vs failure — the earlier
    "terminated only" change wasn't paired with a proper V(s') bootstrap on
    truncation and was probably destabilizing the value learning.
    
    * Rewrite README; drop per-folder READMEs
    
    Single concise README at the repo root: 9-algorithm table, uv setup,
    run commands.  Per-folder READMEs were stubs with broken image links
    to the old folder layout, easier to just delete them.
    
    * Restore original description; move modernization notes to an Updates section
    
    * Add 3-atari/ — DQN and PPO with GPU support
    
    env.py provides --env breakout|pong selection, standard Atari
    preprocessing (frame-skip 4, 84x84 grayscale, frame-stack 4), and a
    pick_device helper that prefers CUDA, then MPS, then CPU.
    
    Both algorithms use the Nature CNN backbone, DeepMind-standard reward
    clipping (sign), and the same --render / --test CLI as cartpole.
    
    Defaults are tuned to be runnable on a laptop rather than to hit
    paper-quality scores: TOTAL_FRAMES = 1M, smaller replay buffer for
    DQN, single-env rollout for PPO.  Bump these for serious training.
    
    * Switch to opencv-python-headless to silence SDL2 dylib clash on macOS
    
    pygame and opencv-python both ship their own libSDL2-2.0.0.dylib, and
    macOS warns loudly when both define the same Objective-C classes.
    The headless variant drops the GUI/SDL bits but keeps cv2's resize,
    which is all gymnasium's AtariPreprocessing actually needs.
    
    * Update DQN batch size from 32 to 64 for improved training performance
    dnddnjs authored May 17, 2026
    Configuration menu
    Copy the full SHA
    9b759c0 View commit details
    Browse the repository at this point in the history
  2. Use loop break instead of sys.exit() on early stop (rlcode#123)

    sys.exit() raises SystemExit which short-circuits env.close() and any
    outer cleanup, can't be unit-tested, and kills the kernel under
    Jupyter/IPython. Replace with a `solved` flag (DQN/A2C, nested loops)
    or a plain break (PPO, single loop) so the function returns normally
    and the final save runs through the same path as the EPISODES-exhausted
    case.
    
    Credit to @AlexisBogroff in rlcode#84 for flagging the pattern; applied to
    the current PyTorch tree.
    dnddnjs authored May 17, 2026
    Configuration menu
    Copy the full SHA
    565a1fe View commit details
    Browse the repository at this point in the history
  3. Add FireResetEnv and enable terminal_on_life_loss for Atari (rlcode#124)

    Breakout requires pressing FIRE to launch the ball after reset / life loss;
    AtariPreprocessing only does NOOPs, so the agent wastes frames waiting for a
    random FIRE. FireResetEnv presses it automatically on reset (applied only to
    games whose action set contains FIRE, so Pong is unaffected).
    
    terminal_on_life_loss=True is the Nature DQN / CleanRL convention: each life
    becomes its own episode so credit for the death-causing action isn't diluted
    across hundreds of steps. Reported returns will look ~5x smaller on Breakout
    since they now reflect per-life score, not full-game score.
    dnddnjs authored May 17, 2026
    Configuration menu
    Copy the full SHA
    b495daa View commit details
    Browse the repository at this point in the history
  4. Vectorize Atari PPO with 8 SyncVectorEnvs (rlcode#125)

    * Vectorize Atari PPO with 8 SyncVectorEnvs
    
    Add make_vec_env helper in env.py and have PPO use 8 parallel envs
    (ROLLOUT_STEPS 1024->128 keeps the per-update batch at 1024). Standard
    CleanRL convention; improves sample efficiency and GPU utilization since
    forwards now batch across envs. DQN keeps using single-env make_env.
    
    * Bump Atari PPO TOTAL_FRAMES to 5M
    
    1M was too short to see PPO Atari really take off (Breakout plateaus
    around per-game ~20 there). 5M lets the curve get well past the initial
    ramp; bump further if needed.
    dnddnjs authored May 17, 2026
    Configuration menu
    Copy the full SHA
    1ad59be View commit details
    Browse the repository at this point in the history
  5. Add optional W&B logging to Atari DQN and PPO (rlcode#126)

    * Add optional W&B logging to Atari PPO
    
    New --wandb CLI flag (in shared parse_args) initializes a wandb run and
    logs recent_mean_return, policy_loss, value_loss, and entropy keyed by
    global_step. Off by default so non-wandb users see no behavior change.
    wandb added to dependencies via uv.
    
    * Gitignore wandb run dir and document --wandb usage
    
    Add wandb/ to .gitignore so local run artifacts don't get committed, and
    add a short Logging section to the README covering 'wandb login' and
    the --wandb flag. The flag is opt-in and per-user (W&B login is tied to
    your own account), so contributors who skip it don't need the package
    at runtime.
    
    * Add optional W&B logging to Atari DQN
    
    Same --wandb pattern as PPO: logs recent_mean_return, epsilon, last loss,
    and buffer size every 10k frames into project rl-atari-dqn. Off by
    default so non-wandb users see no change. README updated to cover both
    Atari scripts.
    dnddnjs authored May 17, 2026
    Configuration menu
    Copy the full SHA
    cbb8e9d View commit details
    Browse the repository at this point in the history

Commits on May 24, 2026

  1. Atari fixes + benchmarks: memory, life-loss, per-game metric, README …

    …table (rlcode#128)
    
    * PPO tuning: LR anneal, value clipping, per-minibatch adv norm, 10M frames
    
    Three of CleanRL's 'PPO 37 details' that were missing — flagged when the
    5M and 10M Breakout runs both plateaued at per-game ~75 with entropy
    stuck around 0.8 (policy wasn't sharpening, clip rarely activating):
    
    - Linear LR anneal from 2.5e-4 -> 0 across the run; lets late updates
      fine-tune instead of bouncing.
    - Value-function loss clipping around the old prediction (CLIP_COEF),
      matching the policy clipping range; stabilizes value targets.
    - Advantage normalization moved inside the minibatch loop instead of
      once per batch.
    
    Also bumps TOTAL_FRAMES 5M -> 10M to match the CleanRL Atari budget so
    runs are directly comparable to their published curves. lr now logged
    to wandb so the anneal is visible.
    
    * Atari: shrink DQN buffer 4x, fix life-loss reset, track per-game returns
    
    - ReplayBuffer stores single frames and stacks 4 at sample time (~28GB -> ~7GB).
    - LifeLossTerminalEnv signals terminal on life loss but defers real reset to
      game-over, so noop_max + FIRE no longer fire every life and GAE/Q chains
      break only at the right boundary.
    - DQN: BATCH_SIZE 64 -> 32, TARGET_UPDATE_EVERY 2500 -> 250 train steps
      (~1k env frames), EPSILON_END 0.1 -> 0.01.
    - Log per-life and per-game returns separately (DQN and PPO).
    
    * Atari: README benchmarks table + DQN 500k buffer for 8GB Macs
    
    - README: add Atari to algorithms list, new Benchmarks section with hardware
      notes, per-algo row (params, train time, score, RAM, CPU/GPU, W&B report).
    - DQN buffer 1M -> 500k (~3.5GB) so a 1M-capacity run stops swapping on
      8GB unified memory.
    - moviepy added for the local eval/recording script.
    - .gitignore: exclude scripts/ and docs/ (local-only working dirs).
    
    * Ignore local logs/ directory
    dnddnjs authored May 24, 2026
    Configuration menu
    Copy the full SHA
    54ffaeb View commit details
    Browse the repository at this point in the history

Commits on Jun 6, 2026

  1. 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark (rlcode#130)

    * Ignore local CLAUDE.md collaboration notes
    
    * 4-atari-hard: PPO + RND scaffold for Montezuma's Revenge
    
    New chapter for hard-exploration Atari. PPO with Random Network
    Distillation (Burda et al., 2018) as the curiosity bonus.
    
    - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the
      standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv
      (uninterrupted episodes so intrinsic returns can chain across deaths).
    - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with
      LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS
      seeded by 50 rollouts of a random agent, intrinsic reward scaled by
      running std of discounted intrinsic returns, dual GAE (extrinsic
      episodic + intrinsic non-episodic), predictor updated on 25% of each
      minibatch, combined advantage A = 2*A_ext + 1*A_int.
    
    Not run end-to-end yet. Sanity-checked static shapes and module wiring.
    
    * 4-atari-hard: add envpool count-based exploration
    
    * 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark
    
    PPO+RND made reproducible and resumable. Shared run plumbing (seed,
    metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary)
    lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel
    envs crack the first-key bottleneck (128 envs never scored in 50M); final mean
    per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO
    baseline (2497). Adds a README benchmark row. Count-based exploration is
    deferred to a later PR (not yet trained/benchmarked).
    
    * README: link the Montezuma PPO+RND W&B report
    
    ---------
    
    Co-authored-by: soyoung park <ssoyyoung.p@gmail.com>
    dnddnjs and ssoyyoung authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    e590444 View commit details
    Browse the repository at this point in the history
  2. Bump pillow from 10.4.0 to 12.2.0 (rlcode#127)

    Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.4.0 to 12.2.0.
    - [Release notes](https://github.com/python-pillow/Pillow/releases)
    - [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
    - [Commits](python-pillow/Pillow@10.4.0...12.2.0)
    
    ---
    updated-dependencies:
    - dependency-name: pillow
      dependency-version: 12.2.0
      dependency-type: indirect
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    5331fbf View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2026

  1. 4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robust…

    …ification) (rlcode#133)
    
    * Ignore local CLAUDE.md collaboration notes
    
    * 4-atari-hard: PPO + RND scaffold for Montezuma's Revenge
    
    New chapter for hard-exploration Atari. PPO with Random Network
    Distillation (Burda et al., 2018) as the curiosity bonus.
    
    - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the
      standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv
      (uninterrupted episodes so intrinsic returns can chain across deaths).
    - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with
      LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS
      seeded by 50 rollouts of a random agent, intrinsic reward scaled by
      running std of discounted intrinsic returns, dual GAE (extrinsic
      episodic + intrinsic non-episodic), predictor updated on 25% of each
      minibatch, combined advantage A = 2*A_ext + 1*A_int.
    
    Not run end-to-end yet. Sanity-checked static shapes and module wiring.
    
    * 4-atari-hard: add envpool count-based exploration
    
    * 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark
    
    PPO+RND made reproducible and resumable. Shared run plumbing (seed,
    metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary)
    lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel
    envs crack the first-key bottleneck (128 envs never scored in 50M); final mean
    per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO
    baseline (2497). Adds a README benchmark row. Count-based exploration is
    deferred to a later PR (not yet trained/benchmarked).
    
    * README: link the Montezuma PPO+RND W&B report
    
    * 6-atari-go-explore: Go-Explore Phase 1 (exploration) for Montezuma
    
    Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural
    net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated
    random actions (p=0.95), raw-score accept rule, virtual DONE cell, global
    experience log with prev_id chains (demo source for robustification),
    12-worker spawn pool over raw gymnasium ALE clone/restore.
    
    Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume;
    explog flushed as compressed chunks; checkpoint = archive+log+RNG at
    batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps.
    
    * 6-atari-go-explore: resolve flushed explog chunks from the resumed run's dir
    
    Cross-run-dir resume (harness relaunches into a fresh run dir) could not
    see chunks flushed by the original run; chunk lookup now falls back to
    the ancestor run's explog dir and resume fails loudly if any chunk is
    unreachable.
    
    * Move Go-Explore into 4-atari-hard alongside PPO+RND
    
    Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py
    (gradient + intrinsic reward, envpool) and 2-go-explore.py (archive +
    emulator restore, raw ALE). Go-Explore keeps its own plumbing in
    env_go_explore.py since the two stacks share nothing.
    
    * 4-atari-hard/2-go-explore: use sampling-time captures in the archive walk
    
    A result earlier in the same batch can replace a cell; walking a later
    result against the cell's CURRENT score/trajectory stitched actions
    executed from the old state onto the new prefix, fabricating scores no
    single playthrough achieved. sample() now freezes snapshot/score/
    trajectory per pick and the walk uses the capture — matching the
    official Go-Explore, which ships these values inside each task.
    Caught by publish-time demo replay verification (score mismatch).
    
    * 4-atari-hard: Go-Explore robustification (backward algorithm) — demo extract + GRU PPO
    
    extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint +
    experience log, replay-verify it reproduces the archived score (31,000),
    truncate after the last reward, save actions/rewards/periodic ALE states.
    
    env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play
    forward under sticky actions; success = raw score >= demo return; lag/success
    kills) + ResetManager curriculum (starting points march backward as the agent
    matches the demo, forward-cumsum move rule per atari-reset, nudge forward on
    collapse).
    
    3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated
    BPTT with done-masked state, advantage chains cut at artificial success resets,
    periodic from-reset sticky eval -> final.json. Single-machine scaled port of
    openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs
    end-to-end; curriculum logic covered by harness T0.
    
    * 4-atari-hard/3-robustify: eval honors game_over + faithful resume RNG
    
    Preflight audit found the from-reset eval reused ReplayResetEnv with the
    training-curriculum kills active, so eval episodes were cut by lag/success-kill
    before game_over — value_mean reported a key-but-slower-than-demo policy as ~0.
    Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma
    episode (4500 agent steps) so eval runs from reset to a real game_over.
    
    Also checkpoint torch + per-env RNG state and restore them on resume (was global
    numpy RNG only), so the kill->resume contract is faithful for the deterministic
    streams (MPS policy sampling has no bit determinism).
    
    * 4-atari-hard/extract_demo: --max-rewards to truncate at the Kth reward
    
    Cutting the demo just after the first reward (--max-rewards 1) yields a short
    first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the
    robustification backward curriculum to bootstrap on.
    
    * 4-atari-hard/3-robustify: --ent-coef flag for entropy tuning
    
    The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34:
    the policy commits before reliably executing the demo suffix under sticky actions.
    Expose the entropy bonus as a flag (default unchanged) to test whether more
    exploration breaks the plateau.
    
    * README: Go-Explore robustification — single-machine negative result
    
    Document the backward-algorithm robustification (3-robustify.py): the curriculum
    bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no
    from-reset score on a single machine. Honest negative result, single seed, no
    benchmark row claimed.
    
    * README: consolidate Montezuma into one table (RND + Go-Explore exploration + robustification)
    
    Merge the three Montezuma results into a single table with a protocol column,
    trim the prose to one note. Restores the exploration row (31,000, replay-verified)
    that lived only on the superseded rlcode#132 branch.
    
    ---------
    
    Co-authored-by: soyoung park <ssoyyoung.p@gmail.com>
    dnddnjs and ssoyyoung authored Jun 12, 2026
    Configuration menu
    Copy the full SHA
    3d421f6 View commit details
    Browse the repository at this point in the history
Loading