Comparing changes

* Modernize toolchain: uv + PyTorch 2.11 + gymnasium 1.2 Replace 2017-era Keras/TF1 requirements.txt with pyproject.toml managed by uv. Python pinned to 3.11. Adds .venv to gitignore. * Prune to 9 algorithms and renumber folders Keep: policy/value iteration, SARSA, Q-learning, deep-SARSA, REINFORCE (grid-world); DQN, A2C, PPO (cartpole). Drop Monte Carlo, DDQN, A3C, PER, dueling, mountaincar, pong, breakout variants. Also drop save_model and save_graph directories per revival design. * Convert deep RL algorithms to PyTorch + gymnasium - Deep SARSA, REINFORCE (grid-world): Keras MLP -> nn.Module - DQN, A2C (cartpole): Keras -> PyTorch, gym -> gymnasium 5-tuple API - PPO (cartpole): new CleanRL-style single-file implementation Smoke-tested cartpole agents: DQN and A2C reach 500, PPO learns to 200+. * Add paper citations and equations to deep RL algorithm comments Each algorithm file now has a top docstring with the paper reference and the core update equation, plus inline comments explaining the why behind key lines (target detach, advantage normalization, ratio clipping, etc.). Also downgrade Pillow to 10.x to avoid a tkinter binding issue on macOS with Pillow 12 (the grid-world envs still use tkinter for now). * Add shared pygame grid-world env (used by SARSA and Q-learning) Cross-platform replacement for tkinter — pygame ships SDL via wheels so the same env works on Mac/Windows/Linux with no system tcl-tk install. Shapes are drawn as pygame primitives instead of PIL-loaded PNGs. This commit only touches SARSA and Q-learning (Type B: static grid, tabular Q-values overlaid on cells). Deep-SARSA, REINFORCE and policy/value iteration will follow in separate commits. * Inline gridworld import in agent files, drop pointless environment.py shim * Drop stale per-folder .python-version pins (3.5.0 from 2017) * Ignore .python-version (pyproject.toml owns Python version pin) * Port Deep SARSA and REINFORCE to pygame DynamicEnv Add DynamicEnv to gridworld.py: 5x5 grid with 3 obstacles moving horizontally every other step, 15-dim relative state encoding, goal-only termination, optional per-step penalty (REINFORCE uses -0.1, Deep SARSA uses 0). Action mapping is (up, down, right, left) to match the original deep-grid-world code. render_mode=None disables display for headless training/tests. * Show episode/score HUD and obstacle-hit flash in DynamicEnv - Top bar: "Episode: N Score: X.X" - Agent flashes red and a floating "-1" appears for 4 frames when it lands on a moving obstacle, so the penalty is visible during runs. * Port policy/value iteration to pygame GraphicDisplay Add PolicyEnv (pure MDP data) and GraphicDisplay (pygame button-driven viewer) to gridworld.py. The display takes (label, handler) button slots so each algorithm's main script wires up its own actions: - Policy iter: Evaluate / Improve / Move / Reset - Value iter: Calculate / Print Policy / Move / Clear Click handling: pygame event loop dispatches mouse clicks to the handler whose button rect contains the cursor. show_values overlays V(s) text; show_arrows draws policy arrows; move_along_policy animates the agent along greedy actions. Removes the per-folder environment.py files — all grid-world envs now live in 1-grid-world/gridworld.py (~580 lines total vs ~1200 before). * Restore full names policy_iteration/value_iteration (drop pi/vi abbreviations) * Simplify policy/value iteration main blocks - Drop display_ref dict hack: handlers close over `display` directly now that buttons can be assigned after construction. - Drop manual eval/improve counters; on_move just runs whatever policy the agent currently has (random initially, sharpens after Improve). - on_reset re-runs the agent's __init__ instead of reaching into its fields to reset value_table / policy_table by hand. * Shrink gridworld.py from 643 to 515 lines via shared draw helpers - Lift pump_events, draw_grid, draw_square/circle/triangle, cell_center to module level so all three classes (Env, DynamicEnv, GraphicDisplay) share the same primitives instead of each inlining its own. - Move `import math` to top, drop unused _clock fields and the redundant _check_boundary helper, tighten docstrings, and inline a few one-shot intermediates that were just adding noise. - No behavior changes: Env / DynamicEnv / PolicyEnv sanity checks still pass and DQN smoke run still converges. * Compress gridworld.py further: 515 -> 344 lines Inline most per-class draw helpers, replace the print_value_all/ show_values/show_arrows setter pattern with simple public attributes where it didn't sacrifice the API, and tighten the lazy display init into a tiny _open helper. Behavior unchanged. * Flatten directory layout: one file per algorithm Move each algorithm's single .py up to its category folder and drop the sub-folder layer. Layout is now: 1-grid-world/ 1-policy_iteration.py 4-q_learning.py 2-value_iteration.py 5-deep_sarsa.py 3-sarsa.py 6-reinforce.py gridworld.py (shared env module) 2-cartpole/ 1-dqn.py 2-a2c.py 3-ppo.py Also delete 1-grid-world/img/ — the pygame-based env draws all shapes as primitives, so the PNG sprites are no longer used. With gridworld.py now next to the agents that import it, the sys.path.insert dance in each agent file goes away — they just do `from gridworld import ...` like a normal sibling module. * Fix policy-arrow head direction (was opening forward like a Y) * Stop move_along_policy at the goal cell Policy iteration's get_action returns a float (0.0) sentinel for the terminal state, which crashes enumerate(). The original tkinter loop guarded with a len() check on policy_table — emulate that by checking the reward grid before calling the picker. * Fill agent square; draw V text and arrows over it * Disable buttons until their prerequisites have been clicked GraphicDisplay now tracks per-label click counts (display.clicks dict, accessible via display.click_count(label)). Button tuples accept an optional third element — a zero-arg predicate returning bool — and the dispatcher skips clicks on disabled buttons while the renderer greys them out. In policy iteration: Improve enabled only after Evaluate; Move only after Improve. In value iteration: Print Policy after Calculate; Move after Print Policy. Reset / Clear stay always enabled and wipe the click counter on the way out. * Document the intended button workflow in policy/value iteration mains * Add visual press feedback to GraphicDisplay buttons Briefly tint the clicked button blue and offset its text by a pixel for 6 frames after a click, so users get tactile confirmation that their click registered. * Add episode/score HUD to static Env (SARSA, Q-learning) Same pattern as DynamicEnv: 32px dark bar at the top showing the current episode number and cumulative score. Episode count goes up on reset (after the first reset); score resets with each episode. * Static Env HUD: show Episode, Steps, Last (terminal reward) The cumulative score was always 0 in mid-episode (rewards only at terminals) and got wiped on reset before it could be read. Replace with current-episode step count plus the last episode's terminal reward, which persists across resets. * Static Env HUD: round Steps to nearest 5, rename Last -> Last Score * Enlarge DynamicEnv grid to match static Env and add Last Score HUD DYN_UNIT 50 -> 100 so the dynamic env window is the same scale as the static one. Add last_score (previous episode's terminal score) to the HUD so users can compare across episodes without watching the terminal. * Drop gridworld_changing.png — doesn't match current 5x5 / 3-obstacle env * Cartpole: add RENDER toggle and save trained model on exit Each cartpole script now has two constants at the top: RENDER — True to open a pygame window during training (much slower) SAVE_PATH — where to torch.save the trained network on exit Models are saved both on early-stop (recent mean reward > 490) and on hitting EPISODES, so a successful run always leaves a checkpoint behind. *.pt added to .gitignore so checkpoints don't get committed. * Cartpole: RENDER/TEST env vars instead of a file-level constant RENDER=1 toggles the pygame window during training. TEST=1 loads the saved checkpoint and runs inference forever (no learning, no gradient steps); TEST implies RENDER since the point is to watch. Usage: uv run python 1-dqn.py # headless train, save .pt on exit RENDER=1 uv run python 1-dqn.py # train with the window open (slow) TEST=1 uv run python 1-dqn.py # replay the saved trained policy * Cartpole: --render / --test CLI flags instead of env vars argparse is more discoverable (each script supports --help) and reads better than RENDER=1 sigils. Usage: uv run python 1-dqn.py # headless train uv run python 1-dqn.py --render # train with the window open (slow) uv run python 1-dqn.py --test # replay the saved trained policy * Extract shared cartpole CLI/test helpers into cartpole.py argparse, env construction, the pygame-QUIT poll and the test-mode episode loop were duplicated across the three algorithm files. Lift them into 2-cartpole/cartpole.py with parse_args / make_env / quit_if_window_closed / run_test_loop, and have each algorithm import just what it needs. Each algo still owns its own checkpoint format and the action picker it passes into run_test_loop. * Rename gridworld.py / cartpole.py -> env.py for consistency Both folders now have an env.py — the import line in every algorithm file becomes a uniform 'from env import ...' regardless of category. * Cartpole training: poll pygame QUIT each step too So the X button works during --render training runs, not only in --test mode. The poll is a no-op when no display has been initialized (checked via pygame.display.get_init()), keeping headless runs fast. * Bump PPO rollout to 1024 steps and total updates to 1500 256-step rollouts gave noisy GAE estimates and PPO oscillated around 200 on CartPole. Larger rollouts stabilize the advantage estimate at the cost of ~4x wall time, which is the right trade for a learning demo. * PPO stability: orthogonal init + correct truncation handling Two CleanRL-standard fixes that were missing: 1. Orthogonal weight init with gain=sqrt(2) on the tanh trunk, 0.01 on the policy head (keeps the initial action distribution near uniform), 1.0 on the value head. This is the single biggest stability win for PPO on small MLPs. 2. The GAE done mask now uses *terminated* only, not terminated|truncated. When CartPole-v1 truncates at 500 steps the world hasn't actually ended, so V(s') should be bootstrapped instead of zeroed — otherwise the agent learns to fear the time limit and value collapses near the end of every successful episode. * Cartpole: kr-v2 style reward shaping (+0.1 / -1) across all three algos Match the rlcode-kr-v2 reference: per-step reward is +0.1, the failure step pays -1, and successful 500-step episodes get +0.1 on every step. Well-scaled magnitudes (vs. our previous +1 / -100) help PPO especially, which has no reward normalization and is sensitive to large rewards. PPO: re-enabled `done = terminated or truncated` for the GAE mask now that the shaping itself distinguishes success vs failure — the earlier "terminated only" change wasn't paired with a proper V(s') bootstrap on truncation and was probably destabilizing the value learning. * Rewrite README; drop per-folder READMEs Single concise README at the repo root: 9-algorithm table, uv setup, run commands. Per-folder READMEs were stubs with broken image links to the old folder layout, easier to just delete them. * Restore original description; move modernization notes to an Updates section * Add 3-atari/ — DQN and PPO with GPU support env.py provides --env breakout|pong selection, standard Atari preprocessing (frame-skip 4, 84x84 grayscale, frame-stack 4), and a pick_device helper that prefers CUDA, then MPS, then CPU. Both algorithms use the Nature CNN backbone, DeepMind-standard reward clipping (sign), and the same --render / --test CLI as cartpole. Defaults are tuned to be runnable on a laptop rather than to hit paper-quality scores: TOTAL_FRAMES = 1M, smaller replay buffer for DQN, single-env rollout for PPO. Bump these for serious training. * Switch to opencv-python-headless to silence SDL2 dylib clash on macOS pygame and opencv-python both ship their own libSDL2-2.0.0.dylib, and macOS warns loudly when both define the same Objective-C classes. The headless variant drops the GUI/SDL bits but keeps cv2's resize, which is all gymnasium's AtariPreprocessing actually needs. * Update DQN batch size from 32 to 64 for improved training performance

@AlexisBogroff

sys.exit() raises SystemExit which short-circuits env.close() and any outer cleanup, can't be unit-tested, and kills the kernel under Jupyter/IPython. Replace with a `solved` flag (DQN/A2C, nested loops) or a plain break (PPO, single loop) so the function returns normally and the final save runs through the same path as the EPISODES-exhausted case. Credit to @AlexisBogroff in rlcode#84 for flagging the pattern; applied to the current PyTorch tree.

Breakout requires pressing FIRE to launch the ball after reset / life loss; AtariPreprocessing only does NOOPs, so the agent wastes frames waiting for a random FIRE. FireResetEnv presses it automatically on reset (applied only to games whose action set contains FIRE, so Pong is unaffected). terminal_on_life_loss=True is the Nature DQN / CleanRL convention: each life becomes its own episode so credit for the death-causing action isn't diluted across hundreds of steps. Reported returns will look ~5x smaller on Breakout since they now reflect per-life score, not full-game score.

* Vectorize Atari PPO with 8 SyncVectorEnvs Add make_vec_env helper in env.py and have PPO use 8 parallel envs (ROLLOUT_STEPS 1024->128 keeps the per-update batch at 1024). Standard CleanRL convention; improves sample efficiency and GPU utilization since forwards now batch across envs. DQN keeps using single-env make_env. * Bump Atari PPO TOTAL_FRAMES to 5M 1M was too short to see PPO Atari really take off (Breakout plateaus around per-game ~20 there). 5M lets the curve get well past the initial ramp; bump further if needed.

* Add optional W&B logging to Atari PPO New --wandb CLI flag (in shared parse_args) initializes a wandb run and logs recent_mean_return, policy_loss, value_loss, and entropy keyed by global_step. Off by default so non-wandb users see no behavior change. wandb added to dependencies via uv. * Gitignore wandb run dir and document --wandb usage Add wandb/ to .gitignore so local run artifacts don't get committed, and add a short Logging section to the README covering 'wandb login' and the --wandb flag. The flag is opt-in and per-user (W&B login is tied to your own account), so contributors who skip it don't need the package at runtime. * Add optional W&B logging to Atari DQN Same --wandb pattern as PPO: logs recent_mean_return, epsilon, last loss, and buffer size every 10k frames into project rl-atari-dqn. Off by default so non-wandb users see no change. README updated to cover both Atari scripts.

…table (rlcode#128) * PPO tuning: LR anneal, value clipping, per-minibatch adv norm, 10M frames Three of CleanRL's 'PPO 37 details' that were missing — flagged when the 5M and 10M Breakout runs both plateaued at per-game ~75 with entropy stuck around 0.8 (policy wasn't sharpening, clip rarely activating): - Linear LR anneal from 2.5e-4 -> 0 across the run; lets late updates fine-tune instead of bouncing. - Value-function loss clipping around the old prediction (CLIP_COEF), matching the policy clipping range; stabilizes value targets. - Advantage normalization moved inside the minibatch loop instead of once per batch. Also bumps TOTAL_FRAMES 5M -> 10M to match the CleanRL Atari budget so runs are directly comparable to their published curves. lr now logged to wandb so the anneal is visible. * Atari: shrink DQN buffer 4x, fix life-loss reset, track per-game returns - ReplayBuffer stores single frames and stacks 4 at sample time (~28GB -> ~7GB). - LifeLossTerminalEnv signals terminal on life loss but defers real reset to game-over, so noop_max + FIRE no longer fire every life and GAE/Q chains break only at the right boundary. - DQN: BATCH_SIZE 64 -> 32, TARGET_UPDATE_EVERY 2500 -> 250 train steps (~1k env frames), EPSILON_END 0.1 -> 0.01. - Log per-life and per-game returns separately (DQN and PPO). * Atari: README benchmarks table + DQN 500k buffer for 8GB Macs - README: add Atari to algorithms list, new Benchmarks section with hardware notes, per-algo row (params, train time, score, RAM, CPU/GPU, W&B report). - DQN buffer 1M -> 500k (~3.5GB) so a 1M-capacity run stops swapping on 8GB unified memory. - moviepy added for the local eval/recording script. - .gitignore: exclude scripts/ and docs/ (local-only working dirs). * Ignore local logs/ directory

* Ignore local CLAUDE.md collaboration notes * 4-atari-hard: PPO + RND scaffold for Montezuma's Revenge New chapter for hard-exploration Atari. PPO with Random Network Distillation (Burda et al., 2018) as the curiosity bonus. - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv (uninterrupted episodes so intrinsic returns can chain across deaths). - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS seeded by 50 rollouts of a random agent, intrinsic reward scaled by running std of discounted intrinsic returns, dual GAE (extrinsic episodic + intrinsic non-episodic), predictor updated on 25% of each minibatch, combined advantage A = 2*A_ext + 1*A_int. Not run end-to-end yet. Sanity-checked static shapes and module wiring. * 4-atari-hard: add envpool count-based exploration * 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark PPO+RND made reproducible and resumable. Shared run plumbing (seed, metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary) lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel envs crack the first-key bottleneck (128 envs never scored in 50M); final mean per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO baseline (2497). Adds a README benchmark row. Count-based exploration is deferred to a later PR (not yet trained/benchmarked). * README: link the Montezuma PPO+RND W&B report --------- Co-authored-by: soyoung park <ssoyyoung.p@gmail.com>

Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.4.0 to 12.2.0. - [Release notes](https://github.com/python-pillow/Pillow/releases) - [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst) - [Commits](python-pillow/Pillow@10.4.0...12.2.0) --- updated-dependencies: - dependency-name: pillow dependency-version: 12.2.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…ification) (rlcode#133) * Ignore local CLAUDE.md collaboration notes * 4-atari-hard: PPO + RND scaffold for Montezuma's Revenge New chapter for hard-exploration Atari. PPO with Random Network Distillation (Burda et al., 2018) as the curiosity bonus. - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv (uninterrupted episodes so intrinsic returns can chain across deaths). - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS seeded by 50 rollouts of a random agent, intrinsic reward scaled by running std of discounted intrinsic returns, dual GAE (extrinsic episodic + intrinsic non-episodic), predictor updated on 25% of each minibatch, combined advantage A = 2*A_ext + 1*A_int. Not run end-to-end yet. Sanity-checked static shapes and module wiring. * 4-atari-hard: add envpool count-based exploration * 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark PPO+RND made reproducible and resumable. Shared run plumbing (seed, metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary) lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel envs crack the first-key bottleneck (128 envs never scored in 50M); final mean per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO baseline (2497). Adds a README benchmark row. Count-based exploration is deferred to a later PR (not yet trained/benchmarked). * README: link the Montezuma PPO+RND W&B report * 6-atari-go-explore: Go-Explore Phase 1 (exploration) for Montezuma Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated random actions (p=0.95), raw-score accept rule, virtual DONE cell, global experience log with prev_id chains (demo source for robustification), 12-worker spawn pool over raw gymnasium ALE clone/restore. Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume; explog flushed as compressed chunks; checkpoint = archive+log+RNG at batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps. * 6-atari-go-explore: resolve flushed explog chunks from the resumed run's dir Cross-run-dir resume (harness relaunches into a fresh run dir) could not see chunks flushed by the original run; chunk lookup now falls back to the ancestor run's explog dir and resume fails loudly if any chunk is unreachable. * Move Go-Explore into 4-atari-hard alongside PPO+RND Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py (gradient + intrinsic reward, envpool) and 2-go-explore.py (archive + emulator restore, raw ALE). Go-Explore keeps its own plumbing in env_go_explore.py since the two stacks share nothing. * 4-atari-hard/2-go-explore: use sampling-time captures in the archive walk A result earlier in the same batch can replace a cell; walking a later result against the cell's CURRENT score/trajectory stitched actions executed from the old state onto the new prefix, fabricating scores no single playthrough achieved. sample() now freezes snapshot/score/ trajectory per pick and the walk uses the capture — matching the official Go-Explore, which ships these values inside each task. Caught by publish-time demo replay verification (score mismatch). * 4-atari-hard: Go-Explore robustification (backward algorithm) — demo extract + GRU PPO extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint + experience log, replay-verify it reproduces the archived score (31,000), truncate after the last reward, save actions/rewards/periodic ALE states. env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play forward under sticky actions; success = raw score >= demo return; lag/success kills) + ResetManager curriculum (starting points march backward as the agent matches the demo, forward-cumsum move rule per atari-reset, nudge forward on collapse). 3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated BPTT with done-masked state, advantage chains cut at artificial success resets, periodic from-reset sticky eval -> final.json. Single-machine scaled port of openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs end-to-end; curriculum logic covered by harness T0. * 4-atari-hard/3-robustify: eval honors game_over + faithful resume RNG Preflight audit found the from-reset eval reused ReplayResetEnv with the training-curriculum kills active, so eval episodes were cut by lag/success-kill before game_over — value_mean reported a key-but-slower-than-demo policy as ~0. Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma episode (4500 agent steps) so eval runs from reset to a real game_over. Also checkpoint torch + per-env RNG state and restore them on resume (was global numpy RNG only), so the kill->resume contract is faithful for the deterministic streams (MPS policy sampling has no bit determinism). * 4-atari-hard/extract_demo: --max-rewards to truncate at the Kth reward Cutting the demo just after the first reward (--max-rewards 1) yields a short first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the robustification backward curriculum to bootstrap on. * 4-atari-hard/3-robustify: --ent-coef flag for entropy tuning The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34: the policy commits before reliably executing the demo suffix under sticky actions. Expose the entropy bonus as a flag (default unchanged) to test whether more exploration breaks the plateau. * README: Go-Explore robustification — single-machine negative result Document the backward-algorithm robustification (3-robustify.py): the curriculum bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no from-reset score on a single machine. Honest negative result, single seed, no benchmark row claimed. * README: consolidate Montezuma into one table (RND + Go-Explore exploration + robustification) Merge the three Montezuma results into a single table with a protocol column, trim the prose to one note. Restores the exploration row (31,000, replay-verified) that lived only on the superseded rlcode#132 branch. --------- Co-authored-by: soyoung park <ssoyyoung.p@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on May 17, 2026

Commits on May 24, 2026

Commits on Jun 6, 2026

Commits on Jun 12, 2026

This comparison is taking too long to generate.

Uh oh!