Update data-analysis materials for "Python for Data Analysis" refresh by realpython-bot · Pull Request #779 · realpython/materials

realpython-bot · 2026-06-05T11:58:02Z

Syncs the data-analysis/ materials with the refreshed Python for Data Analysis tutorial and its updated dependency stack.

Dependencies

Added data-analysis/requirements.txt pinning the versions from the tutorial update:

pandas==3.0.3
matplotlib==3.10.9
scikit-learn==1.9.0
openpyxl==3.1.5
pyarrow==24.0.0
lxml==6.1.1

(The tutorial targets Python 3.14.)

Code changes (both notebooks)

Currency cleanup regex now uses a raw string and also strips whitespace: .replace("[$,]", "", regex=True) → .replace(r"[$,\s]", "", regex=True). The source CSV stores figures with surrounding spaces (" $1,000.00 "), so this makes the cleanup explicit rather than relying on astype() to trim.
film_length suffix removal now removes the leading space too: .str.removesuffix("mins") → .str.removesuffix(" mins").
read_html() now passes a browser User-Agent via storage_options={"User-Agent": "Mozilla/5.0"}, since Wikipedia now returns HTTP 403 Forbidden without one (findings notebook).

Regenerated artifact

james_bond_data_cleansed.csv was regenerated under pandas 3.0. In pandas 3.0, .combine_first() no longer alphabetically sorts the result columns, so the cleansed file now preserves the logical source column order. The data values are unchanged — only the column order differs.

Verification

Ran the full pipeline end-to-end on the pinned versions (Python 3.14, pandas 3.0.3, scikit-learn 1.9.0, matplotlib 3.10.9):

Cleansing reproduces the dataset (25 rows, 0 nulls; typos and outliers fixed).
Regression analysis still yields R-squared 0.79, best fit y = 1.6637x - 4.9276.
film_length stats: min 106, max 163, mean 128.28, std 12.94 — matching the tutorial output.

🤖 Generated with Claude Code

Sync the materials with the refreshed "Python for Data Analysis" tutorial and its updated dependencies (pandas 3.0.3, matplotlib 3.10.9, scikit-learn 1.9.0, openpyxl 3.1.5, pyarrow 24.0.0, lxml 6.1.1, Python 3.14). Code changes in both notebooks: - Currency cleanup regex now uses a raw string and strips whitespace: .replace("[$,]", ...) -> .replace(r"[$,\s]", ...), matching the source data, which has spaces inside the quoted figures (" $1,000.00 "). - film_length suffix removal now strips the leading space too: .str.removesuffix("mins") -> .str.removesuffix(" mins"). - read_html() now sends a browser User-Agent via storage_options, since Wikipedia returns HTTP 403 without it. Add data-analysis/requirements.txt pinning the tutorial's dependencies. Regenerate james_bond_data_cleansed.csv: under pandas 3.0, .combine_first() no longer alphabetically sorts the result columns, so the cleansed file now preserves the logical source column order. Data values are unchanged. Verified end-to-end on the pinned versions (Python 3.14): cleansing reproduces the dataset and the regression analysis still yields R-squared 0.79 with film-length stats min 106 / max 163 / mean 128.28 / std 12.94. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

martin-martin · 2026-06-05T12:11:37Z

Closed in favor of #780

martin-martin closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data-analysis materials for "Python for Data Analysis" refresh#779

Update data-analysis materials for "Python for Data Analysis" refresh#779
realpython-bot wants to merge 1 commit into
realpython:masterfrom
realpython-bot:update-python-for-data-analysis

realpython-bot commented Jun 5, 2026

Uh oh!

martin-martin commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realpython-bot commented Jun 5, 2026

Dependencies

Code changes (both notebooks)

Regenerated artifact

Verification

Uh oh!

martin-martin commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants