Skip to content

Update data-analysis materials for "Python for Data Analysis" refresh#779

Closed
realpython-bot wants to merge 1 commit into
realpython:masterfrom
realpython-bot:update-python-for-data-analysis
Closed

Update data-analysis materials for "Python for Data Analysis" refresh#779
realpython-bot wants to merge 1 commit into
realpython:masterfrom
realpython-bot:update-python-for-data-analysis

Conversation

@realpython-bot
Copy link
Copy Markdown
Contributor

Syncs the data-analysis/ materials with the refreshed Python for Data Analysis tutorial and its updated dependency stack.

Dependencies

Added data-analysis/requirements.txt pinning the versions from the tutorial update:

pandas==3.0.3
matplotlib==3.10.9
scikit-learn==1.9.0
openpyxl==3.1.5
pyarrow==24.0.0
lxml==6.1.1

(The tutorial targets Python 3.14.)

Code changes (both notebooks)

  • Currency cleanup regex now uses a raw string and also strips whitespace: .replace("[$,]", "", regex=True).replace(r"[$,\s]", "", regex=True). The source CSV stores figures with surrounding spaces (" $1,000.00 "), so this makes the cleanup explicit rather than relying on astype() to trim.
  • film_length suffix removal now removes the leading space too: .str.removesuffix("mins").str.removesuffix(" mins").
  • read_html() now passes a browser User-Agent via storage_options={"User-Agent": "Mozilla/5.0"}, since Wikipedia now returns HTTP 403 Forbidden without one (findings notebook).

Regenerated artifact

  • james_bond_data_cleansed.csv was regenerated under pandas 3.0. In pandas 3.0, .combine_first() no longer alphabetically sorts the result columns, so the cleansed file now preserves the logical source column order. The data values are unchanged — only the column order differs.

Verification

Ran the full pipeline end-to-end on the pinned versions (Python 3.14, pandas 3.0.3, scikit-learn 1.9.0, matplotlib 3.10.9):

  • Cleansing reproduces the dataset (25 rows, 0 nulls; typos and outliers fixed).
  • Regression analysis still yields R-squared 0.79, best fit y = 1.6637x - 4.9276.
  • film_length stats: min 106, max 163, mean 128.28, std 12.94 — matching the tutorial output.

🤖 Generated with Claude Code

Sync the materials with the refreshed "Python for Data Analysis" tutorial
and its updated dependencies (pandas 3.0.3, matplotlib 3.10.9,
scikit-learn 1.9.0, openpyxl 3.1.5, pyarrow 24.0.0, lxml 6.1.1, Python 3.14).

Code changes in both notebooks:
- Currency cleanup regex now uses a raw string and strips whitespace:
  .replace("[$,]", ...) -> .replace(r"[$,\s]", ...), matching the
  source data, which has spaces inside the quoted figures (" $1,000.00 ").
- film_length suffix removal now strips the leading space too:
  .str.removesuffix("mins") -> .str.removesuffix(" mins").
- read_html() now sends a browser User-Agent via storage_options, since
  Wikipedia returns HTTP 403 without it.

Add data-analysis/requirements.txt pinning the tutorial's dependencies.

Regenerate james_bond_data_cleansed.csv: under pandas 3.0, .combine_first()
no longer alphabetically sorts the result columns, so the cleansed file now
preserves the logical source column order. Data values are unchanged.

Verified end-to-end on the pinned versions (Python 3.14): cleansing
reproduces the dataset and the regression analysis still yields R-squared
0.79 with film-length stats min 106 / max 163 / mean 128.28 / std 12.94.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@martin-martin
Copy link
Copy Markdown
Contributor

Closed in favor of #780

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants