Skip to content

Eliminate duplicated calculations and unnecessary work for linear regression#25922

Merged
rhettinger merged 10 commits into
python:mainfrom
rhettinger:linear_regresssion_inline
May 6, 2021
Merged

Eliminate duplicated calculations and unnecessary work for linear regression#25922
rhettinger merged 10 commits into
python:mainfrom
rhettinger:linear_regresssion_inline

Conversation

@rhettinger
Copy link
Copy Markdown
Contributor

@rhettinger rhettinger commented May 5, 2021

The current code, while pretty, does repeated calculations and unnecessary work:

  • covariance() and variance() both divide by n - 1 which is thrown away in the slope calculation. This also causes two unnecessary roundings.

  • covariance(x,y) and variance(x) both compute fmean(x). This doesn't need to be done twice.

  • variance(x) uses the extremely slow internal _ss(), _sum(), and _convert() functions whose purpose is to preserve type information. However, that type information is thrown away by linear_regression(x, y) which always returns a pair of floats:

    >>> from statistics import linear_regression
    >>> from fractions import Fraction as F
    >>> linear_regression([F(1,2), F(2,3)], [F(5,7), F(8,9)])
    LinearRegression(intercept=0.19047619047619047, slope=1.0476190476190477)
  • the intercept calculation makes two more redundant fmean() calls that are unnecessary.

  • The inlined code makes the actual calculation more clear. It matches this typical presentation: slope = s_{x,y} / s^2_x

rhettinger and others added 10 commits March 15, 2021 21:12
.
Merge branch 'master' of github.com:python/cpython
.
Merge branch 'master' of github.com:python/cpython
.
Merge branch 'master' of github.com:python/cpython
.
Merge branch 'master' of github.com:python/cpython
Merge branch 'main' of github.com:python/cpython into main
@rhettinger rhettinger requested a review from pablogsal May 5, 2021 17:16
@rhettinger rhettinger added needs backport to 3.10 only security fixes skip issue skip news performance Performance or resource usage labels May 5, 2021
@rhettinger rhettinger changed the title Inline the calculations for linear regression Eliminate duplicated calculations and unnecessary work for linear regression May 5, 2021
Copy link
Copy Markdown
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread Lib/statistics.py
x, y = regressor, dependent_variable
xbar = fsum(x) / n
ybar = fsum(y) / n
sxy = fsum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: isn't the generator + zip going to make it slightly slower?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an existing line take from covariance(). I think it is the fastest way the run this computation.

@rhettinger rhettinger merged commit 55b78ce into python:main May 6, 2021
@miss-islington
Copy link
Copy Markdown
Contributor

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

@miss-islington
Copy link
Copy Markdown
Contributor

Sorry @rhettinger, I had trouble checking out the 3.10 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 55b78ce3c4e23abe4f27bf16d7968f8851532e47 3.10

@rhettinger rhettinger added needs backport to 3.10 only security fixes and removed needs backport to 3.10 only security fixes labels May 6, 2021
@miss-islington
Copy link
Copy Markdown
Contributor

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

@bedevere-bot bedevere-bot removed the needs backport to 3.10 only security fixes label May 6, 2021
@bedevere-bot
Copy link
Copy Markdown

GH-25945 is a backport of this pull request to the 3.10 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 6, 2021
…ression (pythonGH-25922)

(cherry picked from commit 55b78ce)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance or resource usage skip issue skip news

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants