Eliminate duplicated calculations and unnecessary work for linear regression by rhettinger · Pull Request #25922 · python/cpython

rhettinger · 2021-05-05T17:16:51Z

The current code, while pretty, does repeated calculations and unnecessary work:

covariance() and variance() both divide by n - 1 which is thrown away in the slope calculation. This also causes two unnecessary roundings.
covariance(x,y) and variance(x) both compute fmean(x). This doesn't need to be done twice.
variance(x) uses the extremely slow internal _ss(), _sum(), and _convert() functions whose purpose is to preserve type information. However, that type information is thrown away by linear_regression(x, y) which always returns a pair of floats:

    >>> from statistics import linear_regression
    >>> from fractions import Fraction as F
    >>> linear_regression([F(1,2), F(2,3)], [F(5,7), F(8,9)])
    LinearRegression(intercept=0.19047619047619047, slope=1.0476190476190477)

the intercept calculation makes two more redundant fmean() calls that are unnecessary.
The inlined code makes the actual calculation more clear. It matches this typical presentation: slope = s_{x,y} / s^2_x

Update to 15 March

Merge branch 'master' of github.com:python/cpython

Merge branch 'main' of github.com:python/cpython into main

pablogsal

LGTM!

pablogsal · 2021-05-06T13:58:38Z

+    x, y = regressor, dependent_variable
+    xbar = fsum(x) / n
+    ybar = fsum(y) / n
+    sxy = fsum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))


Question: isn't the generator + zip going to make it slightly slower?

That was an existing line take from covariance(). I think it is the fastest way the run this computation.

miss-islington · 2021-05-06T14:43:16Z

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

miss-islington · 2021-05-06T14:43:18Z

Sorry @rhettinger, I had trouble checking out the 3.10 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 55b78ce3c4e23abe4f27bf16d7968f8851532e47 3.10

miss-islington · 2021-05-06T14:44:10Z

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

bedevere-bot · 2021-05-06T14:44:18Z

GH-25945 is a backport of this pull request to the 3.10 branch.

…ression (pythonGH-25922) (cherry picked from commit 55b78ce) Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>

…ression (GH-25922) (GH-25945)

rhettinger and others added 10 commits March 15, 2021 21:12

Merge pull request #1 from python/master

bbd2da9

Update to 15 March

Merge branch 'master' of github.com:python/cpython

74bdf1b

Merge branch 'master' of github.com:python/cpython

6c53f1a

.

a487c4f

Merge branch 'master' of github.com:python/cpython

.

eb56423

Merge branch 'master' of github.com:python/cpython

.

cc7ba06

Merge branch 'master' of github.com:python/cpython

.

d024dd0

Merge branch 'master' of github.com:python/cpython

merge

b10f912

Merge branch 'main' of github.com:python/cpython into main

Avoid repeated calculations, unnecessary scaling, and type preservation

c6dcb38

Improve variable names

125a09f

rhettinger requested a review from pablogsal May 5, 2021 17:16

bedevere-bot added the awaiting core review label May 5, 2021

the-knights-who-say-ni added the CLA signed label May 5, 2021

rhettinger added needs backport to 3.10 only security fixes skip issue skip news performance Performance or resource usage labels May 5, 2021

rhettinger changed the title ~~Inline the calculations for linear regression~~ Eliminate duplicated calculations and unnecessary work for linear regression May 5, 2021

pablogsal approved these changes May 6, 2021

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels May 6, 2021

pablogsal reviewed May 6, 2021

View reviewed changes

rhettinger merged commit 55b78ce into python:main May 6, 2021

bedevere-bot removed the awaiting merge label May 6, 2021

miss-islington assigned rhettinger May 6, 2021

rhettinger added needs backport to 3.10 only security fixes and removed needs backport to 3.10 only security fixes labels May 6, 2021

bedevere-bot removed the needs backport to 3.10 only security fixes label May 6, 2021

rhettinger pushed a commit that referenced this pull request May 6, 2021

Eliminate duplicated calculations and unnecessary work for linear reg…

8e3cb61

…ression (GH-25922) (GH-25945)

pablogsal mentioned this pull request Jul 18, 2021

[3.10] Correct the order of regen-abidump #27228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eliminate duplicated calculations and unnecessary work for linear regression#25922

Eliminate duplicated calculations and unnecessary work for linear regression#25922
rhettinger merged 10 commits into
python:mainfrom
rhettinger:linear_regresssion_inline

rhettinger commented May 5, 2021 •

edited

Loading

Uh oh!

pablogsal left a comment

Uh oh!

pablogsal May 6, 2021

Uh oh!

rhettinger May 6, 2021

Uh oh!

miss-islington commented May 6, 2021

Uh oh!

miss-islington commented May 6, 2021

Uh oh!

miss-islington commented May 6, 2021

Uh oh!

bedevere-bot commented May 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

rhettinger commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal left a comment

Choose a reason for hiding this comment

Uh oh!

pablogsal May 6, 2021

Choose a reason for hiding this comment

Uh oh!

rhettinger May 6, 2021

Choose a reason for hiding this comment

Uh oh!

miss-islington commented May 6, 2021

Uh oh!

miss-islington commented May 6, 2021

Uh oh!

miss-islington commented May 6, 2021

Uh oh!

bedevere-bot commented May 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rhettinger commented May 5, 2021 •

edited

Loading