6_Statistics

Statistics

There are 2 sets of tools for statistics:

Descriptive Statistics: It is used to identify important elements in a dataset and describe the dataset.
Inferential Statistics: It explain those relationships via other elements.

            Statistics
     Descriptive     Inferential
Univariate                Hypothesis Testing 
Bivariate                 Model Fitting
Multivariate

Descriptive statistics help us summarize the data.

It is the first step in exploratory data analysis.
It helps us understand the data.
It help you detect outlier.
Plan to prepare data.
It also helps us in feature engineering.

          Descriptive
Univariate  Bivariate  Multivariate

Univariate: It involves single variable.

Bivariate: It involves 2 variables, examples: correlation and covariance.

Multivariate: It includes multiple variables, examples: corrilation matrix and covariance matrix.

Lets understand Univariate Statistics

It is divided into 3 parts:

Measures of Frequency

How often a particular value occur in data.
It includes frequency tables and histogram.
A histogram is a visualization where we plot the values, bucketized if needed on the x-axis and on the y-axis we have count of record.

Central Tendency

Measure of central tendency for a variable try to determine one value that best represents your data.
Common measures include average or the mean of your data, the median value and the mode.

Other measures of central tendency

Geomatric mean
Harmonic mean

They are used with ratios or rates

Mean

It is the single best value to represent your data. It not actually be the data point itself.
It considers every point available in your data.
It can be computed on discrete data as well as continuous data.

Discrete data can be numeric value that take on values from a predetermine subset such as star ratings.

Continuous data can take on any value in a range.

Mean is extremly sensitive to the presence of outlier.

Median

It is less influenced by the outlier.
The median is that value in your data such that 50% of the dataset is either side of it.
For calculating mean you have to sort the data in ascending or decending order and then use the middle element.
It the data points are even then average the middle 2.
The median might be the better measure of central tendency than the mean because it is much more robust to the presence of outliers.
Like mean, Median does not consider very data point available in the dataset.
The median can be computed on ordinal data, data that is not numeric but can be sorted like rating like good bad ugly very ugly.

Mode

The value that occurs most frequently in the data, if you plot a histogram, the highest bar represents the mode.
Examples - casting votes in an election
It is used with catagorical data, variables that take on a fixed set of values from a predetermined range, examples of catagorical data includes, days of a week, month of the year, makes of cars.
Mode dont need to be unique.
Mode is not great for continuous data that can take on any value within the range.
Contiuous data needs to be descretized before you can perform mode computation.

Dispersion

It tells you how your dataset is spread out.

Range(max-min) of your variable. It is sensitive to outliers. - Range completely ignores the mean, thats why variance comes in.
Inter Quartile range: Difference between the values at the 75th percentile of your data and the 25th percentile of your data.
Standard deviation and variance

Variance

Not only tells us how the numbers jumps around but also understands where your numbers are clustered.
It also considers the mean.
Variance is the second most important number to summarize your datapoints.
Variance of the data is simply the sum of the squares of the mean deviations divided by the number of data points you have in your data.

Bessels Correction - put n-1 in the denominator while computing variance

Name		Name	Last commit message	Last commit date
parent directory ..
.ipynb_checkpoints		.ipynb_checkpoints
datasets		datasets
38_Statistics.ipynb		38_Statistics.ipynb
39_Statistics.ipynb		39_Statistics.ipynb
40_Statistics.ipynb		40_Statistics.ipynb
41_Statistics .ipynb		41_Statistics .ipynb
42_Statistics Probability.ipynb		42_Statistics Probability.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Statistics

Lets understand Univariate Statistics

Mean

Mean is extremly sensitive to the presence of outlier.

Median

Mode

Variance

FilesExpand file tree

6_Statistics

Directory actions

More options

Directory actions

More options

Latest commit

History

6_Statistics

Folders and files

parent directory

README.md

Statistics

Lets understand Univariate Statistics

Mean

Mean is extremly sensitive to the presence of outlier.

Median

Mode

Variance