Cheatsheets
Analyze Data with Python - Simple Statistics

Analyze Data with Python - Simple Statistics

Understanding Variance and Standard Deviation

What is Standard Deviation?

Standard deviation measures how spread out the values in a dataset are. It tells us how much the data varies from the average. To find it, we calculate the square root of the variance. The result is in the same units as the original data.


import numpy as np
values = np.array([1,3,4,2,6,3,4,5])
# calculate standard deviation of values
std_dev = np.std(values)
std_dev_sample = np.std(values, ddof=1)
print(std_dev_sample)  # Output: 1.60 (approx.)

How to Calculate Variance in Python

Variance measures how much the data values differ from the average. In Python, you can calculate it using the NumPy `var()` function.


import numpy as np
values = np.array([1,3,4,2,6,3,4,5])
# calculate variance of values
variance_sample = np.var(values, ddof=1)
print(variance_sample)  # Output: 2.8 (approx.)

Understanding Standard Deviation Units

Standard deviation gives context to the average of a dataset. For example, if the dataset is [3, 5, 10, 14], with a standard deviation of 4.3, the mean is 8.0. This tells us that 14 is more than one standard deviation away from the average.

Calculating Standard Deviation in Python

To find the standard deviation in Python, use the NumPy `std()` function. It provides a simple way to understand how data values spread out from the average.

What Does Variance Tell Us?

Variance shows how spread out the data is. A high variance means the data points are far from the mean. A variance of 0 means all values are the same.

What is Variance?

Variance measures the average of the squared differences from the mean. It shows how much the data varies. The result is in squared units of the original data.

Quartiles, Quantiles, and Interquartile Range

Understanding Quantiles

Quantiles divide the dataset into equal-sized parts. For example, if we split data into 4 equal parts, the values dividing these parts are called quantiles.


import numpy as np
data = [1, 3, 5, 9, 20]
Q1 = np.quantile(data, 0.25)
Q2 = np.quantile(data, 0.5)  # Median
Q3 = np.quantile(data, 0.75)
print(f"Q1: {Q1}, Q2 (Median): {Q2}, Q3: {Q3}")
# Output: Q1: 3.0, Q2 (Median): 5.0, Q3: 9.0

What are Quartiles?

Quartiles split the data into four equal parts. The three points dividing these parts are called quartiles. For example, Q1, Q2 (median), and Q3 are quartiles.

Using Numpy’s Quantile() Function

In Python, you can use `numpy.quantile()` to find values that divide the data into parts. For example, `numpy.quantile(data, 0.25)` gives the value at the first quartile.

Quantiles and Group Sizes

If you have n quantiles, your dataset will be split into n+1 groups of equal size.

What is the Median?

The median is the middle value of a dataset, splitting it into two halves. It is also called the 50th percentile or second quartile.

Understanding Interquartile Range

The interquartile range (IQR) is the difference between the first quartile (Q1) and the third quartile (Q3). It shows the range where the middle 50% of the data lies.

Why IQR is Useful

The IQR is useful because it is not affected by extreme values (outliers). It gives a better sense of the spread of the middle 50% of the data.

Sample Mean vs. Population Mean and P-Values

Understanding Hypothesis Test Errors

Type I error is when we incorrectly reject a true hypothesis (a false positive). The acceptable rate for this error is usually 0.05 (5%) or 0.01 (1%).

Sample Mean vs. Population Mean

Type II error is when we fail to reject a false hypothesis (a false negative). This means we miss detecting something that is actually there.

What is the Central Limit Theorem?

The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the population distribution.

Understanding P-Values

In statistics, a p-value helps us understand whether our sample results are likely to occur under the null hypothesis. It helps determine the strength of the evidence against the null hypothesis.

Programming Cheatsheets: Quick Reference for Productivity

Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.

ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.