Data Distributions in StatisticsΒΆ
This post aims to display and consider different data distributions commonly used in statistics and data analysis.
The code below aims to generate plots for various distributions. The code uses https://docs.scipy.org/doc/scipy/reference/stats.html for the distributions and matplotlib for plotting.
Properties of distributionsΒΆ
Different distributions have different properties:
Central tendencyΒΆ
Central tendency measures where the center of a distribution lies. Common measures include:
- Mean (ΞΌ or xΜ): The arithmetic average of all values in the distribution.
- Median (M or xΜ): The middle value when the data is ordered from least to greatest.
- Mode (Mo): The most frequently occurring value in the distribution.
DispersionΒΆ
Dispersion measures how spread out the values in a distribution are. Common measures include:
- Range (R): The difference between the maximum and minimum values.
- Variance (ΟΒ² or sΒ²): The average of the squared differences from the mean.
- Standard Deviation (Ο or s): The square root of the variance, representing the average distance from the mean.
- Interquartile Range (IQR or Qβ - Qβ): The range between the first quartile (25th percentile) and the third quartile (75th percentile), representing the middle 50% of the data.
ShapeΒΆ
Shape describes the overall form of the distribution, including its symmetry and the presence of tails. Common measures include:
- Skewness (Ξ³β or skew): A measure of the asymmetry of the distribution. Positive skew indicates a longer tail on the right side, while negative skew indicates a longer tail on the left side.
- Kurtosis (Ξ³β or kurt): A measure of the "tailedness" of the distribution. Higher kurtosis indicates more frequent extreme deviations from the mean.
InΒ [1]:
# Import libraries
import numpy as np
from scipy.stats import norm, skewnorm
import matplotlib.pyplot as plt
import seaborn as sns
# Set seaborn style for better-looking plots
sns.set_style("whitegrid")
# Create x values ranging from -10 to 10 with 500 points
x = np.linspace(-10, 10, 500)
# ===== PLOT 1: Normal Distributions with Different Standard Deviations =====
plt.figure(figsize=(12, 6))
# Loop through 4 different standard deviation values
for std in [1, 2, 3, 4]:
# Create a normal distribution with mean=0 and current std
dist = norm(loc=0, scale=std)
# Calculate the probability density function (PDF) values
y = dist.pdf(x)
# Plot the distribution curve
plt.plot(x, y, label=f'Ο={std}', linewidth=2)
# Shade the area within Β±1 standard deviation from the mean
x_shade = x[(x >= -std) & (x <= std)] # Select x values within Β±std
y_shade = dist.pdf(x_shade) # Get corresponding y values
plt.fill_between(x_shade, y_shade, alpha=0.2) # Fill with transparency
# Label the x-axis
plt.xlabel('x')
# Label the y-axis
plt.ylabel('Probability Density')
# Add title to the plot
plt.title('Normal Distributions (ΞΌ=0) with Increasing Standard Deviations')
# Display the legend
plt.legend()
# Show the plot
plt.show()
# ===== PLOT 2: How Skewness Changes the Shape =====
plt.figure(figsize=(12, 6))
# Create x values for skewed distributions
x_skew = np.linspace(-5, 10, 500)
# Loop through different skewness parameter values
# Negative values = left skew, Positive values = right skew
for skewness in [-5, -2, -1, -0.5, -0.25, 0]:
# Create a skew-normal distribution
# a = skewness parameter, loc = location, scale = spread
dist = skewnorm(a=skewness, loc=0, scale=2)
# Calculate PDF values
y = dist.pdf(x_skew)
# Get the actual mean of this distribution
mean = dist.mean()
# Get the actual skewness statistic
actual_skew = dist.stats(moments='s')
# Plot the distribution
plt.plot(x_skew, y, label=f'Ξ±={skewness}, ΞΌ={mean:.2f}, Ξ³β={actual_skew:.2f}', linewidth=2)
# Add a vertical line at the mean to show how it shifts
# plt.axvline(mean, linestyle='--', linewidth=1, alpha=0.5)
# Label the axes
plt.xlabel('x')
plt.ylabel('Probability Density')
# Add title explaining what we're showing
plt.title('How Skewness Parameter (Ξ±) Changes the Shape of the Distribution')
# Display legend
plt.legend()
# Show the plot
plt.show()
/opt/anaconda3/envs/pymc_env/lib/python3.11/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 8321 (\N{SUBSCRIPT ONE}) missing from font(s) Arial.
fig.canvas.print_figure(bytes_io, **kw)
Normal DistributionΒΆ
Normal continuous random variable. Also known as Gaussian distribution.