Kickstart ML with Python snippets

Introduction to Hypothesis Testing with Python

Hypothesis Testing is a statistical method used to make decisions about a population parameter based on sample data. It involves formulating two competing hypotheses: the null hypothesis and the alternative hypothesis.

Key Concepts

1. Null Hypothesis (H0)

Definition: The null hypothesis is a statement that there is no effect or no difference, and it represents the status quo or a baseline condition. It is what we seek to test against.
Example: There is no difference in the mean test scores of two groups of students.

2. Alternative Hypothesis (H1 or Ha)

Definition: The alternative hypothesis is a statement that indicates the presence of an effect or a difference. It is what we want to prove.
Example: There is a difference in the mean test scores of two groups of students.

Steps in Hypothesis Testing

Formulate Hypotheses:
- Define the null hypothesis (H0) and the alternative hypothesis (H1).
Select Significance Level (α):
- Choose a significance level, commonly set at 0.05, which represents a 5% risk of rejecting the null hypothesis when it is true.
Choose the Appropriate Test:
- Select a statistical test based on the data and the hypotheses. Common tests include t-tests, chi-square tests, and ANOVA.
Calculate Test Statistic:
- Compute the test statistic using sample data.
Determine the p-value:
- The p-value indicates the probability of observing the test results under the null hypothesis.
Make a Decision:
- Compare the p-value to the significance level (α). If the p-value is less than α, reject the null hypothesis; otherwise, fail to reject it.

Practical Example in Python

Let's walk through an example of hypothesis testing using Python and the scipy library.

Example: One-Sample t-Test

Suppose we want to test whether the mean of a sample dataset is equal to a known population mean.

Import Libraries:

import numpyas np 
from scipy import stats

Generate Sample Data:

# Generate a sample datasetnp.random.seed(0)
sample_data = np.random.normal(loc=5, scale=2, size=30)# Known population meanpopulation_mean =5

Formulate Hypotheses:

Null Hypothesis (H0): The mean of the sample data is equal to the population mean.
Alternative Hypothesis (H1): The mean of the sample data is not equal to the population mean.

Select Significance Level (α):

alpha = 0.05

Perform One-Sample t-Test:

# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, population_mean)

print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")

Make a Decision:

# Compare p-value to significance level
if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Data Generation:
- We generate a sample dataset with a known mean and compare it to the population mean.
Hypotheses Formulation:
- We define our null and alternative hypotheses based on the problem statement.
Significance Level:
- We choose a significance level (α) of 0.05, which is a common threshold in hypothesis testing.
Performing the Test:
- We use the ttest_1samp function from the scipy.stats module to perform a one-sample t-test.
- The test returns a t-statistic and a p-value.
Decision Making:
- We compare the p-value to our significance level to decide whether to reject or fail to reject the null hypothesis.

Practical Advice

Choosing the Right Test:
- Use t-tests for comparing means, chi-square tests for categorical data, and ANOVA for comparing more than two groups.
- Ensure that the assumptions of the chosen test (e.g., normality, independence) are met.
Interpreting p-values:
- A p-value less than the significance level (α) indicates strong evidence against the null hypothesis, leading to its rejection.
- A p-value greater than α indicates insufficient evidence to reject the null hypothesis.
Avoiding Common Pitfalls:
- Do not confuse failing to reject the null hypothesis with accepting it. It simply means there isn't enough evidence against it.
- Ensure the sample size is adequate to detect a meaningful effect.
Reporting Results:
- Clearly state the hypotheses, the test used, the test statistic, the p-value, and the decision.
- Provide context and implications of the findings.

Back to Kickstart ML with Python cookbook page