Hypothesis Testing with Prompt AI

Becca Weng - December 12th, 2023

Hypothesis testing is a key part of any data scientist and data analyst’s toolkit. As the name suggests, this particular branch of statistics allows users to test different types of hypotheses about different kinds of data. For this example, we will walk through a two-sample t-test (not to be confused with a two-sided t-test–more on this later), and provide additional vocabulary and resources with regard to other kinds of hypothesis testing.

The Data and Notebook

For this tutorial, I'll be using a classic data set, penguins, which I've loaded in from seaborn. The dataset is quite small, including information about the penguin species, island, bill length, bill depth, flipper length, body mass, and sex.

You can follow along on the canvas below, and you can open it and fork it if you would like your own copy:

Prompt 1: Data Cleaning

For the following example, I decided to isolate two groups for our two samples: the penguins from Biscoe island and the penguins from Dream island. I click on the canvas, as you can see in the video, select Prompt, and type in the following prompt.

Subset for Biscoe and Dream island penguins
# Filter the penguin data for records where the island is either 'Biscoe' or 'Dream'
some_penguins = penguins[(penguins['island'] == 'Biscoe') | (penguins['island'] == 'Dream')]

Prompt 2: Exploratory Data Analysis

Next, to examine the groups further, I'll take a look at the distribution of body mass for each group of penguins.

Use a box plot for the distribution of body mass by island.
import seaborn as sns

# Create a box plot to visualize the distribution of body mass by island using the 'some_penguins' dataframe
sns.boxplot(x='island', y='body_mass_g', data=some_penguins)

Code Output:

Based on the chart, we can see that in general, the Biscoe penguins have more mass than the Dream penguins. Next, we can check this assumption using a t-test.

Prompt 3: Two-sided t-test

As we are running a hypothesis test, we need to state the null and alternative hypotheses:

$H_0: \text{The means of the distributions underlying the two samples are the same.} \newline H_1: \text{The means of the distributions underlying the two samples are NOT the same.} \newline \newline H_0: \mu_{\text{Biscoe}} = \mu_{\text{Dream}} \newline H_1: \mu_{\text{Biscoe}} \neq \mu_{\text{Dream}}$

In less statistically accurate but perhaps more meaningful terms, the null hypothesis is that the average body mass of the Biscoe penguins is the same as the average body mass of the Dream penguins. The alternative hypothesis is that the average body mass of the Biscoe penguins is NOT the same as the average body mass of the Dream penguins.

Now I can ask Prompt to run the two-sample t-test. There are a few popular libraries that have functions for hypothesis tests. For today's example, I'm using Scipy:

Use scipy to run a two-sample t-test comparing body mass by island. State the null and alternative hypotheses being tested.
Before running the t-test, drop rows with missing data.

The second prompt is helpful for removing missing data, which will ruin the results of the test. You can impute the missing data with a reasonable value, but in this case, I've chosen just to drop the missing rows.

# Import the necessary package 'scipy.stats' to perform the two-sample t-test
from scipy import stats

# Drop rows with missing data before running the t-test
some_penguins_cleaned = some_penguins.dropna()

# Perform a two-sample t-test using 'scipy.stats.ttest_ind' to compare the body mass of penguins by island
t_stat, p_value = stats.ttest_ind(some_penguins_cleaned[some_penguins_cleaned['island'] == 'Biscoe']['body_mass_g'],
some_penguins_cleaned[some_penguins_cleaned['island'] == 'Dream']['body_mass_g'])
t_stat, p_value

Code Output:

(12.771236008672712, 8.023975698647641e-30)

From the results, we have a t-statistic of about 12.77, and a p-value of about 8.02*10^-30. This p-value is much smaller than 0.05, which is the generally accepted cutoff. So we can say that we reject the null hypothesis that the means of the underlying population are the same.

Key Terms and Vocabulary

To cap this post, I wanted to offer some terms and definitions for your reference in case you're interested in other kinds of hypothesis testing. This list of terms is not exhaustive, but if you’re interested, you can check out our post on how to use Python to run one-sample t-tests and two-sample t-tests for more.

• T-test: A statistical test used to determine if there is a significant difference between the means of two groups.
• One-sample vs. two-sample t-test
• One-sample t-test: Compares the mean of a single sample to a known value (population mean) or tests whether the sample mean differs significantly from zero.
• Two-sample t-test: Compares the means of two independent samples to assess if they come from populations with different mean values.
• One-sided vs. two-sided t-test
• One-sided (or one-tailed) t-test: Tests the hypothesis that there is a significant difference in one specific direction (e.g., greater than or less than) between the sample means.
• Two-sided (or two-tailed) t-test: Tests the hypothesis that there is a significant difference between the sample means without specifying the direction of the difference.
• P-value: A measure used in hypothesis testing that indicates the strength of evidence against the null hypothesis. It represents the probability of observing the test statistic (or more extreme results) if the null hypothesis is true.
• Confidence Interval: It is a range of values constructed from sample data that is likely to contain the true population parameter. For example, a 95% confidence interval implies that if the study were repeated many times and new confidence intervals were computed, approximately 95% of these intervals would contain the true population parameter.
• Type I Error: Occurs when the null hypothesis is incorrectly rejected when it is actually true. It signifies the conclusion that there is a significant effect or difference when there isn't one (false positive).
• Type II Error: Occurs when the null hypothesis is incorrectly not rejected when it is actually false. It means failing to detect a true effect or difference (false negative).