A t-test refers to a hypothesis test that is used to compare the mean of a population to a certain value (one-sample t-test) or to compare two means (two-sample t-test). In this post, we’ll show you how to conduct a two-sample t-test in Python using the SciPy library. We’ll cover the basic syntax, and a few key arguments you can use to further configure your hypothesis test.
For this example, we’re using a dataset about Adidas sales in the United States.
Since t-tests are a statistical test, there are certain assumptions that the data has to meet in order for there to be high confidence in the results of the test. For example:
- Both datasets should have equivalent variance
- Each value should be independent of other values in the dataset
- The data should be normally distributed
- The data should be continuous
Assuming we’ve met these criteria, we need to establish a null and alternative hypothesis to run the test:
In order to reject the null hypothesis with a 95% confidence level, the test needs to yield a p-value of less than 0.05. While you can calculate the test statistic and p-value by hand, SciPy has a convenient function,
ttest_ind(), which will run a two-sample t-test for you.
The function takes 2 key arguments:
b, which represent the two samples you are comparing. In this case, we're comparing the price per unit for sales in the Northeast and West regions. We should expect that the prices per unit are not meaningfully different.
from scipy import stats ne_lst = list(northeast["Price per Unit"]) w_lst = list(west["Price per Unit"]) # Run two sample t-test # Assumes equal variance. You can also set `equal_var = False` # In that case, the function will run a Welch's t-test stats.ttest_ind(ne_lst, w_lst)
From the results of the t-test, we can see the p-value is very small,
~1.08e-16. This means we can reject the null hypothesis that the means of the underlying distributions are the same.
Let's run through a quick gut check if the results make sense. First we can plot the underlying distributions:
import matplotlib.pyplot as plt import seaborn as sns sns.set_theme() # Subset data northeast = df[df["Region"] == "Northeast"] west = df[df["Region"] == "West"] # Plot data fig, ((ax1)) = plt.subplots(nrows = 1, ncols = 1) hist1 = sns.histplot(ax = ax1, x = northeast["Price per Unit"], label = "Northeast") hist2 = sns.histplot(ax = ax1, x = west["Price per Unit"], label = "West") plt.legend() plt.show()
Based on the graph, we can see that the distribution of prices for the West region is heavier for the higher prices than the distribution of prices for the Northeast region.
If we calculate the average for the two samples, we see this as well:
We can see that the average price for the Northeast is $46.69, while the average price for the West is $49.94.
If you have a small sample size, for example, you can use the
permutations argument, which will then run a permutations test using your data. The datasets are pooled together, and each value is randomly assigned to group a or b. The t-statistic is calculated, and the process is repeated, and you can then compare the t-statistic for the observed data with the distribution of simulated t-statistics.
If you’re running a permutations test, and want reproducible results, it’s advised to set the
random_state. It does not matter what value you set the
random_state to, just that it is set to some value.
alternative argument helps define what kind of two-sample t-test you are running: one-sided or two-sided.
‘two-sided’: default, two-sided t-test, alternative hypothesis states that the means of the distributions are unequal
‘less’: one-sided t-test, alternative hypothesis states that the first sample comes from a distribution whose mean is less than the distribution underlying the second sample
‘greater’: one-sided t-test, alternative hypothesis states that the first sample comes from a distribution whose mean is greater than the distribution underlying the second sample
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.