
Statistically, Coda should not have been a strong contender to win best picture at the Oscars.
The small, independent film had just three nominations but walked away from the 94th Academy Awards with trophies for each of them - including the big one, best picture. – Steven McIntosh, BBC
Statistically, when an entertainment column uses the word “statistically,” they are in no way relying on any statistics. In their Oscars recap, the BBC boldly proclaimed that Coda should not have won.
But this article started my engines; could I produce a simple model to predict Academy Award winners based on historical performance?
Part 0: The layman’s guide to awards, by a layman as it relates to this analysis
I do things to data for a living, not cover an entertainment beat, so I’ll let domain experts cover the specifics of awards season.
But in short, the Oscars are the most prestigious award for movies. The Oscars are decided by the votes of the Academy of Motion Picture Arts and Sciences. The Academy is largely various folks working in film and entertainment so its body will overlap many other awards, and may differ from critics' reactions.
The Oscars are also one of the last major awards in the awards season, making them a fun task to predict based on earlier awards. The other two “prestige” awards during the season are the Golden Globes, and the BAFTAs presented by the British Academy. For this predictive exercise, I tacked on one more award for the SAG Awards presented by the Screen Actors Guild since the SAG (a union) largely overlaps (if a subset) of the Oscar voting body.
There are many more awards, but these are the big ones, and I’ve linked everything below, so you can always take what I’ve provided and build something even bigger.
Part 1: Data assembly and transformation
Assembling data
First, I needed to assemble the data. The Google query “movie awards winners .csv” unfortunately did not turn up results. Fortunately, the internet’s favorite bot, ChatGPT (and its Python API cousin) allowed me to return results fairly quickly.
I took data from 2001 to present for each of the awards listed above.
# Collect data on BAFTA awards via OpenAI API
response = openai.Completion.create(
model="text-davinci-003",
prompt="name the winners of the BAFTAS from 2001 for the following categories, Best Film, Best Director, Best Actor, Best Actress, Best Supporting Actor, Best Supporting Actress",
temperature=1,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
Unfortunately, I had to manually impute values for the last two years since the bot was only trained until 2021.
The bigger problem, and listen up folks, is that the completion algorithms like to hallucinate winners.
- It had problems keeping movie release year and award year (in the subsequent year) straight so some movies would show up twice
- It just loved big movies who maybe had enough sentences written about them that they appeared to win
I did my best, for you, the reader, to clean this data up. I know that the Oscars winners are in fact correct. But if one or two awards got wrongly categorized, well, the source material is all linked below so feel free to extend the work.
Cleaning and organizing
I wanted to produce a table, for which every award win would be a binary.
- 0: you did not win the award
- 1: you did win the award
Then it would just be a simple classification exercise of regressing the Oscar results on all the other award results.
I did some of the cleaning and preprocessing in Excel and some of the cleaning in Python. Then a simple pivot made my data ready to go.
# Pivot data table and reset index
df_pivot = df2.pivot_table(index=['FilmCleansed', 'Year'], columns='CategoryCleansed', fill_value = 0, aggfunc='size').reset_index()
Output:

Part 1.5: Data exploration
Just so no one thought I skipped it, I did get correlations and made bar charts! They were pretty, but this article is already too long.

Part 2: The Machine Learning Part
So I cheated. I used an AutoML tool to tune my hyperparameters, but that is because I work at a software company with an AutoML widget in our Python notebooks.
But if you wanted to search by hand, you could try something like this (and I am not trying to over engineer this one):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
## Get a subset of years to be holdout
distinct_years = df_model['Year'].unique()
subset_years = distinct_years[:int(len(distinct_years)*0.25)]
subset_years = np.append(subset_years,2022)
df_subset = df_model[df_model['Year'].isin(subset_years)]
df_remainder = df_model[~df_model['Year'].isin(subset_years)]
y_train = df_remainder["Oscars Best Picture"]
X_train = df_remainder[[{{all the other columns}}]]
y_test = df_subset["Oscars Best Picture"]
X_test = df_subset[[{{all the other columns}}]]
#Set search parameters
parameters = {'learning_rate': [0.01, 0.02, 0.03],
'n_estimators': [25, 55, 100],
'max_depth': [1, 5, 10],
'min_child_weight': [1, 2, 3],
'subsample': [0.5, 0.7, 1],
'colsample_bytree': [1]}
xgb_model = xgb.XGBClassifier(objective = 'binary:logistic', seed = 123)
clf = RandomizedSearchCV(xgb_model, parameters, cv=5, n_jobs=4, verbose=1, scoring = 'f1')
clf.fit(X_train, y_train)
# Print best parameters
print(str(clf.best_params_).replace("'",'"'))
xyz = pd.concat([pd.DataFrame(clf.cv_results_["params"]),pd.DataFrame(clf.cv_results_["mean_test_score"], columns=["F1"])],axis=1)
It ended up being that the AutoML hyperparameters were close to what RandomizedSearch could get me with some benefit to non-0.5 thresholding on probability and some other bells and whistles. When I discuss values in writing, they will be based on predictions from the AutoML-trained Python pipeline… but they should largely align with the simple Python-trained model above as well.
We reached a holdout F1 of 0.714, which is quite good – and since this is a fun project, I avoided FOMO.

Part 3: Best Picture
The top features that mattered were performances at the SAG and Golden Globes. Why? My uneducated guess is that the close alignment of the SAG body and the Academy voters gives a close relationship. Golden Globes may offer an orthogonal perspective to the artist-centric Academy and SAG, and the BAFTAs do not show as influential to award performance for any of the later categories either.
Remembering that I am a data scientist and not an economist, I forge on without worrying too much about causality.

So who wins? Everything Everywhere All at Once (EEAAO) takes first place, with a grand probability of 46%.
The model really likes its win at the SAGs. However, The Fabelmans ranks highly because historically, dramas tend to fare better than comedies at the coveted Best Picture category. There is nothing the Academy likes more than a film about film, so if someone is giving you 100:1 odds, maybe give it a shot.
Otherwise, you might note that 46% isn’t actually greater than 50% ... which speaks to a thresholding somewhat lower.


2022 Best Picture - Coda Revisited
Let’s visit last year’s win by Coda for a minute, and just call out everyone who was surprised. By our model’s standards, Coda was the favorite as well. Even with slightly fewer nods than EEAAO, it seemed the most promising out of the various movies of the year.


Just in the interest of fairness, we would have missed a year earlier, with the model preferring The Trial of the Chicago 7 to Nomadland for the 2021 Oscar for Best Picture … but neither was a strong enough contender to outright make it past the threshold probability to be the prediction of 1.

Part 4: Best Actor in a Leading Role
The model here once again shunned the BAFTA’s, and put all of the weight on the SAGs and Golden Globe Drama Awards. Musical comedy showed a listless sign of life, but otherwise, no other category meaningfully mattered.

So with the SAG and Golden Globes split, Colin Farrell in The Banshees of Inisherin lags since it won the comedy category for Best Actor instead of Drama (though contextually it might be viewed with the prestige of a drama). But let’s say it's a tossup with Brendan Fraser in The Whale coming out ahead.
But it is close! Colin Farrell or Austin Butler in Elvis could both take this still.

Part 5: Best Actress in a Leading Role
The women’s category tells a similar story, with an added halo from winning a Drama award from Golden Globes. Unlikely to be a swing vote, but likely to push some probabilities around the branches, BAFTAs continue to not show additive predictive power, which is at least consistent if not fully explained.

Historically, we would have called 8 of the last 10 awards winners as predicted winners, so we are doing well.
Here, Cate Blanchett’s Tár loses out purely because it won the Golden Globes, valued historically a little bit less.

Pundits have Tár winning, but history favors the winner of the SAG Awards over the Golden Globes.
Concluding Thoughts
- Would adding more awards help predictive power? It probably would not hurt, but my guess is that it might add noise without adding too much accuracy.
- The other thing to note is overall lower probabilities – voters tend to split across awards, and it's never clear one year to the next. Humans are capricious, and somewhat hard to predict.
Finally, models do help us manage our own recency and confirmation biases and point out trends in data! If the BBC had asked me last year, they maybe would have guessed right!
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.