Using the Reddit Python API to Generate Datasets

Paul Yang - October 14th, 2022

Use the Reddit Python API to create datasets for exploring your favorite topics.

Reddit is a rich data source that is actually very easy to mine. Upvotes inherently tell us about the user sentiment about the topic or comment, while a diverse array of communities means you can grab data about any subject.

With the below example, you can either just pull comments/posts from the Pushshift database, or also use OAuth to also connect to the Reddit API and pull in the actual live scores for comments/posts.

You will need a Reddit account and create an app first if you want to use the “PRAW” (Python Reddit API Wrapper) package — https://www.reddit.com/prefs/apps

Pushshift information can be found here: https://reddit-api.readthedocs.io/

How to use the Reddit API with Python Reddit Wrapper to download comments, posts WITHOUT scores

!pip install pmaw ## We need the pmaw package https://pypi.org/project/pmaw/

import pandas as pd
from pmaw import PushshiftAPI
import os

api = PushshiftAPI() ##Note! We cannot get the actual count of upvotes without authenticating 

## Set the time range for the search
import datetime as dt
before = int(dt.datetime(2022,4,1,0,0).timestamp())
after = int(dt.datetime(2021,4,1,0,0).timestamp())

subreddit="funny"   ## Which subreddit (optional)
limit=1000          ## Set the limits of how many to return

## Call the API for submissions
submissions = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after)
# Turn into a dataframe for use
df = pd.DataFrame(submissions)
print(df.head())

## Call the API for comments
comments = api.search_comments(q="dog", subreddit=subreddit, limit=1000)
df2 = pd.DataFrame(comments)
print(df2.head())


## Einblick does not support complex objects like dicts, structs, etc... 
df = df.astype(str)
df2 = df2.astype(str)

How to use the Reddit API with Python Reddit Wrapper to download comments, posts and scores

!pip install praw   ## This time, we bring in praw, which is a wrapper for Reddit API
!pip install pmaw 

import praw 
from pmaw import PushshiftAPI

## You need to create an app first here: https://www.reddit.com/prefs/apps
## Then, we authenticate with Reddit OAuth using PRAW
reddit = praw.Reddit(
    client_id="APP",                ## This is your app ID
    client_secret="SEC",            ## This is your app secret
    username="UN",                  ## This is youre reddit password
    password="PW",                  ## This is your reddit username
    user_agent="testscript by user of Einblick"     ## Name the useragent
)

print(reddit.user.me()) ## check you're logged in correctly 

## Pass praw to Pushshift
api_praw = PushshiftAPI(praw=reddit)

## Set the time range for the search
import datetime as dt
before = int(dt.datetime(2022,4,1,0,0).timestamp())
after = int(dt.datetime(2021,4,1,0,0).timestamp())

subreddit="nba"     ## Which subreddit (optional)
limit=10000           ## Set the limits of how many to return

## Call the API for submissions
submissions = api_praw.search_submissions(q="lebron", subreddit=subreddit, limit=limit, before=before, after = after)
df = pd.DataFrame(submissions)
print(df.head())

## Call the API for comments
comments = api_praw.search_comments(q="lebron", subreddit=subreddit, limit=limit, before=before, after = after)
df2 = pd.DataFrame(comments)
print(df2.head())

## Einblick does not support complex objects like dicts, structs, etc... 
df = df.astype(str)
df2 = df2.astype(str)

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators