Using the Reddit Python API to Generate Datasets

Paul Yang - October 14th, 2022

Use the Reddit Python API to create datasets for exploring your favorite topics.

Reddit is a rich data source that is actually very easy to mine. Upvotes inherently tell us about the user sentiment about the topic or comment, while a diverse array of communities means you can grab data about any subject.

With the below example, you can either just pull comments/posts from the Pushshift database, or also use OAuth to also connect to the Reddit API and pull in the actual live scores for comments/posts.

You will need a Reddit account and create an app first if you want to use the “PRAW” (Python Reddit API Wrapper) package — https://www.reddit.com/prefs/apps

Pushshift information can be found here: https://reddit-api.readthedocs.io/

How to use the Reddit API with Python Reddit Wrapper to download comments, posts WITHOUT scores

!pip install pmaw ## We need the pmaw package https://pypi.org/project/pmaw/

import pandas as pd
from pmaw import PushshiftAPI
import os

api = PushshiftAPI() ##Note! We cannot get the actual count of upvotes without authenticating 

## Set the time range for the search
import datetime as dt
before = int(dt.datetime(2022,4,1,0,0).timestamp())
after = int(dt.datetime(2021,4,1,0,0).timestamp())

subreddit="funny"   ## Which subreddit (optional)
limit=1000          ## Set the limits of how many to return

## Call the API for submissions
submissions = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after)
# Turn into a dataframe for use
df = pd.DataFrame(submissions)
print(df.head())

## Call the API for comments
comments = api.search_comments(q="dog", subreddit=subreddit, limit=1000)
df2 = pd.DataFrame(comments)
print(df2.head())


## Einblick does not support complex objects like dicts, structs, etc... 
df = df.astype(str)
df2 = df2.astype(str)

How to use the Reddit API with Python Reddit Wrapper to download comments, posts and scores

!pip install praw   ## This time, we bring in praw, which is a wrapper for Reddit API
!pip install pmaw 

import praw 
from pmaw import PushshiftAPI

## You need to create an app first here: https://www.reddit.com/prefs/apps
## Then, we authenticate with Reddit OAuth using PRAW
reddit = praw.Reddit(
    client_id="APP",                ## This is your app ID
    client_secret="SEC",            ## This is your app secret
    username="UN",                  ## This is youre reddit password
    password="PW",                  ## This is your reddit username
    user_agent="testscript by user of Einblick"     ## Name the useragent
)

print(reddit.user.me()) ## check you're logged in correctly 

## Pass praw to Pushshift
api_praw = PushshiftAPI(praw=reddit)

## Set the time range for the search
import datetime as dt
before = int(dt.datetime(2022,4,1,0,0).timestamp())
after = int(dt.datetime(2021,4,1,0,0).timestamp())

subreddit="nba"     ## Which subreddit (optional)
limit=10000           ## Set the limits of how many to return

## Call the API for submissions
submissions = api_praw.search_submissions(q="lebron", subreddit=subreddit, limit=limit, before=before, after = after)
df = pd.DataFrame(submissions)
print(df.head())

## Call the API for comments
comments = api_praw.search_comments(q="lebron", subreddit=subreddit, limit=limit, before=before, after = after)
df2 = pd.DataFrame(comments)
print(df2.head())

## Einblick does not support complex objects like dicts, structs, etc... 
df = df.astype(str)
df2 = df2.astype(str)

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators