BeautifulSoup: Python web scraping library

Paul Yang - January 6th, 2023

Here's a fast, easy, and practical example of using Python web scraping to make life good!

My Motivation

Yesterday, my preferred wine store ran a 20% off sale on 1000's items (shoutout to Astor Wines in NYC).

I am no expert on wines, so I outsource all of my judgment to Vivino's crowdsourced ratings--but rather than searching one-by-one, I wanted to use Vivino's "scan wine list" feature where if I could get all of the wines on sale into a long list it would iterate through and return all the scores for me.

Also, I am not sponsored, I swear (but if anyone did want to make me an affiliate, find me at py@einblick.ai)

This is what it looked like scanning the wine list off my Einblick workspace:

What I Did: Web Scraping with BeautifulSoup and requests

The requests package is one of the most commonly used Python packages, and is used to make HTML requests. In plainer words, it's a utility package used to grab the HTML code from a given URL.

First, we go ahead and set up a for-loop, making one request with the requests package for each of the pages the items are listed on. This returns us a response, which is essentially the HTML code of that page.

Then, we feed that response object into BeautifulSoup's HTML parser. Since the beginning of time, using RegEx to parse HTML has not been advised. BeautifulSoup can do many things that you might want, including, but not limited to:

  • Formatting HTML code
  • Grabbing specific elements by the element name
  • Returning major elements like page title
  • Grabbing text

In our case, we notice that the wine name is in a class named "item-name" and we ask BeautifulSoup to grab all instances of "item-name" named classes with the find_all function. For each page here, this is 12 wines, that we then loop through and add them to our data frame.

Finally, we pull that dataframe onto the canvas in Einblick (or you can do it wherever!) and we have made ourselves a "wine list" of sorts.

Note: Always abide by terms of use! (e.g. don't scrape Google results and get your company IP blacklisted)

For instance, I used Vivino's wine menu option because they do not like people scraping their data (given it is one of their main assets and competitive advantages). I hoped that my slow-speed few-page scraping that ended in material sales wouldn't annoy the wine shop too much though.

Canvas Example

Code Example:

!pip install requests 
!pip install bs4

import requests
from bs4 import BeautifulSoup
import pandas as pd 
import time

# Grabbing the low cost natural wines on sale! 
url = "https://www.astorwines.com/WineSearchResult.aspx?p=1&search=Advanced&searchtype=Contains&srt=4&term=&cat=1&pricerange=2&organicall=True&srt=0&Page="

# Create a blank data frame to hold my items 
df = pd.DataFrame(columns = ['items'])

# I knew there were ~20 pages of wines so I went through them one page at a time
for i in range(1,20): 
  print(url+str(i)) # just for reference 
  
  response = requests.get(url+str(i)) # add the page # to URL

  # Parse the HTML content
  soup = BeautifulSoup(response.text, "html.parser")

  # Find all elements with the class "item-name"
  items = soup.find_all(class_="item-name")
  
  # Extract the text from each element and print it
  for item in items:
    print(item.text)
    df = df.append({'items': item.text}, ignore_index = True)
  
  time.sleep(1) # don't look like a bot attacking every ms

About

Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators