Here's a fast, easy, and practical example of using Python web scraping to make life good!
My Motivation
Yesterday, my preferred wine store ran a 20% off sale on 1000's items (shoutout to Astor Wines in NYC).
I am no expert on wines, so I outsource all of my judgment to Vivino's crowdsourced ratings--but rather than searching one-by-one, I wanted to use Vivino's "scan wine list" feature where if I could get all of the wines on sale into a long list it would iterate through and return all the scores for me.
Also, I am not sponsored, I swear (but if anyone did want to make me an affiliate, find me at py@einblick.ai)
This is what it looked like scanning the wine list off my Einblick workspace:

What I Did: Web Scraping with BeautifulSoup and requests
The requests
package is one of the most commonly used Python packages, and is used to make HTML requests. In plainer words, it's a utility package used to grab the HTML code from a given URL.
First, we go ahead and set up a for-loop, making one request with the requests
package for each of the pages the items are listed on. This returns us a response, which is essentially the HTML code of that page.
Then, we feed that response object into BeautifulSoup's HTML parser. Since the beginning of time, using RegEx to parse HTML has not been advised. BeautifulSoup can do many things that you might want, including, but not limited to:
- Formatting HTML code
- Grabbing specific elements by the element name
- Returning major elements like page title
- Grabbing text
In our case, we notice that the wine name is in a class named "item-name"
and we ask BeautifulSoup to grab all instances of "item-name"
named classes with the find_all
function. For each page here, this is 12 wines, that we then loop through and add them to our data frame.
Finally, we pull that dataframe onto the canvas in Einblick (or you can do it wherever!) and we have made ourselves a "wine list" of sorts.
Note: Always abide by terms of use! (e.g. don't scrape Google results and get your company IP blacklisted)
For instance, I used Vivino's wine menu option because they do not like people scraping their data (given it is one of their main assets and competitive advantages). I hoped that my slow-speed few-page scraping that ended in material sales wouldn't annoy the wine shop too much though.
Canvas Example
Code Example:
!pip install requests
!pip install bs4
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
# Grabbing the low cost natural wines on sale!
url = "https://www.astorwines.com/WineSearchResult.aspx?p=1&search=Advanced&searchtype=Contains&srt=4&term=&cat=1&pricerange=2&organicall=True&srt=0&Page="
# Create a blank data frame to hold my items
df = pd.DataFrame(columns = ['items'])
# I knew there were ~20 pages of wines so I went through them one page at a time
for i in range(1,20):
print(url+str(i)) # just for reference
response = requests.get(url+str(i)) # add the page # to URL
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Find all elements with the class "item-name"
items = soup.find_all(class_="item-name")
# Extract the text from each element and print it
for item in items:
print(item.text)
df = df.append({'items': item.text}, ignore_index = True)
time.sleep(1) # don't look like a bot attacking every ms
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.