web scraping | The Aspiring Roboticist

Buffalo Trace bourbon

Bourbon has become a hot commodity, and as it takes years to make, the supply can’t quickly ramp up to match demand. Buffalo Trace is one of many brands that have become quite popular, making it hard to find. The clerk at a local Virginia ABC store told my wife that they get a small shipment in and it “flows out like a river”, and is gone in a day. My wife found that you could check inventory of local stores online, and asked me to write a script to check it. Starting tomorrow morning, the “Buffalo Hunter” script will run once a day, check the inventory at our two closest stores, and send her a text if there are any bottles in stock.

Version 1 was a fun 1-day project. I had to learn some new tricks, as the page uses client-side javascript and I hadn’t used Twilio to send texts before, but I got it all working well. Some time later I decided to port it to the cloud, using the AWS Lamda service, which had a short but steep learning curve.

The website uses javascript to generate a dynamic page, so I couldn’t simply use something like Beautiful Soup to parse the html. So I used Selenium, using a Chrome headless browser on my local version, switching over to phantomjs on AWS. I switched to phantomjs because you need to have executables compiled to run under AWS, and I found a precompiled version of phantomjs on the web, and didn’t find the same for Chrome.

There was one other “gotcha” I ran into. I use Windows. While I had found a correctly compiled phantomjs executable, when I zipped it along with the other files to upload, it lost its permissions settings. I could have booted up in Linux, instead I installed the Linux subsystem that’s available for Windows 10 and used bash to zip the files up. That ended up working fine. You also need to change the directory for the phantomjs log to the /tmp/ folder that AWS gives you write access to.

In version 1, I handed off the final processed web page to Beautiful Soup because I hadn’t used Selenium’s parsing before, and I’d used Beautiful Soup’s. You can easily hand off the processed resulting web page from Selenium to Beautiful Soup (see the commented out line that starts page2Soup in the code below). When I moved to Amazon, I also figured out how to do the page scraping in Selenium, so that I didn’t need Beautiful Soup any more. The concept’s simple, but I didn’t find a good reference for the find_element)by_css_selector in python, so it took a little trial and error. .If you’re interested, here’s the version of the code that runs on AWS:

Buffalo Hunter.py


import logging
import datetime
import time
# from bs4 import BeautifulSoup
from selenium import webdriver
from twilio.rest import Client

accountSID='SID Here'
authToken = 'token here'
stores = {'219': 'Old Courthouse' , '231': 'Maple Ave.'}

# options = webdriver.ChromeOptions()
# options.add_argument('headless')
# driver = webdriver.Chrome('c:/program files (x86)/chromedriver.exe')
driver = webdriver.PhantomJS(executable_path="/var/task/phantomjs", service_log_path='/tmp/ghostdriver.log')

def myhandler(event, context):
	try:
		results = ''
		success = 0
		for store in stores:
			driver.get('https://www.abc.virginia.gov/stores/'+store)
			make_my_store = driver.find_element_by_id('make-this-my-store')
			make_my_store.click()
			time.sleep(5)
			driver.get('https://www.abc.virginia.gov/products/bourbon/buffalo-trace-bourbon#/product?productSize=0')
			time.sleep(5)
			element = driver.find_element_by_css_selector('td[data-title="Inventory"]')
			# page2Soup = BeautifulSoup(driver.page_source, 'lxml')
			# element = page2Soup.find("td", {"data-title": "Inventory"})
			inventory_value = element.text
			if inventory_value <> '0': success = 1
			results= results+stores[store] +' has '+inventory_value+ ' bottles of Buffalo Trace. '
		driver.close()
		driver.quit()
	# Send results if inventory not 0 at both stores
		if success == 1:
			results = 'Success! ' + results
			twilioCli = Client(accountSID, authToken)
			myTwilioNumber = 'myPhoneNumberHere'
			destinationCellNumber = 'destinationCellNumberHere'
			message = twilioCli.messages.create(body=results,from_=myTwilioNumber, to=destinationCellNumber)
	except Exception as e:
		logging.error(str(datetime.datetime.now())+' Error at %s', 'division', exc_info=e)

The Aspiring Roboticist

Notes from a computer and robotics hobbyist

Tag Archives: web scraping

Using Geek Power for Good: Better Living Through Code Edition