Sentiment Analysis and Web Charts with Python

I’m mostly retired, but still consult a bit in my field: Intelligent Transportation Systems (the use of advanced sensors, communications and information processing systems to improve ground transportation). The field includes connected and automated vehicles. And out of intellectual curiosity, I’ve also been learning about sentiment analysis. So I decided to combine these interests by tracking the change in sentiment over time, about automated vehicles, using tweets as the source material. And while I was at it, I figured I’d do the same thing for electric vehicles, and publicly display the results on two web pages.

One of the common, basic approaches to sentiment analysis is to determine whether the general tone of a sample of text is positive, negative, or neutral. That is the approach that I’ve taken. To determine this, one can use a rules-based approach, where certain words and word combinations are assigned tones and intensity (e.g., “hate” is very negative, “cool” is positive, “not cool” is negative). The scores for each sentiment word are combined to get an overall sentiment for the text sample. The other approach is to use machine learning algorithms to determine the sentiment. This approach can be more flexible, but requires a lot of training data, and the training data must be tagged by humans. In addition, multiple humans should review each sample, as studies have shown that humans can disagree with one another 20% of the time when assigning sentiment to a text sample.

I decided to use tweets posted on Twitter as the source material, since there are a large number of new entries every day and Twitter provides a free API that allows one to sample thousands of tweets per day, filtered by keywords and other parameters.

Once each tweet is assigned a sentiment value, I look at several metrics based on the total numbers of positive, negative, and neutral tweets. In addition, I extract and save the 100 most common 1 word and 2-word combinations found in the set of tweets for the day. The results are then used to generate several metrics, plots, and displays for the web pages.

The overall system for data collection, analysis, and display is shown…

Swim diagram for this project analyzing the sentiment of tweets about AVs and EVs and publishing the results on a web page

Getting and Filtering Tweets

The first step is to get the relevant tweets and clean and filter them. Two almost identical programs are used, one for AV tweets and one for EV tweets. The reason these are not combined into a single program is that this would exceed the 15 minute access limit of Twitter’s API. Instead, each is run once, about an hour apart, during the middle of the night eastern standard time.

Twitter’s free “Essentials” access level lets one search through a sample of tweets that were posted in the last 7 days. In order to access the twitter API, one must first register. Then, I used tweepy, a free python library for accessing the Twitter API. The search query API provides for a large number of query parameters. For this project, I searched for tweets containing any of a list of key words and that were not retweets, replies, or quote tweets (i.e., only original tweets are included).

The search terms were selected to try to broadly collect relevant tweets, but to avoid accidentally capturing a significant number of irrelevant ones. For example, the AV search terms include “self driving,” “automated vehicles,” and “automated shuttles” but does not include “Tesla,” as many tweets with the word Tesla are about the stock price or other aspects of the company, not about autonomous vehicles.

The next steps filtered out empty tweets and remove URLs from the tweet contents. During development, manual inspection revealed a large number of identical or nearly identical tweets, mostly originating from bots that pick up and tweet news stories. Therefore a filter was set up to keep only one sample from these duplicates and near-duplicates. To remove exact duplicates I simply used the drop_duplicates method in pandas. Removing near duplicates proved more problematic. The basic approach is to assess the similarity between pairs of tweets, and if they are sufficiently similar, remove one of them. Unfortunately this involves comparing every tweet with every other tweet, and there are thousands of tweets. I found that efficiently iterating through the tweets required the use of pandas’ itertuples method. This was at least an order of magnitude faster than using iterrows or items methods. Even so, the first similarly library I tried to use, SequenceMatcher, took over 10 minutes to perform all the comparisons! In the end, I used the Levenshtein.normalized_similarity method from the RapidFuzz library. This brought the runtime down to seconds.

Sentiment Analysis

The filtered and cleaned up tweets were now each analyzed to determine the sentiment of the tweets. I experimented with a number of approaches and settled on using VADER, an open-sourced, rule-based tool for sentiment analysis, written in Python. VADER was specifically developed to analyze short social media posts, such as tweets. In addition, it runs very quickly, which is useful as I analyze thousands of tweets at a time on each of the two subjects.

I made two very minor changes to the VADER code. First, as reported in the “issues” on GitHub, some words are listed more than once in it’s dictionary, with differing sentiment values. For these, I used my best judgement on which to keep and discarded the duplicates. In addition, VADER lets the user add their own words to its dictionary of sentiments. Based on manual review of a large number of tweets, some additional words tailored to the subject matter were added and assigned sentiments:

  • “advances”: 1.2
  • “woot”: 1.8
  • “dystopia”: -2.5
  • “dystopian”: -2.5
  • “against”: -0.9
  • “disaster”: -2.5

Sentiment Metrics

Donut chart showing percentages of positive, negative, and neutral tweets
Donut Chart showing the current day’s sentiments by percentage.

Once the sentiment of each tweet is determined, the results are aggregated. The total numbers of negative, positive, and neutral counts are calculated and stored in an AWS S3 storage bucket. Using matplotlib’s pyplot, a donut graph is created showing the percentages for each type of sentiment, and the graph is also stored in an S3 bucket for display on the web pages.

Two indices are also calculated: the ratio of the number of positive to negative tweets, ignoring neutral tweets, and the index, which is the average value, with each negative tweets counting as -1, neutral tweets as 0, and positive tweets as +1. These scores are put into an HTML fragment and transferred to the web server for incorporation into the two sentiment indices web pages.

Word Cloud

Word Cloud showing the 100 most common 1 and 2 word combinations

In addition to calculating the sentiment indices, the 100 most common single and two-word combinations are determined for the current days tweets. A word cloud image is also generated. Both of these are done using the WordCloud library. Both the word list and the Word Cloud are written out to one of the S3 buckets.

Time Series Charts with Bokeh

After both of the sentiment analysis programs are run each day, a single additional program is run. This program has two functions, each of which operate separately on the AV and EV data. First, it produces two time series plots of the sentiment indices, using the data stored in one of the S3 buckets. Second, it uses the most common words stored in S3 bucket and compares them with the list that was stored in the bucket 7 days ago. Words that appear in the most common 100 for the current day, but not the day a week ago are determined (“hot” words), as well as those that were on the list 7 days ago, but not today (“not” words). A subset of the list is then formatted into an HTML table for display on the web pages.

Bokeh is used to generate the two time series plots. I used bokeh rather than matplotlib because I wanted some interactivity on the web page, and Bokeh is a Python library for creating interactive visualizations for web browsers. The actual web scripts that Bokeh generates are in javascript.

This was my first time using Bokeh. It wasn’t hard to figure out how to generate the basic time series plots that I wanted, or to add the “hover” tool to allow one to see the value of particular data points. However, I had some trouble figuring out how to generate the results for use on a web page and then incorporating them in the page. Bokeh plots can require a server, for complex interactivity involving changing data or plots, or can be embedded as standalone plots. For this application, I used it in standalone mode. There are four modes that can be used for this, and as a beginner, I freely confess I don’t understand them well.

The file_html method returns a complete HTML document that embeds the Bokeh documents. That wouldn’t be appropriate for this application, as the Bokeh plots are embedded within other web pages. The json_item method “returns a JSON block that can be used to embed standalone Bokeh content.” I played around with this a bit, but I don’t really understand it. The components method returns “Return HTML components to embed a Bokeh plot. The data for the plot is stored directly in the returned HTML.” I didn’t explore this one at all. Finally, the autoload_static method returns “JavaScript code and a script tag that can be used to embed Bokeh Plots. The data for the plot is stored directly in the returned JavaScript code.” This is what I used. The method returns a tuple consisting of JavaScript code to be saved and a <script> tag to load it. The <script> tab is placed at the appropriate location in your HTML.

There are several ways to embed the plots in your HTML. I used a server side include, and the section of code looks like this:

            <div class="bokeh">
            <!--#include file="AV_ratio.tag" -->
            </div>

You can see the two plots in the left column of the AV Sentiment Index web page. Static screenshots are shown below:

Time series plots generated using Bokeh

Initial Observations

Figure from Observing the Effect of a Crash on Twitter Sentiment: Early Results from Time Series Data, showing the correlation between news of a crash and a sharp rise in negative tweets regarding AVs

Shortly after I began to capture data, the sentiment regarding automated vehicles dipped sharply for about a week, with the number of negative tweets rising sharply and the positive to negative ratio dropping 2 standard deviations below the mean. In checking what was happening, news had just broken of a major multi-vehicle crash involving a vehicle in self-driving mode. This news was clearly reflected in the negative tweets, with “hot” words that frequently occurred, but had been missing the previous week such as “Bay Bridge,” “Tesla full,” “eight car,” “eight vehicle,” and “vehicle crash.”

Since then, other news events can be clearly observed to influence the tweets. You can read more about this in the short write up, Observing the Effect of a Crash on Twitter Sentiment: Early Results from Time Series Data.

Simplify Your Life: Using a Relay Module

Simple relay

A Relay

Today’s blog is on a simple topic, but one that comes up a lot, and  may be of use, especially to beginners. Many projects involve controlling a higher voltage device or devices based on an external event triggered by a sensor, such as a PIR. And to control the higher voltage device, a relay is used. Along with the relay, you’ll need a transistor to provide adequate current to drive the relay, a diode to protect from reverse currents when the relay switches off, and probably a resistor to limit the current draw. This is simple enough, but another option is to use a relay module. The module combines a relay along with these other components into one simple package.

Depending upon your application, you may also put some sort of timer circuit, or a microcontroller such as an Arduino, or a single board computer (SBC) like a Raspberry Pi between the sensor output and the relay input. This lets you implement more complex logic than simply having the relay echo the output from the sensor. For example, one could have a light toggle on or off each time the sensor is triggered. Or, when triggered, one could have a device stay on for a set amount of time before turning off. Or even  more complex actions. But if you don’t need the added logic, you can use the sensor directly, without a microcontroller or SBC.

The output voltage from the sensor or controller board must be high enough to trigger the relay, but voltage alone is not enough. The driving device must also be capable of supplying enough current. This is often not the case for either sensors or computer boards. They lack the power to drive many relays.

Relay and driver circuit, showing transistor, diode, and resistor

Relay and Driver Circuit

The solution is to use a transistor. The sensor or board switches on and off the larger current flow through the transistor, and this, in turn, drives the relay. In addition, one needs a resistor to limit the current drawn from the sensor or controller and a “flyback” diode to protect both devices and the transistor from reverse currents when the relay turns off. This is shown in the figure, taken from Circuit Digest’s Arduino Relay Control Tutorial.

You can learn more about a relay and the relay driver circuit in the article How Electrical Relays Work at Circuit Basics.

 

Relay Module Board

Relay Module

But rather than using discrete components, there’s another option which is often quicker and easier: the relay module. A relay module combines all of these elements into a single board. It may include additional features, such as an LED to indicate when the relay is triggered. In this way, you only need your sensor, optionally a controller for more complex logic, and the relay module. No discreet diodes, transistors, or resistors required. I’m using the one discussed in this article, along with a PIR sensor, to trigger the “Try Me” switch on a Halloween prop. Although the signal output for the PIR is 3.3V and the relay is a 5V input signal relay, it seems to work just fine. As you can see, it’s the same relay, mounted onto a circuit board along with the transistor, resistor, diode, and indicator LED. In addition, the signal side inputs are connected to 3 male pins for easier connections. The pins are spaced so that they will fit into a standard breadboard or for connecting a standard 3-wire servo cable.

So the next time you need to trigger a higher voltage AC or DC device, reach for a relay module.

Wireless Microphone Using Two Raspberry Pi’s (Updated 5/4/2021)

I’m in the process of integrating wireless audio input and  into my Yorick the Mimic project. This involves adding microphone input and wireless transmission into the sensor cap and then integrating the movement controller with a modified version of Chatter Pi that takes the transmitted audio as an input.

In order to get started, I first put together bare bones transmitter and receive programs. I’m using Python, along with PyAudio, which I also used in Chatter Pi, to process the audio on both ends. I’m using UDP to send the data packets contain the audio. I saw some examples using TCP, but it seemed to me that UDP was better suited to real-time audio. If anyone knows more on which is the better approach, please post a comment.

PyAudio

The code runs fine, but generates a continuing stream of

ALSA lib pcm.c:8424:(snd_pcm_recover) underrun occurred

warning messages. This doesn’t interfere with the program’s operations, but if anyone knows why I’m getting them and/or how to eliminate the warnings, I’d appreciate your letting me know.

PyAudio has two modes, a blocking mode, where each call to pyaudio.Stream.write() or pyaudio.Stream.read() blocks until all the given/requested frames have been played/recorded and a non-blocking mode where a callback function is launched in a separate thread, so that processing can continue in the program calling it, and the thread ends when the current chunk of audio is processed. The gist with my code uses the non-blocking mode using the callback function and two different versions of the receiver, one using blocking mode and one using non-blocking. You need to be careful when using the non-blocking mode that the callback function does not include anything really time consuming, like file reading and writing. If it does, it can’t finish before the next chunk of audio is ready and you get clipping or worse problems.

As is well-known, the audio jack output on a Pi produces low volume, poor quality audio. A USB speaker works much better. However at least for the speaker I’m using, I need to use the ALSAMixer to control the volume. If I touch the speaker icon in the GUI, it produces no sound if set to anything other than the maximum volume. Again, if anyone knows why and how to fix this, please add a comment.

By design, both the xmit and rcv programs run forever once started. There’s one other feature beyond the bare bones basics. The receive program has a Boolean variable named EFFECTS. If set to True, the Sox library is used to deepen the pitch and add a bit of reverb before sending the audio to the speaker.

Hopefully this project will help others with similar needs.