Opinion Mining with Sklearn

Classify American senators with Tweepy

Opinion Mining of Twitter users with Sklearn
January 13, 2018 nschaetti

Introduction

Today we spend a large part of our time on social networks, blogs and other instant messaging services. Theses services have all in common the fact the they produce a large amount of data with various characteristics like texts, pictures, music and videos. In the datanami (Tsunami of data), text takes a large part as we are naturally language-driven organisms.

Furthermore, social networks based on textual data like Twitter are now the area for communication and marketing, specially for promoting and displaying political ideas and movements. In this week tutorial, we will work with textual data from Twitter classify Twitter’s profile of American senators according to their political affiliation.

Let’s get started!

Create your data set

We start with data acquisition. For that we need to create a Twitter application and use it to access Twitter profiles and download tweets. You can refer to last week tutorial here to learn step by step how to create a Twitter application. Once done, we will use in our code four package, tweepy to access our Twitter application, json to write and encode json files, codecs to write and read unicode files and time to use temporal functions.

 
#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
# 

# Imports 
import tweepy 
import json 
import codecs 
import time 

We need now a list of Twitter usernames of senators from the republican and democrat party.

# Republican users
users = dict()
users['democrats'] = [u"HillaryClinton", u"BillClinton", u"BarackObama", 
                      u"SenBobCasey", u"SenGillibrand", u"SenBillNelson",
                      u"SenFeinstein", u"SenSchumer", u"SenMarkey", 
                      u"SenatorHeitkamp", u"SenSanders",
                      u"SenatorTomUdall", u"SenBennetCO", 
                      u"SenWhitehouse", u"SenWarren"]
users['republicans'] = [u"realDonaldTrump", u"marcorubio", 
                        u"JeffFlake", u"RandPaul", u"lisamurkowski", u"SenToomey",
                        u"SenJohnMcCain", u"tedcruz", u"ChuckGrassley", 
                        u"SenatorCollins", u"SenDeanHeller", u"JerryMoran",
                        u"SenatorTimScott", u"SenTomCotton", u"senrobportman"]

As already shown in the last tutorial on machine learning and social networks, we create four variables with Twitter API credentials.

# Twitter API credentials
consumer_key = "..."
consumer_secret = "..."
access_key = "..."
access_secret = "..."

The function OAuthHandler of the tweepy package is used with the consumer’s
key and secret to get the auth object. We then use the set_access_token function set tell the
Twitter API the key and secret related to our application.

# Authentification
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)

# Get API
api = tweepy.API(auth)

Thanks to the tweepy’s function API() with auth as argument, we can retrieve the api object to access a large panel of Twitter functions. We then proceed to the creation of the data set. We will divide our data in two parts, the training set used to train our classifier, and the test set used to predict new senator that the classifier has never seen before. This will allow us to evaluate if our classifier’s performance is correct or not.
We start by creating four list, train_X which will contains the training data, meaning the last 200 tweets of each senators. The train_Y list containing all the senator’s label (republican/democrat). And finally the test_name and text_X lists, which are
respectively the list containing the test senator’s name and tweets.

# Data set
train_X = list()
train_Y = list()
test_name = list()
test_X = list()

Now comes the time to download the data we need. We will browse the list of democrat and republican senators and for each username in these list, we use the user_timeline() function of the API object to browse the last 200 tweets of this user (200 is a Twitter limit).
At each tweet retrieved, we add it to a string variable named tweet_text which collect all the tweets of a single user. Once we have all the user’s tweets put side by side in tweet_text, we add these data to the list train_X and the corresponding class label to the list train_Y.

# For republicans and democrats
for c in ["democrats", "republicans"]:
    print(u"Downloading tweets for class \"{}\"".format(c))
    tweet_text = u""
    for user in users:
        print(u"Downloading tweets for user \"{}\"".format(user))
        # Get statuses
        for status in api.user_timeline(screen_name=user, count=200):
            tweet_text += unicode(status.text)
        # end for
        train_X.append(tweet_text)
        train_Y.append(c)
        time.sleep(60)
    # end for
# end for

We do the same for the test part of the data set. We download all the tweets of test senator’s and we add it
to the tweet_text variable. Once done, we add the data to the list test_Y and the corresponding
senator’s name.

# For each users
for user in test:
    tweet_text = u""
    for status in api.user_timeline(screen_name=user, count=200):
        tweet_text += unicode(status.text)
    # end for
    test_name.append(user)
    test_X.append(tweet_text)
# end for

Congratulation! You created the data set and it is now the time to write it to a JSON file. For this we use the dump() function of the json package, with a dictionary as parameter where an entry ‘X’ will contains the training data, the entry ‘Y’ will contain the training label, ‘test_X’ will contain the test data and ‘test_name’ the test senator’s names.

 json.dump({'X': X, 'Y': Y, 'test_X', 'test_name': test_name}, codecs.open(u"dataset.p", 'wb', encoding='utf-8')) 

With more than 30 senators and 1 minute break between each senator, count half an hour to download all the data. Once terminated, you will find the file dataset.json in your directory, ready to serve as a training
for your future classifier.

Create the classifier

To create our classifier, we will use sklearn, the machine learning toolkit of the SciPy module. We then import as before the json and codecs module to read our JSON dataset, and few Sklearn useful objects.

# Imports
import json
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

We load our data set, this time with the load() function and the opening mode ‘rb’ (read).

# Open file
dataset = json.load(codecs.open(u"dataset.json", 'rb', encoding='utf-8'))

The next step when you design a machine learning model, is to decide which features in the data you want to extract and use with your classifier. This is a fundamental step as it will condition the future performances. Working on text, features can be sentences length, vocabulary or frequencies of series of characters.
Here we will use classical features in the field of natural language processing : a vector of words frequency. Each entry being the frequency in the text of a specific word in the vocabulary. Here the vector is more specific, as we use a TF-IDF vector, a even more powerful model that I will introduce in a future theoretical article.
For now, lets use the object CountVectorizer which compute count vector from raw text.

 
# Count vector 
count_vec = CountVectorizer(ngram_range=(1, 1)) 

The output count vectors can be transformed into a term-frequency inverse document-frequency where each  value represents the term-frequency modulated by the document-frequency. The complete model is called  TF-IDF and will be the object of a theoretical article. Let’s now use TfidfTransformer() object.

 
# TF-IDF transformer 
tfidf_transformer = TfidfTransformer()

We can now create our classifier, a multinomial Bayes classifier, with the Sklearn object MultinomialNB.

# Classifier 
classifier = MultinomialNB() 

Sklearn gives us a very pratical object, Pipeline. It allows us to put the three preceding object into a pipeline which we can train et make prediction with common functions fit() and predict().

# Pipleline 
text_clf = Pipeline([('vec', count_vec),
                     ('tfidf', tfidf_transformer), 
                     ('clf', classifier)]) 

The fit() function is used to train our model. We only need to pass as argument the training data (tweets’ text) and the corresponding labels (republican/democrats).

 
# Train on entire dataset 
text_clf.fit(dataset['X'], dataset['Y']) 

Once our model is trained. Let’s use it to predict the political party of our test senator. We can iterate through the list of test senators and call the predict method with their tweets’ text as argument to compute the prediction. We only have to print that prediction with print().

# For each users
for index, user in enumerate(dataset['test_name']):
print(u"Trying {}".format(user))

# Predict
prediction = text_clf.predict([dataset['test_X'][index]])

# Print prediction
print(u"{} is predicted as {}".format(user, prediction))
# end for

Here is the result.

Prediction class for SenJohnThune
SenJohnThune is predicted as [u'republicans']
Prediction class for SenatorEnzi
SenatorEnzi is predicted as [u'republicans']
Prediction class for SteveFarleyAZ
SteveFarleyAZ is predicted as [u'democrats']
Prediction class for JanetBewley4WI
JanetBewley4WI is predicted as [u'democrats']

As you can see, the first to senator are classified as republicans and the next two as democrats, which is correct. Obviously, our training set is confined to a very small set of user, American senators, and we tested it only on few samples. A bigger experiment would be to take much more politically affiliated people on Twitter, and train different models and test them with a more robust methodology.

Meanwhile, our can find this code on GitHub and a video tutorial in french on YouTube.

Nils Schaetti is a doctoral researcher in Switzerland specialised in machine learning and artificial intelligence.

1 Comment

Leave a reply

Your email address will not be published. Required fields are marked *

*