Using NLU Watson from IBM to Analyse Twitter IPL Content

This article will cover up the following:

  1. Taking tweets as per certain hashtags(you may apply other filters if you want to).
  2. Saving these tweets in a text file for post analysis.
  3. Reading these tweets and analyzing them.
  4. Saving the results of 3 in a CSV file
import settingsTwitter
import tweepy
class StreamListener(tweepy.StreamListener):

def on_status(self, status):
f = open("#DDvKXIP.txt",'a')
if status.retweeted:
tweetText = status.text

def on_error(self, status_code):
if status_code == 420:
return False
auth = tweepy.OAuthHandler(settingsTwitter.TWITTER_APP_KEY, settingsTwitter.TWITTER_APP_SECRET)
auth.set_access_token(settingsTwitter.TWITTER_KEY, settingsTwitter.TWITTER_SECRET)
api = tweepy.API(auth)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(languages=["en"], track=["#DDvKXIP", "#KXIPvDD"])

settingsTwitter is the file in which my twitter auths are stored.

I am ignoring the re-tweets, however, re-tweets with some additional status are accepted.

# -*- coding: utf-8 -*-
import re
import csv
import json
import settingsWatson
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 import Features, EntitiesOptions, KeywordsOptions, CategoriesOptions
import os
import time
natural_language_understanding = NaturalLanguageUnderstandingV1(
filename = '#DDvKXIP.txt'
file = open(filename,'r')
st_results = os.stat(filename)
st_size = st_results[6]
tweets = []
while 1:
where = file.tell()
line = file.readline()
if not line:
print "no line found, waiting for a 1 seconds"
if ('[a-zA-Z]', line)):
print "-----------------------------"
print "the line is: "
print line
print "-----------------------------"
response = natural_language_understanding.analyze(
response["tweet"] = line
print(json.dumps(response, indent=2))
with open('#DDvKXIP.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
for key, value in response.items():
writer.writerow([key, value])
print "--------------------------------"
print "found a line without any alphabet in it, hence not considering."
print line
print "--------------------------------"

You might have read your logs(be it of an application or a server, etc) and might have used tail -f to read the tail of the log file. Similarly, the aforementioned code will read the #DDvKXIP.txt file from the tail end, will process over the tweets one by one and save the results in #DDvKXIP.csv .

import settingsWatson has my auth keys. Using these keys(which are generated after I create a developer account on IBM) will enable me to use Watson.

if ('[a-zA-Z]', line))

The above if condition is to ensure that the line read from the txt file is not a newline(not containing only /n, does not contain only whitespaces and has the ability to read only those tweets which contain something written, something that can be comprehended for Natural Language Understanding.

Few Loopholes:

This module streams Tweets too fast for Watson to handle. For eg, the match gets over at X(say) hours and the processing of tweets generated(assuming is switched off as soon as the match gets over) takes place until X+3 hours.

This tweet will accept English only, however, Hindi written in Roman script is also processed which is nothing but making noise in my signal.

Some results:

Size of Tweets saved.
Size of Processed data(and it keeps growing)

Thanks for the read!

Code + Data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store