Merouane Benthameur | Pre-processing Twitter Data using Python

Pre-processing Twitter Data using Python

2017-12-22 10 min read comments

The main goal of collecting Twitter data is to make sense of it. Whether to get insights to make a decision, or validating some theories or just exploring like we are doing in this tutorial.

The data we receive from Twitter comes in raw JSON format, and unstructured text is hard to analyze.
Before we start our analysis, we need to do some work on the dataset. Known as data pre-processing or a staging step, it consists basically of passing the data through several processes for cleansing, consolidation and validation. In this article, we will see some techniques to clean the text using Python.

Let’s take a look first on what a tweet post looks like inside a JSON format:

					    
{
	"created_at": "Sun Dec 10 18:52:35 +0000 2017",
	"id": 939930864649154561,
	"id_str": "939930864649154561",
	"text": "@rossoneroanto Worst transfer market strategy of all time... WINGERS WINGERS WINGERS",
	"truncated": false,
	"entities": {
		"hashtags": [],
		"symbols": [],
		"user_mentions": [
			{
				"screen_name": "rossoneroanto",
				"name": "Cutrogol",
				"id": 1000589126,
				"id_str": "1000589126",
				"indices": [
					0,
					14
				]
			}
		],
		"urls": []
	},
	"metadata": {
		"iso_language_code": "en",
		"result_type": "recent"
	},
	"source": "Twitter for iPhone",
	"in_reply_to_status_id": 939930639649923072,
	"in_reply_to_status_id_str": "939930639649923072",
	"in_reply_to_user_id": 1000589126,
	"in_reply_to_user_id_str": "1000589126",
	"in_reply_to_screen_name": "rossoneroanto",
	"user": {
		"id": 709385918,
		"id_str": "709385918",
		"name": "ACMirabelli",
		"screen_name": "Safa_Jad",
		"location": "Montreal, Quebec",
		"description": "The sleeping giant has awoken \ud83d\udd34\u26ab\ufe0f #ForzaMilan",
		"url": null,
		"entities": {
			"description": {
				"urls": []
			}
		},
		"protected": false,
		"followers_count": 486,
		"friends_count": 171,
		"listed_count": 20,
		"created_at": "Sat Jul 21 18:33:03 +0000 2012",
		"favourites_count": 9149,
		"utc_offset": null,
		"time_zone": null,
		"geo_enabled": false,
		"verified": false,
		"statuses_count": 14006,
		"lang": "en",
		"contributors_enabled": false,
		"is_translator": false,
		"is_translation_enabled": false,
		"profile_background_color": "C0DEED",
		"profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
		"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
		"profile_background_tile": false,
		"profile_image_url": "http://pbs.twimg.com/profile_images/878202978964058112/VDJUlFg6_normal.jpg",
		"profile_image_url_https": "https://pbs.twimg.com/profile_images/878202978964058112/VDJUlFg6_normal.jpg",
		"profile_banner_url": "https://pbs.twimg.com/profile_banners/709385918/1477185548",
		"profile_link_color": "1DA1F2",
		"profile_sidebar_border_color": "C0DEED",
		"profile_sidebar_fill_color": "DDEEF6",
		"profile_text_color": "333333",
		"profile_use_background_image": true,
		"has_extended_profile": false,
		"default_profile": true,
		"default_profile_image": false,
		"following": false,
		"follow_request_sent": false,
		"notifications": false,
		"translator_type": "none"
	},
	"geo": null,
	"coordinates": null,
	"place": null,
	"contributors": null,
	"is_quote_status": false,
	"retweet_count": 0,
	"favorite_count": 0,
	"favorited": false,
	"retweeted": false,
	"lang": "en"
}

As you can see from the tweet above, there is a lot of data and metadata related to a single tweet, which is good form an analytical point of view because it gives us information that allows some interesting analysis.

Some interesting key attributes:

text: the content tweet itself
created_at: creation date of the tweet
id: tweet unique identifier
user: the author’s full profile
followers_count: count of followers of the user
friends_count: count of friends of the user
lang: the language of the tweet (e.g. “en” for English)
retweet_count: count of retweet
entities: list of entities like URLs, @-mentions, hashtags and symbols
place, coordinates, geo: geo-location information if available
in_reply_to_user_id: user ID if the tweet is a reply to a specific user
in_reply_to_status_id: status ID the tweet is a reply to a specific status

Almost all the attributes give us an additional context to analyze the data. However, to conduct an analysis you probably don’t need all of them, it depends on the question you’re trying to answer. i.e. if you want to know who is the most followed user, or who’s discussing with whom, it’s clear that you need to focus your analysis on specific keys (user, followers_count, in_reply_to_user_id, in_reply_to_status_id…). Also, when it comes to the text key (the content of the tweet itself), we can do a lot of interesting analysis, i.e. what is the most popular hashtag, word frequencies, sentiment analysis ... etc. But as I mentioned before, text cleansing is a must prior to any analysis you want to conduct .

Cleaning Twitter text

Since we are intrested in analysing the Text of tweet, we will the most One of the most important techniques is Tokenisation, which means breaking the text down into words. The main idea here is to split a big chunk of text into small pieces called tokens (words) to make it easy to analyze. In Python world, there is a popular library to process human language text called NLTK (Natural Language ToolKit).
Let’s see an example of tokenizing a tweet:

					
from nltk.tokenize import word_tokenize

# word_tokenize example
tweet = '@Merouane_Benth: This is just a tweet example! #NLTK :) http://www.twitter.com'
print(word_tokenize(tweet))

# result 
['@', 'Merouane_Benth', ':', 'This', 'is', 'just', 'a', 'tweet', 'example', '!', '#', 'NLTK', ':', ')', 'http', ':', '//www.twitter.com']

The word_tokenize method from NLTK makes the job very easy. It splits the text into words (tokens). However, if you inspect the output, you notice that there are some odd results. I’m particularly referring to emoticons, @mentions, #hashtag and URL, they are splited into multiple tokens. Well, the reason is, Twitter data pose some challenges because of the nature of the language, and the word_tokenize method doesn’t capture these aspects out of the box. The good news is NLTK has a special method for Twitter data, let’s see how the same example with TweetTokenizer method:

					
from nltk.tokenize import TweetTokenizer

tweet = '@Merouane_Benth: This is just a tweet example! #NLTK :) http://www.twitter.com'
# TweetTokenizer example
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(tweet)
print(tokens)

# result 
['@Merouane_Benth', ':', 'This', 'is', 'just', 'a', 'tweet', 'example', '!', '#NLTK', ':)', 'http://www.twitter.com']

As you can see now, @mentions, #hash-tags, emoticons and URLs are now grouped as individual tokens. We’ve seen how to tokenize tweets, but we still have some useless punctuation we need to remove. but, we need to take the @mentions, #hash-tags, emoticons and URLs into consideration. Let’s see how to do it:

					
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json
import re
from nltk.tokenize import word_tokenize, TweetTokenizer


# removing punctuation
emoticons_str = r"""
	(?:
		[<>]?
		[:;=8]                     # eyes
		[\-o\*\']?                 # optional nose
		[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
		|
		[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
		[\-o\*\']?                 # optional nose
		[:;=8]                     # eyes
		[<>]?
		|
		<3                         # heart
	)"""
regex_str = [
	emoticons_str,
	r'(?:@[\w_]+)', # @mentions
	r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hashtags
	r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+'] # URLs

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)

tweet_tokens_list = ['@Merouane_Benth', ':', 'This', 'is', 'just', 'a', 'tweet', 'example', '!', '#NLTK', ':)', 'http://www.twitter.com']
clean_token_list = []

for element in tweet_tokens_list:
	if tokens_re.findall(element):
		clean_token_list.append(element)
	else:
		if not re.match(r'[^\w\s]', element):
			clean_token_list.append(element)

I’ve used in my code the regex filters the same way as the code source of the TweetTokenizer method does, instead of using a simple regex filter directly like r'[^\w\s]'. Well, the reason behind this tweaking is to keep the symbols of the @mentions, #hash-tags and emoticons, otherwise the filter will remove them, and we don’t want to do that.

Stop words are those words that do not contribute to the deeper meaning of the phrase. i.e. “the“, “a“, and “is “. For some analysis applications like text classification, it may make sense to remove them. NLTK library provides a list of commonly agreed upon stop words for a variety of languages.
You can load the list this way:

					
# Loading stop words
stop_words_english = stopwords.words('english')
stop_words_french = stopwords.words('french')
print(' English stop words:', stop_words_english, '\n', 'French stop words:', stop_words_french)

#Result
English stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 
 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 
 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 
 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this',
 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 
 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 
 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 
 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there',
 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 
 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 
 "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
 "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 
 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] 

French stop words: ['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je',
 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 
 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 
 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 
 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 
 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 
 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 
 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 
 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']

# Filtering out stop words
words = [w.lower() for w in clean_token_list if not w.lower() in stop_words_english]

I’ve loaded the list in both languages, French and English, you also notice that they are all lowercase and have punctuation removed. Now to filter out our tokens against these lists, we need to ensure that they are prepared the same way. As you can see from the code, I turned our tokens in a lower case before filtering out the stop words. At this level, we have covered some basics of cleaning text, and I’ll stop the process at this point. Now, you can apply what we’ve seen so far on your tweets you previously saved.
You can find the code in my Git repository .

Summary

Additional text cleaning considerations:

What we’ve seen in this blog post is just the basics in terms of cleaning raw text, actually, this process can get a lot more complex. Here it is a shortlist of additional considerations when cleaning text:

Steam words: refers to the process of reducing the token (word) to its root. i.e. ‘cleaning’ become ‘clean.’
Locate and correct common typos and misspellings
Handling numbers and dates inside the text
Extracting text from markup like HTML, XML, PDF, or other documents format

We’ve seen in this blog post the structure of a tweet, and we’ve demonstrated how to clean the text. We have especially seen some basic techniques to deal with the aspects of data nature comming from Twitter . After these pre-processing steps, our dataset is ready now for some interesting analysis, I’ll cover this in the next blog posts.