Twitter Profile Descriptions Word Cloud from 2017 Dump using Python 3

Just a quick post to display a word cloud I created as a demo for my Adv Programming students using the Python WordCloud module. I was surprised to find that “Founder” was the top occurring word in the profile descriptions. What does this tell us?

I first had to write a PHP script to grab all the tweet objects from Twitter Search API that were captured by I then stored all the unique author profiles in a separate table from the tweets, which gave me approximately 3.4 million unique Twitter users who tweeted about science as captured by Of these 3.4 million, there were approximately 2.8 million users who had some characters in their Twitter profile description field.

To return results from the description found in my MySQL table of author profiles, I used the following query because I wanted to remove hard returns and tabs from the profile descriptions.

SELECT REPLACE(REPLACE(REPLACE(TRIM(`description`), '\r', ' '), '\n', ' '), '\t', ' ') 
FROM `profiles` 
WHERE TRIM(`description`)!='';

Next, we have the actual Python 3 code I used to create the WordCloud from the 2.8 million user descriptions. You’ll note I added a few extra terms to the STOPWORDS list because I ran this multiple times and found these terms that I wanted to remove from the final version.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
Created on Sun Feb 10 16:19:56 2019
@author: tdbowman
import io
import csv
import numpy as np
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image

# current directory
currdir = path.dirname(__file__)

# from
def create_wordcloud(text):

    # use cloud.png as mask for word cloud
    mask = np.array(, "cloud.png")))
    # create set of stopwords	
    stop_words = ["https", "co", "RT", "del", "http", 
                  "tweet", "tweets", "twitter", "en", "el", "us", "et",
                  "lo", "will", "ex", "de", "la", "rts"] + list(STOPWORDS)
    # create wordcloud object
    wc = WordCloud(background_color="white",
    # generate wordcloud
    # save wordcloud
    wc.to_file(path.join(currdir, "wc.png"))
if __name__ == "__main__":

    # Grab text from file and convert to list
    your_list = []
    with'all_descriptions.csv', 'r', encoding='utf-8') as f:
        reader = csv.reader(x.replace('\0', '') for x in f)
        your_list = ','.join([i[0] for i in reader])    
    # generate wordcloud

It could use some cleanup and the image could be higher resolution, but it’s a good example for the students how to utilize Python to create a word cloud.

PHP + jQuery + Twitter API => Performing a Basic Search Using OAuth and the REST API v1.1


To get started, you’ll need access to PHP 5.x on a web server.  I’m currently working on an Apache server with the default installation of PHP 5.3.  This should be available on most hosting services, especially those setups featuring open source software (as opposed to Microsoft’s .NET framework). In addition, I’m using a Postgres database on the back end to store the information I’m scraping and extracting (you can just as easily use MySQL).  If you want to run this code on your local machine, download WAMP, MAMP, XAMPP, or another flavor of server/language/database package.

TWITTER API, OAuth, & PHP twitteroauth Library

First, familiarize yourself with the Twitter Developer Website.  If you want to skip right to the API, check out the REST API v1.1 documentation.  To test a search, go to the Twitter Search page and type in a search term; try typing #BigData in the query field to search for the BigData hashtag.  You’ll be presented with a GUI version of the results.  If you want to try doing the same thing programatically and return data in JSON format, you’ll need to use the REST API search query… and you must be authenticated to do this.  To create credentials to use the search query, you must create an OAuth profile; so go and visit so you can retrieve your ACCESS TOKEN and ACCESS SECRET.  Luckily we can use the PHP twitteroauth library to connect to Twitter’s API and start writing code (here’s an example of the code you’ll need:  At this point you’ll need to set up your OAuth profile with Twitter and download the PHP twitteroauth library, edit the proper information to add your TOKEN and SECRET to the PHP twitteroauth library, and ensure all the files are on your web server in the appropriate place.


I’m assuming you have set up the OAuth profile on Twitter and that you’ve downloaded the PHP twitteroauth library.  I like to create an “app_tokens.php” file containing my CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, and USER_SECRET information assigned to variables; this way I can include anywhere I need it.

Now that we have our authorization credentials we are ready to use tmhOAuth as the middle man to send a request to Twitter’s API.  Let’s say we want to perform the same search we did above, but this time we don’t want a GUI version of the data… instead we want JSON data back so that we can easily add it to a database.  We need to find out what command the Twitter API expects and pass it a value; for our example, the Twitter API search query is simply: We can pass it several different parameters, but we’ll start with the most basic and use the q query parameter.  We want to pass the parameter the value “#BigData”, but we need to convert the pound sign (#) to a URL encoded version => %23… Our code then looks like this:

This request will use the REST API v1.1 and return JSON data.  We are passing the search a paramater of q=>’%23BigData’ which translates to searching for the hashtag “#BigData” (without the quotes).  We are also passing the ‘count’ and ‘result_type’ parameters (for more info on the other parameters, see the documentation).  Lastly, we need to get the response back from Twitter and output it; if we have an error, we need to output that too.  Using the twitteroauth libraries examples, I know I need to have the following code:

The above code receives two pieces of data from the Twitter API:  the response code and the response data. The response code indicates if we have errors.  The response data holds the JSON data the we received from the query.  The first result of my JSON data (yours won’t contain the same information, but it will contain similar structure) looks like this:

If you look at the JSON data above, you’ll see a key titled “text” and the value assigned to it; this is the content of the tweet and you can clearly see that it contains the hashtag #bigdata.  So we now know the code works and we can programatically query Twitter.  When you examine the Twitter API you will find that we can make 450 request every 15 minutes;  this will of course not get us ALL the tweets using the hashtag “#bigdata”, but it will give us a useful sample at 30 results per request == 13,500 tweets every 15 minutes.


Web Scraping Using PHP and jQuery

I was asked by a friend to write code that would scrape a DLP website’s content of letters to use in an academic study (the website’s copyright allows for the non-commercial use of the data).  I’d not tried this before and was excited by the challenge, especially considering I’m becoming more involved in “big data” studies and I need to understand how one might go about developing web scraping programs. I started with the programming languages I know best:  PHP & jQuery.  And yes, I know that there are better programming languages available to write code for webscraping.  I’ve used PERL, Python, JAVA, and other programming languages in the past, but I’m currently much more versed in PHP than anything else! If I had been unable to quickly build this in PHP, than of course I’d have turned to Python or PERL; but in the end I was able to write some code and it worked. I’m happy with the results and so was my friend.

First, I had to figure out what PHP had under the hood that would allow me to load URLs and retrieve information. I did some searching via Google and figured out the best option was to use the cURL library (  The cURL lib allows one to connect to a variety of servers and protocols and was perfect for my needs; don’t forget to check your PHP install to see if you have the cURL library installed and activated.  I did a quick search on cURL and PHP and came across where I found a custom function that I thought I could edit to suit my needs:

Next I needed a way to grab specific DOM elements from the pages being scraped; I needed to find a <span> tag that had a specific attribute containing a value that was both a function name and a URL.  I am very familiar with jQuery syntax and CSS3 syntax that allows one to find specific DOM elements using patterns.  Low and behold I discovered that someone had developed a PHP class to do similar things named “simplehtmldom” (  I downloaded simplehtmldom from sourceforge, read the documentation, and created code that would find my elements and return the URLs I needed.

Now I have the actual URLs from which I want to get  a copy of the data in an array.  I need to loop through the $links array and use cURL once again to get the data.  While I’m looping through the array I need to check to see if the URL is pointing to an HTML file or a PDF file (my only two options in this case).  If it is an HTML file, I use the get_data() function to grab the data and use PHP file commands to write/create a file in a local directory to store the data. If it’s a PDF, I need to use different cURL commands to grab the data and create a PDF file locally.

That’s it for the scraping engine!

Now we need to create a way pass a  start and end value (increments of 50 and maxes out at 4000) to the PHP scraping engine.  I know there are many ways to tackle this and I specifically considered executing the code from a terminal, in a CRON job, or from a browser.  I again went with my strengths and chose to use an AJAX call via jQuery.  I created another file and included the most recent jQuery engine.  I then created a recursive jQuery function that would make an AJAX POST call to the  PHP engine, pause for 5 seconds, and then do it again. The function accepts four parameters: url, start, increment, and end.

Put this all together and we have a basic web scraper that does a satisfactory job of iterating through search results and grabbing copies of HTML and PDF files and storing them locally.  I was excited to get it finished using my familiar PHP and jQuery languages and it was a nice exercise to think this problem through logically.  Again, I’m SURE there are better, more efficient ways of doing this… but I’m happy and my friend is happy.

Fun times.