Harvesting Images from Site for Study

Recently
I needed to write code that would allow me to harvest images from a site where the images were displayed 10 per page over n number of pages.  I wanted to set it up so that I could start it and let it run over time and harvest images.  This immediately meant I’d be working with PHP and jQuery using AJAX.  I’ve written another post titled Web Scraping Using PHP and jQuery about this type of AJAX script and I needed to use what I’d learned here to implement this new scraping engine.

The reason I’m writing this code is so that we can start a few studies on selfies.

I ended up using a bookmarklet, which is a JavaScript snippet that you can add to a browser as a bookmark. We had assistants visit the pages where the images were stored and click on the bookmark to harvest images and metadata associated with each image. While it is a bit cumbersome, it was the easiest and quickest way to start collecting the images for our project. I wrote the code such that the JavaScript would speak to the PHP code and the PHP code would handle all the heavy lifting (saving image and scraping page for metadata). The images and text files were automatically saved to a Dropbox folder using the Dropbox API.

PHP + jQuery + Twitter API => Performing a Basic Search Using OAuth and the REST API v1.1

INTRO

To get started, you’ll need access to PHP 5.x on a web server.  I’m currently working on an Apache server with the default installation of PHP 5.3.  This should be available on most hosting services, especially those setups featuring open source software (as opposed to Microsoft’s .NET framework). In addition, I’m using a Postgres database on the back end to store the information I’m scraping and extracting (you can just as easily use MySQL).  If you want to run this code on your local machine, download WAMP, MAMP, XAMPP, or another flavor of server/language/database package.

TWITTER API, OAuth, & PHP twitteroauth Library

First, familiarize yourself with the Twitter Developer Website.  If you want to skip right to the API, check out the REST API v1.1 documentation.  To test a search, go to the Twitter Search page and type in a search term; try typing #BigData in the query field to search for the BigData hashtag.  You’ll be presented with a GUI version of the results.  If you want to try doing the same thing programatically and return data in JSON format, you’ll need to use the REST API search query… and you must be authenticated to do this.  To create credentials to use the search query, you must create an OAuth profile; so go and visit https://dev.twitter.com/docs/auth/tokens-devtwittercom so you can retrieve your ACCESS TOKEN and ACCESS SECRET.  Luckily we can use the PHP twitteroauth library to connect to Twitter’s API and start writing code (here’s an example of the code you’ll need:  https://dev.twitter.com/docs/auth/oauth/single-user-with-examples#php).  At this point you’ll need to set up your OAuth profile with Twitter and download the PHP twitteroauth library, edit the proper information to add your TOKEN and SECRET to the PHP twitteroauth library, and ensure all the files are on your web server in the appropriate place.

PERFORMING A SEARCH & RETRIEVING JSON DATA

I’m assuming you have set up the OAuth profile on Twitter and that you’ve downloaded the PHP twitteroauth library.  I like to create an “app_tokens.php” file containing my CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, and USER_SECRET information assigned to variables; this way I can include anywhere I need it.

Now that we have our authorization credentials we are ready to use tmhOAuth as the middle man to send a request to Twitter’s API.  Let’s say we want to perform the same search we did above, but this time we don’t want a GUI version of the data… instead we want JSON data back so that we can easily add it to a database.  We need to find out what command the Twitter API expects and pass it a value; for our example, the Twitter API search query is simply:   https://api.twitter.com/1.1/search/tweets.json We can pass it several different parameters, but we’ll start with the most basic and use the q query parameter.  We want to pass the parameter the value “#BigData”, but we need to convert the pound sign (#) to a URL encoded version => %23… Our code then looks like this:

This request will use the REST API v1.1 and return JSON data.  We are passing the search a paramater of q=>’%23BigData’ which translates to searching for the hashtag “#BigData” (without the quotes).  We are also passing the ‘count’ and ‘result_type’ parameters (for more info on the other parameters, see the documentation).  Lastly, we need to get the response back from Twitter and output it; if we have an error, we need to output that too.  Using the twitteroauth libraries examples, I know I need to have the following code:

The above code receives two pieces of data from the Twitter API:  the response code and the response data. The response code indicates if we have errors.  The response data holds the JSON data the we received from the query.  The first result of my JSON data (yours won’t contain the same information, but it will contain similar structure) looks like this:

If you look at the JSON data above, you’ll see a key titled “text” and the value assigned to it; this is the content of the tweet and you can clearly see that it contains the hashtag #bigdata.  So we now know the code works and we can programatically query Twitter.  When you examine the Twitter API you will find that we can make 450 request every 15 minutes;  this will of course not get us ALL the tweets using the hashtag “#bigdata”, but it will give us a useful sample at 30 results per request == 13,500 tweets every 15 minutes.

Cheers.

Web Scraping Using PHP and jQuery

I was asked by a friend to write code that would scrape a DLP website’s content of letters to use in an academic study (the website’s copyright allows for the non-commercial use of the data).  I’d not tried this before and was excited by the challenge, especially considering I’m becoming more involved in “big data” studies and I need to understand how one might go about developing web scraping programs. I started with the programming languages I know best:  PHP & jQuery.  And yes, I know that there are better programming languages available to write code for webscraping.  I’ve used PERL, Python, JAVA, and other programming languages in the past, but I’m currently much more versed in PHP than anything else! If I had been unable to quickly build this in PHP, than of course I’d have turned to Python or PERL; but in the end I was able to write some code and it worked. I’m happy with the results and so was my friend.

First, I had to figure out what PHP had under the hood that would allow me to load URLs and retrieve information. I did some searching via Google and figured out the best option was to use the cURL library (http://php.net/manual/en/book.curl.php).  The cURL lib allows one to connect to a variety of servers and protocols and was perfect for my needs; don’t forget to check your PHP install to see if you have the cURL library installed and activated.  I did a quick search on cURL and PHP and came across http://www.digimantra.com/technology/php/get-data-from-a-url-using-curl-php/ where I found a custom function that I thought I could edit to suit my needs:

Next I needed a way to grab specific DOM elements from the pages being scraped; I needed to find a <span> tag that had a specific attribute containing a value that was both a function name and a URL.  I am very familiar with jQuery syntax and CSS3 syntax that allows one to find specific DOM elements using patterns.  Low and behold I discovered that someone had developed a PHP class to do similar things named “simplehtmldom” (http://sourceforge.net/projects/simplehtmldom/).  I downloaded simplehtmldom from sourceforge, read the documentation, and created code that would find my elements and return the URLs I needed.

Now I have the actual URLs from which I want to get  a copy of the data in an array.  I need to loop through the $links array and use cURL once again to get the data.  While I’m looping through the array I need to check to see if the URL is pointing to an HTML file or a PDF file (my only two options in this case).  If it is an HTML file, I use the get_data() function to grab the data and use PHP file commands to write/create a file in a local directory to store the data. If it’s a PDF, I need to use different cURL commands to grab the data and create a PDF file locally.

That’s it for the scraping engine!

Now we need to create a way pass a  start and end value (increments of 50 and maxes out at 4000) to the PHP scraping engine.  I know there are many ways to tackle this and I specifically considered executing the code from a terminal, in a CRON job, or from a browser.  I again went with my strengths and chose to use an AJAX call via jQuery.  I created another file and included the most recent jQuery engine.  I then created a recursive jQuery function that would make an AJAX POST call to the  PHP engine, pause for 5 seconds, and then do it again. The function accepts four parameters: url, start, increment, and end.

Put this all together and we have a basic web scraper that does a satisfactory job of iterating through search results and grabbing copies of HTML and PDF files and storing them locally.  I was excited to get it finished using my familiar PHP and jQuery languages and it was a nice exercise to think this problem through logically.  Again, I’m SURE there are better, more efficient ways of doing this… but I’m happy and my friend is happy.

Fun times.