Web Scraping Using PHP and jQuery

I was asked by a friend to write code that would scrape a DLP website’s content of letters to use in an academic study (the website’s copyright allows for the non-commercial use of the data).  I’d not tried this before and was excited by the challenge, especially considering I’m becoming more involved in “big data” studies and I need to understand how one might go about developing web scraping programs. I started with the programming languages I know best:  PHP & jQuery.  And yes, I know that there are better programming languages available to write code for webscraping.  I’ve used PERL, Python, JAVA, and other programming languages in the past, but I’m currently much more versed in PHP than anything else! If I had been unable to quickly build this in PHP, than of course I’d have turned to Python or PERL; but in the end I was able to write some code and it worked. I’m happy with the results and so was my friend.

First, I had to figure out what PHP had under the hood that would allow me to load URLs and retrieve information. I did some searching via Google and figured out the best option was to use the cURL library (http://php.net/manual/en/book.curl.php).  The cURL lib allows one to connect to a variety of servers and protocols and was perfect for my needs; don’t forget to check your PHP install to see if you have the cURL library installed and activated.  I did a quick search on cURL and PHP and came across http://www.digimantra.com/technology/php/get-data-from-a-url-using-curl-php/ where I found a custom function that I thought I could edit to suit my needs:

Next I needed a way to grab specific DOM elements from the pages being scraped; I needed to find a <span> tag that had a specific attribute containing a value that was both a function name and a URL.  I am very familiar with jQuery syntax and CSS3 syntax that allows one to find specific DOM elements using patterns.  Low and behold I discovered that someone had developed a PHP class to do similar things named “simplehtmldom” (http://sourceforge.net/projects/simplehtmldom/).  I downloaded simplehtmldom from sourceforge, read the documentation, and created code that would find my elements and return the URLs I needed.

Now I have the actual URLs from which I want to get  a copy of the data in an array.  I need to loop through the $links array and use cURL once again to get the data.  While I’m looping through the array I need to check to see if the URL is pointing to an HTML file or a PDF file (my only two options in this case).  If it is an HTML file, I use the get_data() function to grab the data and use PHP file commands to write/create a file in a local directory to store the data. If it’s a PDF, I need to use different cURL commands to grab the data and create a PDF file locally.

That’s it for the scraping engine!

Now we need to create a way pass a  start and end value (increments of 50 and maxes out at 4000) to the PHP scraping engine.  I know there are many ways to tackle this and I specifically considered executing the code from a terminal, in a CRON job, or from a browser.  I again went with my strengths and chose to use an AJAX call via jQuery.  I created another file and included the most recent jQuery engine.  I then created a recursive jQuery function that would make an AJAX POST call to the  PHP engine, pause for 5 seconds, and then do it again. The function accepts four parameters: url, start, increment, and end.

Put this all together and we have a basic web scraper that does a satisfactory job of iterating through search results and grabbing copies of HTML and PDF files and storing them locally.  I was excited to get it finished using my familiar PHP and jQuery languages and it was a nice exercise to think this problem through logically.  Again, I’m SURE there are better, more efficient ways of doing this… but I’m happy and my friend is happy.

Fun times.