Scholarly Communication, ‘Altmetrics’, and Social Theory

In a recent book chapter (that is currently under review), my colleagues and I discuss the application of citation theories and social theories to popular media and social media metrics (so-called altmetrics) being collected by sites like Altmetric.com, ImpactStory.org, and Plum Analytics. These metrics are being used by organizations such as libraries, publishers, universities, and others to measure scholarly impact. It is an interesting area of research in that it helps us understand how scholarly work is being consumed and disseminated in social media (and thus presumably to an audience outside of the academy).

I come to this research having dabbled in many different areas of studies beginning with neuropsychology (as an undergraduate), human-computer interaction, information architecture, and web design (as a master’s student), and finally social informatics (at the beginning of my Ph.D.), digital humanities (middle of Ph.D.), and scholarly communication and sociology (thesis work). I believe this indirect path has allowed me to consider research questions from different perspectives and allows me to apply various theoretical and methodological lenses to the same problem (as is the case for many Information Science graduates). It’s also a path that has allowed me to contribute to the data collection aspect of this work, as I’ve written several programs that have assisted in the collection and storage of huge amounts of data (hundreds of millions of tweets, publication records, etc.) on scholarly (and other) activities. These experiences have allowed me to contribute to the book chapter mentioned above, several articles and presentations, and continues to allow me to contribute to understanding scholarly communication in social and popular media venues.

I’m looking forward to finalizing my thesis and to continue to examine these social and scholarly communication issues in my current research position at UdeM and in a permanent faculty position with future colleagues.

Harvesting Images from Site for Study

Recently
I needed to write code that would allow me to harvest images from a site where the images were displayed 10 per page over n number of pages.  I wanted to set it up so that I could start it and let it run over time and harvest images.  This immediately meant I’d be working with PHP and jQuery using AJAX.  I’ve written another post titled Web Scraping Using PHP and jQuery about this type of AJAX script and I needed to use what I’d learned here to implement this new scraping engine.

The reason I’m writing this code is so that we can start a few studies on selfies.

I ended up using a bookmarklet, which is a JavaScript snippet that you can add to a browser as a bookmark. We had assistants visit the pages where the images were stored and click on the bookmark to harvest images and metadata associated with each image. While it is a bit cumbersome, it was the easiest and quickest way to start collecting the images for our project. I wrote the code such that the JavaScript would speak to the PHP code and the PHP code would handle all the heavy lifting (saving image and scraping page for metadata). The images and text files were automatically saved to a Dropbox folder using the Dropbox API.

PHP + jQuery + Twitter API => Performing a Basic Search Using OAuth and the REST API v1.1

INTRO

To get started, you’ll need access to PHP 5.x on a web server.  I’m currently working on an Apache server with the default installation of PHP 5.3.  This should be available on most hosting services, especially those setups featuring open source software (as opposed to Microsoft’s .NET framework). In addition, I’m using a Postgres database on the back end to store the information I’m scraping and extracting (you can just as easily use MySQL).  If you want to run this code on your local machine, download WAMP, MAMP, XAMPP, or another flavor of server/language/database package.

TWITTER API, OAuth, & PHP twitteroauth Library

First, familiarize yourself with the Twitter Developer Website.  If you want to skip right to the API, check out the REST API v1.1 documentation.  To test a search, go to the Twitter Search page and type in a search term; try typing #BigData in the query field to search for the BigData hashtag.  You’ll be presented with a GUI version of the results.  If you want to try doing the same thing programatically and return data in JSON format, you’ll need to use the REST API search query… and you must be authenticated to do this.  To create credentials to use the search query, you must create an OAuth profile; so go and visit https://dev.twitter.com/docs/auth/tokens-devtwittercom so you can retrieve your ACCESS TOKEN and ACCESS SECRET.  Luckily we can use the PHP twitteroauth library to connect to Twitter’s API and start writing code (here’s an example of the code you’ll need:  https://dev.twitter.com/docs/auth/oauth/single-user-with-examples#php).  At this point you’ll need to set up your OAuth profile with Twitter and download the PHP twitteroauth library, edit the proper information to add your TOKEN and SECRET to the PHP twitteroauth library, and ensure all the files are on your web server in the appropriate place.

PERFORMING A SEARCH & RETRIEVING JSON DATA

I’m assuming you have set up the OAuth profile on Twitter and that you’ve downloaded the PHP twitteroauth library.  I like to create an “app_tokens.php” file containing my CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, and USER_SECRET information assigned to variables; this way I can include anywhere I need it.

Now that we have our authorization credentials we are ready to use tmhOAuth as the middle man to send a request to Twitter’s API.  Let’s say we want to perform the same search we did above, but this time we don’t want a GUI version of the data… instead we want JSON data back so that we can easily add it to a database.  We need to find out what command the Twitter API expects and pass it a value; for our example, the Twitter API search query is simply:   https://api.twitter.com/1.1/search/tweets.json We can pass it several different parameters, but we’ll start with the most basic and use the q query parameter.  We want to pass the parameter the value “#BigData”, but we need to convert the pound sign (#) to a URL encoded version => %23… Our code then looks like this:

This request will use the REST API v1.1 and return JSON data.  We are passing the search a paramater of q=>’%23BigData’ which translates to searching for the hashtag “#BigData” (without the quotes).  We are also passing the ‘count’ and ‘result_type’ parameters (for more info on the other parameters, see the documentation).  Lastly, we need to get the response back from Twitter and output it; if we have an error, we need to output that too.  Using the twitteroauth libraries examples, I know I need to have the following code:

The above code receives two pieces of data from the Twitter API:  the response code and the response data. The response code indicates if we have errors.  The response data holds the JSON data the we received from the query.  The first result of my JSON data (yours won’t contain the same information, but it will contain similar structure) looks like this:

If you look at the JSON data above, you’ll see a key titled “text” and the value assigned to it; this is the content of the tweet and you can clearly see that it contains the hashtag #bigdata.  So we now know the code works and we can programatically query Twitter.  When you examine the Twitter API you will find that we can make 450 request every 15 minutes;  this will of course not get us ALL the tweets using the hashtag “#bigdata”, but it will give us a useful sample at 30 results per request == 13,500 tweets every 15 minutes.

Cheers.

Web Scraping Using PHP and jQuery

I was asked by a friend to write code that would scrape a DLP website’s content of letters to use in an academic study (the website’s copyright allows for the non-commercial use of the data).  I’d not tried this before and was excited by the challenge, especially considering I’m becoming more involved in “big data” studies and I need to understand how one might go about developing web scraping programs. I started with the programming languages I know best:  PHP & jQuery.  And yes, I know that there are better programming languages available to write code for webscraping.  I’ve used PERL, Python, JAVA, and other programming languages in the past, but I’m currently much more versed in PHP than anything else! If I had been unable to quickly build this in PHP, than of course I’d have turned to Python or PERL; but in the end I was able to write some code and it worked. I’m happy with the results and so was my friend.

First, I had to figure out what PHP had under the hood that would allow me to load URLs and retrieve information. I did some searching via Google and figured out the best option was to use the cURL library (http://php.net/manual/en/book.curl.php).  The cURL lib allows one to connect to a variety of servers and protocols and was perfect for my needs; don’t forget to check your PHP install to see if you have the cURL library installed and activated.  I did a quick search on cURL and PHP and came across http://www.digimantra.com/technology/php/get-data-from-a-url-using-curl-php/ where I found a custom function that I thought I could edit to suit my needs:

Next I needed a way to grab specific DOM elements from the pages being scraped; I needed to find a <span> tag that had a specific attribute containing a value that was both a function name and a URL.  I am very familiar with jQuery syntax and CSS3 syntax that allows one to find specific DOM elements using patterns.  Low and behold I discovered that someone had developed a PHP class to do similar things named “simplehtmldom” (http://sourceforge.net/projects/simplehtmldom/).  I downloaded simplehtmldom from sourceforge, read the documentation, and created code that would find my elements and return the URLs I needed.

Now I have the actual URLs from which I want to get  a copy of the data in an array.  I need to loop through the $links array and use cURL once again to get the data.  While I’m looping through the array I need to check to see if the URL is pointing to an HTML file or a PDF file (my only two options in this case).  If it is an HTML file, I use the get_data() function to grab the data and use PHP file commands to write/create a file in a local directory to store the data. If it’s a PDF, I need to use different cURL commands to grab the data and create a PDF file locally.

That’s it for the scraping engine!

Now we need to create a way pass a  start and end value (increments of 50 and maxes out at 4000) to the PHP scraping engine.  I know there are many ways to tackle this and I specifically considered executing the code from a terminal, in a CRON job, or from a browser.  I again went with my strengths and chose to use an AJAX call via jQuery.  I created another file and included the most recent jQuery engine.  I then created a recursive jQuery function that would make an AJAX POST call to the  PHP engine, pause for 5 seconds, and then do it again. The function accepts four parameters: url, start, increment, and end.

Put this all together and we have a basic web scraper that does a satisfactory job of iterating through search results and grabbing copies of HTML and PDF files and storing them locally.  I was excited to get it finished using my familiar PHP and jQuery languages and it was a nice exercise to think this problem through logically.  Again, I’m SURE there are better, more efficient ways of doing this… but I’m happy and my friend is happy.

Fun times.

 

Scientometrics, Scholarly Communication, and Big Data… oh my!

Wordle Image
The Digital Humanities and Humanities Computing: An Introduction by S Schreibman, R Siemens, & J Unsworth

I’ve started the 2012 fall semester with a new G.A. position working for Dr. Cassidy Sugmioto on a grant titled Cascades, Islands, or Streams? Time, Topic, and Scholarly Activities in Humanities and Social Science Research.  The grant was awarded through the NEH and the Office of Digital Humanities and was part of the Digging Into Data challenge. The official grant description reads:

This project will examine topic lifecycles across heterogeneous corpora, including not only scholarly and scientific literature, but also social networks, blogs, and other materials. While the growth of large-scale datasets has enabled examination within scientific datasets, there is little research that looks across datasets. The team will analyze the importance of various scholarly activities for creating, sustaining, and propelling new knowledge; compare and triangulate the results of topic analysis methods; and develop transparent and accessible tools. This work should identify which scholarly activities are indicative of emerging areas and identify datasets that should no longer be marginalized, but built into understandings and measurements of scholarship.

I’m extremely excited about this G.A. position! It will allow me to study, record, and understand communication, connections, and behavior in social network sites (SNS) using a different set of tools and theories and it will allow me to make use of my semester AI’ing with Dr. John Walsh on his S657 Digital Humanities course and my past five years of experience working in the digital humanities (DH) realm and developing/designing DH websites and tools for the Chymistry of Isaac Newton project, TILE, and other projects.

It’s an ideal opportunity and I’m so lucky that I asked to audit Dr. Sugimoto’s Ph.D. version of her Scholarly Communication course.  I was interested in the scholarly communication beforehand after examining how Erving Goffman was cited in a subset of information science (IS) literature (a large portion of the discourse simply cited him ceremoniously). I became quite involved with and interested in the subject of scholarly communication as the course progressed. The course and discussion opened my eyes to vast possibilities outside the simple “citation count” and I became quite interested in the discourse. Later Dr. Sugimoto approached me with the possibility of working together and I jumped at the opportunity.

I came to the project with both excitement and fear, fear primarily because I felt a bit out of my comfort zone as the scholarly communication discourse was still relatively new to me. The other faculty and students working on the project have been fantastic and I’m positive I’ll benefit in a variety of ways from the experience. The fist part of my assistantship will include working with a small group of Ph.D. and masters students set with the task of examining network characteristics of the DH community across a variety of sources including various social network tools/sites, journals, books, and listservs, to name just a few. Some in the DH community have visualized and discussed characteristics of the DH populace (Cleo, Melissa Terras, and Alex Reid, to list just a few) and our group hopes to add to this picture by examining other sources for characteristics common to the DH community.

 

Integrating Social Capital, Impression Management, and Privacy

At this time, my dissertation looks to involve the integration of theories and frameworks related to social capital, impression management, and privacy. Impression management is the easiest theory for me to deal with because I have a strong attraction to Goffman’s dramaturgical framework. While I know that his is not the only framework or theory relating to impression management, at this point in time, it is the framework to which I’m attracted. The big three of the social capital world, based on my own readings, include Bourdieu, Coleman, and Putnam. I tend to lean toward Coleman’s definition of social capital only because it is less egocentric; Bourdieu’s definition and explanation is extremely egocentric and Putnam’s views are at a macro level and include large organizations and groups. Privacy research is a grab bag of ideas and theories composed of a variety of definitions and understandings.

I’m attempting to integrate these theories to explain behaviors in the domain of social networking sites (SNS). SNSs are an extremely hot topic in academia and in the popular press. Facebook is pushing 700 million users and Google is now entering the realm with Google+. The SNS environment is an extremely interesting domain to investigate because

    1. SNS such as Facebook have so may users
    2. The computer-mediated environment presents us with different challenges during interactions than face-to-face interaction
    3. It is a relatively new domain of study and has continued to show growth over the past 7+ years
    4. Popular media has made blanket statements regarding interaction, privacy, and safety within the SNS environment that academicians have set out to examine
    5. Web 2.0 technologies have allowed for a variety of media to be transmitted through the SNS frameworks that affect self-presentation, social capital maintenance, and privacy

While the SNS domain is ripe for investigation, I also feel that a successful integration of these theories will be allow us to examine a variety of computer-mediated environments.

So far I’ve explained that I want to integrate impression management, social capital, and privacy theories and frameworks to investigate computer-mediated environments such as SNS. I’ve left out a crucial component… WHAT will I be investigating? This is the ultimate question and what is giving me the hardest time at the moment.

“Indiana University project releases more of Sir Isaac Newton’s alchemy manuscripts” from IUB Newsroom

Story from http://newsinfo.iu.edu/web/page/normal/19929.html

"Star regulus of antimony" was produced by the Chymistry of Isaac Newton project following directions written by Newton.
"Star regulus of antimony" was produced by the Chymistry of Isaac Newton project following directions written by Newton.

The Chymistry of Isaac Newton project at Indiana University Bloomington has released digital editions of 30 previously unedited manuscripts written around 300 years ago by the great British scientist Sir Isaac Newton, the founder of modern physics.

The project, devoted to the editing and exposition of Newton’s work involving alchemy, the dream of transmuting base metals into gold, is directed by William R. Newman, Ruth N. Halls Professor of History and Philosophy of Science in the IU College of Arts and Sciences.

Look for a new Newton site design shortly!

Publicity on Newton Project, Moving on to TILE

SLIS News Story: http://www.slis.indiana.edu/news/story.php?story_id=2186

We have some new publicity on the Chymistry of Isaac Newton project. IU did a press release on the launch of new manuscripts, new website design, and new features… and I get some credit for redesigning the website. I’ve had a really great time working on the Newton Project with Dr. John A Walsh, Dr. William Newman, Wally Hooper, and the rest of the crew. We’ve had some good times. Now I’m moving on to the TILE project (Text-Image Linking Environment) and I’ll be working with Doug Reside, Dot Porter, Melissa Terras, and Dr. John A Walsh.

TILE Project Release: http://www.slis.indiana.edu/news/story.php?story_id=1985