Lessons from 2:AM

Last week I was lucky enough to attend the 2:AM Conference in Amsterdam. The conference was focused on altmetrics–a type of metric that is typically calculated based on scholarly communication events captured in online contexts (e.g., events in Twitter, Mendeley, Wikipedia, etc). For some time I’ve been critical of the term “altmetrics” because I had taken it to mean “alternative to citations,” but after this conference I’m not so confident in my previous position. Altmetrics is an umbrella term that we use to help describe the type of research we are doing (at least those of us that research these things), it is a buzzword that others use to talk about scholarly communication in online contexts, it is a term that the media has used, it is currently used in organizations, libraries, universities, and companies to promote scientific work, and it has become a term that somehow represents the potential for measuring impact outside of the academic machine (other than scientific impact). While it has been criticised many times in the past for being the wrong term, I am not sure there is a more appropriate term… and that is fine.  We have had suggestions including social media metrics (Haustein, Larivière, Thelwall, Amyot, & Peters, 2014), complimetrics (complimentary metrics) (Adie, 2014), influmetrics (influence metrics) (Cronin & Weaver, 1995; Rousseau & Ye, 2013), and more traditionally webometrics (Almind & Ingwersen, 1997), to name just a few, but these do not seem to be any better and also do not seem to possess that something that “alt”metrics seems to possess.

I dabble in linguistics and I believe that words are of vital importance to our ability to understand and discuss the same phenomenon (especially in science), which is why I was so adamant that “altmetrics” was the wrong term to be using. But then I took another look at the altmetrics manifesto (the 5th anniversary of this important object was celebrated at the conference) and reevaluated my own position based on my accumulated knowledge in the field, what I learned at this conference, and a closer inspection of the manifesto to come to the realization that altmetrics is fine when you think of it as an “alternative means of measuring scholarly communication.”

The conference venue was great as we were housed at the Amsterdam Science Park, a sprawling complex on the eastern side of Amsterdam.  There were quite a few attendees and the presentations and workshop were informative and thought-provoking. Many of the primary data providers, publishing companies, metrics providers, and others in this field sent representatives including Jason Priem (impactstory.org), Euan Adie (altmetric.com), William Gunn (mendeley.com), Greg Gordon (ssrn.com), Martin Fenner (niso.org), and Geoff Bilder (crossref.org). In addition, the four authors of the altmetrics manifesto were in attendance to celebrate its 5th anniversary– Jason PriemDario TaraborelliPaul Groth, and Cameron Neylon.  I was able to speak with both Jason and Cameron and they were engaging, down to earth people who are great scholars and excited by the future of scholarly communication (I wasn’t able to speak with Paul or Dario at such length).

What I gleaned from 2:AM was that there was an ongoing discussion from multiple perspectives taking place regarding the ability for altmetrics to measure impact, the types of impact there might be for scholarly communication, and the importance of trust when considering the reasons behind altmeteric events. In addition, I am looking forward to be a part of a group (formed at the”theories” conference breakout session) that will write a white paper describing and defining common terms used in altmetric research for the purpose of allowing others outside of our community to understand and contribute to the ongoing work in the field. I also learned that many in the field had read our book chapter (arXiv:1502.05701) on applying citation and social theories to the understanding of altmetric events–they were very supportive of our efforts to put forth this first attempt at developing a framework for understanding altmetric events. Yet we all know that much more work needs to be done and hopefully this white paper will be a nice step in that direction.

What I also learned from listening to Jason Priem, Dario Taraborelli, Paul Groth, and Cameron Neylon was that our group has somewhat ignored the important component of the manifesto, which is talking about altmetric events as a type of  “filters” for scholarly research and communication:

No one can read everything. We rely on filters to make sense of the scholarly literature, but the narrow, traditional filters are being swamped. However, the growth of new, online scholarly tools allows us to make new filters; these altmetrics reflect the broad, rapid impact of scholarship in this burgeoning ecosystem. We call for more tools and research based on altmetrics. (Priem, Taraborelli, Groth & Neylon, para 1, 2010)

This is an important aspect that I too simply took for granted and something I need to reflect on as my understanding of these phenomena continue to grow and change.

 

References

Priem, J., Taraborelli, D., Groth, P., & Neylon, C. (2010). Altmetrics: A manifesto. October 26, 2010. Retrieved from http://altmetrics.org/manifesto

 

 

Boundaries

The current state of scholarly communication is in flux as various avenues for the consumption and dissemination of ideas, discussions, and research continue to be developed and then adopted by scholars. These contexts offer different affordances (Gibson, 1977), or possibilities for action provided by the platform, and offer different types of networks with which a scholar can view and interact. Affordances found in these environments typically include the ability to share information in a particular way (e.g. tweet, Facebook post, blog, comment, post links or media, hashtags, etc.), consume information, create a profile (public, private, or mixed), and connect with other users of the platform. There are various types of networks represented across these platforms, including blogs (external facing networks), Facebook and Twitter (social networks), and Wikipedia (interconnected networks). A scholar can present herself on these online platforms along a continuum ranging from personal to professional.

Problems can arise from interacting within these online contexts as the information is disseminated to a vast unknown audience, it is archivable, it is searchable, and it can be copied and removed from the context in which it was originally published (boyd, 2006). This can prove damaging to the reputation of a scholar and can lead to shame, punishment, or dismissal as seen from recent examples. In one example, a scholar who had been offered a tenure-track position at the University of Illinois, Urbana-Champaign, had this same offer rescinded after several tweets made by the individual were deemed anti-semitic in nature by the university board (Jaschik, 2014). In another example, a professor from the University of New Mexico was put on probation and given counselling after tweeting an offensive remark about Ph.D. applicants (Ingeno, 2013). There have been other examples of these types of infractions from Facebook and from blogging.

Before the rise of these massive online networks, scholars already found it difficult to manage the boundaries between their personal and professional lives. The introduction of online contexts in which a person can interact with vast audiences exacerbates the situation for scholars as they (often) already are maintaining a tenuous balance between their personal and professional identities from their time spent mentoring and teaching students in and out of the classroom. The boundaries between personal and professional are changing; what was considered personal interactions outside the classroom now have been thrust into the spotlight partially because of the new networks in which scholars interact. This relationship between the changing personal and professional boundaries of self-presentation and the size of the network and proximity of the nodes has not been adequately discussed.

Goffman (1959) discussed the acts of self-presentation and impression management in his social research as acting out a particular role for an audience and maintaining that role across time. These acts rely on various aspects including social norms, rules, and context to be effective. You could interpret Goffman’s writing in a way that suggests he considered the network and it’s significance to people in their day to day lives, as he (Goffman, 1961, p. 127) noted later that “[w]hen seen up close, the individual, bringing together in various ways all the connections that he has in life, becomes a blur.” He knew that boundary maintenance was a crucial component of self-presentation and impression management, as he divided the act of self-presentation into three different regions: front stage, back-stage, and the outside region. What he did not directly speak to was the actual size of the network and the influence this would have on the boundaries between these regions.

goffman-self-presentation
A graphical representation of Goffman’s Self-Presentation framework

Related to this, Mehra, Kilduff, and Brass (2001, p. 131) argued that while a large network “can enable the individual to access numerous others for information and other resources,” they warned that “[p]eople who interact with numerous others in organizations run the risk of running short of time and other resources” In addition to the time and resources used to maintain large networks, scholars run the risk of further blurring the boundaries between their personal and professional selves. I want to further investigate this relationship between networks and self-presentation and impression management and the blurring between personal and professional.

 

References

boyd, d. (2006). Friends, Friendsters, and MySpace Top 8: Writing Community Into Being on Social Network Sites. First Monday, 11 (12)(12), 1–15. Retrieved from http://www.firstmonday.org/issues/issue11_12/boyd/index.html

Gibson, J. J. (1977). The Theory of Affordances. In R. Shaw & J. Bransford (Eds.), Perceiving, Acting, and Knowing: Toward an Ecological Psychology (pp. 127–143). Hillsdale, NJ: Lawrence Erlbaum.

Goffman, E. (1959). The Presentation of Self in Everyday Life. New York: Anchor.

Goffman, E. (1961). Encounters: Two studies in the sociology of interaction. Indianapolis: The Bobbs-Merrill Company, Inc.

Ingeno, L. (2013, June 14). Outrage over professor’s Twitter post on obese students. Inside Higher Ed. Retrieved from https://www.insidehighered.com/news/2013/06/04/outrage-over-professors-twitter-post-obese-students

Jaschik, S. (2014, August). Out of a job. Inside Higher Ed. Retrieved from https://www.insidehighered.com/news/2014/08/06/u-illinois-apparently-revokes-job-offer-controversial-scholar

Mehra, A., Kilduff, M., and Brass, D.J. (2001) The social networks of high and low selfmonitors: Implications for workplace performance. Administrative Science Quarterly, 46(1), pp. 121-146.

The ecosystem of science

I’ve been thinking a lot about what Science really means to me and what the philosophers of science have said about the system of science. I love Newton’s famous notion about “standing on the shoulders of giants,” but I don’t necessarily see it in that way… especially in my line of research investigating altmetrics and scholarly communication.

It’s a blustery evening in Finland and I am watching the trees bend and shed leaves in the strong breeze while thinking about this. It seems to me that the system of science resembles an ecosystem in which we try to make our lives meaningful and to shed light on our surroundings. We do, of course, use the work of others to view things through their eyes, but I don’t see myself standing on their shoulders and reaching for the stars. Instead I see myself as a small sapling, struggling for nourishment in a vast forest. At the same time, I view those before me, especially those marvelous minds from which I borrow, as large trees that shade me from the sun and break the harsh winds blowing over me. I see the trees of Goffman and Gibson, of Heidegger and Kant, and on and on, in my part of the forest. These solid, long standing trees protect me and nourish me, allowing me to grow and to become a tree myself.

As scholarly communication and science has changed, so too has the ecosystem. We are no longer simply trying to aspire to being the trees that provide the root system of science, we are also trying to spread and have an impact outside our forests. I feel like we are  now flowering trees, making pollen that can be carried away to the farthest fields with hopes of having an impact on our surroundings. We have evolved to make use of the technologies that have become a part of our world, to attract the attention of others so that they can carry our pollen away. A large part of this new technology and ecosystem is the internet, specifically social media and other online sources of information. Social media users are the bees that we need to spread our pollen, our information, outside of our isolated forests. What the bees are doing with this information, we don’t yet know.  But what we do know is that they can spread it faster and farther than ever before.

Through my work I hope we can figure out where our information is being spread and what kinds of impact we are having on society.

It. Is. Done.

I have finally finished my Ph.D. Yay. I graduated from the School of Informatics and Computing,  Indiana University, Bloomington at the end of July, 2015.

After seven years of contemplating social structures, norms, behaviors, communication, and the ways in which people use the affordances of social media, I was able to successfully defend my thesis in front of four of my peers and a handful of students in May, 2015 and make the required minor revisions and formatting changes to submit the final version of the document to the graduate school at the beginning of July, 2015.

It has been a long, rewarding journey and I am happy that I completed it. I have been able to travel around the world, move to two countries, and meet some extraordinary scholars, travelers, and neighbors. It’s been quite an adventure, one which I hope continues as I progress in my career as an academic. Thank you to everyone for the support and love throughout this process.

I’m now in Finland working with great scholars and looking to improve my abilities as a scholar, researcher, teacher, and coworker.

Kiitos!

Scholarly Communication, ‘Altmetrics’, and Social Theory

In a recent book chapter (that is currently under review), my colleagues and I discuss the application of citation theories and social theories to popular media and social media metrics (so-called altmetrics) being collected by sites like Altmetric.com, ImpactStory.org, and Plum Analytics. These metrics are being used by organizations such as libraries, publishers, universities, and others to measure scholarly impact. It is an interesting area of research in that it helps us understand how scholarly work is being consumed and disseminated in social media (and thus presumably to an audience outside of the academy).

I come to this research having dabbled in many different areas of studies beginning with neuropsychology (as an undergraduate), human-computer interaction, information architecture, and web design (as a master’s student), and finally social informatics (at the beginning of my Ph.D.), digital humanities (middle of Ph.D.), and scholarly communication and sociology (thesis work). I believe this indirect path has allowed me to consider research questions from different perspectives and allows me to apply various theoretical and methodological lenses to the same problem (as is the case for many Information Science graduates). It’s also a path that has allowed me to contribute to the data collection aspect of this work, as I’ve written several programs that have assisted in the collection and storage of huge amounts of data (hundreds of millions of tweets, publication records, etc.) on scholarly (and other) activities. These experiences have allowed me to contribute to the book chapter mentioned above, several articles and presentations, and continues to allow me to contribute to understanding scholarly communication in social and popular media venues.

I’m looking forward to finalizing my thesis and to continue to examine these social and scholarly communication issues in my current research position at UdeM and in a permanent faculty position with future colleagues.

Harvesting Images from Site for Study

Recently
I needed to write code that would allow me to harvest images from a site where the images were displayed 10 per page over n number of pages.  I wanted to set it up so that I could start it and let it run over time and harvest images.  This immediately meant I’d be working with PHP and jQuery using AJAX.  I’ve written another post titled Web Scraping Using PHP and jQuery about this type of AJAX script and I needed to use what I’d learned here to implement this new scraping engine.

The reason I’m writing this code is so that we can start a few studies on selfies.

I ended up using a bookmarklet, which is a JavaScript snippet that you can add to a browser as a bookmark. We had assistants visit the pages where the images were stored and click on the bookmark to harvest images and metadata associated with each image. While it is a bit cumbersome, it was the easiest and quickest way to start collecting the images for our project. I wrote the code such that the JavaScript would speak to the PHP code and the PHP code would handle all the heavy lifting (saving image and scraping page for metadata). The images and text files were automatically saved to a Dropbox folder using the Dropbox API.

PHP + jQuery + Twitter API => Performing a Basic Search Using OAuth and the REST API v1.1

INTRO

To get started, you’ll need access to PHP 5.x on a web server.  I’m currently working on an Apache server with the default installation of PHP 5.3.  This should be available on most hosting services, especially those setups featuring open source software (as opposed to Microsoft’s .NET framework). In addition, I’m using a Postgres database on the back end to store the information I’m scraping and extracting (you can just as easily use MySQL).  If you want to run this code on your local machine, download WAMP, MAMP, XAMPP, or another flavor of server/language/database package.

TWITTER API, OAuth, & PHP twitteroauth Library

First, familiarize yourself with the Twitter Developer Website.  If you want to skip right to the API, check out the REST API v1.1 documentation.  To test a search, go to the Twitter Search page and type in a search term; try typing #BigData in the query field to search for the BigData hashtag.  You’ll be presented with a GUI version of the results.  If you want to try doing the same thing programatically and return data in JSON format, you’ll need to use the REST API search query… and you must be authenticated to do this.  To create credentials to use the search query, you must create an OAuth profile; so go and visit https://dev.twitter.com/docs/auth/tokens-devtwittercom so you can retrieve your ACCESS TOKEN and ACCESS SECRET.  Luckily we can use the PHP twitteroauth library to connect to Twitter’s API and start writing code (here’s an example of the code you’ll need:  https://dev.twitter.com/docs/auth/oauth/single-user-with-examples#php).  At this point you’ll need to set up your OAuth profile with Twitter and download the PHP twitteroauth library, edit the proper information to add your TOKEN and SECRET to the PHP twitteroauth library, and ensure all the files are on your web server in the appropriate place.

PERFORMING A SEARCH & RETRIEVING JSON DATA

I’m assuming you have set up the OAuth profile on Twitter and that you’ve downloaded the PHP twitteroauth library.  I like to create an “app_tokens.php” file containing my CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, and USER_SECRET information assigned to variables; this way I can include anywhere I need it.

Now that we have our authorization credentials we are ready to use tmhOAuth as the middle man to send a request to Twitter’s API.  Let’s say we want to perform the same search we did above, but this time we don’t want a GUI version of the data… instead we want JSON data back so that we can easily add it to a database.  We need to find out what command the Twitter API expects and pass it a value; for our example, the Twitter API search query is simply:   https://api.twitter.com/1.1/search/tweets.json We can pass it several different parameters, but we’ll start with the most basic and use the q query parameter.  We want to pass the parameter the value “#BigData”, but we need to convert the pound sign (#) to a URL encoded version => %23… Our code then looks like this:

This request will use the REST API v1.1 and return JSON data.  We are passing the search a paramater of q=>’%23BigData’ which translates to searching for the hashtag “#BigData” (without the quotes).  We are also passing the ‘count’ and ‘result_type’ parameters (for more info on the other parameters, see the documentation).  Lastly, we need to get the response back from Twitter and output it; if we have an error, we need to output that too.  Using the twitteroauth libraries examples, I know I need to have the following code:

The above code receives two pieces of data from the Twitter API:  the response code and the response data. The response code indicates if we have errors.  The response data holds the JSON data the we received from the query.  The first result of my JSON data (yours won’t contain the same information, but it will contain similar structure) looks like this:

If you look at the JSON data above, you’ll see a key titled “text” and the value assigned to it; this is the content of the tweet and you can clearly see that it contains the hashtag #bigdata.  So we now know the code works and we can programatically query Twitter.  When you examine the Twitter API you will find that we can make 450 request every 15 minutes;  this will of course not get us ALL the tweets using the hashtag “#bigdata”, but it will give us a useful sample at 30 results per request == 13,500 tweets every 15 minutes.

Cheers.

Web Scraping Using PHP and jQuery

I was asked by a friend to write code that would scrape a DLP website’s content of letters to use in an academic study (the website’s copyright allows for the non-commercial use of the data).  I’d not tried this before and was excited by the challenge, especially considering I’m becoming more involved in “big data” studies and I need to understand how one might go about developing web scraping programs. I started with the programming languages I know best:  PHP & jQuery.  And yes, I know that there are better programming languages available to write code for webscraping.  I’ve used PERL, Python, JAVA, and other programming languages in the past, but I’m currently much more versed in PHP than anything else! If I had been unable to quickly build this in PHP, than of course I’d have turned to Python or PERL; but in the end I was able to write some code and it worked. I’m happy with the results and so was my friend.

First, I had to figure out what PHP had under the hood that would allow me to load URLs and retrieve information. I did some searching via Google and figured out the best option was to use the cURL library (http://php.net/manual/en/book.curl.php).  The cURL lib allows one to connect to a variety of servers and protocols and was perfect for my needs; don’t forget to check your PHP install to see if you have the cURL library installed and activated.  I did a quick search on cURL and PHP and came across http://www.digimantra.com/technology/php/get-data-from-a-url-using-curl-php/ where I found a custom function that I thought I could edit to suit my needs:

Next I needed a way to grab specific DOM elements from the pages being scraped; I needed to find a <span> tag that had a specific attribute containing a value that was both a function name and a URL.  I am very familiar with jQuery syntax and CSS3 syntax that allows one to find specific DOM elements using patterns.  Low and behold I discovered that someone had developed a PHP class to do similar things named “simplehtmldom” (http://sourceforge.net/projects/simplehtmldom/).  I downloaded simplehtmldom from sourceforge, read the documentation, and created code that would find my elements and return the URLs I needed.

Now I have the actual URLs from which I want to get  a copy of the data in an array.  I need to loop through the $links array and use cURL once again to get the data.  While I’m looping through the array I need to check to see if the URL is pointing to an HTML file or a PDF file (my only two options in this case).  If it is an HTML file, I use the get_data() function to grab the data and use PHP file commands to write/create a file in a local directory to store the data. If it’s a PDF, I need to use different cURL commands to grab the data and create a PDF file locally.

That’s it for the scraping engine!

Now we need to create a way pass a  start and end value (increments of 50 and maxes out at 4000) to the PHP scraping engine.  I know there are many ways to tackle this and I specifically considered executing the code from a terminal, in a CRON job, or from a browser.  I again went with my strengths and chose to use an AJAX call via jQuery.  I created another file and included the most recent jQuery engine.  I then created a recursive jQuery function that would make an AJAX POST call to the  PHP engine, pause for 5 seconds, and then do it again. The function accepts four parameters: url, start, increment, and end.

Put this all together and we have a basic web scraper that does a satisfactory job of iterating through search results and grabbing copies of HTML and PDF files and storing them locally.  I was excited to get it finished using my familiar PHP and jQuery languages and it was a nice exercise to think this problem through logically.  Again, I’m SURE there are better, more efficient ways of doing this… but I’m happy and my friend is happy.

Fun times.