Top 200 @Mentions and #Hashtags used in tweets from 2017 Altmetric.com dump as Word Cloud using Python

A follow up to the Twitter profile description word cloud… I’ve created a hashtag word cloud from the 19.2 million hashtags used in the tweets collected by Altmetric.com

Top 200 Hashtags used in tweets collected by Altmetric.com
Top 200 @Mentions used in tweets collected by Altmetric.com

The Python code is VERY similar to the profile description word cloud code, however we have to turn off the ‘collocations’ option in the WordCloud module options to make it work as we expect.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Feb 10 16:19:56 2019

@author: tdbowman
"""
import io
import csv
import numpy as np
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image


# current directory
currdir = path.dirname(__file__)

# from https://github.com/nikhilkumarsingh/wordcloud-example/blob/master/mywc.py
def create_wordcloud(text):

    # use cloud.png as mask for word cloud
    mask = np.array(Image.open(path.join(currdir, "cloud.png")))
    # create set of stopwords	
    #stop_words = list(STOPWORDS)
    
    # create wordcloud object
    wc = WordCloud(collocations=False,
                   background_color="white",
                   max_words=200, 
                   mask=mask,
                   width=1334,
                   height=945)
    	
    # generate wordcloud
    wc.generate(text)
    # save wordcloud
    wc.to_file(path.join(currdir, "wc_hashtags.png"))
    
if __name__ == "__main__":

    # Grab text from file and convert to list
    your_list = []
    with io.open('hashtags.csv', 'r', encoding='utf-8') as f:
        reader = csv.reader(x.replace('\0', '') for x in f)
        your_list = ','.join([i[0] for i in reader])    
    
    # generate wordcloud
    create_wordcloud(your_list)

Twitter Profile Descriptions Word Cloud from 2017 Altmetric.com Dump using Python 3

Just a quick post to display a word cloud I created as a demo for my Adv Programming students using the Python WordCloud module. I was surprised to find that “Founder” was the top occurring word in the profile descriptions. What does this tell us?

I first had to write a PHP script to grab all the tweet objects from Twitter Search API that were captured by Altmetric.com. I then stored all the unique author profiles in a separate table from the tweets, which gave me approximately 3.4 million unique Twitter users who tweeted about science as captured by Altmetric.com. Of these 3.4 million, there were approximately 2.8 million users who had some characters in their Twitter profile description field.

To return results from the description found in my MySQL table of author profiles, I used the following query because I wanted to remove hard returns and tabs from the profile descriptions.

SELECT REPLACE(REPLACE(REPLACE(TRIM(`description`), '\r', ' '), '\n', ' '), '\t', ' ') 
FROM `profiles` 
WHERE TRIM(`description`)!='';

Next, we have the actual Python 3 code I used to create the WordCloud from the 2.8 million user descriptions. You’ll note I added a few extra terms to the STOPWORDS list because I ran this multiple times and found these terms that I wanted to remove from the final version.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Feb 10 16:19:56 2019
@author: tdbowman
"""
import io
import csv
import numpy as np
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image


# current directory
currdir = path.dirname(__file__)

# from https://github.com/nikhilkumarsingh/wordcloud-example/blob/master/mywc.py
def create_wordcloud(text):

    # use cloud.png as mask for word cloud
    mask = np.array(Image.open(path.join(currdir, "cloud.png")))
    # create set of stopwords	
    stop_words = ["https", "co", "RT", "del", "http", 
                  "tweet", "tweets", "twitter", "en", "el", "us", "et",
                  "lo", "will", "ex", "de", "la", "rts"] + list(STOPWORDS)
    
    # create wordcloud object
    wc = WordCloud(background_color="white",
    					max_words=200, 
    					mask=mask,
    	               	stopwords=stop_words)
    	
    # generate wordcloud
    wc.generate(text)
    # save wordcloud
    wc.to_file(path.join(currdir, "wc.png"))
    
if __name__ == "__main__":

    # Grab text from file and convert to list
    your_list = []
    with io.open('all_descriptions.csv', 'r', encoding='utf-8') as f:
        reader = csv.reader(x.replace('\0', '') for x in f)
        your_list = ','.join([i[0] for i in reader])    
    
    # generate wordcloud
    create_wordcloud(your_list)

It could use some cleanup and the image could be higher resolution, but it’s a good example for the students how to utilize Python to create a word cloud.

Lessons from 2:AM

Last week I was lucky enough to attend the 2:AM Conference in Amsterdam. The conference was focused on altmetrics–a type of metric that is typically calculated based on scholarly communication events captured in online contexts (e.g., events in Twitter, Mendeley, Wikipedia, etc). For some time I’ve been critical of the term “altmetrics” because I had taken it to mean “alternative to citations,” but after this conference I’m not so confident in my previous position. Altmetrics is an umbrella term that we use to help describe the type of research we are doing (at least those of us that research these things), it is a buzzword that others use to talk about scholarly communication in online contexts, it is a term that the media has used, it is currently used in organizations, libraries, universities, and companies to promote scientific work, and it has become a term that somehow represents the potential for measuring impact outside of the academic machine (other than scientific impact). While it has been criticised many times in the past for being the wrong term, I am not sure there is a more appropriate term… and that is fine.  We have had suggestions including social media metrics (Haustein, Larivière, Thelwall, Amyot, & Peters, 2014), complimetrics (complimentary metrics) (Adie, 2014), influmetrics (influence metrics) (Cronin & Weaver, 1995; Rousseau & Ye, 2013), and more traditionally webometrics (Almind & Ingwersen, 1997), to name just a few, but these do not seem to be any better and also do not seem to possess that something that “alt”metrics seems to possess.

I dabble in linguistics and I believe that words are of vital importance to our ability to understand and discuss the same phenomenon (especially in science), which is why I was so adamant that “altmetrics” was the wrong term to be using. But then I took another look at the altmetrics manifesto (the 5th anniversary of this important object was celebrated at the conference) and reevaluated my own position based on my accumulated knowledge in the field, what I learned at this conference, and a closer inspection of the manifesto to come to the realization that altmetrics is fine when you think of it as an “alternative means of measuring scholarly communication.”

The conference venue was great as we were housed at the Amsterdam Science Park, a sprawling complex on the eastern side of Amsterdam.  There were quite a few attendees and the presentations and workshop were informative and thought-provoking. Many of the primary data providers, publishing companies, metrics providers, and others in this field sent representatives including Jason Priem (impactstory.org), Euan Adie (altmetric.com), William Gunn (mendeley.com), Greg Gordon (ssrn.com), Martin Fenner (niso.org), and Geoff Bilder (crossref.org). In addition, the four authors of the altmetrics manifesto were in attendance to celebrate its 5th anniversary– Jason PriemDario TaraborelliPaul Groth, and Cameron Neylon.  I was able to speak with both Jason and Cameron and they were engaging, down to earth people who are great scholars and excited by the future of scholarly communication (I wasn’t able to speak with Paul or Dario at such length).

What I gleaned from 2:AM was that there was an ongoing discussion from multiple perspectives taking place regarding the ability for altmetrics to measure impact, the types of impact there might be for scholarly communication, and the importance of trust when considering the reasons behind altmeteric events. In addition, I am looking forward to be a part of a group (formed at the”theories” conference breakout session) that will write a white paper describing and defining common terms used in altmetric research for the purpose of allowing others outside of our community to understand and contribute to the ongoing work in the field. I also learned that many in the field had read our book chapter (arXiv:1502.05701) on applying citation and social theories to the understanding of altmetric events–they were very supportive of our efforts to put forth this first attempt at developing a framework for understanding altmetric events. Yet we all know that much more work needs to be done and hopefully this white paper will be a nice step in that direction.

What I also learned from listening to Jason Priem, Dario Taraborelli, Paul Groth, and Cameron Neylon was that our group has somewhat ignored the important component of the manifesto, which is talking about altmetric events as a type of  “filters” for scholarly research and communication:

No one can read everything. We rely on filters to make sense of the scholarly literature, but the narrow, traditional filters are being swamped. However, the growth of new, online scholarly tools allows us to make new filters; these altmetrics reflect the broad, rapid impact of scholarship in this burgeoning ecosystem. We call for more tools and research based on altmetrics. (Priem, Taraborelli, Groth & Neylon, para 1, 2010)

This is an important aspect that I too simply took for granted and something I need to reflect on as my understanding of these phenomena continue to grow and change.

 

References

Priem, J., Taraborelli, D., Groth, P., & Neylon, C. (2010). Altmetrics: A manifesto. October 26, 2010. Retrieved from http://altmetrics.org/manifesto