We are Working On It is a representation of 1.9 million scientific journal articles related to cancer research from 1980 to 2010, accompanied by audio interviews with researchers from Memorial Sloan Kettering Cancer Center in New York. We Are Working On It was my final project for the Data Representation and Collective Storytelling classes at ITP.
Background and Objectives
This year a lot of my work at ITP has involved the use of large data sets to tell or discover personal stories. We Are Working On It is a continuation of this theme. Unfortunately my family, like many families, has experienced the devastating affects of cancer across multiple generations. I wanted to use available data and accessible tools to investigate how the amount and type of cancer research has changed during my lifetime. I also wanted to look behind the data and meet with cancer research scientists, talk about their work and understand who they are and what they do.
My goal with this project was to combine both sources - data and audio recordings of conversations with researchers - into one coherent timeline.
Retrieving 1.9 Million Scientific Journal Article Abstracts
The first part of the project involved identifying a source of scientific data, then writing an application to retrieve, store, parse and display this data in a timeline. I decided to use the Pubmed scientific journal database provided by the National Center for Biotechnology Information (NCBI) as my primary source of data because of the range and depth of available data, and the fact that this data is accessible through a very solid API.
My Processing application first passed a search term (“cancer” and related keywords) into the NCBI's search utility and retrieved scientific journal abstracts as XML - the amount of data was pretty immense (it took over a day to complete the full search).
One of the key challenges was to then interpret and store all this data - as you might expect the hygiene of the data varies the further back in time we look. My application had to construct useful objects (journals, articles, research centres, scientists etc.) from the data and then store these objects in a local database (MongoDB) - where it would be much quicker to run follow-up queries.
Building a Timeline of Cancer Research
The next step of the process involved the presentation of the data in a form of timeline. I spent a lot of time thinking about how the aesthetic of the representation of the data could be sensitive of the subject matter, informative and elegant - just enough design to tell a story without getting in the way. I knew from early in the process that it would be incredibly difficult to take all this scientific information and tell an accessible story - I decided to start with a smaller story about the number of scientists working on cancer research and the volume of work they produce. This seemed to fit well with the parallel track of the project where I was meeting with scientists.
I developed a graphic design that shows the number scientists and research centres were publishing research related to cancer research in a number of different journals. I wanted this quantitative data to read almost like a sentence from the top of the graphic down to the bottom.
I also wanted to take a number of article titles from each year and use them to give a flavour of the type of work being done; I used a lexical analysis technique to weight the terms used most frequently in each year, picking articles that scored highly on this representative scale. A better way of doing this would have been to use the most cited articles from each year, but this wasn't data I had access to.
I have spoken about my belief that all data representation that purports to represent “the truth” needs to address the source and quality of its data. In each graphic generated by my application I make a statement about any data that didn't meet quality checks (an example of a failed check would be where a scientist had a first but no last name, or where an article had no title).
The application generated a number of PDF images (showing data for 1980, 1990, 2000, 2010) which were then brought into Adobe Illustrator for final type setting and small tweaks.
The graphics were printed on archival paper and framed. When the framed graphics hang next to each other they show the change in the data being represented over time - with more time it would be great to print and hang each year, but this would have been very expensive. You can see the final graphics (and the many failures along the way) on Flickr.
Interviews at Memorial Sloan Kettering Cancer Center
I spent two afternoons at MSKCC's Powell research laboratory. I interviewed two scientists; Laura Eccles and Ciara O'Driscoll - from England and Ireland respectively - about their work.
I prepared for the interviews by doing some basic desk research about MSKCC and the work being done there. I developed a short interview script that focused on the scientists' backgrounds, how they came to be working in cancer research, their day to day work and their hopes for the future.
Laura and Ciara were incredibly forthcoming about their lives and their work; I came away from the conversations with them feeling immensely grateful for their hard work, but also much more aware of the day to day frustrations involved in research - it is not an easy life by any stretch of the imagination.
I have embedded two short excerpts from the interviews here - this audio was played alongside the printed graphics at the ITP show in May 2011.
What I Learned
This project was technically challenging and personally draining at times - but I learned a huge amount. I feel now as though I would be much better prepared for a project involving the manipulation of a huge amount of data - there are just sensible ways to do this stuff, and it took a lot of trial and error to hit on a good approach.
Aside from the technical challenges, I believe that I answered the questions that I set out to answer at the beginning - even if I only scratched the surface of the potential to elaborate on this answer.
I know now that there have never been more people involved in the search for better treatments and potential cures for cancer - that isn't something I knew before this project. There are over six times more people working in cancer-related fields as there were in 1980. I also know that this international network of researchers publish more work than ever - and that a considerable amount of this work is available to anybody who wants to look for it through bodies like NCBI.
I got to meet two of the people working directly on cancer research - I learned that they are deeply motivated to do the work that they do, but that they are not superheroes. They go to work, work is frustrating, sometimes they make progress and most of the time they don't. It is deeply heartening to have had a tiny glimpse into the day to day work of these people - it makes the data really come alive for me because I can start picture all the other people grafting away on this impossible problem around the world.
I am proud of this project.