One of the central concerns in our course is the question of how the collection, organization and analysis of information lays the foundation for how we then produce knowledge from it. By information, in this class alone we’ve have considered books, manuscripts, images, tweets… In “Middlewhere: Landscapes of Library Logistics,” Professor Mattern takes us to BookOps, a centralized sorting, cataloging, and distribution facility that serves the local libraries distributed all across New York City. (1) We also get a glimpse into the workings of the Research Collections and Preservation Consortium, or ReCAP, which connects NYPL’s patrons to Princeton and Columbia’s resources and vice versa. We learned that if the underlying software that operates NYPL, Columbia, and Princeton can be mutually intelligible, then ReCAP would be much more robust by allowing patrons to do “common searches” across all three catalogs. This is a question of interoperability, and it is also the central concern of my subject today, the Breast Cancer Campaign Tissue Bank, to which I will return after addressing the larger topic of biobanking.
I’m interested in the collection, organization, and analysis of biological information, and that’s what led me to look at biobanks, which are organizations that “collect, store, and oversee the distribution of specimens and data” for institutional, non-profit, or commercial purposes. (2) Biobanks form an important part of the infrastructure for today’s population health research and personalized medicine, or precision medicine, initiatives. I see many overlapping concerns between biobanks and libraries in terms of its infrastructure for collection, organization, and research, including the problem of interoperability. The word “biobank” itself has no concrete definition. Sometimes they are also called “biorepository,” “specimen bank,” and “tissue bank,” or “bio-library.” Basically, a biobank stores biological information, ranging from physical tissue samples to genomic data to various forms of electronic medical records. (2)
In 2009, biobank was named one of TIME magazine’s “10 Ideas Changing the World Right Now,” but the practice of collecting, organizing, and then analyzing biological material had begun far before 2009. (3) So what changed? The TIME magazine piece itself offers some clues. The 2009 article cited several European countries’ efforts to build their own “national biobanks.” It also mentioned deCODE, an Icelandic commercial genetics company that has, reportedly, collected over 100,000 Icelandic individual’s DNA, which is 30% of Iceland’s entire population.
DeCODE was founded in 1996; it preceded most public and private population-wide biobanking initiatives, such as the UK biobank or 23andme, by almost a decade. This decade from 1996 to 2006 seems to mark the maturation and stabilization of the technology of mass DNA collection and sequencing. This diagram shows how, shortly after 2007, the cost for sequencing a genome started sharply declining. (4) This is a turning point at which population genomics shifts from a technology problem to a collection and analysis problem.
Biobanks do not only store genomic data, of course. Depending on the type and purpose of the biobank, it may collect your blood, your permission to access your electronic medical records from elsewhere; it may ask you to perform various sorts of physical or psychological tests. It may ask the volunteers to come back months or years later for follow-up tests. The purpose of biobanks, large or small, is typically to advance research by bringing together multiple forms of data on a huge scale. But if analyzing genetic data–finding correlations between genes and diseases–is not complicated enough, then analyzing multiple forms of data is infinitely more complicated. In Kadiya Ferryman and Mikaela Pitcan’s Data & Society Report on “Fairness in Precision Medicine,” they quote a computer scientist calling genetic data “low-hanging fruit,” as “the methods of collecting and analyzing genetic data are more established than for other kinds of data (such as wearables data), or for analyzing multiple types of data together.” (5)
My case study today is the Breast Cancer Campaign Tissue Bank in the UK, hereafter referred to as the BCC Tissue Bank. UK is the home to one of the earliest and biggest national biobanking initiatives, simply called the UK Biobank. In the U.S., there is the “All of Us” initiative, previously known as the Precision Medicine Initiative. I choose to present on the BCC Tissue Bank because 1) it was the subject of a really neat research paper I found; (6) and 2) unlike the UK Biobank or the All of Us Initiative, (7) the BCC Tissue Bank is not an actual physical biobank that recruits, collects, and stores samples from volunteers; it is meant to be a network with the specific goal of solving some of the interoperability issues that concern biobank-based research.
Around 2010, Breast cancer researchers in the UK identified specific knowledge gaps in the breast cancer research, and to fill the gap they needed “high-quality and clinically annotated samples,” which is challenging because relevant samples are spread out in different biobanks and therefore in different software systems with different terminologies and standards. The BCC Tissue Bank was created in 2010 as an attempt to solve this issue by creating “a single web portal from which researchers could source and request samples from across the network using the terms agreed to in the data standard.” The BCC Tissue Bank, therefore, is built to be a networked information library. (6)
To facilitate data collection between systems, the BCC Tissue Bank decided to create a “plug-in” to be installed at each individual biobank. The “plug-in” was called the “Node.” The researchers call this the “federated” approach that preserves the autonomy and variability among regional biobanks, as opposed to a centralized approach that mandates every bank to use the same system, which will inevitably result in the need for a massive transfer of data for the biobanks who are already using a different system.
Data collection in the case of BCC Tissue Bank actually means data uploads, which can take a number of forms:
- Direct Input.
One is to input data directly into a centralized database run by the BCC Tissue Bank. Biobanks can directly input information into the web portal. This has the benefit of the data vocabulary being automatically aligned with BCC Tissue Bank’s vocabulary. While some biobanks that do not have a robust data infrastructure would theoretically choose this options, most biobanks already have their own elaborate information systems, so to do input data separately into a completely different system would prove cumbersome and unrealistic.
Spreadsheets are exported out of one system and then imported into the BCC Tissue Bank system. Spreadsheets allows for mass data transferring, but as anyone who has any experience with migrating datasets across systems would know, cleaning up the spreadsheets so that information from one system can be legible to another system can be also very complicated and time-consuming.
Biobanks can use JSON “to automate the push of data from their biobanks’ data systems into the Node.” This is obviously the preferred method for BCC Tissue Bank, as it eliminates the periodic labor involved in upload via spreadsheet or direct entry.
To ensure that the data pushes through smoothly, however, there is still the problem of database-by-database or regional variations in how a term is used. For that, the Node has an module for “mapping,” which maps the term that is used by the central system onto the local term used by the individual biobanks. After the relationship between the local and the central terms are connected, or mapped, researchers can perform searches on BCC Tissue Bank’s web portal using the central terms while the local biobanks can continue to use whatever terms they have been using. Here’s an example of how central terms like post-menopausal is mapped onto the local system that records post-menopausal without the dash.
So here’s the summary of the main approaches BCC Tissue Bank took to increase interoperability between different data systems used by individual biobanks. Similar to how BookOps and ReCAP are meant to facilitate the logistics of running a distributed network, the BCC Tissue Bank is a project that seeks to centralize information spread across a distributed network and embedded in varying standards and definitions. The BCC Tissue Bank is still up and running, of course, although the researcher notes that the preferred method of data transfer is still spreadsheets, as there are just too many technical and regulatory issues with the automatic data push option. This shows how ingrained infrastructure could impede the adoption of revolutionary technology and thus influence the trajectory of the technological medium itself. It reminds me how computing technology co-evolved with punch cards for a good period before punch cards finally became history. It also makes me interested in the claims about how the blockchain technology will change the way medical records are accessed and shared.
- Shannon Mattern, “Middlewhere: Landscapes of Library Logistics,” Urban Omnibus (June 24, 2015)
- Boyer, Gregory J. et al. “Biobanks in the United States: How to Identify an Undefined and Rapidly Evolving Population.” Biopreservation and Biobanking 10.6 (2012): 511–517. PMC. Web. 2 Oct. 2018.
- Alice Park, “Biobank, 10 Ideas Changing the World Right Now” TIME, March 12t, 2009, http://content.time.com/time/specials/packages/article/0,28804,1884779_1884782_1884766,00.html
- Editorial Team, “The Past, Present and Future of Genome Sequencing,” LABIOTECH.edu, April 9, 2018, https://labiotech.eu/features/genome-sequencing-review-projects/
- Kadija Ferryman and Mikaela Pitcan, Fairness in Precision Medicine (Data and Society, February 2018)
- Qinlan PR, Groves M, Jordan LB, et al. The informatics challenges facing biobanks: A perspective from a United Kingdom biobanking network. Biopreserv Biobank 2015;13:336–370
- All of Us Initiative, National Institute of Health (NIH), allofus.nih.gov