Leila L. – Data Archive Infrastructure Fall 2018

I came across a a very short but very nice clip from a ChinaFile interview with a historian of China:

Historian Sulmaan Khan talks about how archival research in China has changed since Xi Jinping came to power.

Processing: Archives

Leila L.November 6, 20182 Comments

First, a digression: only after years of studying and being exposed to various Asian and Romance languages that I came to really think about the enormous European bias in the designations of languages/dialects. The former contains connotations of higher status or prestige, while the latter denotes something more local, less important. Languages are worthy of preservation, while dialects are not. Romance languages can be mutually intelligible and yet considered distinct “languages,” while Chinese “dialects” can be mutually unintelligible and yet never achieve the status of “language.” The main difference is perhaps that the distinct romance languages have distinct written texts (and yet so similar) while the distinct Chinese dialects share the same written text.

Diana Taylor’s article not only challenges the privileging the written over the spoken and points out its Eurocentric bias, but also further argues for a programme of “performance studies” that takes non-textual modes of communication, such as singing, dancing, and other modes of performances, as serious conduits of meaning. As Ann Stoler’s article argues, the written archive as a repository of meaning is not neutral, inert; history is not waiting in the archives to be discovered by historians. Neither the archive nor the “repertoire” of performance-based production of knowledge and meaning is neutral or inert. And yet the written archive has come to be the most dominant mode of communication of knowledge for its material properties make storage and retrieval possible for a very long time. Our claims to knowledge/our capacity to know, therefore, hinges on the materialities of the archive, all the way down to the techniques of language itself. Taylor is against the disciplinary divisions among dance, music, and theater, but it also appears that we should also re-incorporate these non-textual modes of communication back into the core curriculum of liberal arts education in general.

Classify/Think/Know

Processing Posts

Leila L.October 16, 2018One Comment

Foucault, Perec, and Drabinski are all concerned with how a priori classifications not only delineate and form the basis and limits of thinking (as per the way refined description of snow in Eskimo language allow people to think and perceive differently about snow), but can also produce new realities (not just representing it) from the tension between their framing (containing world in categories) and their inevitable overflows. These are very productive frameworks for thinking about classification and knowledge.

When it comes to artificial intelligence/neuro-network, Kate Crawford explains how, depends on how the AI system is “trained,” different patterns of bias may emerge as a result, which is contradictory to how AI evangelists like to sell AI as a post-human “objective” technology. This is, of course, a very relevant and important issue, but I do wonder if the more existential question revolves around the fact that, underlying how humans understand the classifications and biases of AI systems, the neuro-network is more fundamentally like the “culture entirely devoted to the ordering of space, but one that does not distribute the multiplicity of existing things into any of the categories that make it possible for us to name, speak, and think.”

In other words, even though the outcomes of neuro-network calculations are made legible to humans, the precise logic appears to be enigmatic even to the developers of the AI. AI seems to produce confounding results all the time, and so far AI scientists seem to explain why except resorting to always calling for more data. The fact that humans have created a machine whose logic is foreign to even the developers who made it seems like a major philosophical as well as a practical problem.

Application: Breast Cancer Campaign Tissue Bank, A Case Study in Building a Biobank Network

Application Post + Presentation

Leila L.October 2, 2018

One of the central concerns in our course is the question of how the collection, organization and analysis of information lays the foundation for how we then produce knowledge from it. By information, in this class alone we’ve have considered books, manuscripts, images, tweets… In “Middlewhere: Landscapes of Library Logistics,” Professor Mattern takes us to BookOps, a centralized sorting, cataloging, and distribution facility that serves the local libraries distributed all across New York City. (1) We also get a glimpse into the workings of the Research Collections and Preservation Consortium, or ReCAP, which connects NYPL’s patrons to Princeton and Columbia’s resources and vice versa. We learned that if the underlying software that operates NYPL, Columbia, and Princeton can be mutually intelligible, then ReCAP would be much more robust by allowing patrons to do “common searches” across all three catalogs. This is a question of interoperability, and it is also the central concern of my subject today, the Breast Cancer Campaign Tissue Bank, to which I will return after addressing the larger topic of biobanking.

I’m interested in the collection, organization, and analysis of biological information, and that’s what led me to look at biobanks, which are organizations that “collect, store, and oversee the distribution of specimens and data” for institutional, non-profit, or commercial purposes. (2) Biobanks form an important part of the infrastructure for today’s population health research and personalized medicine, or precision medicine, initiatives. I see many overlapping concerns between biobanks and libraries in terms of its infrastructure for collection, organization, and research, including the problem of interoperability. The word “biobank” itself has no concrete definition. Sometimes they are also called “biorepository,” “specimen bank,” and “tissue bank,” or “bio-library.” Basically, a biobank stores biological information, ranging from physical tissue samples to genomic data to various forms of electronic medical records. (2)

In 2009, biobank was named one of TIME magazine’s “10 Ideas Changing the World Right Now,” but the practice of collecting, organizing, and then analyzing biological material had begun far before 2009. (3) So what changed? The TIME magazine piece itself offers some clues. The 2009 article cited several European countries’ efforts to build their own “national biobanks.” It also mentioned deCODE, an Icelandic commercial genetics company that has, reportedly, collected over 100,000 Icelandic individual’s DNA, which is 30% of Iceland’s entire population.

DeCODE was founded in 1996; it preceded most public and private population-wide biobanking initiatives, such as the UK biobank or 23andme, by almost a decade. This decade from 1996 to 2006 seems to mark the maturation and stabilization of the technology of mass DNA collection and sequencing. This diagram shows how, shortly after 2007, the cost for sequencing a genome started sharply declining. (4) This is a turning point at which population genomics shifts from a technology problem to a collection and analysis problem.

Biobanks do not only store genomic data, of course. Depending on the type and purpose of the biobank, it may collect your blood, your permission to access your electronic medical records from elsewhere; it may ask you to perform various sorts of physical or psychological tests. It may ask the volunteers to come back months or years later for follow-up tests. The purpose of biobanks, large or small, is typically to advance research by bringing together multiple forms of data on a huge scale. But if analyzing genetic data–finding correlations between genes and diseases–is not complicated enough, then analyzing multiple forms of data is infinitely more complicated. In Kadiya Ferryman and Mikaela Pitcan’s Data & Society Report on “Fairness in Precision Medicine,” they quote a computer scientist calling genetic data “low-hanging fruit,” as “the methods of collecting and analyzing genetic data are more established than for other kinds of data (such as wearables data), or for analyzing multiple types of data together.” (5)

My case study today is the Breast Cancer Campaign Tissue Bank in the UK, hereafter referred to as the BCC Tissue Bank. UK is the home to one of the earliest and biggest national biobanking initiatives, simply called the UK Biobank. In the U.S., there is the “All of Us” initiative, previously known as the Precision Medicine Initiative. I choose to present on the BCC Tissue Bank because 1) it was the subject of a really neat research paper I found; (6) and 2) unlike the UK Biobank or the All of Us Initiative, (7) the BCC Tissue Bank is not an actual physical biobank that recruits, collects, and stores samples from volunteers; it is meant to be a network with the specific goal of solving some of the interoperability issues that concern biobank-based research.

Around 2010, Breast cancer researchers in the UK identified specific knowledge gaps in the breast cancer research, and to fill the gap they needed “high-quality and clinically annotated samples,” which is challenging because relevant samples are spread out in different biobanks and therefore in different software systems with different terminologies and standards. The BCC Tissue Bank was created in 2010 as an attempt to solve this issue by creating “a single web portal from which researchers could source and request samples from across the network using the terms agreed to in the data standard.” The BCC Tissue Bank, therefore, is built to be a networked information library. (6)

To facilitate data collection between systems, the BCC Tissue Bank decided to create a “plug-in” to be installed at each individual biobank. The “plug-in” was called the “Node.” The researchers call this the “federated” approach that preserves the autonomy and variability among regional biobanks, as opposed to a centralized approach that mandates every bank to use the same system, which will inevitably result in the need for a massive transfer of data for the biobanks who are already using a different system.

Data collection in the case of BCC Tissue Bank actually means data uploads, which can take a number of forms:

Direct Input.

One is to input data directly into a centralized database run by the BCC Tissue Bank. Biobanks can directly input information into the web portal. This has the benefit of the data vocabulary being automatically aligned with BCC Tissue Bank’s vocabulary. While some biobanks that do not have a robust data infrastructure would theoretically choose this options, most biobanks already have their own elaborate information systems, so to do input data separately into a completely different system would prove cumbersome and unrealistic.

Spreadsheets.

Spreadsheets are exported out of one system and then imported into the BCC Tissue Bank system. Spreadsheets allows for mass data transferring, but as anyone who has any experience with migrating datasets across systems would know, cleaning up the spreadsheets so that information from one system can be legible to another system can be also very complicated and time-consuming.

Using JavaScript ObjectNotation (JSON)

Biobanks can use JSON “to automate the push of data from their biobanks’ data systems into the Node.” This is obviously the preferred method for BCC Tissue Bank, as it eliminates the periodic labor involved in upload via spreadsheet or direct entry.

To ensure that the data pushes through smoothly, however, there is still the problem of database-by-database or regional variations in how a term is used. For that, the Node has an module for “mapping,” which maps the term that is used by the central system onto the local term used by the individual biobanks. After the relationship between the local and the central terms are connected, or mapped, researchers can perform searches on BCC Tissue Bank’s web portal using the central terms while the local biobanks can continue to use whatever terms they have been using. Here’s an example of how central terms like post-menopausal is mapped onto the local system that records post-menopausal without the dash.

So here’s the summary of the main approaches BCC Tissue Bank took to increase interoperability between different data systems used by individual biobanks. Similar to how BookOps and ReCAP are meant to facilitate the logistics of running a distributed network, the BCC Tissue Bank is a project that seeks to centralize information spread across a distributed network and embedded in varying standards and definitions. The BCC Tissue Bank is still up and running, of course, although the researcher notes that the preferred method of data transfer is still spreadsheets, as there are just too many technical and regulatory issues with the automatic data push option. This shows how ingrained infrastructure could impede the adoption of revolutionary technology and thus influence the trajectory of the technological medium itself. It reminds me how computing technology co-evolved with punch cards for a good period before punch cards finally became history. It also makes me interested in the claims about how the blockchain technology will change the way medical records are accessed and shared.

References

Shannon Mattern, “Middlewhere: Landscapes of Library Logistics,” Urban Omnibus (June 24, 2015)
Boyer, Gregory J. et al. “Biobanks in the United States: How to Identify an Undefined and Rapidly Evolving Population.” Biopreservation and Biobanking 10.6 (2012): 511–517. PMC. Web. 2 Oct. 2018.
Alice Park, “Biobank, 10 Ideas Changing the World Right Now” TIME, March 12t, 2009, http://content.time.com/time/specials/packages/article/0,28804,1884779_1884782_1884766,00.html
Editorial Team, “The Past, Present and Future of Genome Sequencing,” LABIOTECH.edu, April 9, 2018, https://labiotech.eu/features/genome-sequencing-review-projects/
Kadija Ferryman and Mikaela Pitcan, Fairness in Precision Medicine (Data and Society, February 2018)
Qinlan PR, Groves M, Jordan LB, et al. The informatics challenges facing biobanks: A perspective from a United Kingdom biobanking network. Biopreserv Biobank 2015;13:336–370
All of Us Initiative, National Institute of Health (NIH), allofus.nih.gov

Processing: Library Lineages

Processing Posts

Leila L.September 24, 2018One Comment

Matthew Battles’ sweeping history of the library, which in turn is also the history of the organization and consumption of knowledge, provides a fascinating survey of the roles libraries had served as custodians of wisdom, status symbols, objects of conspicuous consumption, and, more recently, a space of gathering for purposes of community, literacy, and access to information.

As public libraries today become more and more what Mattern calls “a network of integrated, mutually reinforcing, evolving infrastructures — in particular, architectural, technological, social, epistemological and ethical infrastructures,” the university libraries seem to retain the elite, research-oriented quality that had characterized most libraries before the proliferation of mass-produced books. Big cities like NYC aside, most public libraries nowadays do not have very much scholarly literature at all, let alone access to academic databases. I can’t help but think that this university/public divide is very problematic as it makes independent scholarship without university affiliation much more difficult (and expensive), while the public libraries are being stretched thin to serve the community’s needs for baseline English as well as digital literacy.

Processing: Ecologies of Information (Week 2)

Processing Posts

Leila L.September 2, 2018One Comment

This week we read three different metaphors of digital information systems: 1) infrastructure; 2) the stack; 3) the commons. Starr, for example, is concerned with the design of information systems through standards, protocols, categories that may be biased but becomes deeply entrenched and embedded in the network once adopted. Bratton wrestles with emerging governmentalities that transcend physical and sovereign boundaries, and its actors include both private corporations as well as states. Bratton proposes the idea of the Black Stack—an “image of a totality” that directly contrasts with Starr’s metaphor of information as infrastructure.

While Starr uses the metaphor of infrastructure to map out potential sites of research into information systems, it’s not clear what action items can be taken from the metaphor of the “Stack,” or even from just the idea of the “Cloud platforms.” The specific referents are companies like Google and Amazon, which provide services that are very much grounded in and powered, literally, by very physical infrastructure and human labor. Their sheer wealth and influence do distort traditional geopolitics, for example see Amazon’s enormous leverage over the city of choice for their new headquarter. But the Stack-like verticality of relationships doesn’t appear to be useful in terms of visualizing its components (so to identify sites of research/potential intervention). As Chris Watterston’s famous sticker says:

There is no cloud, it's just someone else's computer.

Author: Leila L.

The Good Old Days of Archival Research in China

Processing: Archives

Classify/Think/Know

Application: Breast Cancer Campaign Tissue Bank, A Case Study in Building a Biobank Network

Processing: Library Lineages

Processing: Ecologies of Information (Week 2)