Voluminous problems

Back in 2011, I visited an exhibition by photographer Erik Kessels at Foam, a photography museum in Amsterdam. It was an invitation to wander through rooms full of unordered mounds of printed photographs – every photo that had been uploaded to Flickr within a 24-hour period. At that time, the daily upload was around 1 million images. This was just before the total ubiquity of smart phones, before Facebook acquired Instagram, around the start of photo sharing becoming a core component of communication. As a mechanism for appreciating the volume of images generated, the exhibition was both memorable and formidable.

24hrs_of_photos
Photo: Erik Kessels, http://www.kesselskramer.com/exhibitions/24-hrs-of-photos

According to some stats from earlier this year: 300 million photos are uploaded to Facebook every day, 95 million images and videos are uploaded to Instagram, and a total of 4.7 trillion photos were stored digitally by the end of 2017. And of course, there’s an exponential curve there.

Img_6820
Photo: Erik Kessels, http://www.kesselskramer.com/exhibitions/24-hrs-of-photos

When I think back to those rooms, and how daunting the volume was even then, I viscerally felt Tagg’s comment about “the danger of being entirely submerged if the other cameras follow suit and the stream becomes a deluge.” There is no solution to the challenge of ordering and archiving such image-based data that does not involve some “archiving machine,” whether submission is to the logics of a filing cabinet or photo recognition AI.

Spielgman’s “Words: worth a thousand” seems a quaint account of the problem of ordering pre-digital images. The senior librarian’s comments that the “indexing will become more rational when we go to digital storage” seems a radical simplification of who’s definition of “rational” will have deciding power of ordering.

Perhaps it is the time of the semester, and having to deal with my own problems of “overaccumulation” of information, but surrendering to the convenient tyranny of AI suddenly seems to make sense. Yes, all ordering, codifying, archiving, will “make us ask what we have lost of our being to archival machines” – but there was a “certain lack of precision” in human ordering of pre-digital photographs too. We have never been in control of our data, some machines just give us the sense that we are.

epistemological + political subjects

Stoler and Taylor both contextualize archives in the power imbalances and systemic oppressions imposed from colonization and its enduring legacies. Taylor separates the archive from the repertoire, the latter for embodied and lived forms of knowledge. She argues that Western epistemologies are founded on the equation of writing to memory and knowledge. The dominance of language and writing means these mechanisms for knowledge come to stand in for meaning itself. Yet it is not perfectly clear that written forms of knowledge are archived versus embodied knowledge being repertoire-d. Even through colonization, writing did not entirely displace embodied practice. Even colonizers bring (and enforce) their own embodied practices too. But writing became the dominant mechanism of legitimization of over other epistemic and mnemonic systems. This is how, as Stoler argues, we should think of the archive as the supreme technology of late 19th century imperial state.

So what happens to our understanding of “knowledge” (and whose knowledge) as we have moved from dominance of technologies based on written texts to digital ones? Videos and images are not the same as performances and acts, though we may treat them as alike. Robertson identifies this tension in digital archiving of lesbian porn/erotic publications from 1984-2004. It was both documented and embodied, though was very likely intended for a small and time-bound audiences of queer zine-readership, before anyone could imagine reverse image search engines and permanent googleability.

Foucault reminds us again that the realm of thinkable thoughts available to any one of us is limited by our discourses and our ability to determine something as knowledge. The archive functions both as the “law of what can be said” and the rules by which these delineated “unique events” accumulate into understood patterns. We can neither exhaustively archive all of a given society, culture, time period, nor can we adequately describe our own from within. What gets recorded in archival institutions passes for collective memory on the side of the powerful, and “structural amnesia” for those memories that do not serve them (Edward Evan Evans-Pritchard). History reminds us of the importance in looking to the form of archive, as well as the content. In my research on voice recognition technologies, looking to the mechanisms of digital archiving itself hopefully provides one avenue to do this.

The Good Life: the utopian “no place” of email archives

Susan Breakell makes the point that “to archive” was not originally used as a verb; rather the word became one around the same time as the entrance of the PC into our homes and lives. We use the word both to mean to store records, and to store electronic information that we no longer regularly use. Zielinksi highlights that “the archive serves to organize mental and enforced orders in the shape of appropriate structure and to preserve, with a tremendous amount of effort, the memory of past orders.” And from Mattern we see that archives demonstrate the interconnected technological, social, intellectual, architectural infrastructures required. This embodiment is entwined with certain politics and epistemologies, and particularly takes place in large part through aesthetics.

“The Good Life” is a project by artists Tega Brain and Sam Lavigne, an archival performance art that positions your email inbox as the stage. It is based off a proportion of the emails sent between Enron employees in the late 1990s to the early 2000s. This large-scale archive of emails was the first of its kind, and was the training database used for many early natural language processing (NLP) algorithms – including most current spam filters, and early versions of Siri. By allowing your inbox to be hijacked for a period of time of your choosing (between 7 and 28 years), you too can embody “The Good Life” of white collar, mainly white, mainly male, corporate workers (and some criminals) though language-based architectures of late 90s corporate culture. In doing so, we can all explore the enduring nature and wide-usage of digital archives, “the impulse to archive” against “the right to be forgotten,” the inescapability of bias in training data sets, and the aesthetic of emails, the poetry, and the “rational” world order of this corporate elite.

Enron started out as an energy company. Based in Houston, Texas, it was considered “America’s most innovative company” for six years in a row. It employed 20,000 people, and in 2000, the year before it collapsed, it claimed revenues of $101 billion.  It embodied a vision of American corporate success, constantly scaling and growing, moving from energy into creating new financial instruments, from trading to investments in broadband. Right before its collapse, was in partnership with Blockbuster to stream movies online – it could have been Netflix. In 2001, its stock price collapsed, and in the fallout, the company and its executives were found to have been involved in price fixing, misrepresentation of earnings, institutionalized accounting fraud, and generally corrupt business practice. When it declared bankruptcy, it was the largest in American history.

As a consequence, the Federal Energy Regulatory Commission (FERC) acquired the company’s data, including the massive archive of emails that had been sent to, from, and between employees – 1.6million emails in total. After complaints, some of these emails were removed from the archive. We can consider this a form of selection and curating the archive, though as Breakell notes, “any selection process is problematic.” One hundred employees were given 10 days to search through and remove personal emails (of their coworkers, their friends, their family members, their children). These workers were told to search for terms like “social security number” “credit card number” and “divorce”. However, as you can still find emails sent between divorcing spouses and flirting coworkers through The Good Life’s database, it’s clear many of these searches were not particularly effective in their task. The archive of 500,000 emails was the first large scale archive of its kind to be made publicly available. It is still one of the only large public email collections that’s easily and freely accessible online.

As Hal Foster writes, “no place” is the literal meaning of “utopia.” The artists’ project’s name, “The Good Life” speaks to Hal Foster and Breakall’s point – in the “no place” of the archive, we see the archival impulse go further: we can imagine “possible scenarios of alternative social relations.” To fully experience “The Good Life,” you can opt in for your own email inbox to receive a slightly reduced version of the archive. You can have 225,000 emails in total sent to your inbox in the order and with an equivalent time-spacing they were originally sent. Originally the project provided the option to have the emails sent over 5 days, 30 days, or 1 year, but these tiers had to be canceled because the emails kept getting blacklisted as spam. Given that modern spam filters were originally built off this database of emails, this seems ironic. Now, your options are to sign up to receive the emails every day for 7, 14 or 28 years.

Beyond an examination of the banality and volume of email even from its earliest usage, the project brings into play a much deeper critical commentary on contemporary digital archives, perhaps especially unintentional ones. First, we can consider the political, social, and cultural architecture of an archive, the importance of archives, and the enduring legacy of this particular database. Finn Brunton notes “the FERC had unintentionally produced a remarkable object: the public and private mailing activities of 158 people in the upper echelons of a major corporation, frozen in place like the ruins of Pompeii for future researchers.” As the first of its kind, it has been used to train spam filters, email recognition technologies like prioritization rules in your inbox, fraud detection, counterterrorism operations, and workplace behavioral patterns.The hegemonic ordering of an archive that Zielinski writes about is very much alive and enduring. There is a good chance that at least something on your phone is running off software that used this archive as its training database.

It matters then, that the users from which this archive was generated were from a particularly narrow group of people. This archive was used to build NLP algorithms because it was assumed to be representative of how people use email. But algorithms are only as good as the data provided, even or perhaps especially when they are on a large scale. As we discussed last week, biased inputs can generate and embed biased outputs in both allocation and representation. What cause for concern does it give us that so much of the epistemic scaffolding of our current information management systems are built off the corporate (and at least somewhat corrupt) working elite of the 1990s and early 2000s? On the other hand, as artist Mimi Onuoha has pointed out, today many our current datasets are built off the personal data of those who have no choice, or limited choice but to sign away their data, typically the structurally disadvantaged. This archive then offers a rare view into a group of users normally afforded more “privacy” than most people.

However, it is clear there was still a personal cost. The scrubbing of the archive did not clear out, for example, a named husband and wife emailing each other as their divorce proceeded. Employees may not have been aware in 1990s that their emails would ever resurface, least of all for public perusal. It is likely that corporate practice has changed since this time with increased awareness of the permanence of emails – the concept of “huddling” in corporate culture today is to take something offline, to communicate without leaving a digital trace. And even though we all know on an abstract level that email is not private, most of us today would still be deeply uncomfortable with our emails being publicly available in a searchable format and with our names attached to them, even though we operate with some awareness that this is possible. While it is clear that this email database deeply embodies the archival impulse, it also speaks to the right to be forgotten.

Though we might ask if that is realistic. We are now all contributing to digital archives many many orders of magnitude larger than these 500,000 emails of the Enron database. Every email, click, like, hovering over a link, and many other forms of our digital footprints are now collected by the biggest (and some not so big) corporate players in the world. What machines are ultimately being trained off datasets produced by our digital labors, and what implications does this have for both material and immaterial orders? Is anarchive even possible in this terrain? The artists suggest that by rendering your inbox into a timewarp between 1998 and present day, you subvert your email provider’s algorithm’s ability to make accurate sense of your data. Their “service obfuscates your personal emails, and it breaks the machine learning’s algorithms for understanding you.” They add: the real benefit is that it also makes it impossible for you to use your email.

Though there is a strong case to be made for examining the material infrastructure required to enable email technologies, for most people, emails appear largely through immaterial means. And yet, clearly they too operate at an aesthetic level. The Good Life’s commitment to replicating the Enron employee’s experience is achieved through the Windows 1995 interface. And while we might imagine email as standardized communication, the variation in content is analogous to Zielinski’s write up of VALIE EXPORT’s work. Formally similar frames can bring to the forefront the heterogeneity of what is contained, in this case in emails. In teaching the machines how to “think” through human language, this archive is showing a range of human communications. Granted, this is limited both by it being explicitly written content (which differs greatly from human speech, for example) and by the narrow collection of humans whose “labor” was used to generate this.

Mattern writes of a critical reviewer of an early article at pains to point out that highlighting the aesthetic experience might suggest that poetry is devoid of “intellectual or political engagement” and to fail to acknowledge that “poets even think rationally.” Given the current political debates about whose speech is considered “rational” and “unemotional” I thought it was telling that artist Constant Dullaart and NYU data scientist Leon Yin created an experiment with Brain and Lavigne’s project – a predictive text generator based off the Enron corpus. When the generator was fed a “poem” (itself found in the Enron database), it emulated the speech patterns of the emails to create this rather poetic response:

 …I put my arms in front of me

The company, that Enron companies,

the service of the company

so the company

so the company seedness.

 

And went to pull her nearer

To the CIO,

The CCPM

Please no California

Thanks company

And the company

So the company.

 

And realized that my new best friend

Business conceding the company

so the company

so the company

so the companies seedness.

 

Was nothing but a mirror

Of the company

So the company.

A feature, not a bug?

Perec’s awkward foray into thinking through categorization of snow using the problematic term “Eskimos” is an own-goal. He highlights the power imbalances and reductionist processes inherent in any classification system even as he muses on the limitations of codification: which comes first, the thought or the classification?

But he writes that it is “so very tempting to want to distribute the world in terms of a single code.” Is he universalizing this tendency? Just as bias is a feature, not a bug of artificial intelligence, can human intelligence, or even human language function without some kind of categorization? Foucault writes of the unease we feel at the disorder of inappropriately linked ideas not just because it is incongruous but “in which fragments of a large number of possible orders glitter separately in the dimension.” Do we all cling to systems of order to stave off the queasiness of unbounded uncertainty? And if so, is this what keeps systems of control in knowledge organization in place—and their unequal human consequences—in place? When we swap higher ideals of justice for (the pretense of) utopian order, what dystopias have we signed up for?

Where Drabinski highlights that critiques of classification and categorizations have a long history and are not new conversations, Crawford situates why this matters so crucially in our current context. Critical theorists may have been pondering these questions for a while, but data practitioners, whose work has huge implications for us all, have embraced codification and run with it. The scale and scope is unprecedented. Machine learning algorithms function through and inevitably embed biases both on matters of allocation and representation, the latter having more insidious and long-ranging impacts. It matters at an epistemological level then that categories are increasingly applied in attempt to naturalize and essentialize that which we consider the social, relational, cultural.

As highlighted in Crawford’s presentation, an example of bias in machine learning, and how it makes its way into everyday systems. Using Google to translate “he is a nurse, she is a doctor” into a language like Turkish without gendered pronouns, then reversing the translation, the AI automatically equates nurse with “she” and doctor with “he.”

The nature of the archive

The archive invokes oppositional questions. Is it about power, control of narrative, history, access, memory? All entail the question of “who” – whose power, whose control, whose absence from the record. Yet, isn’t it also “under siege” (Manoff, 13)? Should we protect it from fetishization or from quiet elimination? Derrida shows us that the apparent psychological drive to record also involves its own kind of production. Technological advances and corporate interests vie with the material demands of archive-keeping. The Peel Archive pages remind us of archivists’ work to render less obscure even the workings of the archive. We see the physical space required, the impossibility of digitizing everything, that choices have to be made by. Should those choices be automated or remain subject to human biases and error? Even those who wouldn’t describe themselves as post-modernists recognize the lack of certain objectivity of historical record. No wonder the archive is a subject of contention.

The pervasiveness of data in digital age obscures how easily it can be obliterated from record, information obsolescence is as much a problem as information saturation. It is fitting we are investigating this concept in an interdisciplinary class, these questions invoke the meaning of the boundaries between them.