4 Creation of an Application to Capture the Metadata of Big Data

Given the complicated and multifaceted ways marginalia can be read and understood, how their complex materialities require close observation and tactile engagement with the page, and how the definition of what ‘counts’ as marginalia can be so contested, an application designed to identify and find marginal annotations might seem a foolish endeavour. If we shift perspective for a moment, from the margins of our books to the margins of our planets, a similar kind of problem can be seen for archaeologists who study human settlements in space (admittedly, there is only one single such settlement at present, the International Space Station). That is to say, the problem is one of identifying the interesting elements in a collection of materials where we cannot physically examine the ‘real’ materials. A problem at present for space archaeologists is that they study the margins of lived lives, the detritus left over from earth beyond its stratosphere, yet NASA alongside other space agencies do not permit archaeologists to become astronauts.⁴⁸ So, the archaeologists interested in how life is lived on the ISS can only work from photographs taken by those who were able to become astronauts. For the archaeologists to study these photographs, they needed a way of annotating the images so that larger patterns could be deduced. This context alongside discussions on metadata and machine learning within cultural heritage institutions are what framed the first technical component of my MRE.

My application for identifying annotations to create a training corpus is actually a further adaptation of an application I created for the International Space Station Archaeology Project (ISSAP) to support the needs of archeologists working on the project.⁴⁹ The ISSAP version of the application is used to analyze photos taken of the living quarters ISSAP received from the International Space Station. The project sought to understand how astronauts use the space of the habitation modules by tracking the small items of daily use across the station as they appeared and disappeared in photographs.⁵⁰ The archaeologists annotate photos to eventually create an ‘automatic archaeologist’ that can analyze spatial patterns of material culture in the photos. The tool I created for ISSAP is a rewritten version of the more general purpose Visual Geometry Group Image Annotator developed at Oxford University; the original tool was not suitable for the project because it could not be used collaboratively and did not have a structure for the automated recording of metadata as annotations are generated.⁵¹ In general, image annotators are used when creating applications for computer vision to create datasets for the purpose of training classification or detection based machine learning models to recognize items of interest. The researcher annotates the image, and the machine learning model learns to look for these features which have been marked as important. A side effect of this approach is that the metadata created by the researcher while producing these annotations to train the model with becomes divorced from the original metadata of the images. When considering this in relation to the study of book marginalia, it is equivalent to cutting the marginalia out of the pages and only analyzing those select segments without any thought about the document the marginalia came from.

I sought to expand on the collaboration functionality as well as the ability to import existing metadata from both the archive and directly from the images to be annotated in this new version of the application, which I named RocketAnnotator as a gesture towards its galactic origins. With this feature, new metadata can be produced alongside the context of the original object metadata common in archives, allowing for the tool to become not only an image annotator, but also a way to reference and track the creation of training data for machine learning projects. Additionally, all metadata will be easily searchable both from within the application and through the structured output file, allowing for researchers to easily reference specific data points and for scholars to browse the data which formed the output being presented to them.

In 2016, book historian Ryan Cordell called for more robust methods to describe digital artifacts bibliographically within the context of utilizing digitised archives. Research which makes use of these digital objects often fails to account for the sources, technologies, and social realities of the objects’ creation in ways that make their affordances and limitations more readily visible and available for critique.⁵² Likewise, in conceptualizing digital archives as sources of data in their book, Data Feminism, Catherine D’Ignazio and Laura Klein continuously emphasize the necessity of further context at all stages of working with “data”, from acquisition to analysis, because the context which data is situated in is seen as essential to the ultimate “framing and communication of results” formed through its use.⁵³ How a digital object is catalogued within the archive becomes how the object is situated when used as data, thus when bibliographic records only offer an incomplete account of the digitised object the results of research using this data are also incomplete in a way, as the researcher is not provided all the information necessary to understand the object in its entirety. This line of thought signals that for any tool designed with the use of digital artifacts and collections in mind, it is vital to attempt at including ways that this lost data could be drawn out and made accessible to the user.

When archival materials are integrated into research utilizing more traditional methods of historical inquiry, the subject being analyzed tends to be singular – focused on the work of one individual or the content of one collection. This in turn makes answering questions about the affordances and limitations of their sources more manageable without extensive organization, since there is cohesion across sources. Comparatively, the large amount of data needed for machine learning methods often results in these questions being difficult to answer on a microscopic level, because of both the diversity in the data when drawing from multiple sources and the archival metadata being omitted during the process of data collection. The scale of the dataset produced makes such detailed information be perceived as unnecessary, a mode of thought which carries even into projects with humanistic foundations.

Yet these details which go into the first step of building a machine learning model are vital to understanding the influences and limitations of it; the foundations which machines learn with and from are human, meaning that they contain “human subjectivities, biases, and distortions” like all other works created by humans.⁵⁴ In order to produce meaningful output using either analog or automated research methods, such as machine learning for identification, it is vital to interrogate the input that contributed to the making of the method being applied for answers regarding social, cultural, historical, institutional, and material conditions under which that input was produced, as well as about the identities of the people who created it.⁵⁵

4.1 Technical Overview

When using digitised cultural heritage collections as data, each step of the process holds the possibility of introducing unspoken assumptions or hidden transformations of the data. Thus, the digital historian must write with both reproducibility and transparency in mind, so that the reader can verify and trust the results and conclusions. Even historians engaging in methods they would not consider digital make similar transformations as they convert historical information into their notes and writing, although these transformations are not nearly so apparent; as Ian Milligan stated in his piece documenting the research practices of historians in the archive during this period of technological shifts, “we are all digital historians now.”⁵⁶ Thus, before delving into the details of its creation, it is important to briefly consider the technical foundation of the application in order to understand the very first considerations that went into how the end product took form.

The code for this application was written using SvelteKit, a framework for building web applications using a specialised implementation of JavaScript.⁵⁷ When the website is compiled, the step in the web development process where the code is converted into what is displayed on a web page, the code is converted into highly efficient vanilla JavaScript resulting in faster performance compared to traditional frameworks that compile into multifaceted layers. It is this performance advantage, as well as ease of use, that led me to select SvelteKit for this task, since as an image annotator, users would be uploading and interacting with files, as well as populating files with data which can be computationally heavy tasks, thus having a performant framework was a must.

To transform the SvelteKit web application into a desktop application, I used Electron, an open-source software framework that allows developers to build cross-platform desktop applications using web technologies such as HTML, CSS, and JavaScript. Electron powers many popular desktop applications such as Discord, Slack, Notion, and even the application which I am writing in right now, Visual Studio Code. It functions by running the web application code using a Chromium browser engine, essentially turning the code into a browser itself designed to do the singular task which the web application’s code instructs it to do. Due to its popularity, Electron has extensive documentation and a large community making it a good choice for developers such as myself who have little experience creating desktop-based software. The most significant limitation Electron presents is that by using web technologies to build desktop applications, there can be a slightly higher memory consumption and application size compared to “native” applications, which are software that is developed specifically for a particular operating system or platform and thus very able to be more optimized since they are developed with very specific hardware parameters in mind. Alternatives to Electron which focus on reducing memory consumption and application size have begun to be developed in recent years, the most notable being Tauri, however they optimize the application through using whatever browser engine the operating system comes pre-installed with rather than installing a dedicated one; for example, if a user is opening the application on an Apple device, it will run using a Safari browser engine. Each browser has different development standards and ways which they display information, so if the application uses the browser engine based on the user’s device, the developer ultimately has little control over how the application appears and functions outside of the operating system that they use to develop the application, unless they have access to multiple machines to perform testing, as well as the resources to tailor a version of the application to each browser engine.

I chose to make my application a desktop tool rather than publish it as a website to ensure both ease of collaboration and use. The application functions by creating and updating a project save file, which is simply a JSON file that contains all data surrounding the images selected for annotation, as well as any annotations drawn upon each image. JSON files, being a form of structured data, are both human readable and easy for a machine to manipulate in a consistent way. Additionally, they are small in size which makes them easily shareable. These factors combined have made the JSON file format popular which has resulted in many tools that can make them useable even outside of the application. When used in a project with collaborative annotation needs, the project save file can be placed in a shared code repository such as GitHub and versions can be managed using Git. This method also adds a level to transparency to a project using my application, as each step of annotating images and changes are being recorded through each push to the repository, assuming the repository is public. As a security measure, web browsers are not allowed access to a user’s file system; when uploading a file to a website, a temporary fake path to the selected file is generated, and this is either used to make a copy of the file which is then stored on the website’s server (for example, Google Drive), or temporarily stored then discarded once the web page is closed, the latter of these options being very resource intensive if uploading a large number of files. Electron applications allow access to the file system, since although it makes use of a browser engine, this engine is installed and run locally on the user’s device rather than being connected to the World Wide Web. When starting a new or existing project with my application, the user is first prompted to select the folder of images which they want to annotate. This establishes a path to where the images are on the user’s computer since this does not get saved in the project save file, as paths to where the images are located are unique to each device. If the user wants to open an existing project, they will also be prompted to select the project save JSON file. The path to this file will also be saved so that when the user manually saves their project, this same file will be updated. Access to the file system also allows for the application to autosave the project save file, so should anything go wrong, the user will lose at most ten minutes of work. In summary, Electron ensures that the functional, visual, and file-based user experience is universal when using this tool.

4.2 Process

4.2.1 Interface and Tooling

As a whole, when designing the interface of my application, I sent through the design around the user experience/design concepts of mapping, and the principle of familiarity. The principle of familiarity is concerned with the ability of an interactive system to allow a user to map prior experiences, either real world or gained from interaction with other systems, onto the features of a new system. By extension, mapping in this context is using a familiar imagery to invoke the action/operation which an interactive element will perform. The layout of the app is similar to other popular tools for image manipulation, such as those within the Adobe Suite, MS Paint, or Windows Photo Viewer.

To the left of the window is a simple tool bar, which allows user to select the shape they want to use for annotating the image (set to a rectangle by default), perform basic manipulations like zooming in and out in a controlled manner, and “reset” the image to its original position. This tool bar is hovering over the largest component of the application, the image viewer and annotation canvas. This viewer utilizes technology that those in the humanities are likely already familiar with, even if they may not be aware of it. The image viewer itself uses OpenSeadragon, a tool for viewing high-resolution zoomable images, which is the technology largely behind image viewers used by digital archives. aside from using the tool bar’s buttons, OpenSeadragon also allows user to manipulate the image using trackpad gestures as well as click-and-drag to move around the image. Annotorious works with OpenSeadragon to allow for annotations to be drawn on images viewed in the OpenSeadragon window; this combination has been leveraged for cultural heritage purposes before, one notable example being the Arts and Humanities Research Council crowdsourcing platform, MicroPasts, which allows for the public to assist with large scale archaeology, history and heritage tasks.⁵⁸ At the bottom right of the image viewer are arrows the user may use to switch from one image to the next, however they may also do so by using the left and right arrow keys.

Occupying the right side of the application window is the primary space for application and file management. At the top of this space, users can either return to the home menu should they want to begin a new project, choose to manually save their project, or select where they want the project save file to be saved. Below this is a file viewer, where users can add images they want to annotate from the image folder they selected when creating the project or remove them, as well as view a list of image files they have uploaded. The user can also jump to any of the images by clicking the relevant file name in the file list. Below the file list is a drop-down menu where the user may apply a filter that indicates whether images have or have not been annotated, which functions by highlighting the entry in the file list that matches the filter criteria. The search bar functions in the same highlighting manner, except it highlights the images which have metadata that matches the search term. Following this, there is an “Export” menu in which users can choose to export their annotated images into a variety of popular data formats used for training object detection models. The last section included in this side bar is a quick guide to how the application functions as a reminder to users who have just begun using the application or are returning to the project after a period time away from it.

4.2.2 Annotation and Metadata

The annotation editor and metadata viewer which occupies most of RocketAnnotator’s bottom pane was created specifically to address these issues of decontextualized data in large scale datasets. At the surface level, it is designed to appear similar to a spread sheet such as those found in Excel, so it is intuitive to the user understanding what the section of the application is for and how it is used. In the first tab, “Annotations”, there are five descriptive columns present by default: the annotation’s unique ID, the date and time the annotation is created, who the annotation is created by, the broader category the annotations fall into, and what specifically the annotation is. There is a “+” symbol at the end of the column headers that allows the user to extend this metadata through adding their own columns specific to the project. A row is added to this table each time an annotation is drawn on an image, and likewise, deleting an annotation on the image canvas deletes the corresponding row, facilitating a direct connection between the image and the metadata being generated. The second tab, “Metadata”, displays data associated with the image being annotated– this can be metadata from the digital archive which the image was obtained from if the archive chooses to tag their images with this information or the user does so in the process of collecting the images, as well as any additional EXIF data, the metadata embedded within digital images, which can be extracted.

Cordell encourages us to think of items found within digital archives as not simply a transparent surrogate for a corresponding physical object, but instead as a “new edition” in the full bibliographic sense of the word; while it “departs more and more from the form impressed upon it by its original author,” it nonetheless “exerts, through its imperfections as much as through its perfections, its own influence upon its surroundings.”⁵⁹ When it comes to cultural heritage collections, the digitised item is often described in metadata as if it were the original item picture rather than a new version; in museums, replicas of deteriorated artefacts are marked as such, yet digitised objects are often treated as if they are exact substitutes for the physical. As Adam Crymble demonstrates in his history of mass digitization, the digitisation of primary sources was to a great extent driven by the desire to democratize primary sources for education and research purposes; in the beginning, digitised sources were explicitly intended to be surrogates for the original.⁶⁰ By and large, digitised sources have been used as such, and so this form of metadata has been considered suitable for its audience. Yet in the age of big data and machine learning, the digital archive’s audience has shifted from solely human consumption to machine consumption as well. Archival metadata and what that entails must be expanded to fit this use. Metadata, the data which describes data, is what holds data accountable.

In the context of machine learning, what has been perceived as valuable is the data that will be used to train a model, and any data surrounding that data is largely ignored or discarded after it is finished its use in create the training data. A model uses an algorithm to make sense of the data given to it, and produce some form of task or output, such as classifying images or generating a paragraph of text. In order for an algorithm to adequately “learn” to do something, it needs an extensive number of examples; for example, the Common Objects in Context (COCO) dataset is a popular dataset used for training object detection models, and it contains 1.5 million examples of objects in photos which fall into one of 80 categories.⁶¹ The development of these massive datasets nearly always involves ingesting massive amounts of data from convenient or easily-scraped Internet sources such as Twitter or Flickr under the assumption that this will inherently result in diverse content, therefore metadata serves little purpose since the data was not created by nor possible to be revised in its entirety by a person.⁶² The datasets which do offer metadata associated with the items rarely offer it in an accessible manner, with metadata for the mass amounts of content being stored in obscure file formats or in large multipart archives.⁶³ This belief in the unimportance of metadata has resulted in researchers lacking an understanding of the training data being used to train their models, which has led to multiple instances of machines learning to replicate the harmful views their data possess. A recent example at the time of writing this would be the Stable Diffusion text-to-image generation model, which was trained on billions of image-text pairs scraped from across the internet.⁶⁴ Claims from both casual users and formal investigation of the model have found that Stable Diffusion may unexpectedly generate inappropriate or disturbing images, as well as otherwise offensive content; for example, images generated using the statement “Japanese body” yielded almost exclusively inappropriate material, with 90% showing explicit nudity.⁶⁵ Closer attention paid to the metadata of the training data could have mitigated undesirable outcomes and identified patterns of discrimination before they were fed to the model and reproduced.

In recent years, there has been movement within the field of computer science towards critical analysis of how datasets are constructed, composed, and used. Primarily, these efforts have been directed toward standardizing the documentation of datasets through ‘datasheets’, overviews attached to datasets which communicate the content of a dataset in a way that prioritizes transparency and accountability.⁶⁶ Within this conversation, there has been encouragement to draw upon the existing language and procedures for managing sociocultural data within libraries and archives. In Lessons from Archives, scholars Eun Seo Jo and Timnit Gebru argue that archives, as “a form of large-scale, collective human record-keeping”, can aid in addressing the questions of power imbalance, privacy, and other ethical concerns that datasheets leave unaddressed through interventionist data collection strategies to address biases and ensure fair representation.⁶⁷ They indicate a number of ways they believe practices which emerge from archival studies would enhance the practice of machine learning; firstly, that archives begin with focused, institutional mission statements that outline a commitment to “collecting the cultural remains of certain concepts, topics, or demographic groups” which guides their data collection process, as well as curators who are responsible for weighing the risks and benefits of gathering different types of data in relation to an archive’s objectives and have developed theoretical frameworks for appraising collected data.⁶⁸ Gebru and Jo encourage the machine learning community to approach data collection and appraisal by at least starting with a statement of commitment rather than starting with datasets by availability to ensure equitable targets during the construction of datasets. This echoes D’ignazio and Klein earlier conceptual call for data scientists to proceed with awareness of context and an analysis of power in the collection environment to determine whose interests are being served by being counted in the dataset, and who runs the risk of being harmed.⁶⁹ Additionally, archives often have codes of conduct or ethics and a professional framework for enforcing them alongside developed detailed standards for data description, ensuring ethical practices in data collection by helping ensure transparency and accountability; these multi-faceted forms of review and record-keeping are unheard of in machine learning data collection.⁷⁰ Lastly, archival sciences have promoted collective efforts to address issues of representation, inclusivity, and power imbalance; for example, community-based activism has been used to ensure that various cultures are represented in the manner in which they would like to be seen.⁷¹ Machine learning researchers can draw from these efforts towards participatory archives to ensure diverse and inclusive datasets.

As Lessons from Archives highlights, the issues of historical power structures and how they may be dismantled have long been discussed within archival studies. Yet what Lessons from Archives does not discuss is the digital turn within the archive itself, how the archive, through embracing a digital form, has moved from being a collective of human recordkeeping to a collection of data to be made sense of and mined.⁷² When viewing collections themselves as data, archival data is seen as beneficial for the existing metadata associated with or describing each archival item; unlike a blog post where metadata about it needs to be constructed by identifying and compiling available information from the web page, items in a collection have this descriptive information already curated and compiled. Yet a significant issue comes from this perception of digital archives broadly as complete in their current state. Despite institutional mission statements, codes of ethics, and community contributions, at an individual item level, the metadata is still the same as that in catalogs which have long represented groups of people in problematic ways. What has changed is the new methodology being used to promote the use and reuse of these descriptions and collections. As librarian Sophie Ziegler writes in their article Open Data in Cultural Heritage Institutions: Can We Be Better Than Data Brokers?, “The collections as data framework in cultural institutions carries with it the possibility for our descriptions of people to be shared, combined with other data, and used to negatively affect groups.”⁷³ When framing the archive as data, there is a risk that the archival holdings and their descriptions will look objective and natural, and the work of archivists and others to show how archival collections are never neutral and natural will be obscured; Devon Mordell encourages “active participation and critical discourse” around the tools and practices to ensure that new technologies reinscribe this false sense of neutrality.⁷⁴ One simple yet significant action that works toward this goal is incorporating data process and provenance into the standardized documentation practices for collections. During his time as Humanities Data Curator at the University of California Santa Barbara, Thomas Padilla emphasized the concept of legibility within metadata, that to make collections as data usable, the processes behind their establishment must be transparent and documented. In the context of libraries, Padilla indicates that:

Libraries do not often provide access to the scripts that generate collection derivatives, access to processes for cleaning or subsetting data, access to custom schema that have been used, indications of how representative digital holdings are relative to overall holdings, nor is the quality of data typically indicated. Libraries do not typically expose why some collections have been made available and others have not. Libraries do not typically identify the library staff personally responsible for modifying, describing, and creating collections – a dimension of provenance that must be accessed in order to determine data ability to support a research claim.⁷⁵

These same claims can be applied to archival items. Without this information, the user’s ability to comprehend and thus utilize a collection as data is hindered or even made impossible through the elusive gaps which are left in the collection that would then be transferred to any project that makes use of it. The data is left vulnerable to misuse when not fortified through comprehensive metadata. The potential of collections as data hinges on integrity validated through expanded documentation practice.

The annotation editor and metadata viewer within my application seeks to address the digital archive in the state it is at present. The level at which data provenance is addressed varies widely from institution to institution, thus there are features built into the application which seek to close some of the gaps surrounding the digital origins of objects being annotated. The metadata viewer shows both how the image was contextualised within the archive it was extracted from, and since many archives have not yet begun to include information about the entry as a digital object, the application includes the automatic extraction of EXIF data to expand the predefined archival metadata. EXIF data can provide details on camera settings including make and model, date, and time the image was captured, geographic information regarding where an image was taken, photography settings such as white balance or flash usage, and information on what software was used to process or edit the image. Essentially, a potentially detailed history of how an image was captured and processed when this information might not otherwise be present. Being able to view both the archival and digital metadata in the process of annotation ultimately aides in circumventing the decontextualized access and consumption which occurs during the process of annotating data for computational research.⁷⁶

In light of discussion over digitised archival objects being a new edition in the lineage of an item, the annotation editor expands on this mode of thought and encourages the annotator to view their annotations in the same way. Each annotation visually segments a portion of the image from its surroundings, marking it as something significant which is important enough to be highlighted and thus it should be documented in a way that is similar to other digital objects. One way humanists can distinguish themselves in the process of creating datasets with the end goal of machine learning is through the addition of explanations about decisions that we make while creating data.⁷⁷ While the ability to add their own columns in this tab encourage the user to create structured metadata for their annotations, even without adding additional columns, the user still must record who created the annotation and basic descriptive information about the contents of the annotation, capturing key metadata which holds the creator of it accountable in the process of creation for each annotation. Treating annotations as new digital objects both enhances familiarity with the training data and constructs a more robust log of the training data with more findable items should an issue arise during the process of or after training a model.⁷⁸