2 Introduction

While studying texts of the past, it is not unusual to stumble upon evidence that a document had a ‘life’ before coming to reside in the archive where it now is found: numerous notes throughout the text at the bottom of the page made by the original purchaser, corrections scrawled in rough handwriting from a child using the document as reading practice, or perhaps there is even an initial on the title page left from the document’s first foray into the archive. Indeed, these markings known formally as marginalia not only served a purpose to those who created them, but they also serve a purpose for historians through how they may situate a text within its history and allow for a glimpse into the public and private lives of the annotator.

Despite the insights which marginalia can offer historians of reading, the book, and beyond, the study of marginalia proves to be challenging due to its inherent nature as an element residing in the margins, often scorned or overlooked during the archival process. Most studies of marginalia focus on tracing select annotators or on small collections, at least partially due to the difficulty finding marginalia across larger collections when these annotations are neither abundant nor conspicuous. It is this issue of discoverability which my research project addresses in the form of a case study, demonstrating the usage of an application I built to identify marginalia contextualized within the ongoing conversations surrounding the reconfiguration of digitised cultural heritage collections as data.

Over the last thirty years, historical study has been revolutionized by the rapid emergence of digitised resources which have become widely available to the public, yet techniques which take advantage of the unique digital affordances of such representations are still being developed. A key component of these digitised materials is the metadata which situates them. This metadata describes the object both as a unique digital entity and as the original object it represents. Metadata within digital archives have been used by scholars such as Ryan Cordell to demonstrate the political and social contexts that inform such corpora of materials. Yet there are other forms of metadata generated when scholars use these digitised resources, particularly when they are adapted for use in a data-forward project, that remain unaccounted for.

My MRE project develops an approach to capture this missing metadata. I build, and critically situate, an image annotation application for identifying notable material features from digitised documents, with the focus being placed on marginalia composed by readers of these documents. The tool functions both manually and automatically, at scale for one document or multiple. Drawing on my experience publishing in The Programming Historian, my MRE designs, tests, and describes the tool in such a way that other scholars can immediately deploy it for their own research.1 Tools used in research are theory-laden in that there are always choices to be made; my MRE situates these choices in such a way that the scholar who uses the tool will understand the consequences for their own research, and make this step of the process more transparent to those consuming the output of their research.

Machine learning is a branch of artificial intelligence concerned with creating mathematical algorithms or ‘formulas’ that can be said to ‘learn’ through exposure to data, and thus improve automatically the more it is used.2 This process results in the creation of a model, or abstraction of the patterns in the information the machine has seen, which is then able to make predictions or decisions about new data it has not previously seen. Discovery of marginalia across expansive collections is an exemplary case of when an object detection model should be used; as the name implies, this is a type of machine learning model designed to identify objects in an image. Yet the usage of machine learning has ethical implications both broadly and in particular when used with cultural heritage collections; questions surrounding who gets to decide which information is used for training, whose culture becomes the standard, whose voices are left out, and who profits from the work all come to the forefront when using collections as data. Machine learning is a computationally intensive task, meaning that to use it, there are always costs that must be considered when using these methods. Environmental costs to power the technology required for machine learning, and financial costs to access this technology which in turn erect barriers to entry and limit who can contribute to this research area, are two significant concerns within discussions of machine learning usage across all disciplines at present with the rush towards creating even more vast models.3

The process of teaching machine learning algorithms, formally known as training, requires massive amounts of data to the extent that the metadata is often overlooked or omitted due to the challenge of managing and understanding such a vast quantity of data. But to truly understand the influences and limitations of a machine learning model, it is crucial to fully know the data it is built on and understand these details apart of its initial construction. For meaningful results using machine learning, it is essential to examine the input that shaped the model through understanding those who created it, what it meant to them, where it resides in both a temporal and tangible sense, and its material context. This is of particular importance when using cultural heritage collections for the purpose of machine learning, as there is the risk of models that are trained using these collections replicating the epistemologies, injustices, and anxieties exemplified by previous institutional orders and hierarchies of power.4

My MRE addresses this discourse through using my image annotation application and the surrounding workflow to prepare a selection of digitised archival texts which feature handwritten marginalia from the early modern period to be used as training data for an object detection model. Image annotators in general serve as tools for generating training datasets when using image-based machine learning techniques. Their primary functionality is enabling researchers to annotate images, creating examples for the machine to learn what features in an image are considered important. Both this image annotator and workflow can be adapted to other corpora of materials beyond marginalia. This essay contextualizes the necessity for a tool such as this and the scholarly issues around its design and construction, by applying the model trained with the tool to the collection of chapbooks provided by the National Library of Scotland (NLS).5

These pocket-sized pieces of reading material were printed on a single sheet then folded into booklets of 8, 12, 16 and 24 pages, continuously produced in this manner from the 17th to 19th century.6 The subject matter of chapbooks was diverse, with sermons of covenanting ministers, prophecies, last words of murderers, and biographies of famous people of the time such as Wallace, Napoleon, and Nelson. These were interspersed with works of humour, fairy tales, and poetry, not to mention manuals of instruction and almanacs. It has been estimated that around two thirds of chapbooks contain songs and poems, often under the title garlands.

Chapbook printers frequently utilized worn and broken type purchased second-hand which naturally produced rough and unrefined prints; likewise, the woodcuts used to decorate chapbooks were also cycled and reused in print, and often were not at all related to the text they were present in. Chapbooks were sold on streets and at fairs for a penny a time by pedlars dubbed ‘chapmen’, a term that is related to the word ‘cheap’ but likely also related to the Anglo-Saxon ‘ceapian’, meaning to barter, buy and sell. Individuals could also buy them directly from printing shops, although one of the features of chapbooks was the proliferation of provincial imprints with places such as Fintray, Falkirk and Inveraray being a common home to cheap print shops. Chapmen were supported by running stationers to make chapbooks, alongside broadsides, the most popular reading material for the masses during the latter half of the early modern period. Chapbooks gradually disappeared in the mid 19th century due to both the rapidly increasing amount of cheap printed content available and the rise of Victorian morality which considered many chapbook publications as crude and profane. As a widely available and affordable form of entertainment and paper, chapbooks offer great potential as sites of early modern marginalia.

In creating a model specifically designed to extract marginalia, I will not only demonstrate how an image annotator such as the one I created can be used to consolidate metadata and training data, but also demonstrate how this form of image-centered machine learning can facilitate the large-scale study of reader habits and observations across collections of readers who wrote in their books during the Early Modern period. The MRE will conclude with a reflection on notable marginalia found within the NLS chapbook collection that can be used as a base for future study, as well as the historiographical impact that using this tool might have in the context of book history and more generally, as the study of history becomes increasingly digital and joins the conversations about transparency and accessible data within the humanities.