6 Conclusions
In this work, I have identified key issues which digitised cultural heritage collections and by extension, institutions, face during this time of rapid technological development, specifically in the realm of machine learning. I address these key issues through the development of an application to situate the data which machine learning models are built on in both their archival context and new context as metadata. I further demonstrate the potential positive impacts machine learning approaches might have for the study of history when used consciously with active effort made towards ethical usage.
The study of marginalia has evolved over time, with scholars exploring different aspects of this practice. Early works focused on using marginalia to reconstruct the lives and opinions of earlier scholars and expand on their published works. More recent studies have shifted towards analyzing the materiality of marginalia, examining how readers interacted with and used literature. Scholars have classified marginalia into different categories, including editing, interaction, and avoidance, and have explored the social, cultural, and intellectual contexts in which reading took place.
The presence of marginalia in physical and digital archives presents a challenge for researchers. Physical archives associated with academic and archival institutions have historically prioritized the preservation of pristine copies of books, often leading to the removal of marginalia assuming it was not composed by someone considered significant. Digitised archives, on the other hand, lack comprehensive metadata and search functionalities for marginalia. The lack of intentional curation of marginalia has limited its discovery and exploration. However, recent advances in computer vision and machine learning, particularly object detection, offer new opportunities to identify and analyze marginalia within digital archives.
By examining object traces and using machine learning methods, researchers can expand their understanding of marginalia beyond the marginal doodles of a single reader. The approach I have developed allows us to explore many manuscripts at once, and look at questions of genre and categorization macroscopically, which in turn allow researchers to gain a deeper understanding of the historical, cultural, and social practices of early modern readers, as well as the ways in which texts were consumed, understood, and used in society. The automated detection of marginalia enables large and diverse corpora to be evaluated for the different forms and amounts of marginalia in a much faster way than locating it manually, providing researchers with more efficient methods to sift through works and focus on analysis. Overall, integrating machine learning into the study of marginalia presents exciting possibilities for advancing scholarship in this field.
It is clear from discussions surrounding digitised archives and machine learning methods that there is a need for more robust and comprehensive metadata. The current common practice of treating the content of digitised cultural heritage collections as transparent surrogates for physical objects overlooks the affordances and limitations of these digital artifacts. The use of machine learning in research further emphasizes the need for contextual information throughout the data lifecycle, from acquisition to analysis. The integration of archival materials into research also presents challenges in terms of handling diverse and abundant data, as well as the omission of archival metadata during the data collection process.
Understanding the foundations of machine learning models is essential for meaningful output. The biases and subjectivities inherent in the input data must be interrogated and contextualized to avoid reproducing harmful views or perpetuating discriminatory outcomes. When this is taken into consideration, descriptive metadata becomes a crucial part of holding the training data accountable, alongside providing transparency and accountability in the research process. Thus, by extension the development of tools and technologies that promote transparency, reproducibility, and accountability with a central focus on preserving the metadata created during the formation of training datasets is also crucial. The application I built provides a means to address these gaps and challenges in current digitised archives. By providing comprehensive metadata integrations through the annotation editor and metadata viewer, researchers are encouraged to understand and build their data through the viewing and creation of structured description, facilitating the meaningful annotation of data for computational research. Treating annotations as new digital objects and documenting their creation enhances familiarity with the training data and creates a more robust log of the data, ensuring transparency and accountability throughout the research process.
Essentially, the integration of robust and comprehensive metadata within digitised archives and machine learning workflows is essential for understanding the context, limitations, and biases of digital artifacts. By drawing on lessons from archival studies and implementing tools that prioritize transparency and accountability, researchers can make the most of digitised collections and ensure ethical and inclusive practices in data collection and analysis. The described application serves as an example of how these principles can be applied in practice, facilitating the creation and annotation of training data while maintaining a clear record of the data’s origins and transformations.
The use of machine learning models for the detection of marginalia in digitised chapbooks has both advantages and limitations. The application of machine learning in this context allowed for the quick and efficient identification of marginalia, providing a wealth of insights into how readers interacted with chapbooks in the past. It also allowed for the identification of patterns and trends in the annotations through marks of ownership, engagement with the text, and practical uses of the chapbook’s surface as a place for quick calculations or writing practice. At the same time, there were still a significant number of false positives. These false positives included ink smudges, bleeding, and printing errors, which demonstrate the challenges of detecting marginalia in a dataset with complex and diverse page elements.
Nevertheless, the application of machine learning in the study of marginalia offers a new approach to understanding the materiality and readership of chapbooks. It provides a novel way to explore the annotations made by readers and the ways in which they engaged with the texts. This research demonstrates the viability and potential of both digitised collections as data and machine learning models in the study of cultural heritage collections.
As technology progresses, it is important to consider the ethical implications of using digitised collections as data for machine learning. Institutions holding these collections should actively participate in making their digitised collections open and accessible, while also implementing responsible data practices. If the institution is unable to distribute their data in multiple formats, there should be clear documentation on their format of choice, how it is used, and ideally information or links to technical resources that allow transformation into other formats in a straightforward and reproducible way. To harness the full potential of their content, cultural heritage institutions cannot only rely on the ability of the researchers to access their data through unmonitored and time consuming means such as webscraping; instead, they must invest in more suitable ways to share their data, and in digital curation with a considerably broader scope of use, while also integrating their responsibilities to the content of their data regarding any ethical issues and inequities that may be present. Further, when the collections as data is applied to a project, guidelines, and checklists such as the “Collections as ML Data” checklist proposed by Benjamin Lee can help researchers and practitioners navigate the challenges and ethical considerations involved in using cultural heritage collections as data.
Overall, the use of machine learning models for the detection of marginalia in digitised chapbooks has the potential to enrich our understanding of the past and provide valuable insights into the materiality of these cultural artifacts. By engaging with these collections as data, researchers can shed light on the ways in which readers interacted with these texts, creating a more nuanced understanding of their historical, social, and cultural significance.
6.1 Future Directions
Several promising avenues for further exploration have emerged from this project. The simplest to explore would be to enhance the model’s capabilities by expanding the training data. Incorporating marginalia that is on pages which are noisier and more flawed such as those detected in the chapbooks collection would be an effective first step. This expansion would encompass a broader spectrum of paper quality, thereby improving the model’s adaptability. Furthermore, including more instances of roughly printed text accompanied by marginalia can refine the model’s ability to discern and differentiate between these elements accurately. The accidental detection of thumb prints within the model’s output gestures toward ways in which the method I utilize for training can be repurposed for investigating other more obscure facets of book history. Since the model has shown it is capable of finding trace marks even without intentional training, it would be interesting to train a model using the object traces found in early modern books which Smyth discusses. This would, however, be much more of a challenging task as Smyth struggled to compile even just the few object traces he studied in his work. Gathering enough examples of object traces to create a dataset from and effectively train a model with would be difficult.
Extending the application of the object detection approach to other types of historical documents represents a natural progression. For instance, the technique could be adapted to unearth marginalia in digital collections such as EEBO. Additionally, the methodology holds potential in investigating more obscure facets of book history, such as object marks, inspired by Adam Smyth’s exploration in “Object Traces in Early Modern Books.” It is acknowledged that constructing a suitable dataset for training in this area would be more challenging but equally rewarding.
When considering the results of this project in the context of digital cultural heritage collections themselves, there are multiple ways that machine learning models such as the one I produced can enhance these resources. First and foremost, there is immense potential for the enrichment of the current object metadata. Pages could be flagged as containing marginalia with consistency making it easier to search for and find with a cultural heritage institution’s collections. This is not only beneficial for the researcher, but also for the institution, allowing for them to discover new readers present in their holdings and in turn, have a better understanding of their collections. As the Uppsala University’s project points out, once the marginalia is detected, metadata could be even further enhanced through the automated transcription of the marginalia. Further, if the marginalia found across collections is substantial enough and patterns are identified, the marginalia could be coupled with an automated classifier system, enabling automated identification of marginalia patterns.
6.2 Closing Thoughts
Historical study in our digital age has been marked by the transformative impact of digitised resources, which have opened up new avenues for research and exploration, expanding the horizons of historical inquiry. This work has delved into the challenges and opportunities presented by the intersection of digitised cultural heritage collections and machine learning. The development of an application aimed at contextualizing training data for machine learning has showcased the potential positive impact of leveraging machine learning for historical analysis when accompanied by conscious efforts toward ethical usage. The study of marginalia within digitised archives has evolved from a focus on individual readers to a broader exploration of social, cultural, and intellectual contexts. Machine learning methods, particularly object detection, have opened new avenues for understanding marginalia’s role in history.
The first part of this work detailed the evolution of marginalia study, emphasizing the challenges faced by both physical and digital archives in preserving and providing access to these annotations. Then, the importance of robust metadata in the age of machine learning was highlighted, underlining the need for transparency, accountability, and ethical considerations throughout the research process as emphasized in the creation of my application. Finally, the advantages and limitations of using machine learning models to detect marginalia, and historical data broadly, were laid out, emphasizing the significance of ethical data practices in digitised collections. The future directions explored in this final section suggest avenues for expanding the model’s capabilities, as well as investigating new facets of book history and digital archival practice as a result. In conclusion, this work has contributed to bridging the gap between digital cultural heritage collections and machine learning, emphasizing the potential for meaningful insights while advocating for responsible and ethical practices. By integrating technology, humanist thought, and historical analysis, this research advances our understanding of the past and offers a roadmap for further exploration in this dynamic and evolving field.