7 Appendix A: The Full “Collections as ML Data” Checklist

The cultural heritage collection as data

Here, a distinction is drawn between the cultural heritage collection being studied and the training dataset being utilized for the machine learning model. For example, a project might utilize a pre-trained model to generate embeddings for a photo collection. In this section, we consider the cultural heritage collection itself; in the section “The Machine Learning Model,” we consider the machine learning model's training data.

Dataset composition:
1. Who or what is depicted in the dataset? (Gebru et al., 2020)
2. If the dataset depicts people, are any specific subgroups of people represented? Are any specific individuals personally identifiable? (Gebru et al., 2020)
3. If the dataset depicts people, are any individuals still living? Does this project comply with privacy laws in countries where it will be shared?
4. What medium is the dataset? (image, video, text, web archive, etc.)
5. How large is the dataset, both in cardinality and in disk storage?
6. What metadata is available for the dataset items? (Holland et al., 2018)
7. Does copyright impact this dataset? If so, how? (Cordell, 2020; Gebru et al., 2020; Jakeway et al., 2020; Padilla, 2018)
8. Does this dataset pertain to a difficult history? If so, what extra precautions are being taken?
Collecting process and curation rationale (language borrowed from Bender and Friedman (2018)):
1. Who curated the cultural heritage collection from which this dataset is derived?
2. What organization or institution was the collection created for?
3. What funding was utilized (if known)?
4. What collection process was utilized? (Bender & Friedman, 2018)
5. When was the collection assembled? (i.e., when were the photographs taken or ethnographies recorded?)
6. What instruments were utilized to create the collection? (i.e., a recording device, camera, etc.)
7. If people are included, did individuals consent at the time of collection?
8. What were the decision-making processes behind the collection's curation? (Bender & Friedman, 2018)
9. What is unknown about the collection process and curation rationale?
Digitization pipeline (only applicable if the dataset is a digitized version of a physical collection):
1. Who selected what was digitized?
2. What organization or institution oversaw the digitization?
3. What funding was utilized?
4. What criteria were utilized for determining what was digitized? (Cordell, 2020)
5. What were the steps in the digitization pipeline? (For example, in the case of photos, what scanners were used to digitize the documents? In the case of documents, what OCR engines were utilized?)
6. What metadata was algorithmically produced?
Data provenance:
1. What is the provenance of the dataset, from collection through digitization? (Bender & Friedman, 2018; Diakapoulos et al., n.d.; Holland et al., 2018)
2. Is any part of the provenance unknown?
Crowd labor:
1. Have volunteers or crowd workers added metadata to the dataset? (Cordell, 2020; Jakeway et al., 2020; Padilla, 2018)
2. If so, how were they recruited and compensated?
3. If so, what metadata did they produce? (i.e., transcriptions, annotations, etc.)
Additional modification:
1. Were any additional steps taken after collection curation and digitization in order to produce the dataset in question? (i.e., Were any items removed? Were any additional metadata added? etc.)

The machine learning model

Note: if multiple machine learning models were utilized in the project, this step should be completed for each model.

Overview:
1. What model architecture has been utilized? (Mitchell et al., 2019)
2. What is the task that the model is being deployed to perform?
3. Who trained, finetuned, and/or deployed this model? (Mitchell et al., 2019)
4. Across what organizations or institutions did this training, finetuning, and/or deployment take place? (Mitchell et al., 2019)
5. What funding was utilized? (Gebru et al., 2020)
Training/finetuning:
1. Was the model trained from scratch?
2. If so, what data was used to train the model? (Mitchell et al., 2019)
3. If not, was a pre-trained model utilized? Where can more information on the pre-trained model be found? (Mitchell et al., 2019)
4. Was the pre-trained model finetuned? If so, what data was utilized for finetuning?
5. If training or finetuning was performed, what computational resources were utilized?
Evaluation:
1. How was the model's performance evaluated? (Mitchell et al., 2019)
2. What data was used for evaluation? (M. Arnold, Bellamy, et al. (2019); Mitchell et al., 2019)
3. If the model involves data pertaining to people, has the model been audited for fairness and bias using tools such as FairLearn? (M. Arnold, Bellamy, et al. (2019); Bird et al., 2020; Diakapoulos et al., n.d.; Jakeway et al., 2020; Madaio et al. (2020); Reisman et al., 2018)
4. Have any tools been utilized to generate explanations for predictions (i.e., LIME Ribeiro et al. (2016), SHAP Lundberg and Lee (2017), TCAV (Kim et al., 2018)) and modify the model in response? (M. Arnold, Bellamy, et al. (2019); Cordell, 2020; Diakapoulos et al., n.d.; Padilla, 2020; Ribeiro et al., 2020)
Deployment:
1. How was the model deployed? Was it used to make a single pass over the cultural heritage dataset in question, or will it be continuously deployed?
2. What computational resources were utilized for deployment?
3. Are the metadata generated by the machine learning model (embeddings, classifications, etc.) available as project deliverables?
Release:
1. Has the resulting model been made available for download? (if no, the following questions can be skipped)
2. What license has been provided? (Mitchell et al., 2019)
3. Who are the primary intended users, and what are the intended use cases? (Mitchell et al., 2019)
4. Does this model have applicability outside of cultural heritage collections?
5. What are ways that this model could be misused, either intentionally or unintentionally? (Madaio et al., 2020; Mitchell et al., 2019)
Environmental impact:
1. What were the carbon emissions produced by training, finetuning, and/or deploying this model? (Cordell, 2020; Lacoste et al., 2019; Strubell et al., 2019)
2. How does the environmental impact of this model compare to that of other components of the project, such as a collection's digitization or stakeholders’ flights to relevant conferences?

Organizational considerations

Stakeholders:
1. What stakeholder groups are involved in this project? (Cordell, 2020)
2. What is each project member's familiarity with machine learning? (Cordell, 2020; Jakeway et al., 2020)
3. What is each project member's familiarity with cultural heritage collections as data?
4. Has the project notified and sought input from all potentially relevant stakeholder groups, such as those included within the cultural heritage dataset itself? (Madaio et al., 2020; Reisman et al., 2018)
5. Do groups affected by the project, such as individuals and communities directly represented within the cultural heritage dataset, have an avenue for contacting project staff and seeking recourse? If so, whom should they contact? If not, why not? (Diakapoulos et al., n.d.; Mitchell et al., 2019; Reisman et al., 2018)
Use of machine learning:
1. Was it necessary to use machine learning for this project?
2. If so, why?
3. If not, why was machine learning still utilized?
4. What are potential critiques of applying machine learning in this context?
Organizational context:
1. Can this project be used to build data fluency within the organization or institution? (Padilla, 2020)
2. Do there exist programs or paths for training staff affiliated with the project to develop machine learning skillsets? (Cordell, 2020; Padilla, 2020)
3. Do there exist programs or paths for training staff affiliated with the project to develop fluency with cultural heritage collections?
Project deployment and launch:
1. Who is the target audience of this project? (Madaio et al., 2020)
2. How does the target audience align with the audiences that the institution or organization is hoping to engage?
3. If the target audience of the project is the public, does it make an attempt to educate the public regarding the machine learning approaches employed?
4. Did the project launch reach the intended audience?*
5. Has the project received feedback from stakeholders, including the audience? If so, what feedback has been received?*
6. Has the launch of the project resulted in any changes to the project?*

(* = to be completed post-launch)

Copyright, transparency, documentation, maintenance, and privacy

Copyright:
1. Building on question 1.1.g, does copyright impact the dataset, model, code, or deliverables for the project? (Cordell, 2020; Gebru et al., 2020; Jakeway et al., 2020; Mitchell et al., 2019; Padilla, 2018)
2. If they are made available, what licenses have been chosen?
3. If they are proprietary, how does this impact re-use?
Transparency and re-use:
1. Can the project be audited by outsiders? If so, is there funding available to support outside audits? (Mitchell et al., 2019; Reisman et al., 2018)
2. Is the code created for the project extensible for other cultural heritage researchers? (Padilla, 2020)
3. If so, does the project provide any tutorials or toolkits for re-use?
Documentation:
1. Does the project have documentation? (Katell et al., 2019)
2. If so, is the documentation interpretable by the project's audience?
3. Is the project reproducible to an outside researcher, given the documentation available?
Privacy:
1. If the project is hosted online, are data on visitors collected? If so, what kinds of user data are collected? (Cordell, 2020)
2. Is visitor consent gained before gathering online data? (Cordell, 2020)
Maintenance:
1. Will the project and code be maintained? (Gebru et al., 2020)
2. If so, how frequently, and who will be responsible for maintaining it?

“About the Collections in Calisphere.” Calisphere. Accessed August 21, 2023. https://calisphere.org/overview/.

Acheson, Katherine, ed. Early Modern English Marginalia. New York: Routledge, 2019. https://doi.org/10.4324/9781315228815.

Adler, Noah, and Justin Hall. “Matt 28:19 - 28:20, Pg 141.” Manuscripts of Lichfiled Cathedral. Accessed August 21, 2023. https://lichfield.ou.edu/file/14428.

“Archaeology of Reading,” September 2014. https://archaeologyofreading.org/.

Baio, Andy. “Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator.” Waxy.org, August 2022. https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. Virtual Event Canada: ACM, 2021. https://doi.org/10.1145/3442188.3445922.

Blodgett, Rachael. “Frequently Asked Questions NASA.” Text. NASA, January 2018. http://www.nasa.gov/feature/frequently-asked-questions-0.

Bonde Thylstrup, Nanna, Daniela Agostinho, Annie Ring, Catherine D’Ignazio, and Kristin Veel, eds. Uncertain Archives: Critical Keywords for Big Data, 2021. https://doi.org/10.7551/mitpress/12236.001.0001.

Brayman Hackel, Heidi. Reading Material in Early Modern England: Print, Gender, and Literacy. Cambridge, U.K.; New York: Cambridge University Press, 2005.

Brousseau, Chantal. “Interrogating a National Narrative with GPT-2.” Programming Historian, October 2022. https://doi.org/https://doi.org/10.46430/phen0104.

ChantalMB. “ChantalMB/Issap-Image-Annotator,” February 2022. https://github.com/ChantalMB/issap-image-annotator.

Cordell, Ryan. “"Q i-Jtb the Raven": Taking Dirty OCR Seriously.” Book History 20, no. 1 (2017): 188–225. https://doi.org/10.1353/bh.2017.0006.

Crymble, Adam. Technology and the Historian: Transformations in the Digital Age. Champaign, IL: University of Illinois Press, 2021. https://doi.org/10.5406/j.ctv1k03s73.

D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. Cambridge, MA: The MIT Press, 2020. https://doi.org/10.7551/mitpress/11805.001.0001.

Derrida, Jacques. Of Grammatology. Translated by Gayatri Chakravorty Spivak. Corrected ed. Baltimore: Johns Hopkins University Press, 1997.

Dutta, Abhishek, and Andrew Zisserman. “The VIA Annotation Software for Images, Audio and Video.” In Proceedings of the 27th ACM International Conference on Multimedia, 2276–79. MM ’19. New York, NY, USA: Association for Computing Machinery, 2019. https://doi.org/10.1145/3343031.3350535.

“Early Modern Annotated Books.” University of California Los Angeles: William Andrews Clark Memorial Library, n.d. https://calisphere.org/collections/26771/.

“Energy Use in Sweden.” Sweden.se, November 2022. https://sweden.se/climate/sustainability/energy-use-in-sweden.

Fleming, Juliet. Cultural Graphology: Writing After Derrida. University of Chicago Press, 2016.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. “Datasheets for Datasets,” December 2021. https://doi.org/10.48550/arXiv.1803.09010.

“Getty API Documentation.” Getty. Accessed August 18, 2023. https://data.getty.edu/museum/collection/docs/#attribution.

Graham, Shawn, and Justin Walsh. “Recording Archaeological Data from Space.” International Space Station Archaeological Project, February 2022. https://issarchaeology.org/how-do-you-get-from-an-astronauts-photo-to-usable-archaeological-data/.

Grindley, Carl James. “Reading Piers Plowman C-Text Annotations: Notes Toward the Classification of Printed and Written Marginalia in Texts from the British Isles 1300-1641.” In The Medieval Professional Reader at Work: Evidence from Manuscripts of Chaucer, Langland, Kempe, and Gower, edited by Kathryn Kerby-Fulton and Maidie Hilmo, 73–141. Victoria, BC: English Literary Studies, 2001.

Harvey, Gabriel, and George Charles Moore Smith. Gabriel Harvey’s Marginalia. Stratford-upon-Avon: Shakespeare Head Press, 1913.

Jardine, Lisa, and Anthony Grafton. “"Studied for Action": How Gabriel Harvey Read His Livy.” Past & Present, no. 129 (1990): 30–78. https://www.jstor.org/stable/650933.

Jo, Eun Seo, and Timnit Gebru. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–16. FAT* ’20. New York, NY, USA: Association for Computing Machinery, 2020. https://doi.org/10.1145/3351095.3372829.

Lacoste, Alexandre, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. “Quantifying the Carbon Emissions of Machine Learning.” arXiv Preprint arXiv:1910.09700, 2019.

“Laion-Aesthetic-6pls: Images 1582553.” LAION-Aesthetics V2 6+. Accessed August 21, 2023. http://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images/1582553.

“Lauriston Castle Collection.” Accessed August 16, 2023. https://digital.nls.uk/catalogues/special-and-named-printed-collections/?id=598.

Lee, Benjamin. “Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset.” Digital Humanities Quarterly 015, no. 4 (December 2021). http://www.digitalhumanities.org/dhq/vol/15/4/000578/000578.html.

Lee, Benjamin Charles Germain. “The ‘Collections as ML Data’ Checklist for Machine Learning and Cultural Heritage.” Journal of the Association for Information Science and Technology n/a, no. n/a (May 2023). https://doi.org/10.1002/asi.24765.

Lerer, Seth. “Devotion and Defacement: Reading Children’s Marginalia.” Representations 118, no. 1 (2012): 126–53. https://doi.org/10.1525/rep.2012.118.1.126.

Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. “Microsoft COCO: Common Objects in Context.” arXiv, February 2015. https://doi.org/10.48550/arXiv.1405.0312.

Manley, K. A. “Scottish Circulating and Subscription Libraries as Community Libraries.” Library History 19 (July 2013). https://doi.org/10.1179/lib.2003.19.3.185.

Milligan, Ian. “We Are All Digital Now: Digital Photography and the Reshaping of Historical Practice.” The Canadian Historical Review 101, no. 4 (2020): 602–21. https://doi.org/https://doi.org/10.3138/chr-2020-0023.

Mordell, Devon. “Critical Questions for Archives as (Big) Data.” Archivaria 87 (2019): 140–61. https://proxy.library.carleton.ca/login?qurl=https%3A%2F%2Fwww.proquest.com%2Fscholarly-journals%2Fcritical-questions-archives-as-big-data%2Fdocview%2F2518871266%2Fse-2%3Faccountid%3D9894.

Moss, Michael, David Thomas, and Tim Gollins. “The Reconfiguration of the Archive as Data to Be Mined.” Archivaria, November 2018, 118–51. https://archivaria.ca/index.php/archivaria/article/view/13646.

Neudecker, Clemens. “Cultural Heritage as Data: Digital Curation and Artificial Intelligence in Libraries.” In Proceedings of the Third Conference on Digital Curation Technologies (Qurator 2022), Berlin, Germany, Sept. 19th-23rd, 2022, edited by Adrian Paschke, Georg Rehm, Clemens Neudecker, and Lydia Pintscher, Vol. 3234. CEUR Workshop Proceedings. CEUR-WS.org, 2022. https://ceur-ws.org/Vol-3234/paper2.pdf.

Orgel, Stephen. The Reader in the Book: A Study of Spaces and Traces. Oxford, UK: Oxford University Press, Incorporated, 2015. http://ebookcentral.proquest.com/lib/oculcarleton-ebooks/detail.action?docID=4310757.

Oxford English Dictionary. “Marginalia, n., Etymology.” Oxford University Press, 2023. https://doi.org/10.1093/OED/7050641376.

Padilla, Thomas. “On a Collections as Data Imperative,” 2017. https://escholarship.org/uc/item/9881c8sv.

Palmer, Philip. “Annotated Books at UCLA: Wider Applications of the AoR Schema Archaeology of Reading.” Archaeology of Reading , September 2018. https://archaeologyofreading.org/annotated-books-at-ucla-wider-applications-of-the-aor-schema/.

“Results – Search Objects – eMuseum.” Peabody Museum of Archaeology & Ethnology. Accessed August 21, 2023. https://collections.peabody.harvard.edu/search/daguerreotype/objects.

Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. “High-Resolution Image Synthesis with Latent Diffusion Models,” 2021. https://arxiv.org/abs/2112.10752.

Schramowski, Patrick, Manuel Brack, Björn Deiseroth, and Kristian Kersting. “Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models.” In Proceedings of the 22nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC: arXiv, 2023. https://doi.org/10.48550/arXiv.2211.05105.

Scotland, National Library of. “Chapbooks Printed in Scotland.” National Library of Scotland. Accessed May 30, 2023. https://doi.org/10.34812/VB2S-9G58.

Scott-Warren, Jason. “Reading Graffiti in the Early Modern Book.” The Huntington Library Quarterly, 2010. https://www.proquest.com/docview/763492186/abstract/832093B897F1447APQ/1.

Sherman, William H. Used Books: Marking Readers in Renaissance England. Philadelphia, PA: University of Pennsylvania Press, 2008. https://www.jstor.org/stable/j.ctt3fhgzw.

“SvelteKit • Web Development, Streamlined.” Accessed August 21, 2023. https://kit.svelte.dev/.

team, The MicroPasts. “Crowdfuelled and Crowdsourced Archaeological Data.” MicroPasts: Crowd Sourcing Platform. Accessed August 21, 2023. https://crowdsourced.micropasts.org/.

Victoria, and Albert Museum. “Victoria and Albert Museum Collections Data,” 2021. https://collections.vam.ac.uk/.

Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. “YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors.” arXiv Preprint arXiv:2207.02696, 2022. https://doi.org/10.48550/arXiv.2207.02696.

Weaver, Edmund. Wigmaker’s Account Book. London: A. Parker for the Company of Stationers, 1737. https://calisphere.org/item/ark:/21198/n14s4d/.

Whitaker, Elaine E. “A Collaboration of Readers: Categorization of the Annotations in Copies of Caxton’s Royal Book.” Text 7 (1994): 233–42. https://www.jstor.org/stable/30227702.

Yale, Elizabeth. “The History of Archives: The State of the Discipline.” Book History 18 (2015): 332–59. https://doi.org/10.1353/bh.2015.0007.

Ziegler, S. L. “Open Data in Cultural Heritage Institutions: Can We Be Better Than Data Brokers?” Digital Humanities Quarterly 014, no. 2 (June 2020). http://www.digitalhumanities.org/dhq/vol/14/2/000462/000462.html.