Digital Archives: Reading and Manipulating Large-Scale Catalogues, Curating and Creating Small-Scale Archives

Instructors: Yael Netzer
Duration: both weeks

Abstract

This two-week intensive workshop provides participants with the practical and critical skills necessary to navigate the lifecycle of digital knowledge representation. The first week focuses on “reading” and manipulating data, utilizing OpenRefine to clean, enrich, and reconcile messy datasets with Linked Open Data (LOD) sources like Wikidata and GeoNames. In the second week, the focus shifts to the creation of “Archives of the Present,” where participants move from theory to implementation. Using tools such as Omeka-S, students will conceptualize metadata structures and build small-scale digital archives while grappling with the ethical and technical challenges of preserving contemporary, ephemeral digital traces. By blending hands-on technical training—including web scraping, API integration, and machine learning-assisted metadata extraction—with archival theory, this workshop empowers researchers to transform raw data into structured, sustainable, and accessible digital resources. No prior technical knowledge is required.

Week 1 – Reading and Working with Data / Collections in OpenRefine

Digital data in various formats is at the heart of humanities research. Often, datasets are large, messy, or structured in unfamiliar ways. This week, students will learn to inspect, clean, and enrich digital catalogues using OpenRefine, as well as how to enhance datasets with Linked Open Data (LOD) from sources such as the Library of Congress, VIAF, and Wikidata. By the end of this week, students will be proficient in:

Understanding different file formats (CSV, TSV, Spreadsheets, JSON, XML TEI)
Using regular expressions for data manipulation (with some skill and aid from chatGPT)
Writing expressions with GREL (OpenRefine’s scripting language)
Fetching and reconciling data via REST API (e.g., GeoNames, Wikidata)
Scraping and structuring data from the web
Mapping textual data to geographic locations

Schedule:

Class 1: Introduction, loading a file, faceting, and exploring data
Class 2: Regular expressions and working with dates
Class 3: Clustering techniques for data cleaning
Class 4: Fetching external data using REST APIs (GeoNames example)
Hands-On Session: Practicing administrative tasks (changing working directory, memory allocation)
Class 5: Reconciliation and enriching data with Wikidata
Class 6: Handling JSON and XML file formats
Class 7: Web scraping techniques and automation
Class 8: From text to map – Geospatial representations in OpenRefine
Class 9: Summary and discussion

Week 2 – Building a Digital Archive: Archives of the Present

This week focuses on the creation and structuring of small-scale digital archives, but also introduces the concept of archives of the present—a critical reflection on how contemporary events, data, and digital traces shape our archival practices. Participants will work with their own or provided collections, conceptualizing metadata structures and curatorial strategies. The workshop covers best practices in digital archive development, including metadata schema selection, linked data integration, and user-friendly design. The discussion of archives of the present will explore:

How digital documentation of real-time events (social media, news articles, live-streamed content) can be archived
The ethical challenges of archiving contemporary materials
Methods for ensuring accessibility and preservation of ephemeral data
The evolving nature of authority files and metadata in fast-changing digital environments By the end of this week, students will be proficient in:
Theoretical foundations of archival studies
Metadata structuring and best practices
Using Omeka-S for archive implementation
Using Tropy for organizing and annotating images
Linking archives to external sources and ontologies
Designing and publishing an accessible, structured digital archive
Engaging with contemporary data collection and preservation strategies

Schedule:

Class 1: Theory of archives – an introduction
Class 2: Digital archives – examples and reviewing participant collections
Class 3: Modeling the domain
Class 4: Metadata – methods of description, challenges, and dilemmas
Class 5: Introduction to Omeka-S – setting up and structuring an archive
Class 6: Using Tropy – basic features and integration with Omeka
Hands-On Session: Working on participant collections
Class 7: Archives of the present – Capturing and preserving digital traces
Class 8: Linking and integrating with external resources and authority files
Class 9: Publishing – designing Omeka pages for public access
Class 10: Summary and reflections

To enrich the learning experience, this workshop will aim to incorporate:

Case studies of successful digital archive projects
Collaborative group work, where teams handle different types of archival materials
Expanded toolset beyond OpenRefine and Omeka, including basic Python for data manipulation and SPARQL for querying LOD sources
Introduction to IIIF (International Image Interoperability Framework) for handling digital images in archives
Machine learning-assisted metadata extraction, including OCR (Transkribus), Google Vision API, and Named Entity Recognition (NER)
Sustainability and long-term digital archive maintenance strategies

← Back to all workshops

Share on

X Facebook LinkedIn Bluesky

6-18 July 2026

Abstract

Week 1 – Reading and Working with Data / Collections in OpenRefine

Week 2 – Building a Digital Archive: Archives of the Present

Share on