Instructors: Yael Netzer
Duration: both weeks


Abstract

This two-week intensive workshop provides participants with the practical and critical skills necessary to navigate the lifecycle of digital knowledge representation. The first week focuses on “reading” and manipulating data, utilizing OpenRefine to clean, enrich, and reconcile messy datasets with Linked Open Data (LOD) sources like Wikidata and GeoNames. In the second week, the focus shifts to the creation of “Archives of the Present,” where participants move from theory to implementation. Using tools such as Omeka-S, students will conceptualize metadata structures and build small-scale digital archives while grappling with the ethical and technical challenges of preserving contemporary, ephemeral digital traces. By blending hands-on technical training—including web scraping, API integration, and machine learning-assisted metadata extraction—with archival theory, this workshop empowers researchers to transform raw data into structured, sustainable, and accessible digital resources. No prior technical knowledge is required.

Week 1 – Reading and Working with Data / Collections in OpenRefine

Digital data in various formats is at the heart of humanities research. Often, datasets are large, messy, or structured in unfamiliar ways. This week, students will learn to inspect, clean, and enrich digital catalogues using OpenRefine, as well as how to enhance datasets with Linked Open Data (LOD) from sources such as the Library of Congress, VIAF, and Wikidata. By the end of this week, students will be proficient in:

  • Understanding different file formats (CSV, TSV, Spreadsheets, JSON, XML TEI)
  • Using regular expressions for data manipulation (with some skill and aid from chatGPT)
  • Writing expressions with GREL (OpenRefine’s scripting language)
  • Fetching and reconciling data via REST API (e.g., GeoNames, Wikidata)
  • Scraping and structuring data from the web
  • Mapping textual data to geographic locations

Schedule:

  • Class 1: Introduction, loading a file, faceting, and exploring data
  • Class 2: Regular expressions and working with dates
  • Class 3: Clustering techniques for data cleaning
  • Class 4: Fetching external data using REST APIs (GeoNames example)
  • Hands-On Session: Practicing administrative tasks (changing working directory, memory allocation)
  • Class 5: Reconciliation and enriching data with Wikidata
  • Class 6: Handling JSON and XML file formats
  • Class 7: Web scraping techniques and automation
  • Class 8: From text to map – Geospatial representations in OpenRefine
  • Class 9: Summary and discussion

Week 2 – Building a Digital Archive: Archives of the Present

This week focuses on the creation and structuring of small-scale digital archives, but also introduces the concept of archives of the present—a critical reflection on how contemporary events, data, and digital traces shape our archival practices. Participants will work with their own or provided collections, conceptualizing metadata structures and curatorial strategies. The workshop covers best practices in digital archive development, including metadata schema selection, linked data integration, and user-friendly design. The discussion of archives of the present will explore:

  • How digital documentation of real-time events (social media, news articles, live-streamed content) can be archived
  • The ethical challenges of archiving contemporary materials
  • Methods for ensuring accessibility and preservation of ephemeral data
  • The evolving nature of authority files and metadata in fast-changing digital environments By the end of this week, students will be proficient in:
  • Theoretical foundations of archival studies
  • Metadata structuring and best practices
  • Using Omeka-S for archive implementation
  • Using Tropy for organizing and annotating images
  • Linking archives to external sources and ontologies
  • Designing and publishing an accessible, structured digital archive
  • Engaging with contemporary data collection and preservation strategies

Schedule:

  • Class 1: Theory of archives – an introduction
  • Class 2: Digital archives – examples and reviewing participant collections
  • Class 3: Modeling the domain
  • Class 4: Metadata – methods of description, challenges, and dilemmas
  • Class 5: Introduction to Omeka-S – setting up and structuring an archive
  • Class 6: Using Tropy – basic features and integration with Omeka
  • Hands-On Session: Working on participant collections
  • Class 7: Archives of the present – Capturing and preserving digital traces
  • Class 8: Linking and integrating with external resources and authority files
  • Class 9: Publishing – designing Omeka pages for public access
  • Class 10: Summary and reflections

To enrich the learning experience, this workshop will aim to incorporate:

  • Case studies of successful digital archive projects
  • Collaborative group work, where teams handle different types of archival materials
  • Expanded toolset beyond OpenRefine and Omeka, including basic Python for data manipulation and SPARQL for querying LOD sources
  • Introduction to IIIF (International Image Interoperability Framework) for handling digital images in archives
  • Machine learning-assisted metadata extraction, including OCR (Transkribus), Google Vision API, and Named Entity Recognition (NER)
  • Sustainability and long-term digital archive maintenance strategies

← Back to all workshops

Updated: