Humanities

Reassembling digital archives—strategies for counter-archiving

T. Blanke

This research by Tobias Blanke explores the innovative methodologies shaping digital archives, merging traditional practices with modern techniques. By reassembling marginalized voices, it highlights how archives can be expanded and reinvented, offering fresh perspectives on overlooked content.... show more

Introduction

The paper asks how digital archives both continue and discontinue traditional archival practices. It frames digital archives within broader debates on truth, memory, recording and power, noting the transformative impact of digitisation and computational mediation. While digital archives are often seen as democratising, they reproduce and can amplify existing biases through abundance and algorithmic selection. Drawing on Ann Laura Stoler, the paper argues for reading along the grain (attending to archival regularities) and against the grain (reassembling to surface marginalised perspectives). From data-science practice, the paper conceptualises reassembling as producing alternate data representations through pipelines and proposes a distinction inspired by Balibar between extensifying (inclusion of non-archives and otherwise discarded evidence) and intensifying (non-discriminatory reorganisation via new associations). It emphasises the algorithmic processuality of digital archives and proposes to invert or redirect these processes to reassemble archives in ways that inscribe lost voices and address power asymmetries.

Literature Review

The review situates the work within two archival turns. The first repositioned archives from neutral repositories to subjects of power analysis (Schwartz and Cook; Derrida on archive and political power; Foucault’s archives as systems of discursivity; Mbembe on the state and archives with potential traces of dissent). The second turn uses the archive as a methodological lens and develops counter-archiving (Ben-David on Facebook counter-archives; Stoler’s call to read along and against the grain and to create new archives). Latour’s notion of reassembling suggests deploying complex traceable associations to open and recombine archival materials. Hobsbawm’s grassroots history anticipates digital techniques to surface everyday lives from non-canonical sources, contrasting traditional serendipity with technologically enabled mining. The review also discusses professional notions of enduring value versus non-archival records (e.g., state guidance on destroying non-archival records) and argues that digitisation blurs boundaries between archives and non-archives. It frames extensification and intensification via Balibar’s extensive (inclusion) and intensive (non-discrimination) universality. Additional references cover digitisation as creating surrogates (Conway), archives as data things (Bowker), democratisation debates (Gauld; Taylor & Gibson), and data-science infrastructures and reproducibility (R, Python, Jupyter).

Methodology

The paper develops and demonstrates reassembling strategies across three types of digital archives via extensification and intensification.

Extensifying digital archives (access and inclusion):

Incidental legal document repository (UK Immigration and Asylum Upper Tribunal Decisions):
- Brute-force web scraping via URL-hacking to enumerate and download decisions lacking usable metadata; parsing documents into machine-readable text.
Public transparency repositories (EU TED/eTendering and TED Tenders):
- Construct virtual collections using expert search, URL parameters, regular expressions, and metadata filters (e.g., keywords such as “drones,” “border,” status, start date); bulk retrieval via RESTful services; PDF-to-text transformation for analysis.
Time-indexed Web archive (Internet Archive Wayback Machine for UK NGOs on surveillance):
- Link Ripper to collect timestamped URLs; content extraction of text, links, images.
- Redundancy reduction by removing pages ≥90% identical to mitigate reinforcement bias.
- Fair temporal sampling: cap at 4 snapshots per page per month (if available), otherwise keep maximum available, to balance coverage of popular vs. lesser-known NGOs.
- Human–machine workflow (workshop, July 2022) to select most relevant pages (home/news/campaigns/blog) per NGO aligned to research questions on surveillance and oversight.

Intensifying digital archives (new associations and nondiscrimination):

Issue-based subdocuments and knowledge graph (Upper Tribunal Decisions):
- Issue filtering: extract sentences mentioning platforms (Facebook, Google, Telegram, Twitter, Viber, WhatsApp, YouTube) plus ±5 sentences; deduplicate overlaps, yielding ~1100 subdocuments.
- Named entity resolution: map variants to canonical entities via Wikipedia titles.
- Relation extraction: apply REBEL (BERT fine-tuned) to extract triples; build a knowledge graph constrained to social-media-related entities; visualise network to narrate asylum experiences through socio-technical relations.
Syntactic/dependency parsing of parliamentary debates (Hansard speeches, 1979–2017):
- Data selection: retain speeches where “GCHQ” appears in the top 10% of term frequencies.
- Parse sentences to extract noun–verb pairings and direct-object (DOBJ) relations to reveal actor–action structures and historically specific topics; contrast frequent combinations with rare/salient DOBJs across time (e.g., 1982 “espionage,” 1996 “register paedophiles”).
Guided topic modeling as paratext (NGO Wayback corpus):
- Preprocessing: restrict vocabulary to 10,000 most common English words; correct common web spelling issues; resulting corpus ~20 million words.
- Human–machine co-design: workshop-derived seed topics and keywords informed by prior qualitative research; parallel unseeded LDA runs to inspect distributions; iterative refinement.
- Seeded LDA (Jagarlamudi et al.) with seven topics: state_surveillance, corporate_surveillance, general_democracy, singular_democracy, actors, resistance, oversight; track topic prominence over time (topic-as-summary paratexts).

Across cases, the approach reads along algorithmic and structural grains (search, URLs, timestamps, syntax) to invert or redirect them for counter-archiving.

Key Findings

Digital archives entail both continuities and discontinuities of archival practice; algorithmic mediation can be inverted to counter selective forgettings.
Extensification outcomes:
- Incidental archives (UK Upper Tribunal Decisions) can be re-collected despite minimal metadata using URL-hacking and scraping, enabling downstream analyses otherwise impossible via generic search alone.
- Public procurement repositories (EU TED) support creation of targeted virtual collections via expert search, URL parameters, and regex; REST access plus PDF-to-text unlock opaque government practices (e.g., border technologies) otherwise traceable only through procurement clues.
- Web archives (Wayback) reassembled via temporal strategies: removing ≥90% duplicate snapshots reduces reinforcement bias; fair sampling (≤4 snapshots/page/month) balances visibility between popular and smaller NGOs.
Intensification outcomes:
- Asylum decisions: social-media platforms are increasingly salient—by 2021, references appear in about 15% of all cases; issue-based subdocuments (~1100) and REBEL-derived knowledge graphs reveal socio-technical relations that re-narrate asylum experiences.
- Parliamentary Hansard (GCHQ): noun–verb frequency profiles surface legislative/performative language (“make point,” “belong right,” “call power”); rare/salient DOBJ relations uncover historically specific topics (e.g., 1982 Cold War “espionage,” 1996 “register paedophiles”), highlighting marginal or excluded concerns.
- Guided topic modeling of NGO sites (~20M words): seven seeded topics capture temporal dynamics—“oversight” surges post-Snowden and peaks in 2016, then declines; democracy and surveillance topics persist across years. Topic timelines offer interpretable paratexts summarising evolving debates.
Practical strategies codified: six techniques across three archive types (incidental legal collections, public transparency repositories, time-indexed web archives), spanning rebuilding entire archives, issue-based subdocumenting, virtual collections, syntactic actor–action extraction, temporal sampling, and human–machine guided topic modeling.

Discussion

The study addresses how digital archives both continue and discontinue traditional practices by operationalising reassembling as algorithmic and structural redirection. Reading along the grain (search interfaces, URL schemas, timestamps, syntactic structures) enables reliable decoding of regularities; reading against the grain (issue scoping, relation extraction, virtual collections, guided topics) surfaces marginalised entities, voices, and narratives. Extensification broadens inclusion by transforming non-archives/incidental collections into analyzable corpora, while intensification reduces discrimination by reorganising internal structures to highlight underrepresented relations, topics, and actors. Across legal, governmental, and civil-society domains, the methods demonstrate how algorithmically ruled processuality can be repurposed for counter-archiving, mitigating reinforcement biases (via deduplication and fair sampling) and producing paratexts that render historical dynamics legible. The results show the feasibility and value of human–machine workflows that ground computational models in domain expertise, thereby avoiding uninterpretability (e.g., topic-model ‘tea leaves’) and enabling credible counter-histories.

Conclusion

The paper contributes a conceptual and practical framework for reassembling digital archives to counter dominant knowledge: a) a distinction between extensifying (inclusive expansion to non-archives) and intensifying (nondiscriminatory reorganisation) reassembling; b) six concrete, reproducible strategies across three archive types (incidental legal collections, public transparency repositories, time-indexed web archives); and c) human–machine workflows that align computational pipelines with critical archival questions. The approaches recover lost or overlooked voices (e.g., asylum seekers, dissenting parliamentary discourse) and render opaque practices (e.g., procurement, surveillance oversight) analyzable. Future research should extend dynamic access methods beyond scraping (on-demand APIs/streaming), incorporate hyperlink structures into temporal reassembling, explore richer internal semantic document relations beyond sentences/entities, and leverage multimodality (e.g., images from web archives) to further inclusivity and reduce discrimination. Reassembling remains necessarily partial; expanding to additional archives and non-archives will broaden counter-archival potentials.

Limitations

Data access and legal/ethical constraints: web scraping faces copyright, data protection, and unauthorized access issues; choices varied by jurisdiction. The UK Tribunal dataset cannot be publicly shared due to privacy restrictions (available on request).
Coverage and sampling biases: incidental archives often lack metadata; Wayback captures uneven snapshot frequencies across sites; mitigated via fair temporal sampling but not eliminated.
Technical challenges: dynamic scraping (e.g., Selenium) is resource-intensive; PDF-to-text conversion is error-prone; standard changes over time complicate parsing.
Model and extraction errors: entity resolution (aliasing), relation extraction (REBEL) and syntactic parsing are imperfect; knowledge graphs and dependency-derived insights should be read as incomplete perspectives rather than definitive statements.
Computational cost: end-to-end relation extraction over large corpora can run for hours; scaling may require more efficient or approximate methods.
Interpretability: topic models risk ungrounded interpretations; addressed via seeded, expert-guided workflows, but human judgment remains contingent.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Interdisciplinary Studies

A future for digital public goods for monitoring SDG indicators

D. Liang, H. Guo, et al.

Medicine and Health

Digital Disease Surveillance for Emerging Infectious Diseases: An Early Warning System Using the Internet and Social Media Data for COVID-19 Forecasting in Canada

Y. Yang, S. Tsao, et al.

Agriculture

An agricultural digital twin for mandarins demonstrates the potential for individualized agriculture

S. Kim and S. Heo

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny