logo
ResearchBunny Logo
A deep-learning model for predictive archaeology and archaeological community detection

Interdisciplinary Studies

A deep-learning model for predictive archaeology and archaeological community detection

A. Resler, R. Yeshurun, et al.

This research by Abraham Resler, Reuven Yeshurun, Filipe Natalio, and Raja Giryes showcases a groundbreaking metric learning-based CNN that trains on an extensive archaeological dataset. It not only matches expert archaeologists' accuracy in artifact identification but also reveals new connections across historical sites, demonstrated through intriguing case studies.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of classifying archaeological artefacts—spanning diverse materials, forms, and long temporal ranges—where expert-driven typologies can be subjective and rely on preferred visual criteria. The authors aim to leverage deep convolutional neural networks to automate and standardize artefact classification across the full temporal and cultural diversity of the Levantine archaeological record, using a large public image repository from the Israel Antiquities Authority. The research questions are: (1) Can a CNN trained with transfer learning predict an artefact’s site and period from its image across a million-plus-year span? (2) Can learned embeddings retrieve visually and archaeologically similar artefacts for a query image? (3) Can patterns of systematic “confusion” in predictions be transformed into meaningful archaeological communities that reflect real temporal or cultural affinities? The study’s importance lies in enhancing efficiency and consistency of artefact classification, enabling large-scale comparative analyses, and uncovering previously unknown relations among assemblages and sites.
Literature Review
The paper reviews prior efforts to apply computational methods and machine learning in archaeology. Early attempts relied on hand-crafted features and achieved limited performance. Recent advances include automatic feature extraction via machine learning, such as combining Raman spectroscopy and ML to assess thermal alteration on flint. Deep learning, especially CNNs, has shown promising results in archaeological tasks like ceramic classification, lithic assemblage discrimination, and identifying bone surface modifications. However, previous CNN applications typically focused on narrow material classes or contexts and did not tackle the full diversity of archaeological periods and artefact types. This gap motivates the present work to develop a broadly applicable CNN-based approach with community detection to capture meaningful affinities across time and space.
Methodology
Dataset: Public IAA repository comprising 12,364 photographs of 6,770 artefacts from the southern Levant, spanning the Lower Palaeolithic to Late Islamic periods. Artefact categories include stone, bone, metal, pottery, and figurative items. Each artefact has site and period labels. To balance classes, the 200 largest period-site classes were selected (9,909 images of 5,450 artefacts), split into training (8,031 images; 81%; 4,428 artefacts) and validation (1,878 images; 19%; 1,020 artefacts). Labeling granularity: Multiple label types were used—period-site, site, period, and two temporal groupings (fine: 21 groups; rough: 13 groups). This supports evaluating classification at different temporal resolutions. Pre-processing: Images originally had varied backgrounds and scales. Backgrounds were standardized to white and scales removed using either automatic contour retrieval (Canny edge detection followed by border following) or interactive GrabCut when needed. Images were resized to 300×300 pixels. Base network and transfer learning: Evaluated VGG, InceptionResNetV2, and EfficientNetB3 pre-trained on ImageNet; EfficientNetB3 performed best (≈1% better, fewer epochs). The ImageNet classification head was replaced with a fully connected layer mapping 1536-dim embeddings to 200 classes. Data augmentation included random rotations, shifts, zooms, and horizontal flips. Models trained for 25 epochs with categorical cross-entropy loss. Ensemble and feature construction: Five CNNs with the same ImageNet initialization were trained independently to generate diverse feature vectors. The five vectors were concatenated (dimension 5D) and randomly projected via a Gaussian matrix back to D dimensions to form the final feature vector (Z^RP) used for retrieval/classification. Prediction and metric: For a query image, features are extracted and classification is by k-nearest neighbors (k=1) in the training set using cosine similarity. Alternative losses: Large Margin Cosine Loss and cross-entropy yielded similar accuracy; adding triplet loss with online triplet mining improved VGG/InceptionResNetV2 by ≈1%, but the final EfficientNetB3 model used cross-entropy only. Training details: Optimizer AdamW with initial LR 1e-4, reduced by factor 10 on plateau; batch size 20; hardware: single Nvidia GeForce 2080 Ti GPU. Community detection workflow: Build a confusion matrix A (C×C), where A_ij is the normalized proportion of samples with true label i predicted as j (not necessarily symmetric). Symmetrize to B = ½(A + A′); use B_ij as edge weights in an undirected weighted graph where nodes are period-site classes. Apply the Louvain algorithm to detect communities, reporting modularity. Enhancements included (1) using top-10 nearest neighbors to densify the graph and (2) restricting the confusion matrix to neighboring temporal ranges (e.g., Palaeolithic–Epipalaeolithic, Bronze–Iron, Hellenistic–Byzantine) to reduce irrelevant confusion and enable finer-grained analyses. Expert comparison: A blind validation used 63 images (3 per fine-period group) shown to two archaeologists and the model to assign fine-period labels, enabling direct performance comparison. Interactive tool: An application visualizes classes and detected communities on a map, supports period filtering, node selection, and exploration of community members. Code and model availability: Code at https://github.com/aviresler/antique-gen; a website section enables uploading images for predictions.
Key Findings
- Overall validation accuracies (trained on period-site objective; Top-1/Top-3/Top-5): • Period-site: 58.10% / 64.22% / 67.36% • Site: 63.58% / 68.96% / 71.89% • Period: 67.79% / 74.18% / 77.69% • Fine-period grouping: 71.03% / 77.96% / 81.47% • Rough-period grouping: 76.36% / 82.70% / 85.41% - The model’s confusion concentrates along temporally adjacent periods, reflecting true visual similarity across neighboring time ranges. - Expert comparison on 63 images (fine-period task): model achieved 69.84% accuracy versus archaeologists at 44.44% and 20.63%. Experts matched the model within their specialization but underperformed across all periods. - Training objective comparison: Training on period-site classes yielded highest accuracies across period-site, period, and site evaluations. Training on periods preserved period accuracy but reduced site/period-site performance; training on sites yielded near-comparable performance to period-site training. Site information adds significant value for learning. - Community detection: From the confusion-derived graph, Louvain detected 28 communities with modularity 0.77. Using top-10 neighbors and restricting to adjacent temporal ranges improved community quality. A Dead Sea-centered community illustrated geographically coherent Roman/Byzantine links. - Natufian case-study (restricted rough-periods spanning Middle Palaeolithic to Pre-Pottery Neolithic B) produced five communities mixing Natufian and closely related periods/sites. Examples showed archaeologically meaningful similarities (e.g., bone implements, worked teeth, flint tools) and highlighted occasional false associations due to visual similarity (e.g., Natufian awls with PPNB arrowheads).
Discussion
The CNN successfully learns discriminative visual embeddings for diverse archaeological artefacts across extensive temporal ranges, enabling accurate retrieval and classification at multiple temporal resolutions. Concentration of confusion among adjacent periods indicates that the model captures real stylistic and technological continuities over time. Incorporating site information during training enhances representation learning, improving period-site and site predictions and maintaining strong period performance. The model matches or exceeds archaeologists’ performance when generalizing across periods, suggesting utility as an assistive tool for cross-period classification and rapid triage. Community detection transforms systematic misclassifications into archaeologically meaningful clusters, revealing inter-site and inter-period affinities that can guide hypothesis generation about cultural interactions, trade, or technological diffusion. Case studies (e.g., Natufian) demonstrate both valid, interpretable connections and the need for expert validation where visual similarity can create spurious links. Overall, the findings support using confusion-informed networks to explore cultural structure in large archaeological datasets.
Conclusion
The paper presents a deep-learning workflow for archaeological image analysis that (1) classifies artefacts by period and site with strong accuracy across diverse materials and epochs, (2) retrieves visually and archaeologically similar items via learned embeddings, and (3) discovers meaningful communities from prediction confusions using Louvain clustering. The approach outperforms or complements expert performance in broad temporal tasks and provides scalable tools for exploring large corpora, accelerating routine classification, and surfacing novel patterns across sites and periods. Future work could: expand and balance datasets; incorporate multimodal metadata (context, material, provenience); refine temporal labeling to handle ambiguity; improve community detection by integrating geographic/stratigraphic priors; and evaluate generalization to other regions and repositories. Enhancements to loss functions that encode prior temporal ambiguity and more robust ensembles could further improve performance.
Limitations
- Label ambiguity: Period boundaries are inherently vague; artefact types often span multiple periods, producing label noise (e.g., Early Roman vs. Roman). Attempts to incorporate prior confusion into the loss were hindered by difficulty quantifying ambiguity and did not improve accuracy. - Dataset balance and diversity: Class sizes vary widely; analysis was limited to the 200 largest classes (≈80% of images/artefacts), potentially biasing performance and community structure. - Imaging heterogeneity: Original photos lacked a standardized capture protocol; although backgrounds/scales were normalized, residual variability may affect features. - Evaluation setup: No separate test set beyond the held-out validation split; k=1 nearest neighbor classification limits robustness to noise; image resolutions capped at ~300×300 for model input and ~600 px available for full images. - Community detection sensitivity: Results depend on the chosen temporal window and number of neighbors used to build the graph; some communities include outliers requiring expert interpretation. - Generalizability: Trained on Levantine artefacts; performance on other regions or repositories was not assessed. - Prior confusion modeling: Gaussian-based and confusion-informed loss variants did not yield improvements, indicating limitations in current approaches to encode temporal ambiguity.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny