Date Published: November 5, 2018
Publisher: Public Library of Science
Author(s): Sarah Klassen, Jonathan Weed, Damian Evans, Michael D. Petraglia.
Archaeologists often need to date and group artifact types to discern typologies, chronologies, and classifications. For over a century, statisticians have been using classification and clustering techniques to infer patterns in data that can be defined by algorithms. In the case of archaeology, linear regression algorithms are often used to chronologically date features and sites, and pattern recognition is used to develop typologies and classifications. However, archaeological data is often expensive to collect, and analyses are often limited by poor sample sizes and datasets. Here we show that recent advances in computation allow archaeologists to use machine learning based on much of the same statistical theory to address more complex problems using increased computing power and larger and incomplete datasets. This paper approaches the problem of predicting the chronology of archaeological sites through a case study of medieval temples in Angkor, Cambodia. For this study, we have a large dataset of temples with known architectural elements and artifacts; however, less than ten percent of the sample of temples have known dates, and much of the attribute data is incomplete. Our results suggest that the algorithms can predict dates for temples from 821–1150 CE with a 49-66-year average absolute error. We find that this method surpasses traditional supervised and unsupervised statistical approaches for under-specified portions of the dataset and is a promising new method for anthropological inquiry.
Archaeologists often rely on statistical methods to infer the chronology of and group archaeological sites, artifact types, and architecture. However, this can be limited by incomplete datasets. It can be relatively easy to create large archaeological datasets, with excavations producing thousands of ceramic sherds and lithic assemblages. Similarly, hundreds of archaeological sites can be identified on the landscape using remote sensing at relatively little cost. However, determining the chronology of the sites using excavation and C14 dating methods or assigning ceramics to group using INAA analyses is comparatively expensive and time consuming. As such, archaeologists often have large inventories of artifacts and sites, but the majority of the data points are underspecified because the chronology and group classifications are unknown and expensive to obtain using traditional methods. In this paper, we introduce the use of semi-supervised machine learning algorithms for archaeological inquiry. Machine learning mimics human pattern recognition and learning processes through a series of complex mathematical computations to find structure and define algorithms for large datasets . In this scenario, algorithms refer to the equation, rules, or set of steps and pattern recognition necessary to transform the data (input) into the categories (output) . Pattern recognition is the process of finding structure in data that can be used to divide the data into discrete categories .
Angkor is a sprawling low-density urban complex with hundreds of temples and occupation mounds connected through a network of hydraulic infrastructure . Until recently, the full extent of the settlement was only partially understood. Much of the habitational space was constructed in non-durable organic materials that have since disintegrated. Decades of aerial mapping and other remote sensing, however, have revealed traces of archaeological features including ponds, occupation mounds, embankments, and channels on the landscape [1, 15]. Evans and Pottier mapped much of the hinterlands and identified over 1400 temples (Fig 3). In this paper, we are interested in identifying the construction sequence of these temples so that we can date other urban features by proxy and create historical models of the urban development of the city.
In this paper, we explore several statistical approaches that fall under supervised or unsupervised paradigms. In the case-study, there are 1332 undated temples (non-labeled data points) and 105 dated temples (labeled data points). Seriation like k-means clustering is unsupervised and uses data from all the temples but does not incorporate the known dates in the analysis. In contrast, MLR is supervised and uses the known dates to determine the algorithm, but is limited to approximately 10% of the dataset and could only predict dates for approximately half of the dataset . As a result, none of the analyses took full advantage of the dataset using information from both the labeled and unlabeled data to improve the algorithms. Since collecting data for all the undated temples, using excavation and traditional dating methods, would be prohibitively costly and time-consuming, a semi-supervised paradigm was a natural approach for our analysis to predict dates for the remaining temples that could not be dated using multiple linear regression. However, the GSSL model had a higher AAE than the multiple linear regression. As a result, we decided to merge the results from the GSSL and the MLR to combine the benefits of both approaches and determine estimated errors for different types of temples.
In the absence of detailed chronological models, the working assumption has always been that essentially all of the temples we see on the landscape were operational at the pinnacle of Angkor’s development in the eleventh century, and the lack of chronological resolution has been a persistent obstacle to complex diachronic studies of social and environmental processes. By combining the results of GSSL and MLR, we were able to predict dates for otherwise undated temples from 821–1149 CE with a 49-66-year AAE. These data can be used to create historical models of urban development at Angkor by assigning dates to temples and other landscape features that are associated with the temples. These maps can then be used in future for diachronic analyses of human-environmental and urban dynamics in the Khmer world.