cuneiform-sign-detection-dataset

This repository contains the dataset for cuneiform sign detection introduced in the article:

Dencker, T., Klinkisch, P., Maul, S. M., and Ommer, B. (2020): Deep Learning of Cuneiform Sign Detection with Weak Supervision using Transliteration Alignment, PLOS ONE, 15:12, pp. 1–21 https://doi.org/10.1371/journal.pone.0243039

The dataset comprises image references, transliterations and sign annotations of clay tablets from the Neo-Assyrian epoch.

Annotated tablet segment

Dataset Overview

The cuneiform sign detection dataset is based on images of over 1800 clay tablets with over 3000 view segments (inscribed side of clay tablet) and their corresponding transliterations. We provide bounding box annotations for 81 tablets with a total of 8109 annotated signs spanning over 186 sign classes.

We concentrated on the Neo-Assyrian stage of cuneiform writing (c. 900 BCE - 600 BCE), due to the amount of texts readily available and the standardized form of the cuneiform writing by that time. The dataset of clay tablet images with their transliterations mostly originates from two different sources:

For the Neo-Assyrian period, the SAAo series are invaluable, delivering a diverse corpus of texts, including royal correspondence, divination and literary texts or international treaties, among other genres.

Dataset Compilation

We distilled our dataset by first extracting the usable transliterations from SAAo and then collating those with the available images in cdli. Only those tablets having both a usable transliteration and an image were included in our corpus.

A clay tablet image, as available at cdli, is composite image composed of images of the various inscribed sides (usually front and back, sometimes upper and lower or left, right sides of the clay tablet). Similarly the transliterations are organized according to the visible sides of the clay tablet. To compile this dataset we first segmented the composite images and then matched the resulting view segments with their transliteration sides.

Both steps of the pre-processing were performed algorithmically and thus contain errors which need to be taken into account when working with the provided dataset. In particular the matching step (segment assignment) relies on several heuristics, since composite images of clay tablets differ in their number of segments and in their layout.

In order to obtain sign annotations we employed Assyriology graduate and PhD students that used a custom annotation tool to draw bounding boxes and label individual cuneiform signs. We follow the popular sign code classification (mzl_label) established by Borger’s Mesopotamisches Zeichenlexikon (MZL) which hopefully makes the dataset more accessible to other researchers.

Dataset Organization

The dataset is organized in terms of collections of clay tablets. For the most part we follow the organization of SAAo and only deviate in the train and test collections. In total we provide ten collections with eight from SAAo and two custom collections (with clay tablets from multiple sources):

This dataset is stored in csv files that contain all the information for the individual collections. We provide the following csv files for all collections:

The annotation csv files are only provided for a smaller subsets of collections:

For image data we rely on the cdli website where the tablet composite image files are made available. The individual image files are accessed by means of their cdli no. which we provide in the csv files. For a code example how to load a tablet image and visualize the data provided in this dataset, please consult our short introduction jupyter notebook.

While the vast majority of tablet images from our dataset is available on the cdli website, seven tablet images from the collection of the Vorderasiatisches Museum Berlin are currently not available. Kindly the museum granted us the permission to provide them for research purposes as a downloadable ZIP archive here.

Details on the Dataset Composition

Set BBoxes Signs Segments Tablets Sources
train_full (Train BB) Yes 4663 67 47 train(mixed), saa05, saa09
test_full (Test) Yes 3446 57 34 test(mixed, saa03), saa06
transliteration (Train TL) No 185399 2983 1745 saa01, saa05, saa08, saa10, saa13, saa16

IMPORTANT: These statistics only take into account signs with train_labels in the range (1<=x<=240) and in the case of transliterations are not marked as broken. Label 0 indicates background annotations.

Details on the CSV Files

The tablet segment metadata csv files contain the following information:

The transliteration csv files contain the following information:

The bbox annotation csv files contain the following information:

The line annotation csv files contain the following information:

The columns tablet cdli no. & segment index are used as unique key across the different csv files that allows to identify the records belonging to a specific tablet segment.

References