Extracting archaeological data with Python 3 - part 2

Samian Pottery data are available in various formats, namely XLS, SXC (OpenOffice.org 1.0 format) and TXT (tab-separated values in fact). There is no actual difference between the content and the structure of files among the different formats, just the spreadsheet files have lots of contexts in just 3 files (each context is a single sheet), while the tab-separated values files are one per context. That said, and provided that I already planned to extract data using the Python standard library programming modules, I thought the text files would be the best choice to start. Unfortunately, it’s 173 files and - just because of the ADS site’s conditions of use - there’s no easy way to automagically download them using command line tools like wget or similar (which I do use frequently otherwise). You will be better off with some Firefox extension like FlashGot. I assume that if you want to follow this series you will manage to download the files on your machine. I wonder how difficult it would have been to zip those files into a single compressed file. That would have saved some 5 minutes to myself and at least 3,5 MB worthy on the server (not to mention the fact that using the recently released XZ compression format I could have them compressed down to 74K (yes I did). Speaking about the size of data, see below.

Apart from the generic terms of use of ADS, there is no clear and explicit license attached to these files. Downloading them for personal research use should be ok, anyway.

Text files are in fact nothing more than the spreadsheets saved in tab-separated values format, and thus have the same structure, to call it so. Actually, this structure is a bit odd and not really something one would call “raw data”: the first 4 or 5 lines have “metadata” about the archaeological context like the name, the type of site (urban, rural, etc), and administrative regions where the site is. After, comes a true table of data, in which columns contain time (years) and rows bring pottery types. The bad news start here: each file contains rows for all known types, even those that weren’t found in that context. For this reason, there are more than 320 such rows in each file. I can’t find any reason to have empty rows like this, which are confusing if you look at the file in a text editor and time consuming if you want to process the file like I did. Oddness continues when it comes to describe the quantity for each type in the context. This is presented by filling cells in a row, from the start date to the end date of that context (columns hold years, remember?). Obviously all rows filled like this describe the same time span, making again the provided information redundant from any point of view. I also found out that there are extra rows that introduce every production area or site (like “Italian” or “South Gaulish La Graufesenque”) interleaved within actual “records” - but each record has that information anyway, coded by means of a prefix like ITAL or SGLG. Furthermore, 3 files had some text missing in the first rows, and I needed to get them repaired by hand.

So, each file holds more than 300 rows for data, but in the end the grand total of actual records (i.e. the sum of the number of types found in each context) is 1759. All things considered, the index file proved to be much more useful for getting the general information about each context (it also has bibliographic references, for one, and the time span is in a single place instead of having to parse an entire table for crossings between the right rows and columns).

However, after going through this digital journey, the data are there and can be effectively retrieved for further analysis. In the next article, I’ll show some Python 3 code snippets that I have used so far to parse the data files.

Speaking about size of data, the entire dataset I built for my MA thesis, compressed in XZ format from the source file in JSON (not actually the most efficient storage format, if you ask me), weights 84 K. Provided that the number of records in the Samian Pottery dataset is roughly double, that would make for a complete dataset for less than 200 K. And that JSON file can be loaded straight in Django, making all queries and graphs available at a glance.

Updated: