Home > Creating Our Data Sets

Creating Our Data Sets

Introduction

We began our project with a dataset containing 1,428 entries exported from the MSU Libraries Special Collections Criminology Collection. Our tasks for this project were to a) decide which of the MARC information for the 1,428 entries is relevant for our Poor Sinners Pamphlets website, b) break up pieces of metadata and MARC records into manageable and meaningful chunks of information, c) translate some entry data (for example, location of publication) into a standard format (English), and d) format the data to rid the datasheet of empty space, while also color coding the cells by language. The initial dataset contained many flaws (see Fig.1), such as a lack of diacritical marks in foreign language names and titles, several pieces of data entries imported into single cells, and much of the data repeated throughout the dataset. We eventually obtained another dataset from the MSU Libraries which included the correct diacritical marks, allowing us break up the congruent data with less difficulty. It is also worth mentioning that we faced some difficulties in regards to data sheet compatibility with certain programs. Opening the dataset in Google Sheets, for example, would remove any diacritical marks that were included and replace those marks with various symbols. We used Google Sheets only to plan and outline the data cleanup. Microsoft Excel and OpenRefine were utilized to perform the actual data cleanup since these programs, unlike Google Sheets, caused fewer formatting issues, such as unwanted deletion and unneeded spacing between records in the dataset.

Figure 1:

Raw Data Screenshot

 

Data Planning

We began our project by creating a data planning sheet using Google Sheets. The sheet contained a list of all the MARC data fields that were included in the dataset (see Fig. 2). This was our initial and arguably the most important stage of the project, in which we went through each MARC category, deciding which of the categories contained information relevant to our collection of documents (we also received some help from the rest of the class in this matter). Many of the MARC records were removed from our dataset because they were irrelevant to our project, such as Dewey decimal codes, or because some categories had very few entries, making them unreliable sources of information for the whole. In order to make the codes more intuitive, we classified the MARC categories with color codes; green for relevant or necessary data, yellow for potentially relevant information, and no color for irrelevant or unnecessary data. Additionally, we outlined the tasks at hand and in which order they needed to be performed (MARC data sorting, deleting excess metadata, breaking up congruent data, text faceting to change Latin to English, and formatting). Once we finished mapping out which pieces of MARC data we wanted to keep, we started working on cleaning up the actual dataset.

Figure 2:

MARC Data Screenshot

 

Data Clean-Up

We started refining the data by inserting MARC data descriptions as headings for each column into the datasheet, so that the data in each column was easily identifiable. We proceeded to “clean” the dataset by opening it in Excel and deleting entire columns of information that we had previously deemed irrelevant to the project (in reference to the planning document). After deleting the columns containing information not essential to our project, it was time to begin separating the congruent data into subsequent columns. Using the text-to-columns wizard in Excel, we were able to do this in two different ways: delimited format and fixed-width format.

The delimited format was useful for the data that appeared clumped together, but was separated by a common delimiter per MARC subsection, indicated by $ (see Fig. 3). The delimited format breaks up the column by removing the delimiter and separating each cell in the column into a subsequent column and cell. This made it simpler for us to remove the excess metadata after the fact; for example, the “\\” column pictured in Fig. 3 was separated by the delimiter “$,” creating a column for that metadata, allowing us to easily delete that excess column. With fixed-width format, the column is separated at a fixed-width defined by the requirements of the data using one or more moveable arrow mark cursors. Rather than removing a delimiter, this feature splits up the column where the cursor is placed, allowing the excess metadata in a separate column to be easily removed (see Fig. 4). While utilizing the fixed-width feature, we ran into the issue of not leaving enough blank columns between the columns we were breaking up and the data that came after it, allowing the broken up data to override the pre-existing data under a different heading. To offset this problem we used a separate Excel sheet to check how many columns the data would require before breaking it up. By cutting and pasting the column and doing a test run of the text-to-columns feature, we were able to insert the required number of blank columns to accommodate the broken up data (see Fig. 5). Text-to-columns allowed us to separate the excess metadata into separate columns and remove it from our dataset. In our author column, for example, multiple pieces of information were strung together in one line.  To separate the data, we used text-to-columns (fixed-width) to separate the “20$a” and “$d” from the name and the dates. We then used text-to-columns (delimited) to separate the data further, using the comma as the delimiter, creating three different columns (last name, first name, and birth/death year). The end product of using the text-to-columns feature is shown below with a same piece of data (for more data examples of text-to-columns, see Fig. 6):

jpg of data

Aside from the use of text-to-columns to break up congruent data, we also used OpenRefine to solve various data “issues”. OpenRefine allowed us to easily edit the cells in our dataset, such as language codes (ger to German) or Latin names to English (Amstelodami to Amsterdam) using the text facet feature. OpenRefine clusters the cells with the same text together, allowing them to be edited all at once (see Fig. 7).

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

 

Data Translation and Formatting

Given that many of our records were written in different languages, we were confronted with the task of translation and reformatting. The language breakdown of the data set by percentage of total entries is as follows: German (74.4%), Latin (21.6%), French (1.7%), and English (1.2%), with Italian (3), Dutch (3), and “Undetermined” (7) making up the remaining (1.1%). Using Excel’s conditional format function, we color coded the data entries by language (MARC code 41). That is, this tool allowed us to fill in rows with a blue background, for example, on the condition that it contained “German” under MARC41. A key consisting of the languages and their respective colors is located on a separate sheet in the DataFinal excel workbook. Although the end project will feature images of the texts in their original languages, it is important that we provide relevant bibliographical information (namely, the date and location of publication), in English. It became vital, then, that we translated MARC260 (Publication and Distribution Information), into English. Given that almost a quarter of the data entries were in Latin, it was no easy task. Some location names, such as Lipsiae, Jenae, and Altorfium were self-explicably Leipzig, Jena, and Altorf respectively. Other locations, however, required a little more digging to find their English equivalents (see appendix, Fig. 8). Argentorati, for example looks nothing like its English equivalent Strasbourg. There were 2 or 3 locations which were indecipherable and remain in their original language under MARC260. Location names, however, were not the only pieces of data to be decoded. Much of the Latin documents’ bibliographical information was presented in Roman numerals and required further interpretation (see below).

Figure 8:

Conclusion

Working with this data was insightful; it required us to think about the relevance of each piece of information in the context of our the overarching project and collection. Many of the categories of data that we deemed irrelevant to our project were, in fact, created by librarians and archivists themselves and brought us no new information about the document itself. That is, pulling from our class reading on Foucault’s The Order of Things, a classification is only relevant insomuch as it contributes to the description of, or offers clarification to, the classified object. A few examples of these irrelevant classifications would include the LOC control numbers, ISBN numbers, Local Holding, and control number identifiers - since the viewers of our website are likely not to require such information in order to access or better understand the pamphlets, omission of these classifications was necessary. Some classifications which had a few interesting entries that elaborated on the source info were deemed only potentially relevant because only a handful of entries actually contained data in these fields. The following categories, among others not listed, are categories were found to be most relevant: Language, Location, Main Entry (Author), Title, Publication Info, Physical Description, General Note, and Summary. These categories contain, in our opinion, essential information given what we’ve taken from our readings in our unit on data, in the regard that they exclusively define the structure and character of their respective pamphlets. The purpose of classification is to organize knowledge which deciphers desperate pieces of information through language. It is therefore completely unnecessary to include every bit of information we have on each entry, but it is important to record the specific choices we have made in including and excluding certain types of data. It is the hope that through working with this data, we have selected for the information that will give viewers valuable insight into each document, without boring them with details and irrelevancies.

Download Our Data

Name Size / Type  
Full Data Set 2.4 MB XLS Download full data here
Collection Subset 66 KB XLS Download subset data here