Home > Visualization

Visualization

About

Michigan State University Library German Criminology Collections contains a set of European crime and punishment documents, also known as the ‘Poor Sinners’ Pamphlets.’ These documents are public pamphlets sold as advertisement to the general public at public executions in the eighteenth and nineteenth century. They contain the stories of these executions: written information about crimes, criminals, victims, executions, and executioners. Additional information data provided by the document includes, publication year, publication location (city and country), language of text, physical dimensions, and name of publisher(s). Our goal as the data visualization team was to take this information from previously created datasets and explore trends and connections within the data that can be visually represented as a means of micro-analysis and macro-analysis, revealing stories in the archive and answering questions about the collection as a whole. We used tools such as Google Drive, Google Sheets, Microsoft Excel, RAW from DensityDesign Research Lab, OpenRefine (previously Google Refine), and Adobe Illustrator. Our final results consist in copies of the two datasets we used to create visualizations in RAW, three visualizations made in RAW, and this reflection, which seeks to render the decisions we made to create our visualizations legible. We aimed to answer three separate questions about the data and represent our findings in a visual graphical form.

First Question

The first question we decided to try to answer with a visualization, was what kind of trends there are in the type of crime connection to a certain city in a given time period. We were motivated to ask this question by a discussion in class about how easily the people that the data represent  are turned into just a point on a graph. Hence,we wanted to make sure our first question focused on the stories of the people behind the data. We thought that by creating a visualization showing this trend, we would be able to answer questions more about the people. For example, if there was a strong connection between crimes involving theft in Berlin during the 1850’s, perhaps there was a particular social crisis or even that afflicted poor people in this time period that time. Or if, hypothetically, a high number of crimes against women occurred in a certain place and time, this would tell us that maybe women were not well respected and were treated very poorly, which would explain why if at this time the magnitude of women perpetrators were also high in this area, since they could have been acting out against their oppression or forced into a life of crime. It was going after insights like these that we had in mind when deciding which data to include, and which type of visualization would help us answer questions like these the most effectively.

We decided to use the Alluvial Diagram in RAW to visualize the connection between crime, city, and year, because this would the magnitude of such correlations particularly evident Here, though, we ran into quite a bit of difficulty with the limitations of the data available for us to work with. RAW works most effectively with smaller sets of data, especially with the type of graph we chose, and the three categories we chose that would best answer our question. We ended up taking the data subset, deleting quite a few of the columns with no information in them, or repeat information in them, and then plugging this new set into RAW. We then set the hierarchy of the graph to “Publication, Distribution, etc. (imprint)” string, which gave us the cities of publication, then added the “Fixed- Length Data Elements 2” number, which gave us the year of publication, and then finally added the “Subject Topical Term” string, which gave us the crimes committed see figure 2.1. The final visualization can be seen figure 1.1.

Figure 1.1

Figure 2.1

Quite a few of the rows in the “Publication, Distribution, etc.” column in our data set were blank, which is why there is a large portion of lines coming from an unlabeled section on the left side of the visual. Nevertheless, this visual shows us the most frequent cities of publication in our subset were Augsburg, Munich and publications which were catalogued  with the label “Germany” instead of a city. The most frequent  years of publication were 1819, 1781, 1788, and 1790 and the crimes most frequently committed were murder, robbery, and infanticide. This visualization was not as informative as we’d hoped, but it still allowed us to glean some answers to our question. For example, it seems a good amount of the publications from the most represented  cities also were published in the most popular years, which also correspond to the most popular crimes. Murder being one of the most common crimes could be the result of social unrest or rebellion of some kind, robbery being another very common crime could be the result of economic hardship, or some sort of change in political power that negatively affected the poor or created further social division between the wealthy and the poor. Indeed, the possible explanations of these trends are numerous, but the main point of creating a questions and thus a visualization of this type, was to highlight the importance of representing different groups of people in a dataset like ours, because it is easy to get caught up in the sensational nature of the crimes in the pamphlets, rather than focusing on the individuals behind the crimes.

We went back to the archive to some of the crimes committed in 1781 to see if the answers we got from the visualization supported our hypothesis. As it turns out, with the crimes including robbery during this year, there was in fact a stigma surrounding members of the lower class that probably caused them to receive harsher sentencing compared to someone of a higher class. Looking at the summaries of the executions, there seemed to be obvious partiality; for example a single catholic man named Mathias was a wagon driver in 1781, and confessed to robbery during the second torture session performed on him. Another single, 28 year old catholic man named Xaveri was also sentenced to death in 1781 for robbery. Then we found a married 50 year old catholic man named Anton, who was a swine butcher and had two kids. Anton had been arrested three times before, and served time in prison for robbery; however he was not sentenced to death until the fourth offense, in which he had planned beforehand with multiple accomplices, and a woman was brutally murdered during the act. In the first two cases of Mathias and Xaveri, they seemed to be sentenced to death rather hastily, and Mathias was even forced to confess by torture. Xaveri was younger, Mathias was a wagon driver, and they were both single catholic men; taking these into consideration, it seems their social standing may have been the reason for their unfair treatment. Anton on the other hand was older, married, was the father of two kids, and he had a job that probably paid better than a wagon driver. It seems because of his higher social standing that he was only arrested for stealing instead of being put to death right away like Mathias and Xaveri, and not even just once but three times. Given the difference in treatment, it seems the justice system substantially favored higher standing members of society when dealing out punishments. Looking deeper into the results of our visualization by examining summaries in the archive has shown our hypothesis to be true; that the social division between the wealthy and the poor members of society proved detrimental to the poor due to the partiality of those deciding the punishments towards the wealthy.

Second Question

The second question we aimed at answering regarded the languages in which pamphlets were published in different countries. With this type of visual, we hoped to look at which languages were predominant in which locations, what that can tell us about the social structure of that country, and what that might have meant for people committing the crimes or deciding on the punishments. For this visual we used the full data set, so that the overall language trends could be included. In RAW we choose the Cluster Dendrogram in order to see trends more clearly. The visual can be seen in figure 1.2:

Figure 1.2:  

The hierarchy consisted of languages first, then country. From this visual we can tell German, Latin and English were the three most popular, being published in the most countries. Dutch, Catalan, Italian were only published in one country each, which could mean people who spoke these languages were a minority, or that criminological writings in these languages are less represented in the MSU collection. However it is important to note that although the data includes Catalan as a language, this is a mistake from a previous group, and the text is actually in Latin, not Catalan. We can also tell which countries are best represented in the collection: Germany, England, and France. While it may be tempting to conclude that such countries  had the highest crime rate, but more likely is that  they were the most active in making the crime-related pamphlets and documents, or maybe even that these countries relied on such publications in order as a means of political control.  Regardless, this visualization helps us to draw conclusions or discuss cause and effects of this nature. Narrowing the type of data we are looking at in visualisations make it much easier to see trends that will allow for discussions of the people behind the data, rather than just looking at the data in full on an excel spreadsheet.

 

Third Question

Our third question is intended to get more information about the the 15 countries present in the Criminology Collection at MSU, specifically regarding the cities in which these documents  were published. To answer these components together visually using RAW, we choose to use a circular packing graph. What we ended up creating is this:

 

 

 

 

 

What is useful about this representation of the data is its ability to display multiple levels of meaning in a simple, straightforward manner. Each node is representative to each pamphlets locational imprint, the larger the node the more pamphlets with that specific imprint city. By clustering these cities by country there is also a global and national gauge, by which one can measure the data. Globally, we can see how many countries practiced a method of publishing crime documents  including their punishment. On the level of “nation” (although many of these countries existed as political entities different than the “nations” we know them as now),  we see how many cities in that country had the means to publish these pamphlets throughout the 18th and  19th centuries. Unfortunately, there is not a way to  express visually the years each of the pamphlets within the node were published. And it may appear easy to argue for larger nodes equating to larger concentrations of crime or criminal charges, however those sort of assumptions could be made falsely about any documentation concerning crime. Crimes are often committed with the explicit intention of not being caught, so to take instances of when criminals have been caught as representative of crimes being committed has no discernable evidence. Even with all of the data that we currently have about these pamphlets, the findings we arrive at do not help us explain the inner-workings of a past society through criminological metrics - how many of these pamphlets didn’t survive the test of time?  Instead, what we arrive at are another set of questions regarding not the historical nature of crime, but the historical nature of how a society publishes, displays, talks about crime and - ultimately - how it preserves its histories crime through the set of documents now housed at MSU.

This graph was made using our full dataset with 1500 entries (as to why that shouldn’t have happened will be discussed later) exported as a CSV file to better import into RAW. Circular packing graphs require components of hierarchy, size, color, and label. Using a hierarchy of country then city with a size of 1 being attributed to each entry being one pamphlet, color coding the nodes by country, and labeling each node with its city, this graph was the result.

 

Methods

The methodology behind these visualizations ultimately relied on the current datasets we have, which are varied to say the least. Working on datasets that multiple people have edited and altered creates a variety of issues. Our major problems consist of entries being lost or altered accidentally, as well as new information being added to separate datasets that would not always include information such as call numbers or ID numbers which helped to keep track of entries.  After two separate projects that worked and modified our datasets,  we now have three different versions of our data  which vary in the number  of entries, consistent descriptions, and a consistent means of identifying a specific pamphlets location within any of the sets. For our visualizations we were given the AL340_VisData_updated 2.0.csv file which has these columns of information about the pamphlets.

Year

Number

Country

Language

Last name

First Name

Title Statement

City Publication

Imprint Publication

LCSH_1

We encountered the first major issue in the fact that our data set lacked of call numbers Without the pamphlets call number we had no way of being able to update any of the information within the rows in mass. That issue could have been avoided had each dataset had the same amount of entries, which they did not. Our master dataset from which all data was gathered from has exactly 1,428 rows meaning there are exactly 1,427 entries. This becomes important when the only full dataset to contain updated years and locations has 1,502 rows 1,501 entries, which had been added during the mapping in order to take account of documents with duplicate The lack of certain pamphlets call numbers now becomes of concern because if there are duplicates that can be fixed, but if new entries appear that don’t have a call number there is little to do in terms of matching up information among sets. When sorting the datasets call numbers the official full data set is missing 128 call numbers while the cartoDB full set with updated year and location has 137 missing. It is important to note that even though the updated VisData set has the correct amount of entries when sorting each column in an attempt to align sets, the information in all of these columns is not specific enough to match with certainty. With each set having slightly varied updates matching up any of the columns to the original dataset in order to restore the call numbers was impossible. Even with the subset of 100 selected pamphlets the mapping dataset with the updated year and location did not have 100 entries, it has only 97.

 

Complications

Other issues that came up when working with the data was the difficulty of sorting out the LCSH columns. The tags that currently sit in the columns are formatted in a way that they contain 3-4 individual descriptions, figure 2.2. The major problem with the quantity of these descriptions is that the order of the descriptions is not consistent, but rather separated by “--”’s (see figure 2.4). Our approach to sorting this data was straightforward. Using OpenRefine to separate the columns cells into multiple columns, figure 2.2, we then ran a text facet to find clusters to merge descriptions that are similar but spelled differently or containing typos, figure 2.3. What that gave us was 5 columns with all of the separated descriptions. As to pare down our dataset, we decided to remove  any descriptions unrelated to crime, which required editing each entry through the text facet. From that point we needed to consolidate the descriptions into one column which proved to be beyond the scope of this single project.  

Figure 2.2: 

Figure 2.3:

Figure 2.4:

We encountered another major hurdle  in regards to the amount of time OpenRefine took to edit ed each cell. Changing an unrelated description to either null or blank on average took 20-30 seconds, because of the size of our dataset (and regardless of the amount of cells altered). To add to that frustration was the misconception that we would be working with a refined dataset and be able to focus on visualization methods and exploration of the data not data sorting. To create a meaningful visualization that is representative of the data available we believe it would take extensive analysis of provided descriptions and finding a way to flush out the minute differences such as why a similar description like murderers and murder and why they are recognizing a different aspect of the pamphlets. These sorts of decisions are difficult to make without knowledge of how the Library of Congress Subject Headings  are determined  or what evidence there is to support their meaning. There is also a common description of canon law or roman law which, while informative of the legislation behind why certain acts are considered crimes, they are often used in a way that appears to be a description of a person acting against these laws which is simply too vague of a crime's description to be meaningful.

Further Analysis

After finishing our initial project work we decided it was worth spending extra time on sorting out the Library of Congress Shorthand descriptions of crime and punishment descriptions. Our way of approaching this was the same as our methods used to analyze all previous data with the addition of more extensive data set cleaning and sorting. As previously mentioned the state of information contained within the LCSH columns was jumbled and if not random very near random. Using OpenRefine again to place the descriptions, previously separated in each cell by “--”s, into individual cells in new columns. This created an additional 4-5 columns for each LCSH column (figure 2.2) so around 12 new columns which were text faceted to see all descriptions which were manually filtered by editing and deleting the descriptions not related to crime or punishment (figure 2.5). Due to the wide variation and obscurity of descriptions it is highly probable that certain crime or punishment terms were taken out of context therefore misrepresentative of what actually happened. This seems to be a common occurrence with data that is handled by multiple organizations or even multiple archivists and especially by multiple students. Once each of the 12 columns contained only descriptions of crimes and or punishments condensing that information was a problem that google sheets and Microsoft excel did not have a clear way of merging string values. A workaround that we discovered is the Google Sheets Add-on extension Merge Values by www.ablebits.com. Since it is a third party app using it on files in the classes shared folder in google drive was not permitted by Michigan State University’s Acceptable Use Policy. This is an institutional issue that is not isolated or unique nor is it adequately transparent as to what this policy entails or why it is enforced. Nonetheless dropping the dataset into a non-MSU Google Drive allowed us to use the Add-on and merge all values into one column (Figure 2.6). Revisiting the first issue each column again has multiple descriptions but with only terms of crime and punishment. Due to the range of description the best option for us to separate the column into one column for crimes and one column for punishments was to go individually cell by cell and manually transfer descriptions, this was also useful in attempting to make descriptions more uniform and consistent with each other.

Figure 2.5:

Figure 2.6:

With the cleaned LSCH data, we were able to create visualizations in RAW that incorporate year, crime, and punishment. The first visual we created, Figure 3.1, specifically focused on the crime committed in a certain city at a certain time. We wanted to create a visual that allows us to answer questions such as: which crimes were most common in which areas, if there was a time period when certain crimes were committed at a higher rate, or if at a certain time crime was more prevalent in some cities and not others, and what that could tell us about the city, its citizens, and its social structure or culture. With these questions in mind, we choose the Treemap to visualize our data. The hierarchy for the map is city, the color represents crime, the size represents year, and the label includes city and year. Each of the colors represent a different crime: purple is killing of the elderly, red is assassination, gold is assault and battery, light green is arson by female offenders, dark green is theft by female offenders, teal is murder, light blue is murder and robbery, dark blue is murder, and pink is for entries in which a crime was not included. This visualization shows the type of crimes that were most prevalent in a city in a given year. For example looking at Munich in 1804, the visual shows us murder was the most frequently committed crime. The visual displays a wide range of years pertaining to different cities, but three of the cities have the same color. In Munich, St. Gallen, and Nuremberg murder is the most common crime. 

Figure 3.1:

The cleaned data also allowed us to see the punishments for crimes committed. We decided to use the Treemap again to answer questions about punishments. Some of the questions we had in mind when deciding what kind of visual to create were why certain punishments were given to some crimes and not others, how different cities sentenced compared to others, and if some years certain punishments were more common than others and how those might have changed as time went on. Using RAW we created Figure 3.2 in which the hierarchy for the map is city, the color represents punishment, the size is according to year, and the labels include city and year. Each color represents a different punishment: the red represents entries in which there was no punishment associated, the blue represents hanging, the purple represents death, and the green represents prison. The labels show in 1758 and 1804 death and hanging was much used as a punishment more often, then oven 12 years later prison, a less severe punishment, was used more often. This shows development of the society and culture in the cities, in this case specifically in Straubing.

Figure 3.2

We then wanted to look at the crime and punishment data together to answer questions such as which sentences were commonly used to punish offenders for certain crimes. We also included the location to see if there were trends to crimes committed in different cities and if the area responded with a certain punishment. In order to see these kinds of trends we choose to use the Reingold-Tilford Tree in RAW to visualize the data. Figure 3.3 is what we created to answer these questions, which includes a hierarchy of city, crime, then punishment. This visual allows us to understand more about the data, especially the trends in the data, and help answer some of our questions. The graph shows the most prevalent type of crime or crimes for each city, then the punishments assigned to some of the crimes in certain cities. A trend the graph shows that might have been harder to find if the data was in a different form is that both the city of Ansbach and the city of Augsburg have brigands and robbers as one of their major crimes, and both cities punish that offense with death. Another interesting question this visualization answers is that in the city of Mindelheim theft is punished by hanging specifically, while in the city of Neunhof theft is punished by death, which could have included more severe methods, if it was committed by a female offender. Perhaps women had less of a social standing than men at this time in Neunhof, so they were charged with various forms of death, of which could have been more embarrassing of a public execution than hanging, or other types of death may have been seen as a more painful than hanging.

Conclusion

Despite the complications and limitations of the data, we were still able to use visualization as a tool to answer new questions, draw new conclusions, and generally look at the data in a new way. Visualizations create a whole different set of opportunities for producing knowledge about our data. By posing a question, choosing which data would best answer the question, figuring out what kind of picture we are looking for in the data, creating that picture, and analyzing the visual we produce a whole new system of inquiry that offers its unique set of affordances, implications and restrictions. Visuals present trends, connections, grouping, and proportions in the data and these various types of visualizations - along with others  order data so that it can be viewed in a way that allows for new questions to be discovered, explored, and answered. Visualizations are more than interesting ways of looking data, they open a new avenue of experimentation, and create a new form of knowledge that changes the way we in which we create, analyze, and think about knowledge and its production. The further analysis we did which included the crime and punishment information developed this point even more, highlighting the usefulness and added depth of questions able to be answered using visualizations.