Addendum: The Shape of Data

Addendum: The Shape of Data#

The treemap project asks you to process hierarchical data, represented in files in JSON notation. However, most of the publicly available data you will find in the real world is not represented this way. You are more likely to encounter data in some “flat” tabular format, like comma-separated values (CSV) or relational database records. If you want to use the implicit hierarchical structure of that data to create treemaps or in other ways, you will need to understand the difference between the information (semantic content) represented by those flat structures and the physical data structures used to represent them. With this distinction, you can devise conversions from the data structures available to you to the data structures you need for your computational task.

Semantic content vs representation#

The same semantic content may be represented in different ways in a concrete data structure. For example, suppose we wanted a data structure to represent the data in this table conveying the income share of the richest 1% of the population in different countries in 1975 and 2022:

Country or region	1975	2022	Absolute Change	Relative Change
China	6.52%	15.72%	+9.20 pp	+141%
France	8.20%	12.69%	+4.49 pp	+55%
United States	10.43%	20.87%	+10.44 pp	+100%

Row-wise representation of tabular data#

We could represent this data as a list of lists by row or by column:

columns = {"region": 0, "1975": 1, "2022": 2, "Chng_Abs": 3, 
"Chng_Rel": 4}
rows = [["China", .0652, .1572, .092, 1.41], 
        ["France", .082, .1269, .0449, .55],
        ["United States", .1043, .2087, .1044, 1.0]]
rows[2][columns["Chng_Abs"]]      

0.1044

This row-wise organization might also be represented by a file with a line for each row. The comma-separated values (CSV) format used to exchange data with spreadsheet programs is a common example.

Country or region,1975,2022,Absolute Change,Relative Change
China,6.52%,15.72%,+9.20 pp,+141%
France,8.20%,12.69%,+4.49 pp,+55%
United States,10.43%,20.87%,+10.44 pp,+100%

Column-wise representation of tabular data#

The same information could be represented column-wise:

income_distribution = {
    "Region": ["China", "France", "United States"], 
    "1975": [.0652, .082, .1043],
    "2022": [.1572, .1269, .2087],
    "Chng_Abs": [.092, .0449, .1044],
    "Chng_Rel": [1.41, .55, 1.0]}
income_distribution["Chng_Abs"][2]

0.1044

Column-wise representation is common in scientific and statistical computing. It is the form used in Pandas dataframes, with each column represented as a NumPy array.

Hierarchical data#

Information is often categorized into a hierarchy. Hierarchical part-whole relations might sometimes be fixed and immutable. More often they are at least partly arbitrary and mutable conveniences. Consider, for example, the customary grouping of geographic regions into continents: Europe and Asia are sometimes considered a single continent, Eurasia, but traditionally have often been considered separate continents, with the consequence that Ankara is grouped with Bangkok and not with Athens. Statistical analyses by region may use the United Nations geoscheme, while the Internet Corporation for Assigned Names and Numbers uses a different grouping into just five major administrative regions, for example classifying American Samoa as part of North America.

One can similarly find conflicting and shifting hierarchical groupings for languages, for biological taxa, and for social and cultural groups, among many other classification systems. But as arbitrary as they may be, hierarchical grouping is often essential to making sense of the world. We may quarrel with a particular hierarchical organization, but it is hard to argue with the usefulness of hierarchical organization for data analysis.

Hierarchical part-whole relations can be represented by nesting collection data structures in Python. For example, global causes of death in 2019 could be represented by nested dictionary structures:

death_causes_2017 = {
   "Noncommunicable disease": { 
         "Cardiovascular" : 0.33, 
         "Cancers": 0.18, 
         "Respiratory": 0.07, 
         "Neurological": 0.039, 
         "Digestive": 0.045, 
         "Diabetes": 0.027, 
         "Other noncommunicable": 0.057
         }, 
   "Infectious disease": {
        "Pneumonia": 0.044, 
        "Diarrheal": 0.027, 
        "Tuberculosis": 0.02, 
        "HIV/AIDS": 0.015, 
        "Malaria": 0.011,
        "Other infectious": 0.021,
        },
   "Neonatal": 0.033,
   "Maternal": 0.004,
   "Nutritional": 0.004, 
   "Accidents": {
        "Transport accidents": 0.023,
        "Other accidents": 0.031
        },
   "Violence": {
        "Suicides": 0.013,
        "Homicides": 0.007,
        "Combat":  0.002,
        "Terrorism": 0.0005
   }
}

Such a structure may be represented in a file using a suitable data exchange format such as JSON or YAML. The Python snippet above is legal JSON. More often, we find hierarchical data that is “flattened” into a tabular format. For example, one free online converter from JSON to comma-separated values (CSV) produces column headers like Infectious disease__Pneumonia and Infectious disease__Malaria with a single row of numerical entries. More typically we would find one or more columns used for labels, and one or more additional columns used for numerical values.

In a “flat” tabular representation (e.g., a CSV file), a row may represent a path from the top-level category (e.g, “Infectious disease”) through sub-categories (e.g., “Pneumonia”), one per column, to one or more columns of numerical data. For example, consider the following hierarchy, represented in tree form:

Beverage consumption in a hypothetical organization

Each path from the top of the tree to the amounts of coffee and tea consumed in a particular part of this imaginary organization (teams A1a, A1b, A2, and B) will be represented by one row in the tabular representation. Intermediate levels A and A1 may be represented on separate rows:

Division	Subdivision	Team	Coffee	Tea
A
	A1
		A1a	13	24
		A1b	19	17
	A2		7	12
B			49	16

Alternatively, each row may contain the complete path:

Division	Subdivision	Team	Coffee	Tea
A	A1	A1a	13	24
A	A1	A1b	19	17
A	A2		7	12
B			49	16

In place of empty cells, as in subdivision and team for for B above, a label may be repeated. Applying this tactic to the mortality data, I have repeated “Neonatal” and “Maternal” in the “Cause” column:

Category	Cause	Proportion
Noncommunicable disease	Cardiovascular	0.33
Noncommunicable disease	Cancers	0.18
Noncommunicable disease	Respiratory	0.07
Noncommunicable disease	Neurological	0.039
Noncommunicable disease	Digestive	0.045
Noncommunicable disease	Diabetes	0.027
Noncommunicable disease	Other	0.057
Infectious disease	Pneumonia	0.044
Infectious disease	Diarrheal	0.027
Infectious disease	Tuberculosis	0.02
Infectious disease	HIV/AIDS	0.015
Infectious disease	Malaria	0.011
Infectious disease	Other	0.021
Neonatal	Neonatal	0.033
Maternal	Maternal	0.004
Nutritional	Nutritional	0.004
Accidents	Transport accidents	0.023
Accidents	Other accidents	0.031
Violence	Suicides	0.013
Violence	Homicides	0.007
Violence	Combat	0.002
Violence	Terrorism	0.0005

Indented lists#

An indented list is essentially equivalent to the flattened tabular format above, except that each label appears in a row by itself. Equivalently, one can think of an indented list as shorthand for the nested dictionary structure as written above:

Noncommunicable disease
- Cardiovascular: 0.33
- Cancers: 0.18
- Respiratory: 0.07
- Neurological: 0.039
- Digestive: 0.045
- Diabetes: 0.027
- Other noncommunicable: 0.057
Infectious disease
- Pneumonia: 0.044
- Diarrheal: 0.027
- Tuberculosis: 0.02
- HIV/AIDS: 0.015
- Malaria: 0.011
- Other: 0.021
Neonatal: 0.033
Maternal: 0.004
Nutritional: 0.004
Accidents
- Transport accidents: 0.023
- Other accidents: 0.031
Violence
- Suicides: 0.013
- Homicides: 0.007
- Combat: 0.002
- Terrorism: 0.0005

Other representations of hierarchical data#

There are additional variations in representations that you will encounter, including some that combine multiple tables that must be combined (“joined”) to recover the hierarchy and/or a unified flat representation. It is often relatively straightforward to convert among representations of the same information, provided we make a clear distinction between the semantic content of the data and its representation.

In the treemap project, we use only the libraries provided in a standard Python installation.
The restructure directory provides some additional Python programs to pre-process data and extract a hierarchical representation in JSON format. In a typical workflow, one might clean and summarize a data set with other tools before extracting a hierarchical representation. The SQL language supported by most relational databases provides operations including summarizing and combining tables. The Pandas framework for Python includes aggregation tools. One might even perform some data manipulation in a spreadsheet like Excel or Google Sheets, but often it is best to fully automate a workflow so that it can be reapplied consistently and efficiently to related or evolving datasets.

Summary#

The same information (semantic content) may be represented with differently structured data. External sources of data (say, a spreadsheet or database) may not be organized in a way that is convenient for the specific computations we wish to perform, such as grouping and summarizing the data. Hierarchical information, in particular, is often represented in some “flat” structure (e.g., one or more tables) in which the hierarchical structure is implicit. Often it is useful to reorganize such data as a first step in using it.

Example data preparation workflow#

Consider, for example, the course enrollment data used in our treemap project. The primary source of this data was a class roster, with a row for each student.

Class roster

The primary major code for each student was a column in that table, which was extracted to use as input in our enrollment analysis project. We summarized that table by counting occurrences of each major code and combined the summarized data with another table relating major codes like “ARCH” to full titles like “Architecture”.

Major codes

While we printed the output of that project in a human-friendly format, it is a trivial change to instead produce a table in CSV or JSON format.

The grouping of majors into schools, colleges, and divisions is represented by yet another file, a JSON structure representing that structure at University of Oregon. I wrote a simple Python program to combine the structure with the counts table.

As each of these steps is automated, they form a repeatable “workflow” that can be scripted to apply quickly and consistently to class rosters in subsequent course offerings or other courses.