Addendum: The Shape of Data#

The treemap project asks you to process hierarchical data, represented in files in JSON notation. However, most of the publicly available data you will find in the real world is not represented this way. You are more likely to encounter data in some “flat” tabular format, like comma-separated values (CSV) or relational database records. If you want to use the implicit hierarchical structure of that data to create treemaps or in other ways, you will need to understand the difference between the information (semantic content) represented by those flat structures and the physical data structures used to represent them. With this distinction, you can devise conversions from the data structures available to you to the data structures you need for your computational task.

Semantic content vs representation#

The same semantic content may be represented in different ways in a concrete data structure. For example, suppose we wanted a data structure to represent the data in this table conveying the income share of the richest 1% of the population in different countries in 1975 and 2022:

Country or region

1975

2022

Absolute Change

Relative Change

China

6.52%

15.72%

+9.20 pp

+141%

France

8.20%

12.69%

+4.49 pp

+55%

United States

10.43%

20.87%

+10.44 pp

+100%

Row-wise representation of tabular data#

We could represent this data as a list of lists by row or by column:

columns = {"region": 0, "1975": 1, "2022": 2, "Chng_Abs": 3, 
"Chng_Rel": 4}
rows = [["China", .0652, .1572, .092, 1.41], 
        ["France", .082, .1269, .0449, .55],
        ["United States", .1043, .2087, .1044, 1.0]]
rows[2][columns["Chng_Abs"]]      
0.1044

This row-wise organization might also be represented by a file with a line for each row. The comma-separated values (CSV) format used to exchange data with spreadsheet programs is a common example.

Country or region,1975,2022,Absolute Change,Relative Change
China,6.52%,15.72%,+9.20 pp,+141%
France,8.20%,12.69%,+4.49 pp,+55%
United States,10.43%,20.87%,+10.44 pp,+100%

Column-wise representation of tabular data#

The same information could be represented column-wise:

income_distribution = {
    "Region": ["China", "France", "United States"], 
    "1975": [.0652, .082, .1043],
    "2022": [.1572, .1269, .2087],
    "Chng_Abs": [.092, .0449, .1044],
    "Chng_Rel": [1.41, .55, 1.0]}
income_distribution["Chng_Abs"][2]
0.1044

Column-wise representation is common in scientific and statistical computing. It is the form used in Pandas dataframes, with each column represented as a NumPy array.

Hierarchical data#

Information is often categorized into a hierarchy. Hierarchical part-whole relations might sometimes be fixed and immutable. More often they are at least partly arbitrary and mutable conveniences. Consider, for example, the customary grouping of geographic regions into continents: Europe and Asia are sometimes considered a single continent, Eurasia, but traditionally have often been considered separate continents, with the consequence that Ankara is grouped with Bangkok and not with Athens. Statistical analyses by region may use the United Nations geoscheme, while the Internet Corporation for Assigned Names and Numbers uses a different grouping into just five major administrative regions, for example classifying American Samoa as part of North America.

One can similarly find conflicting and shifting hierarchical groupings for languages, for biological taxa, and for social and cultural groups, among many other classification systems. But as arbitrary as they may be, hierarchical grouping is often essential to making sense of the world. We may quarrel with a particular hierarchical organization, but it is hard to argue with the usefulness of hierarchical organization for data analysis.

Hierarchical part-whole relations can be represented by nesting collection data structures in Python. For example, global causes of death in 2019 could be represented by nested dictionary structures:

death_causes_2017 = {
   "Noncommunicable disease": { 
         "Cardiovascular" : 0.33, 
         "Cancers": 0.18, 
         "Respiratory": 0.07, 
         "Neurological": 0.039, 
         "Digestive": 0.045, 
         "Diabetes": 0.027, 
         "Other noncommunicable": 0.057
         }, 
   "Infectious disease": {
        "Pneumonia": 0.044, 
        "Diarrheal": 0.027, 
        "Tuberculosis": 0.02, 
        "HIV/AIDS": 0.015, 
        "Malaria": 0.011,
        "Other infectious": 0.021,
        },
   "Neonatal": 0.033,
   "Maternal": 0.004,
   "Nutritional": 0.004, 
   "Accidents": {
        "Transport accidents": 0.023,
        "Other accidents": 0.031
        },
   "Violence": {
        "Suicides": 0.013,
        "Homicides": 0.007,
        "Combat":  0.002,
        "Terrorism": 0.0005
   }
}

Such a structure may be represented in a file using a suitable data exchange format such as JSON or YAML. The Python snippet above is legal JSON. More often, we find hierarchical data that is “flattened” into a tabular format. For example, one free online converter from JSON to comma-separated values (CSV) produces column headers like Infectious disease__Pneumonia and Infectious disease__Malaria with a single row of numerical entries. More typically we would find one or more columns used for labels, and one or more additional columns used for numerical values.

In a “flat” tabular representation (e.g., a CSV file), a row may represent a path from the top-level category (e.g, “Infectious disease”) through sub-categories (e.g., “Pneumonia”), one per column, to one or more columns of numerical data. For example, consider the following hierarchy, represented in tree form:

Beverage consumption in a hypothetical organization

Each path from the top of the tree to the amounts of coffee and tea consumed in a particular part of this imaginary organization (teams A1a, A1b, A2, and B) will be represented by one row in the tabular representation. Intermediate levels A and A1 may be represented on separate rows:

Division

Subdivision

Team

Coffee

Tea

A

A1

A1a

13

24

A1b

19

17

A2

7

12

B

49

16

Alternatively, each row may contain the complete path:

Division

Subdivision

Team

Coffee

Tea

A

A1

A1a

13

24

A

A1

A1b

19

17

A

A2

7

12

B

49

16

In place of empty cells, as in subdivision and team for for B above, a label may be repeated. Applying this tactic to the mortality data, I have repeated “Neonatal” and “Maternal” in the “Cause” column:

Category

Cause

Proportion

Noncommunicable disease

Cardiovascular

0.33

Noncommunicable disease

Cancers

0.18

Noncommunicable disease

Respiratory

0.07

Noncommunicable disease

Neurological

0.039

Noncommunicable disease

Digestive

0.045

Noncommunicable disease

Diabetes

0.027

Noncommunicable disease

Other

0.057

Infectious disease

Pneumonia

0.044

Infectious disease

Diarrheal

0.027

Infectious disease

Tuberculosis

0.02

Infectious disease

HIV/AIDS

0.015

Infectious disease

Malaria

0.011

Infectious disease

Other

0.021

Neonatal

Neonatal

0.033

Maternal

Maternal

0.004

Nutritional

Nutritional

0.004

Accidents

Transport accidents

0.023

Accidents

Other accidents

0.031

Violence

Suicides

0.013

Violence

Homicides

0.007

Violence

Combat

0.002

Violence

Terrorism

0.0005

Indented lists#

An indented list is essentially equivalent to the flattened tabular format above, except that each label appears in a row by itself. Equivalently, one can think of an indented list as shorthand for the nested dictionary structure as written above:

  • Noncommunicable disease

    • Cardiovascular: 0.33

    • Cancers: 0.18

    • Respiratory: 0.07

    • Neurological: 0.039

    • Digestive: 0.045

    • Diabetes: 0.027

    • Other noncommunicable: 0.057

  • Infectious disease

    • Pneumonia: 0.044

    • Diarrheal: 0.027

    • Tuberculosis: 0.02

    • HIV/AIDS: 0.015

    • Malaria: 0.011

    • Other: 0.021

  • Neonatal: 0.033

  • Maternal: 0.004

  • Nutritional: 0.004

  • Accidents

    • Transport accidents: 0.023

    • Other accidents: 0.031

  • Violence

    • Suicides: 0.013

    • Homicides: 0.007

    • Combat: 0.002

    • Terrorism: 0.0005

Other representations of hierarchical data#

There are additional variations in representations that you will encounter, including some that combine multiple tables that must be combined (“joined”) to recover the hierarchy and/or a unified flat representation. It is often relatively straightforward to convert among representations of the same information, provided we make a clear distinction between the semantic content of the data and its representation.

In the treemap project, we use only the libraries provided in a standard Python installation.
The restructure directory provides some additional Python programs to pre-process data and extract a hierarchical representation in JSON format. In a typical workflow, one might clean and summarize a data set with other tools before extracting a hierarchical representation. The SQL language supported by most relational databases provides operations including summarizing and combining tables. The Pandas framework for Python includes aggregation tools. One might even perform some data manipulation in a spreadsheet like Excel or Google Sheets, but often it is best to fully automate a workflow so that it can be reapplied consistently and efficiently to related or evolving datasets.

Summary#

The same information (semantic content) may be represented with differently structured data. External sources of data (say, a spreadsheet or database) may not be organized in a way that is convenient for the specific computations we wish to perform, such as grouping and summarizing the data. Hierarchical information, in particular, is often represented in some “flat” structure (e.g., one or more tables) in which the hierarchical structure is implicit. Often it is useful to reorganize such data as a first step in using it.

Example data preparation workflow#

Consider, for example, the course enrollment data used in our treemap project. The primary source of this data was a class roster, with a row for each student.

Class roster

The primary major code for each student was a column in that table, which was extracted to use as input in our enrollment analysis project. We summarized that table by counting occurrences of each major code and combined the summarized data with another table relating major codes like “ARCH” to full titles like “Architecture”.

Major codes

While we printed the output of that project in a human-friendly format, it is a trivial change to instead produce a table in CSV or JSON format.

The grouping of majors into schools, colleges, and divisions is represented by yet another file, a JSON structure representing that structure at University of Oregon. I wrote a simple Python program to combine the structure with the counts table.

As each of these steps is automated, they form a repeatable “workflow” that can be scripted to apply quickly and consistently to class rosters in subsequent course offerings or other courses.