1.2: Exploring Categorical Data

Click Open In Colab to open an interactive version of the full text section.

For a shorter in-class lab version of the section click here.


Additional Reading:

An Overview of Exploratory Data Analysis


Exploratory data analysis, or EDA for short, can be thought of as a cycle:

  • Generate questions about our data.
  • Search for answers by visualizing, transforming, and modeling our data.
  • Use what we learn to refine your questions and/or generate new questions.

The main goal of EDA is to develop an understanding of your data. When we ask a question, the question focuses our attention on a specific part of the data set and helps us decide which graphs, models, or transformations to make.

Loading the dplyr Package


The dplyr package is perhaps one of the most useful R packages for data wrangling and EDA. Data wrangling is generally the cleaning, reorganizing, and transforming data so it can be more easily analyzed. dplyr also contains data sets that we can use to practice our wrangling and visualization skills. In this lab, we will work with the data set called storms.

# loads dplyr package
library(dplyr)

Finding Help Documentation


  • The code cell below opens a glossary tab of all (most?) functions and data in the package dplyr.
# open glossary of dplyr functions
help(package = "dplyr")
  • The code cell below opens a help tab with information about the storms data set.
# opens help tab with info about storms data set
?storms

The Structure of Data


Data frames are two-dimensional data objects and are the fundamental data structure used by most of R’s libraries of functions and data sets.

  • Tabular data is tidy if each row corresponds to a different observation and each column corresponds to a different variable.

Each column of a data frame is a variable (stored as a vector) of possibly different data types.

  • If a variable is measured or counted by a number, it is called a quantitative or numerical variable.
    • Quantitative variables may be discrete (integers) or continuous (decimals).
  • If a variable groups observations into different categories or rankings, it is a qualitative or categorical variable.
    • The different categories of a qualitative variable are called levels or classes.
      • Levels are typically labeled with descriptive character strings or integers.
      • Levels may or may not have an ordering.
    • An ordinal variable is when the levels of a categorical variable do have a specified order.

Exploring Our Data


Below are some common functions used to get a first introduction to our data:

  • summary(df) gives numerical summary of all variables in data frame with generic name df.
  • glimpse(df) gives a glimpse of the data frame df.
  • str(df) summarizes the structure of all variables in data frame df
  • head(df) view first 6 rows in data frame.
  • tail(df) view last 6 rows in data frame.
  • View(df) to view the full data frame.
  • It typically is not recommended to include a View() command in your work since that opens the entire data set.
  • The goal of EDA is to provide nice summaries of a data set so we do not need to look at all the raw data!

Question 1


Let’s get to know the storms data set. Using some (or all) of the commands above, answer the following questions:

  • How many observations are in data set storms?
  • How many variables are storms?
    • Which variables are quantitative and which are categorical?
    • Which categorical variables have a ranking? Which do not?
  • Hint: Avoid using the View() function. Instead, use head() or tail() if you want to get a sense of what the raw data looks like.

Experiment with some of the functions in the code cell below to answer the questions. Then type your answer in the space below.

summary(storms)  # summary of each variable in storms
     name                year          month             day       
 Length:19066       Min.   :1975   Min.   : 1.000   Min.   : 1.00  
 Class :character   1st Qu.:1993   1st Qu.: 8.000   1st Qu.: 8.00  
 Mode  :character   Median :2004   Median : 9.000   Median :16.00  
                    Mean   :2002   Mean   : 8.699   Mean   :15.78  
                    3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:24.00  
                    Max.   :2021   Max.   :12.000   Max.   :31.00  
                                                                   
      hour             lat             long                         status    
 Min.   : 0.000   Min.   : 7.00   Min.   :-109.30   tropical storm     :6684  
 1st Qu.: 5.000   1st Qu.:18.40   1st Qu.: -78.70   hurricane          :4684  
 Median :12.000   Median :26.60   Median : -62.25   tropical depression:3525  
 Mean   : 9.094   Mean   :26.99   Mean   : -61.52   extratropical      :2068  
 3rd Qu.:18.000   3rd Qu.:33.70   3rd Qu.: -45.60   other low          :1405  
 Max.   :23.000   Max.   :70.70   Max.   :  13.50   subtropical storm  : 292  
                                                    (Other)            : 408  
    category          wind           pressure      tropicalstorm_force_diameter
 Min.   :1.000   Min.   : 10.00   Min.   : 882.0   Min.   :   0.0              
 1st Qu.:1.000   1st Qu.: 30.00   1st Qu.: 987.0   1st Qu.:   0.0              
 Median :1.000   Median : 45.00   Median :1000.0   Median : 110.0              
 Mean   :1.898   Mean   : 50.02   Mean   : 993.6   Mean   : 146.3              
 3rd Qu.:3.000   3rd Qu.: 65.00   3rd Qu.:1007.0   3rd Qu.: 220.0              
 Max.   :5.000   Max.   :165.00   Max.   :1024.0   Max.   :1440.0              
 NA's   :14382                                     NA's   :9512                
 hurricane_force_diameter
 Min.   :  0.00          
 1st Qu.:  0.00          
 Median :  0.00          
 Mean   : 14.81          
 3rd Qu.:  0.00          
 Max.   :300.00          
 NA's   :9512            
#head(storms)  # prints first 6 rows to screen
#tail(storms)  # prints last 6 rows to screen
#glimpse(storms)  # gives a glimpse of the data set
#str(storms)   # summary of data structure

Solution to Question 1







Question 2:


What additional information would you like to know about the storms data set that you were unable to find? What questions do you have about the data? In particular, what data is missing and why?

Solution to Question 2:







Data Types


R has 6 basic data types:

  1. character: collections of characters. E.g., "a", "hello world!"
  2. double: decimal numbers. e.g., 1.2, 1.0
  3. integer: whole numbers. In R, you must add L to the end of a number to specify it as an integer. E.g., 1L is an integer but 1 is a double.
  4. logical: Boolean values, TRUE and FALSE
  5. complex: complex numbers. E.g., 1+3i
  6. raw: a type to hold raw bytes.

See the Appendix for an Fundamentals of Working with Data for more information.

Checking Data Types Using typeof()


  • The typeof() function returns the R internal type or storage mode of any object.
typeof(1.0)
[1] "double"
typeof(2)
[1] "double"
typeof(3L)
[1] "integer"
typeof("hello")
[1] "character"
typeof(TRUE)
[1] "logical"
typeof(storms$status)
[1] "integer"
typeof(storms$year)
[1] "double"
typeof(storms$name)
[1] "character"

Extracting a Variable By Name with $


In the command typeof(storms$status), notice we refer to just the status variable of the storms data frame. A variable from a data frame may be extracted using $ and then specifying the name of the desired variable.

  • First indicate the name of the data frame, storms.
  • Followed by a dollar sign $.
  • Then indicate the name of the variable, for example wind.
  • The storms$wind is a vector that contains the wind speed of each storm.
is.vector(storms)  # storms is not a vector
[1] FALSE
is.data.frame(storms)  # storms is a data frame
[1] TRUE
is.data.frame(storms$wind)  # status is not a data frame
[1] FALSE
is.vector(storms$wind)  # status is a vector
[1] TRUE

Investigating Data Types with is.numeric()


  • The is.numeric(x) function tests whether or not an object x is numeric.
  • The is.character(x) function tests whether x is a character or not.
  • The is.factor(x) function tests whether x is a factor or not.
  • Note: Categorical data is typically stored as a factor in R.
is.numeric(storms$year)  # year is numeric
[1] TRUE
is.numeric(storms$category)  # category is also numeric
[1] TRUE
is.numeric(storms$name)  # name is not numeric
[1] FALSE
is.character(storms$name)  # name is character string
[1] TRUE
is.numeric(storms$status)  # status is not numeric
[1] FALSE
is.character(storms$status)  # status is not a character
[1] FALSE
is.factor(storms$status)  # status is a factor which is categorical
[1] TRUE

Converting Decimals to Integers


From the summary of the storms data set we first found above, we see that the variables year and month are being stored as double. These variables actually are integer values.

We can convert another variable of one format into another format using as.[new_datatype]()

  • For example, to convert to year to integer, we use as.integer(storms$year).
  • To convert a data type to character, we can use as.character(x).
  • To convert to a decimal (double), we can use as.numeric(x)
typeof(storms$year)
[1] "double"
typeof(storms$month)
[1] "double"
storms$year <- as.integer(storms$year)
storms$month <- as.integer(storms$month)
typeof(storms$year)
[1] "integer"
typeof(storms$month)
[1] "integer"

Converting to Categorical Data with factor()


Sometimes we think a variable is one data type, but it is actually being stored (and thus interpreted by R) as a different data type. One common issue is categorical data is stored as characters or integers. We would like observations with the same values to be group together.

  • The status variable in storms is being properly stored as a factor.
  • The category variable in storms is being stored as a numeric since the classes are integers.
summary(storms$category)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   1.000   1.898   3.000   5.000   14382 

The summary of category computes statistics such as mean and median. Typically with categorical data, we prefer to count how many observations are in each class of the variable.

  • In the code cell below, we convert category to a factor, and then observe the resulting summary.
storms$category <- factor(storms$category)
summary(storms$category)
    1     2     3     4     5  NA's 
 2478   973   579   539   115 14382 

Assignment of New Objects


Notice in the code cell above, we replace the original integers in the category column with a integers that are now stored as different levels of a categorical variable.

  • To store a data structure in the computer’s memory we must assign it a name.
    • In this case, we choose to give the variable the same name, an therefore overwrite the original category column.
  • Data structures can be stored using the assignment operator <-.

Frequency and Relative Frequency Tables


  • table(x) creates a frequency table for categorical variable x.
  • table(x, y) creates a two-way (or contingency) table for two categorical variables x and y
  • prop.table([table]) creates a relative frequency table relative to the grand total.
    • The input of prop.table() must be a table.
    • prop.table([table], 1) creates a relative frequency table relative to total in each row.
    • prop.table([table], 2) creates a relative frequency table relative to total in each column.

Question 3


  1. Which month over the period from 1975-2021 had the greatest number of subtropical storms? Which table did you use to help answer your question?

  2. Which month over the period from 1975-2021 had the greatest proportion of subtropical storms? Which table did you use to help answer your question?

Run each of the four code cells below, and after interpreting the output of each, answer the questions in the space below.

my.table <- table(storms$month, storms$status)  # gives counts
my.table
    
     disturbance extratropical hurricane other low subtropical depression
  1            0            29         5         5                      0
  4            0            40         0         0                      4
  5            0            18         0        49                      5
  6           13           130        18        82                     35
  7           45           135       202       175                     11
  8           25           275      1038       317                     36
  9           41           732      2380       446                     34
  10          14           520       799       219                     22
  11           8           175       209        81                      4
  12           0            14        33        31                      0
    
     subtropical storm tropical depression tropical storm tropical wave
  1                  6                   2             23             0
  4                  3                   1             18             0
  5                 20                  49             60             0
  6                 12                 213            276             0
  7                  6                 397            625             7
  8                 23                 975           1696            55
  9                 72                1315           2448            41
  10                66                 413           1024             0
  11                42                 139            443             8
  12                42                  21             71             0
prop.grand <- prop.table(my.table)  # gives proportions relative to grand total
round(prop.grand, 4)
    
     disturbance extratropical hurricane other low subtropical depression
  1       0.0000        0.0015    0.0003    0.0003                 0.0000
  4       0.0000        0.0021    0.0000    0.0000                 0.0002
  5       0.0000        0.0009    0.0000    0.0026                 0.0003
  6       0.0007        0.0068    0.0009    0.0043                 0.0018
  7       0.0024        0.0071    0.0106    0.0092                 0.0006
  8       0.0013        0.0144    0.0544    0.0166                 0.0019
  9       0.0022        0.0384    0.1248    0.0234                 0.0018
  10      0.0007        0.0273    0.0419    0.0115                 0.0012
  11      0.0004        0.0092    0.0110    0.0042                 0.0002
  12      0.0000        0.0007    0.0017    0.0016                 0.0000
    
     subtropical storm tropical depression tropical storm tropical wave
  1             0.0003              0.0001         0.0012        0.0000
  4             0.0002              0.0001         0.0009        0.0000
  5             0.0010              0.0026         0.0031        0.0000
  6             0.0006              0.0112         0.0145        0.0000
  7             0.0003              0.0208         0.0328        0.0004
  8             0.0012              0.0511         0.0890        0.0029
  9             0.0038              0.0690         0.1284        0.0022
  10            0.0035              0.0217         0.0537        0.0000
  11            0.0022              0.0073         0.0232        0.0004
  12            0.0022              0.0011         0.0037        0.0000
prop.row <- prop.table(my.table, 1)  # gives proportions relative to row totals
round(prop.row, 4)
    
     disturbance extratropical hurricane other low subtropical depression
  1       0.0000        0.4143    0.0714    0.0714                 0.0000
  4       0.0000        0.6061    0.0000    0.0000                 0.0606
  5       0.0000        0.0896    0.0000    0.2438                 0.0249
  6       0.0167        0.1669    0.0231    0.1053                 0.0449
  7       0.0281        0.0842    0.1260    0.1092                 0.0069
  8       0.0056        0.0619    0.2338    0.0714                 0.0081
  9       0.0055        0.0975    0.3170    0.0594                 0.0045
  10      0.0045        0.1690    0.2597    0.0712                 0.0071
  11      0.0072        0.1578    0.1885    0.0730                 0.0036
  12      0.0000        0.0660    0.1557    0.1462                 0.0000
    
     subtropical storm tropical depression tropical storm tropical wave
  1             0.0857              0.0286         0.3286        0.0000
  4             0.0455              0.0152         0.2727        0.0000
  5             0.0995              0.2438         0.2985        0.0000
  6             0.0154              0.2734         0.3543        0.0000
  7             0.0037              0.2477         0.3899        0.0044
  8             0.0052              0.2196         0.3820        0.0124
  9             0.0096              0.1751         0.3260        0.0055
  10            0.0214              0.1342         0.3328        0.0000
  11            0.0379              0.1253         0.3995        0.0072
  12            0.1981              0.0991         0.3349        0.0000
prop.col <- prop.table(my.table, 2)  # gives proportions relative to column totals
round(prop.col, 4)
    
     disturbance extratropical hurricane other low subtropical depression
  1       0.0000        0.0140    0.0011    0.0036                 0.0000
  4       0.0000        0.0193    0.0000    0.0000                 0.0265
  5       0.0000        0.0087    0.0000    0.0349                 0.0331
  6       0.0890        0.0629    0.0038    0.0584                 0.2318
  7       0.3082        0.0653    0.0431    0.1246                 0.0728
  8       0.1712        0.1330    0.2216    0.2256                 0.2384
  9       0.2808        0.3540    0.5081    0.3174                 0.2252
  10      0.0959        0.2515    0.1706    0.1559                 0.1457
  11      0.0548        0.0846    0.0446    0.0577                 0.0265
  12      0.0000        0.0068    0.0070    0.0221                 0.0000
    
     subtropical storm tropical depression tropical storm tropical wave
  1             0.0205              0.0006         0.0034        0.0000
  4             0.0103              0.0003         0.0027        0.0000
  5             0.0685              0.0139         0.0090        0.0000
  6             0.0411              0.0604         0.0413        0.0000
  7             0.0205              0.1126         0.0935        0.0631
  8             0.0788              0.2766         0.2537        0.4955
  9             0.2466              0.3730         0.3662        0.3694
  10            0.2260              0.1172         0.1532        0.0000
  11            0.1438              0.0394         0.0663        0.0721
  12            0.1438              0.0060         0.0106        0.0000

Solution to Question 3







Visualizing Categorical Data


There are many different ways we can great visualizations to gain insight into our data and search for possible patterns and relations between different variables.

  • For one categorical variable: Bar charts (or bar plots) and pie charts are commonly used.
  • For two categorical variables: Grouped and stacked bar charts are commonly used.

Creating Bar Charts with plot()


  • The plot() function is the most broadly used function for plotting different data types.
  • The type of plot generated by plot() depends on the data type(s) and the number of variable(s) we input.
    • If we input one categorical variable that is stored as factor, we get a bar chart.
    • If we input one quantitative variable, the plot is typically not very useful.
    • If a categorical variable is stored incorrectly as quantitative, we do not get an appropriate plot.
  • If we use plot() to display a categorical variable, the variable must be stored as a factor!
    • Recall we converted category to a factor in a previous code cell.
    • The variable month is a quantitative variable.
# plots appear in an array with 1 row and 2 columns
par(mfrow = c(1, 2))  # create an array of plots

plot(storms$category,  # categorical data
     main = "Hurricanes by Category",  # main title
     xlab = "Hurricane Category",  # horizontal axis label
     ylab = "Frequency",  # vertical axis label
     col = "steelblue")  # fill color of bars)

plot(storms$month,  # quantitative data
     main = "Not Number of Storms in Month",  # main title
     xlab = "Index (Row of Observation)",  # horizontal axis label
     ylab = "Month")  # vertical axis label

Creating Bar Charts and Pie Charts from a Table


If we want to keep month stored as an integer, but would like to create a visualization to display the number of storms that occurred in each month, we can:

  1. First use the table() function to count how many storms occurred in each month.
  2. Then create a bar chart using the barplot() function or pie chart using pie().
# plots appear in an array with 1 row and 2 columns
par(mfrow = c(1, 2))  # create an array of plots

month.table <- table(storms$month)  # create table of month counts
category.table <- table(storms$category)  # create table of category counts

barplot(month.table,  # input table of month counts
        main = "Storms in Each Month",  # main title
        xlab = "Month",  # horizontal axis label
        ylab = "Frequency",  # vertical axis label
        col = "seagreen")  # fill color of bars

pie(category.table,  # input table of category counts
    main = "Hurricanes by Category")  # main title

Note, pie() and barplot() both take tables as inputs. Even if a variable is stored as a factor, we need to store the counts in a table first.

  • The category variable in storms is stored as a factor, but the code below still crashes.
# see what happens if input is not a table
pie(storms$category)

Relationship Between Two Categorical Variables


Imagine we would like to compare the number of different category hurricanes that occurred in each month. In this case, we would like to compare two qualitative variables, namely category and month.

Grouped Frequency Bar Charts


To create a bar chart displaying the number of category hurricanes that occurred in each month:

  1. First create a two-way table of counts.
    • The second variable (month) is displayed on horizontal axis.
    • We get a separate bar for each level of the first variable (category).
  2. Input the table into the barplot() function.
    • Note the option beside = TRUE groups the bars for each month.
    • The default option beside = FALSE stacks the bars.
# two-way table of counts of category in each month
cat.table <- table(storms$category, storms$month)  # gives counts
# create a vector of colors
my.colors <- c("green", "purple", "grey", "red", "blue") 

# create side by side bar chart
barplot(cat.table,  # use counts from contingency table
        beside = TRUE,  # groups side-by-side
        main = "Category Hurricanes By Month",  # main title
        xlab = "Month",  # horizontal axis label
        col = my.colors,  # fill color of bars
        ylab = "Frequency")  # vertical axis label

# add a legend to plot
legend(x="topleft",  # place legend in top left
       legend=rownames(cat.table),  # get labels from row name in contingency table
       fill = my.colors)  # use same fill colors

Stacked Relative Frequency Bar Charts


To create a bar chart displaying the relative frequency (or proportion) of category hurricanes that occurred in each month:

  1. First create a two-way table of relative frequencies.
    • Pay attention to whether you want the proportions relative to grand, row, or column totals.
  2. Input the table into the barplot() function.
    • The default option beside = FALSE stacks the bars.
cat.grand <- prop.table(cat.table)  # gives proportions relative to grand total
cat.row <- prop.table(cat.table, 1)  # gives proportions relative to row totals
cat.col <- prop.table(cat.table, 2)  # gives proportions relative to column totals
par(mfrow = c(1, 3))  # create an array of plots

# create stacked bar chart 1
barplot(cat.grand,  # use proportions from contingency table
        main = "Category Hurricanes By Month",  # main title
        xlab = "Month",  # horizontal axis label
        col = my.colors,  # color of bars
        ylab = "Proportion")  # vertical axis label

# add legend to plot
legend(x="topleft",  # place legend in top left
       legend=rownames(cat.grand),  # get labels
       fill = my.colors)  # use same colors

##########

# create stacked bar chart 2
barplot(cat.row,  # use proportions from contingency table
        main = "Category Hurricanes By Month",  # main title
        xlab = "Month",  # horizontal axis label
        col = my.colors,  # color of bars
        ylab = "Proportion")  # vertical axis label

# add legend to plot
legend(x="topleft",  # place legend in top left
       legend=rownames(cat.row),  # get labels
       fill = my.colors)  # use same colors

###########

# create stacked bar chart 3
barplot(cat.col,  # use proportions from contingency table
        main = "Category Hurricanes By Month",  # main title
        xlab = "Month",  # horizontal axis label
        col = my.colors,  # color of bars
        ylab = "Proportion")  # vertical axis label

# add legend to plot
legend(x="topleft",  # place legend in top left
       legend=rownames(cat.col),  # get labels
       fill = my.colors)  # use same colors

Note

A proportion greater than 1 in the middle bar chart means, for example, the sum of the all September proportions (one relative to each category total) adds up to \(2.8\) since in 4 out of the 5 possible hurricane categories September accounts for over half the total.

Question 4


Based on the three plots generated in the previous code cell, answer the questions below.

  1. Which month has the most hurricanes?

  2. In which month is the proportion of category 1 hurricanes greatest?

Solution to Question 4







Question 5


What are the the differences in the three plots in the output of the previous code cell? Which of the three bar plots above do you believe best visualizes the occurrence of different category hurricanes by month? Which plot do you think is the least useful overall? Explain why.

Solution to Question 5







Exploring the General Social Survey Data Set


The data set gss_cat can be accessed from the forcats package. Below is a quote taken from the website of the GSS Data Explorer website maintained by NORC at the University of Chicago

The General Social Survey (GSS) is a project of NORC at the University of Chicago, with principal funding from the National Science Foundation. Since 1972, the GSS has been monitoring societal change and studying the growing complexity of American society.

The GSS is a publicly available national resource, and is one of the most frequently analyzed sources of information in the social sciences. GSS Data Explorer is one of many ways that NORC supports the dissemination of GSS data for use by legislators, policymakers, researchers, educators and others.

# load the forcats package
library(forcats)
# open help documentation for gss_cat
?gss_cat
# get numerical summary of variables
summary(gss_cat)
      year               marital           age                    race      
 Min.   :2000   No answer    :   17   Min.   :18.00   Other         : 1959  
 1st Qu.:2002   Never married: 5416   1st Qu.:33.00   Black         : 3129  
 Median :2006   Separated    :  743   Median :46.00   White         :16395  
 Mean   :2007   Divorced     : 3383   Mean   :47.18   Not applicable:    0  
 3rd Qu.:2010   Widowed      : 1807   3rd Qu.:59.00                         
 Max.   :2014   Married      :10117   Max.   :89.00                         
                                      NA's   :76                            
           rincome                   partyid            relig      
 $25000 or more:7363   Independent       :4119   Protestant:10846  
 Not applicable:7043   Not str democrat  :3690   Catholic  : 5124  
 $20000 - 24999:1283   Strong democrat   :3490   None      : 3523  
 $10000 - 14999:1168   Not str republican:3032   Christian :  689  
 $15000 - 19999:1048   Ind,near dem      :2499   Jewish    :  388  
 Refused       : 975   Strong republican :2314   Other     :  224  
 (Other)       :2603   (Other)           :2339   (Other)   :  689  
              denom          tvhours      
 Not applicable  :10072   Min.   : 0.000  
 Other           : 2534   1st Qu.: 1.000  
 No denomination : 1683   Median : 2.000  
 Southern baptist: 1536   Mean   : 2.981  
 Baptist-dk which: 1457   3rd Qu.: 4.000  
 United methodist: 1067   Max.   :24.000  
 (Other)         : 3134   NA's   :10146   

Question 6


Create a plot to visualize one categorical variable in the gss_cat data set. Based on your plot, comment on any interesting features of the variable you plotted.

# create a visualization for one categorical variable

Solution to Question 6







Question 7


Create a plot to visualize the relationship between two categorical variables in the gss_cat data set. Based on your plot, comment on any interesting features or relations between two the variables you plotted.

# create a visualization to illustrate the relation
# between two categorical variables

Solution to Question 7








Creative Commons License

Statistical Methods: Exploring the Uncertain by Adam Spiegler is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.