# loads dplyr package
library(dplyr)
1.2: Exploring Categorical Data
Click to open an interactive version of the full text section.
For a shorter in-class lab version of the section click here.
Additional Reading:
- See Overview of Plotting Data in R for further reading and examples about plotting in R.
- See Fundamentals of Working with Data for more information about data types and structures in R.
- The R Graph Gallery has examples of many other types of graphs.
An Overview of Exploratory Data Analysis
Exploratory data analysis, or EDA for short, can be thought of as a cycle:
- Generate questions about our data.
- Search for answers by visualizing, transforming, and modeling our data.
- Use what we learn to refine your questions and/or generate new questions.
The main goal of EDA is to develop an understanding of your data. When we ask a question, the question focuses our attention on a specific part of the data set and helps us decide which graphs, models, or transformations to make.
Loading the dplyr
Package
The dplyr
package is perhaps one of the most useful R packages for data wrangling and EDA. Data wrangling is generally the cleaning, reorganizing, and transforming data so it can be more easily analyzed. dplyr
also contains data sets that we can use to practice our wrangling and visualization skills. In this lab, we will work with the data set called storms
.
- See official dplyr documentation for a comprehensive tutorial.
- Run the code cell below to load the
dplyr
package.
Finding Help Documentation
- The code cell below opens a glossary tab of all (most?) functions and data in the package
dplyr
.
# open glossary of dplyr functions
help(package = "dplyr")
- The code cell below opens a help tab with information about the
storms
data set.
# opens help tab with info about storms data set
?storms
The Structure of Data
Data frames are two-dimensional data objects and are the fundamental data structure used by most of R’s libraries of functions and data sets.
- Tabular data is tidy if each row corresponds to a different observation and each column corresponds to a different variable.
Each column of a data frame is a variable (stored as a vector) of possibly different data types.
- If a variable is measured or counted by a number, it is called a quantitative or numerical variable.
- Quantitative variables may be discrete (integers) or continuous (decimals).
- If a variable groups observations into different categories or rankings, it is a qualitative or categorical variable.
- The different categories of a qualitative variable are called levels or classes.
- Levels are typically labeled with descriptive character strings or integers.
- Levels may or may not have an ordering.
- An ordinal variable is when the levels of a categorical variable do have a specified order.
- The different categories of a qualitative variable are called levels or classes.
Exploring Our Data
Below are some common functions used to get a first introduction to our data:
summary(df)
gives numerical summary of all variables in data frame with generic namedf
.glimpse(df)
gives a glimpse of the data framedf
.str(df)
summarizes the structure of all variables in data framedf
head(df)
view first 6 rows in data frame.tail(df)
view last 6 rows in data frame.View(df)
to view the full data frame.- It typically is not recommended to include a
View()
command in your work since that opens the entire data set. - The goal of EDA is to provide nice summaries of a data set so we do not need to look at all the raw data!
Question 1
Let’s get to know the storms
data set. Using some (or all) of the commands above, answer the following questions:
- How many observations are in data set
storms
? - How many variables are
storms
?- Which variables are quantitative and which are categorical?
- Which categorical variables have a ranking? Which do not?
- Hint: Avoid using the
View()
function. Instead, usehead()
ortail()
if you want to get a sense of what the raw data looks like.
Experiment with some of the functions in the code cell below to answer the questions. Then type your answer in the space below.
summary(storms) # summary of each variable in storms
name year month day
Length:19066 Min. :1975 Min. : 1.000 Min. : 1.00
Class :character 1st Qu.:1993 1st Qu.: 8.000 1st Qu.: 8.00
Mode :character Median :2004 Median : 9.000 Median :16.00
Mean :2002 Mean : 8.699 Mean :15.78
3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.:24.00
Max. :2021 Max. :12.000 Max. :31.00
hour lat long status
Min. : 0.000 Min. : 7.00 Min. :-109.30 tropical storm :6684
1st Qu.: 5.000 1st Qu.:18.40 1st Qu.: -78.70 hurricane :4684
Median :12.000 Median :26.60 Median : -62.25 tropical depression:3525
Mean : 9.094 Mean :26.99 Mean : -61.52 extratropical :2068
3rd Qu.:18.000 3rd Qu.:33.70 3rd Qu.: -45.60 other low :1405
Max. :23.000 Max. :70.70 Max. : 13.50 subtropical storm : 292
(Other) : 408
category wind pressure tropicalstorm_force_diameter
Min. :1.000 Min. : 10.00 Min. : 882.0 Min. : 0.0
1st Qu.:1.000 1st Qu.: 30.00 1st Qu.: 987.0 1st Qu.: 0.0
Median :1.000 Median : 45.00 Median :1000.0 Median : 110.0
Mean :1.898 Mean : 50.02 Mean : 993.6 Mean : 146.3
3rd Qu.:3.000 3rd Qu.: 65.00 3rd Qu.:1007.0 3rd Qu.: 220.0
Max. :5.000 Max. :165.00 Max. :1024.0 Max. :1440.0
NA's :14382 NA's :9512
hurricane_force_diameter
Min. : 0.00
1st Qu.: 0.00
Median : 0.00
Mean : 14.81
3rd Qu.: 0.00
Max. :300.00
NA's :9512
#head(storms) # prints first 6 rows to screen
#tail(storms) # prints last 6 rows to screen
#glimpse(storms) # gives a glimpse of the data set
#str(storms) # summary of data structure
Solution to Question 1
Question 2:
What additional information would you like to know about the storms
data set that you were unable to find? What questions do you have about the data? In particular, what data is missing and why?
Solution to Question 2:
Data Types
R has 6 basic data types:
- character: collections of characters. E.g.,
"a"
,"hello world!"
- double: decimal numbers. e.g.,
1.2
,1.0
- integer: whole numbers. In R, you must add
L
to the end of a number to specify it as an integer. E.g.,1L
is an integer but1
is a double. - logical: Boolean values,
TRUE
andFALSE
- complex: complex numbers. E.g.,
1+3i
- raw: a type to hold raw bytes.
See the Appendix for an Fundamentals of Working with Data for more information.
Checking Data Types Using typeof()
- The
typeof()
function returns the R internal type or storage mode of any object.
typeof(1.0)
[1] "double"
typeof(2)
[1] "double"
typeof(3L)
[1] "integer"
typeof("hello")
[1] "character"
typeof(TRUE)
[1] "logical"
typeof(storms$status)
[1] "integer"
typeof(storms$year)
[1] "double"
typeof(storms$name)
[1] "character"
Extracting a Variable By Name with $
In the command typeof(storms$status)
, notice we refer to just the status
variable of the storms
data frame. A variable from a data frame may be extracted using $
and then specifying the name of the desired variable.
- First indicate the name of the data frame,
storms
. - Followed by a dollar sign
$
. - Then indicate the name of the variable, for example
wind
. - The
storms$wind
is a vector that contains the wind speed of each storm.
is.vector(storms) # storms is not a vector
[1] FALSE
is.data.frame(storms) # storms is a data frame
[1] TRUE
is.data.frame(storms$wind) # status is not a data frame
[1] FALSE
is.vector(storms$wind) # status is a vector
[1] TRUE
Investigating Data Types with is.numeric()
- The
is.numeric(x)
function tests whether or not an objectx
is numeric. - The
is.character(x)
function tests whetherx
is a character or not. - The
is.factor(x)
function tests whetherx
is a factor or not. - Note: Categorical data is typically stored as a
factor
in R.
is.numeric(storms$year) # year is numeric
[1] TRUE
is.numeric(storms$category) # category is also numeric
[1] TRUE
is.numeric(storms$name) # name is not numeric
[1] FALSE
is.character(storms$name) # name is character string
[1] TRUE
is.numeric(storms$status) # status is not numeric
[1] FALSE
is.character(storms$status) # status is not a character
[1] FALSE
is.factor(storms$status) # status is a factor which is categorical
[1] TRUE
Converting Decimals to Integers
From the summary of the storms
data set we first found above, we see that the variables year
and month
are being stored as double
. These variables actually are integer values.
We can convert another variable of one format into another format using as.[new_datatype]()
- For example, to convert to year to
integer
, we useas.integer(storms$year)
. - To convert a data type to character, we can use
as.character(x)
. - To convert to a decimal (
double
), we can useas.numeric(x)
typeof(storms$year)
[1] "double"
typeof(storms$month)
[1] "double"
$year <- as.integer(storms$year)
storms$month <- as.integer(storms$month)
stormstypeof(storms$year)
[1] "integer"
typeof(storms$month)
[1] "integer"
Converting to Categorical Data with factor()
Sometimes we think a variable is one data type, but it is actually being stored (and thus interpreted by R) as a different data type. One common issue is categorical data is stored as characters or integers. We would like observations with the same values to be group together.
- The
status
variable instorms
is being properly stored as afactor
. - The
category
variable instorms
is being stored as anumeric
since the classes are integers.
summary(storms$category)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 1.000 1.000 1.898 3.000 5.000 14382
The summary of category
computes statistics such as mean and median. Typically with categorical data, we prefer to count how many observations are in each class of the variable.
- In the code cell below, we convert
category
to a factor, and then observe the resulting summary.
$category <- factor(storms$category)
stormssummary(storms$category)
1 2 3 4 5 NA's
2478 973 579 539 115 14382
Assignment of New Objects
Notice in the code cell above, we replace the original integers in the category
column with a integers that are now stored as different levels of a categorical variable.
- To store a data structure in the computer’s memory we must assign it a name.
- In this case, we choose to give the variable the same name, an therefore overwrite the original
category
column.
- In this case, we choose to give the variable the same name, an therefore overwrite the original
- Data structures can be stored using the assignment operator
<-
.
Frequency and Relative Frequency Tables
table(x)
creates a frequency table for categorical variablex
.table(x, y)
creates a two-way (or contingency) table for two categorical variablesx
andy
prop.table([table])
creates a relative frequency table relative to the grand total.- The input of
prop.table()
must be a table. prop.table([table], 1)
creates a relative frequency table relative to total in each row.prop.table([table], 2)
creates a relative frequency table relative to total in each column.
- The input of
Question 3
Which month over the period from 1975-2021 had the greatest number of subtropical storms? Which table did you use to help answer your question?
Which month over the period from 1975-2021 had the greatest proportion of subtropical storms? Which table did you use to help answer your question?
Run each of the four code cells below, and after interpreting the output of each, answer the questions in the space below.
<- table(storms$month, storms$status) # gives counts
my.table my.table
disturbance extratropical hurricane other low subtropical depression
1 0 29 5 5 0
4 0 40 0 0 4
5 0 18 0 49 5
6 13 130 18 82 35
7 45 135 202 175 11
8 25 275 1038 317 36
9 41 732 2380 446 34
10 14 520 799 219 22
11 8 175 209 81 4
12 0 14 33 31 0
subtropical storm tropical depression tropical storm tropical wave
1 6 2 23 0
4 3 1 18 0
5 20 49 60 0
6 12 213 276 0
7 6 397 625 7
8 23 975 1696 55
9 72 1315 2448 41
10 66 413 1024 0
11 42 139 443 8
12 42 21 71 0
<- prop.table(my.table) # gives proportions relative to grand total
prop.grand round(prop.grand, 4)
disturbance extratropical hurricane other low subtropical depression
1 0.0000 0.0015 0.0003 0.0003 0.0000
4 0.0000 0.0021 0.0000 0.0000 0.0002
5 0.0000 0.0009 0.0000 0.0026 0.0003
6 0.0007 0.0068 0.0009 0.0043 0.0018
7 0.0024 0.0071 0.0106 0.0092 0.0006
8 0.0013 0.0144 0.0544 0.0166 0.0019
9 0.0022 0.0384 0.1248 0.0234 0.0018
10 0.0007 0.0273 0.0419 0.0115 0.0012
11 0.0004 0.0092 0.0110 0.0042 0.0002
12 0.0000 0.0007 0.0017 0.0016 0.0000
subtropical storm tropical depression tropical storm tropical wave
1 0.0003 0.0001 0.0012 0.0000
4 0.0002 0.0001 0.0009 0.0000
5 0.0010 0.0026 0.0031 0.0000
6 0.0006 0.0112 0.0145 0.0000
7 0.0003 0.0208 0.0328 0.0004
8 0.0012 0.0511 0.0890 0.0029
9 0.0038 0.0690 0.1284 0.0022
10 0.0035 0.0217 0.0537 0.0000
11 0.0022 0.0073 0.0232 0.0004
12 0.0022 0.0011 0.0037 0.0000
<- prop.table(my.table, 1) # gives proportions relative to row totals
prop.row round(prop.row, 4)
disturbance extratropical hurricane other low subtropical depression
1 0.0000 0.4143 0.0714 0.0714 0.0000
4 0.0000 0.6061 0.0000 0.0000 0.0606
5 0.0000 0.0896 0.0000 0.2438 0.0249
6 0.0167 0.1669 0.0231 0.1053 0.0449
7 0.0281 0.0842 0.1260 0.1092 0.0069
8 0.0056 0.0619 0.2338 0.0714 0.0081
9 0.0055 0.0975 0.3170 0.0594 0.0045
10 0.0045 0.1690 0.2597 0.0712 0.0071
11 0.0072 0.1578 0.1885 0.0730 0.0036
12 0.0000 0.0660 0.1557 0.1462 0.0000
subtropical storm tropical depression tropical storm tropical wave
1 0.0857 0.0286 0.3286 0.0000
4 0.0455 0.0152 0.2727 0.0000
5 0.0995 0.2438 0.2985 0.0000
6 0.0154 0.2734 0.3543 0.0000
7 0.0037 0.2477 0.3899 0.0044
8 0.0052 0.2196 0.3820 0.0124
9 0.0096 0.1751 0.3260 0.0055
10 0.0214 0.1342 0.3328 0.0000
11 0.0379 0.1253 0.3995 0.0072
12 0.1981 0.0991 0.3349 0.0000
<- prop.table(my.table, 2) # gives proportions relative to column totals
prop.col round(prop.col, 4)
disturbance extratropical hurricane other low subtropical depression
1 0.0000 0.0140 0.0011 0.0036 0.0000
4 0.0000 0.0193 0.0000 0.0000 0.0265
5 0.0000 0.0087 0.0000 0.0349 0.0331
6 0.0890 0.0629 0.0038 0.0584 0.2318
7 0.3082 0.0653 0.0431 0.1246 0.0728
8 0.1712 0.1330 0.2216 0.2256 0.2384
9 0.2808 0.3540 0.5081 0.3174 0.2252
10 0.0959 0.2515 0.1706 0.1559 0.1457
11 0.0548 0.0846 0.0446 0.0577 0.0265
12 0.0000 0.0068 0.0070 0.0221 0.0000
subtropical storm tropical depression tropical storm tropical wave
1 0.0205 0.0006 0.0034 0.0000
4 0.0103 0.0003 0.0027 0.0000
5 0.0685 0.0139 0.0090 0.0000
6 0.0411 0.0604 0.0413 0.0000
7 0.0205 0.1126 0.0935 0.0631
8 0.0788 0.2766 0.2537 0.4955
9 0.2466 0.3730 0.3662 0.3694
10 0.2260 0.1172 0.1532 0.0000
11 0.1438 0.0394 0.0663 0.0721
12 0.1438 0.0060 0.0106 0.0000
Solution to Question 3
Visualizing Categorical Data
There are many different ways we can great visualizations to gain insight into our data and search for possible patterns and relations between different variables.
- For one categorical variable: Bar charts (or bar plots) and pie charts are commonly used.
- For two categorical variables: Grouped and stacked bar charts are commonly used.
Creating Bar Charts with plot()
- The
plot()
function is the most broadly used function for plotting different data types. - The type of plot generated by
plot()
depends on the data type(s) and the number of variable(s) we input.- If we input one categorical variable that is stored as
factor
, we get a bar chart. - If we input one quantitative variable, the plot is typically not very useful.
- If a categorical variable is stored incorrectly as quantitative, we do not get an appropriate plot.
- If we input one categorical variable that is stored as
- If we use
plot()
to display a categorical variable, the variable must be stored as a factor!- Recall we converted
category
to afactor
in a previous code cell. - The variable
month
is a quantitative variable.
- Recall we converted
# plots appear in an array with 1 row and 2 columns
par(mfrow = c(1, 2)) # create an array of plots
plot(storms$category, # categorical data
main = "Hurricanes by Category", # main title
xlab = "Hurricane Category", # horizontal axis label
ylab = "Frequency", # vertical axis label
col = "steelblue") # fill color of bars)
plot(storms$month, # quantitative data
main = "Not Number of Storms in Month", # main title
xlab = "Index (Row of Observation)", # horizontal axis label
ylab = "Month") # vertical axis label
Creating Bar Charts and Pie Charts from a Table
If we want to keep month
stored as an integer, but would like to create a visualization to display the number of storms that occurred in each month, we can:
- First use the
table()
function to count how many storms occurred in each month. - Then create a bar chart using the
barplot()
function or pie chart usingpie()
.
# plots appear in an array with 1 row and 2 columns
par(mfrow = c(1, 2)) # create an array of plots
<- table(storms$month) # create table of month counts
month.table <- table(storms$category) # create table of category counts
category.table
barplot(month.table, # input table of month counts
main = "Storms in Each Month", # main title
xlab = "Month", # horizontal axis label
ylab = "Frequency", # vertical axis label
col = "seagreen") # fill color of bars
pie(category.table, # input table of category counts
main = "Hurricanes by Category") # main title
Note, pie()
and barplot()
both take tables as inputs. Even if a variable is stored as a factor
, we need to store the counts in a table first.
- The
category
variable instorms
is stored as a factor, but the code below still crashes.
# see what happens if input is not a table
pie(storms$category)
Relationship Between Two Categorical Variables
Imagine we would like to compare the number of different category hurricanes that occurred in each month. In this case, we would like to compare two qualitative variables, namely category
and month
.
Grouped Frequency Bar Charts
To create a bar chart displaying the number of category hurricanes that occurred in each month:
- First create a two-way table of counts.
- The second variable (
month
) is displayed on horizontal axis. - We get a separate bar for each level of the first variable (
category
).
- The second variable (
- Input the table into the
barplot()
function.- Note the option
beside = TRUE
groups the bars for each month. - The default option
beside = FALSE
stacks the bars.
- Note the option
# two-way table of counts of category in each month
<- table(storms$category, storms$month) # gives counts cat.table
# create a vector of colors
<- c("green", "purple", "grey", "red", "blue")
my.colors
# create side by side bar chart
barplot(cat.table, # use counts from contingency table
beside = TRUE, # groups side-by-side
main = "Category Hurricanes By Month", # main title
xlab = "Month", # horizontal axis label
col = my.colors, # fill color of bars
ylab = "Frequency") # vertical axis label
# add a legend to plot
legend(x="topleft", # place legend in top left
legend=rownames(cat.table), # get labels from row name in contingency table
fill = my.colors) # use same fill colors
Stacked Relative Frequency Bar Charts
To create a bar chart displaying the relative frequency (or proportion) of category hurricanes that occurred in each month:
- First create a two-way table of relative frequencies.
- Pay attention to whether you want the proportions relative to grand, row, or column totals.
- Input the table into the
barplot()
function.- The default option
beside = FALSE
stacks the bars.
- The default option
<- prop.table(cat.table) # gives proportions relative to grand total
cat.grand <- prop.table(cat.table, 1) # gives proportions relative to row totals
cat.row <- prop.table(cat.table, 2) # gives proportions relative to column totals cat.col
par(mfrow = c(1, 3)) # create an array of plots
# create stacked bar chart 1
barplot(cat.grand, # use proportions from contingency table
main = "Category Hurricanes By Month", # main title
xlab = "Month", # horizontal axis label
col = my.colors, # color of bars
ylab = "Proportion") # vertical axis label
# add legend to plot
legend(x="topleft", # place legend in top left
legend=rownames(cat.grand), # get labels
fill = my.colors) # use same colors
##########
# create stacked bar chart 2
barplot(cat.row, # use proportions from contingency table
main = "Category Hurricanes By Month", # main title
xlab = "Month", # horizontal axis label
col = my.colors, # color of bars
ylab = "Proportion") # vertical axis label
# add legend to plot
legend(x="topleft", # place legend in top left
legend=rownames(cat.row), # get labels
fill = my.colors) # use same colors
###########
# create stacked bar chart 3
barplot(cat.col, # use proportions from contingency table
main = "Category Hurricanes By Month", # main title
xlab = "Month", # horizontal axis label
col = my.colors, # color of bars
ylab = "Proportion") # vertical axis label
# add legend to plot
legend(x="topleft", # place legend in top left
legend=rownames(cat.col), # get labels
fill = my.colors) # use same colors
A proportion greater than 1 in the middle bar chart means, for example, the sum of the all September proportions (one relative to each category total) adds up to \(2.8\) since in 4 out of the 5 possible hurricane categories September accounts for over half the total.
Question 4
Based on the three plots generated in the previous code cell, answer the questions below.
Which month has the most hurricanes?
In which month is the proportion of category 1 hurricanes greatest?
Solution to Question 4
Question 5
What are the the differences in the three plots in the output of the previous code cell? Which of the three bar plots above do you believe best visualizes the occurrence of different category hurricanes by month? Which plot do you think is the least useful overall? Explain why.
Solution to Question 5