# load the library of functions and data in dplyr
library(dplyr)
Appendix A — Fundamentals of Working with Data
Introduction
Understanding the data types of the variables in our data set, and the structure of our data is crucial before we can identify what aspects might need to be cleaned and transformed so we can perform statistical analysis more efficiently.
This notebook is intended to be a brief overview of some fundamentals of working with data in R.
- How to save and display output of from commands?
- What are the basic data types?
- How to convert data types?
- What are the data structures in R?
- How to create or import a data frame?
- How to slice and extract rows and columns of a data frame?
These topics are important. This notebook just scratches the surface on many concepts. If you do not find a complete answer here, there are free resources online that dig deeper and more completely. Below are two such recommended references.
An Introduction to R by the developers of R
Loading Packages with the library()
Command
To explore some fundamentals of working with data in R, we will use the storms
data set which is located in the package dplyr
.
- The
dplyr
package is already installed in Google Colaboratory - We still need to use a
library
command to load the package. - Run the code cell below to load the
dplyr
package.
Each time you connect or restart a session, you will need to run a library()
command in order to access data and scripts in a package.
Help Documentation
The functions introduced in this document have robust help documentation with lots of options to customize. If you want to view help documentation for any of the functions used in this document, run commands such?typeof
, ?is.numeric
, ?read.table
, and so on.
# access help documentation for storms
# side panel should open with help manual for storms ?storms
# access help documentation for typeof
?typeof
Getting to Know Our Data
The package dplyr
contains a data set called storms
. Let’s find some useful information about this data.
- The code cell below will provide a numeric summary of all variables in the
storms
data. - Recall we need to first run the command
library(dplyr)
in the code cell above to be able to accessstorms
.
# get a numerical summary of all variables
summary(storms)
name year month day
Length:19066 Min. :1975 Min. : 1.000 Min. : 1.00
Class :character 1st Qu.:1993 1st Qu.: 8.000 1st Qu.: 8.00
Mode :character Median :2004 Median : 9.000 Median :16.00
Mean :2002 Mean : 8.699 Mean :15.78
3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.:24.00
Max. :2021 Max. :12.000 Max. :31.00
hour lat long status
Min. : 0.000 Min. : 7.00 Min. :-109.30 tropical storm :6684
1st Qu.: 5.000 1st Qu.:18.40 1st Qu.: -78.70 hurricane :4684
Median :12.000 Median :26.60 Median : -62.25 tropical depression:3525
Mean : 9.094 Mean :26.99 Mean : -61.52 extratropical :2068
3rd Qu.:18.000 3rd Qu.:33.70 3rd Qu.: -45.60 other low :1405
Max. :23.000 Max. :70.70 Max. : 13.50 subtropical storm : 292
(Other) : 408
category wind pressure tropicalstorm_force_diameter
Min. :1.000 Min. : 10.00 Min. : 882.0 Min. : 0.0
1st Qu.:1.000 1st Qu.: 30.00 1st Qu.: 987.0 1st Qu.: 0.0
Median :1.000 Median : 45.00 Median :1000.0 Median : 110.0
Mean :1.898 Mean : 50.02 Mean : 993.6 Mean : 146.3
3rd Qu.:3.000 3rd Qu.: 65.00 3rd Qu.:1007.0 3rd Qu.: 220.0
Max. :5.000 Max. :165.00 Max. :1024.0 Max. :1440.0
NA's :14382 NA's :9512
hurricane_force_diameter
Min. : 0.00
1st Qu.: 0.00
Median : 0.00
Mean : 14.81
3rd Qu.: 0.00
Max. :300.00
NA's :9512
Missing Data
A missing value occurs when the value of something isn’t known. R uses the special object NA
to represent missing value. If you have a missing value, you should represent that value as NA
. Note: The character string "NA"
is not the same thing as NA
.
- The
storms
data has properly coded 14,382 missing values forcategory
since storms that are not hurricanes do not have a category. - The
storms
data has properly coded 9,512 missing values for each oftropicalstorm_force_diameter
andhurricane_force_diameter
since these value only began being recorded in 2004.
Assignment to New (or Existing) Objects
To store a data structure in the computer’s memory we must assign it a name.
Data structures can be stored using the assignment operator <-
or =
.
Some comments:
- In general, both
<-
and=
can be used for assignment. <-
and=
can be used identically most of the time, but not always.- It’s safer and more conventional to use
<-
for assignment.
In the following code, we compute the mean of a vector. Why can’t we see the result after running it?
<- storms$wind # wind is now stored in w
w <- mean(w) # compute mean wind speed and assign to xbar.w xbar.w
- Once an object has been assigned a name, it can be printed by executing the name of the object.
# print the mean wind speed to screen xbar.w
[1] 50.01741
- We can also print an object to screen using the
print()
function.
print(xbar.w) # print the mean with print() command
[1] 50.01741
- We can calculate, assign, and print the result by putting parenthesis around the assignment.
# calculate, assign, and print standard deviation
<- sd(w)) # note ( ) around the entire command (s
[1] 25.50103
- Sometimes you want to see the result of a code cell, and sometimes you will not.
Basic Data Types
R has 6 basic data types:
- character: collections of characters. E.g.,
"a"
,"hello world!"
- double: decimal numbers. e.g.,
1.2
,1.0
- integer: whole numbers. In R, you must add
L
to the end of a number to specify it as an integer. E.g.,1L
is an integer but1
is a double. - logical: Boolean values,
TRUE
andFALSE
- complex: complex numbers. E.g.,
1+3i
- raw: a type to hold raw bytes.
Checking Data Type Using typeof()
- The
typeof()
function returns the R internal type or storage mode of any object.
typeof(1.0)
[1] "double"
typeof(2)
[1] "double"
typeof(3L)
[1] "integer"
typeof("hello")
[1] "character"
typeof(TRUE)
[1] "logical"
typeof(storms$status)
[1] "integer"
typeof(storms$year)
[1] "double"
typeof(storms$name)
[1] "character"
Investigating Data Types with is.numeric()
- The
is.numeric(x)
function tests whether or not an objectx
is numeric. - The
is.character(x)
function tests whetherx
is a character or not. - The
is.factor(x)
function tests whetherx
is a factor or not.
Categorical data is typically stored as a factor
in R.
is.numeric(storms$year) # year is numeric
[1] TRUE
is.numeric(storms$category) # category is also numeric
[1] TRUE
is.numeric(storms$name) # name is not numeric
[1] FALSE
is.character(storms$name) # name is character string
[1] TRUE
is.numeric(storms$status) # status is not numeric
[1] FALSE
is.character(storms$status) # status is not a character
[1] FALSE
is.factor(storms$status) # status is a factor which is categorical
[1] TRUE
- The function
str(x)
provides information about the levels or classes ofx
.
str(storms$status)
Factor w/ 9 levels "disturbance",..: 7 7 7 7 7 7 7 7 8 8 ...
Changing Data Types
Converting to Categorical Data with factor()
- Sometimes we think a variable is one data type, but it is actually being stored (and thus interpreted by R) as a different data type.
- One common issue is categorical data is stored as characters. We would like observations with the same values to be group together.
- The
status
variable instorms
is being properly stored as afactor
! - The
category
variable instorms
is being stored as anumeric
since it is ordinal. - With ordinal categories, we may choose to keep it stored as
numeric
, or we may prefer to treat them as factors.
summary(storms$category)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 1.000 1.000 1.898 3.000 5.000 14382
- The summary of
category
computes statistics such as mean and median. - Typically with categorical data, we prefer to count how many observations are in each class of the variable.
- In the code cell below, we convert
category
to a factor, and then observe the resulting summary.
$category <- factor(storms$category)
stormssummary(storms$category)
1 2 3 4 5 NA's
2478 973 579 539 115 14382
Converting Data Types with as.numeric()
, as.integer()
, etc.
From the summary of the storms
data set we first found above, we see that the variables year
and month
are being stored as double
. These variables actually are integer values.
We can convert another variable of one format into another format using as.[new_datatype]()
- For example, to convert to year to
integer
, we useas.integer(storms$year)
. - To convert a data type to character, we can use
as.character(x)
. - To convert to a decimal (
double
), we can useas.numeric(x)
typeof(storms$year)
[1] "double"
typeof(storms$month)
[1] "double"
$year <- as.integer(storms$year)
storms$month <- as.integer(storms$month)
stormstypeof(storms$year)
[1] "integer"
typeof(storms$month)
[1] "integer"
Data structures
R operates on data structures. A data structure is simply some sort of “container” that holds certain kinds of information
R has 5 basic data structures:
- vector: One dimensional object of a single data type.
- matrix: Two dimensional object of a single data type.
- array: \(n\) dimensional object of a single data type.
- data frame: Two dimensional object where each column can be a different data type.
- list: An object that contains elements of different types like (and possibly another list inside it).
See R documentation for more info.
Vectors
A vector is a single-dimensional set of data of the same type.
Creating Vectors from Scratch
The most basic way to create a vector is the combine function c
. The following commands create vectors of type numeric, character, and logical, respectively.
<- c(1, 2, 5.3, 6, -2, 4)
x1 <- c("one", "two", "three")
x2 <- c(TRUE, TRUE, FALSE, TRUE)
x3 <- c(TRUE, 3.4, "hello")
x4 typeof(x1)
[1] "double"
typeof(x2)
[1] "character"
typeof(x3)
[1] "logical"
typeof(x4)
[1] "character"
- We can check the data structure of an object using commands such as
is.vector()
,is.list()
,is.matrix()
, and so on.
is.list(x1)
[1] FALSE
is.vector(x1)
[1] TRUE
is.list(x4)
[1] FALSE
is.vector(x4)
[1] TRUE
Data Frames
Data frames are two-dimensional data objects and are the fundamental data structure used by most of R’s libraries of functions and data sets.
- Tabular data is tidy if each row corresponds to a different observation and column corresponds to a different variable.
Each column of a data frame is a variable (stored as a vector). If the variable:
- Is measured or counted by a number, it is a quantitative or numerical variable.
- Groups observations into different categories or rankings, it is a qualitative or categorical variable.
Creating Data Frames from Scratch
Data frames are created by passing vectors into the data.frame()
function.
The names of the columns in the data frame are the names of the vectors you give the data.frame
function.
Consider the following simple example.
# create basic data frame
<- c(1, 2, 3, 4)
d <- c("red", "white", "blue", NA)
e <- c(TRUE, TRUE, TRUE, FALSE)
f <- data.frame(d,e,f)
df df
d e f
1 1 red TRUE
2 2 white TRUE
3 3 blue TRUE
4 4 <NA> FALSE
Naming Column Headers
The columns of a data frame can be renamed using the names()
function on the data frame.
# name columns of data frame
names(df) <- c("ID", "Color", "Passed")
df
ID Color Passed
1 1 red TRUE
2 2 white TRUE
3 3 blue TRUE
4 4 <NA> FALSE
The columns of a data frame can be named when you are first creating the data frame by using [new_name] = [orig_vec_name]
for each vector of data.
# create data frame with better column names
<- data.frame(ID = d, Color = e, Passed = f)
df2 df2
ID Color Passed
1 1 red TRUE
2 2 white TRUE
3 3 blue TRUE
4 4 <NA> FALSE
Checking Data Structure
- The
is.matrix(x)
function tests whether or not an objectx
is a matrix. - The
is.vector(x)
function test whetherx
is a vector. - The
is.data.frame(x)
function test whetherx
is a data frame.
is.matrix(df)
[1] FALSE
is.vector(df)
[1] FALSE
is.data.frame(df)
[1] TRUE
Extracting and Slicing Data Frames
Extracting a Column By Name
The column vectors of a data frame may be extracted using $
and specifying the name of the desired vector.
df$Color
would access theColor
column of data framedf
.
$Color # prints column of data frame df named Color df
[1] "red" "white" "blue" NA
Slicing Rows and Columns By Indexing
Part of a data frame can also be extracted by thinking of at as a general matrix and specifying the desired rows or columns in square brackets after the object name.
- Note R starts with index 1 which is different from Python which indexes starting from 0.
For example, if we had a data frame named df
:
df[1,]
would access the first row ofdf
.df[1:2,]
would access the first two rows ofdf
.df[,2]
would access the second column ofdf
.df[1:2, 2:3]
would access the information in rows 1 and 2 of columns 2 and 3 ofdf
.
2] # second column is Color df[,
[1] "red" "white" "blue" NA
2,] # second row of df df[
ID Color Passed
2 2 white TRUE
1:2,2:3] # first and second rows of columns 2 and 3 df[
If you need to select multiple columns of a data frame by name, you can pass a character vector with column names in the column position of []
.
df[, c("ID", "Passed")]
would extract theID
andPassed
columns ofdf
.
c("ID", "Passed")] df[,
ID Passed
1 1 TRUE
2 2 TRUE
3 3 TRUE
4 4 FALSE
c(1, 3)] # another we to pick columns 1 and 3 df[,
ID Passed
1 1 TRUE
2 2 TRUE
3 3 TRUE
4 4 FALSE
# another we to pick columns 1 and 3
-2] # exclude column 2 df[,
ID Passed
1 1 TRUE
2 2 TRUE
3 3 TRUE
4 4 FALSE
Importing an External File as a Data Frame
The read.table
function imports data from file into R as a data frame.
Usage: read.table(file, header = TRUE, sep = ",")
file
is the file path and name of the file you want to import into R.- If you don’t know the file path, set
file = file.choose()
will bring up a dialog box asking you to locate the file you want to import.
- If you don’t know the file path, set
header
specifies whether the data file has a header (variable labels for each column of data in the first row of the data file).- If you don’t specify this option in R or use
header = FALSE
, then R will assume the file doesn’t have any headings. header = TRUE
tells R to read in the data as a data frame with column names taken from the first row of the data file.
- If you don’t specify this option in R or use
sep
specifies the delimiter separating elements in the file.- If each column of data in the file is separated by a space, then use
sep = " "
- If each column of data in the file is separated by a comma, then use
sep = ","
- If each column of data in the file is separated by a tab, then use
sep = "\t"
.
- If each column of data in the file is separated by a space, then use
Here is an example reading a csv (comma separated file) with a header:
# import data as data frame
<- read.table(file="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Data/Transactions.csv",
bike.store header = TRUE, # Keep column headers as names
sep = ",") # comma as separator of columns
glimpse(bike.store)
Rows: 20,000
Columns: 13
$ transaction_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ product_id <int> 2, 3, 37, 88, 78, 25, 22, 15, 67, 12, 5, 61, 3…
$ customer_id <int> 2950, 3120, 402, 3135, 787, 2339, 1542, 2459, …
$ transaction_date <chr> "25-02-2017", "21-05-2017", "16-10-2017", "31-…
$ online_order <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, F…
$ order_status <chr> "Approved", "Approved", "Approved", "Approved"…
$ brand <chr> "Solex", "Trek Bicycles", "OHM Cycles", "Norco…
$ product_line <chr> "Standard", "Standard", "Standard", "Standard"…
$ product_class <chr> "medium", "medium", "low", "medium", "medium",…
$ product_size <chr> "medium", "large", "medium", "medium", "large"…
$ list_price <dbl> 71.49, 2091.47, 1793.43, 1198.46, 1765.30, 153…
$ standard_cost <dbl> 53.62, 388.92, 248.82, 381.10, 709.48, 829.65,…
$ product_first_sold_date <int> 41245, 41701, 36361, 36145, 42226, 39031, 3416…
- The
glimpse()
function provides a nice summary of the structure. - Run the code cell below to see the various options of
read.table()
. - There are other functions and packages that may be better at reading in tabular data.
read.table()
is a good place to start!
?read.table
Logical Statements
Sometimes we need to know if the elements of an object satisfy certain conditions. This can be determined using the logical operators <
, <=
, >
, >=
, ==
, !=
.
==
means equal to.!=
means NOT equal to.
Execute the following commands in R and see what you get.
<- seq(2, 16, by = 2) # creating the vector a
a a
[1] 2 4 6 8 10 12 14 16
> 10 a
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
<= 4 a
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
== 10 a
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
!= 10 a
[1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
And and Or Statements
More complicated logical statements can be made using &
and |
.
&
means “and”- Both statements must be true for
state1 & state2
to returnTRUE
.
- Both statements must be true for
|
means “or”- Only one of the the two statements must be true for
state1 | state2
to returnTRUE
. - If both statements are true in an “or” statement, the statement is also
TRUE
.
- Only one of the the two statements must be true for
Below is a summary of “and” and “or” logic:
TRUE & TRUE
returnsTRUE
FALSE & TRUE
returnsFALSE
FALSE & FALSE
returnsFALSE
TRUE | TRUE
returnsTRUE
FALSE | TRUE
returnsTRUE
FALSE | FALSE
returnsFALSE
# relationship between logicals & (and), | (or)
TRUE & TRUE
[1] TRUE
FALSE & TRUE
[1] FALSE
FALSE & FALSE
[1] FALSE
TRUE | TRUE
[1] TRUE
FALSE | TRUE
[1] TRUE
FALSE | FALSE
[1] FALSE
Execute the following commands in R and see what you get.
<- 3 # b is equal to the number 3
b
# complex logical statements
> 6) & (b <= 10) # FALSE and TRUE (b
[1] FALSE
<= 4) | (b >= 12) # TRUE or FALSE (b
[1] TRUE
Logical Indexing
We can use a logical statement as an index to extract certain entries from a vector or data frame. For example, if we want to to know the product_id
(column 2), brand
(column 7), product_line
(column 8), and list_price
(column 11) of all transactions that have a list_price
greater than $2,090, we can run the code cell below.
- We use a logical index for the row to extract just the rows that have a
list_price
value strictly greater than 2090. - We indicate we want to keep just columns 2, 7 through 8, and 11 with the column index
c(2, 7:8, 11)
. - We store the results to a new data frame named
expensive
. - Finally, we print the first 6 rows of our new data frame with the
head()
function to check the results.
<- bike.store[bike.store$list_price > 2090, c(2, 7:8, 11)]
expensive head(expensive)
product_id brand product_line list_price
2 3 Trek Bicycles Standard 2091.47
16 3 Trek Bicycles Standard 2091.47
69 38 Trek Bicycles Standard 2091.47
154 3 Trek Bicycles Standard 2091.47
165 3 Trek Bicycles Standard 2091.47
188 3 Trek Bicycles Standard 2091.47
Creative Commons License Information
Statistical Methods: Exploring the Uncertain by Adam Spiegler is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.