Appendix A — Fundamentals of Working with Data

Author

Adam Spiegler, University of Colorado Denver

Introduction

Understanding the data types of the variables in our data set, and the structure of our data is crucial before we can identify what aspects might need to be cleaned and transformed so we can perform statistical analysis more efficiently.

This notebook is intended to be a brief overview of some fundamentals of working with data in R.

How to save and display output of from commands?
What are the basic data types?
How to convert data types?
What are the data structures in R?
How to create or import a data frame?
How to slice and extract rows and columns of a data frame?

These topics are important. This notebook just scratches the surface on many concepts. If you do not find a complete answer here, there are free resources online that dig deeper and more completely. Below are two such recommended references.

An Introduction to R by the developers of R
Programming with R Guide to Data Types and Structures

Loading Packages with the `library()` Command

To explore some fundamentals of working with data in R, we will use the storms data set which is located in the package dplyr.

The dplyr package is already installed in Google Colaboratory
We still need to use a library command to load the package.
Run the code cell below to load the dplyr package.

# load the library of functions and data in dplyr
library(dplyr)

Caution

Each time you connect or restart a session, you will need to run a library() command in order to access data and scripts in a package.

Help Documentation

The functions introduced in this document have robust help documentation with lots of options to customize. If you want to view help documentation for any of the functions used in this document, run commands such?typeof, ?is.numeric, ?read.table, and so on.

# access help documentation for storms
?storms  # side panel should open with help manual for storms

# access help documentation for typeof
?typeof

Getting to Know Our Data

The package dplyr contains a data set called storms. Let’s find some useful information about this data.

The code cell below will provide a numeric summary of all variables in the storms data.
Recall we need to first run the command library(dplyr) in the code cell above to be able to access storms.

# get a numerical summary of all variables
summary(storms)

     name                year          month             day       
 Length:19066       Min.   :1975   Min.   : 1.000   Min.   : 1.00  
 Class :character   1st Qu.:1993   1st Qu.: 8.000   1st Qu.: 8.00  
 Mode  :character   Median :2004   Median : 9.000   Median :16.00  
                    Mean   :2002   Mean   : 8.699   Mean   :15.78  
                    3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:24.00  
                    Max.   :2021   Max.   :12.000   Max.   :31.00  
                                                                   
      hour             lat             long                         status    
 Min.   : 0.000   Min.   : 7.00   Min.   :-109.30   tropical storm     :6684  
 1st Qu.: 5.000   1st Qu.:18.40   1st Qu.: -78.70   hurricane          :4684  
 Median :12.000   Median :26.60   Median : -62.25   tropical depression:3525  
 Mean   : 9.094   Mean   :26.99   Mean   : -61.52   extratropical      :2068  
 3rd Qu.:18.000   3rd Qu.:33.70   3rd Qu.: -45.60   other low          :1405  
 Max.   :23.000   Max.   :70.70   Max.   :  13.50   subtropical storm  : 292  
                                                    (Other)            : 408  
    category          wind           pressure      tropicalstorm_force_diameter
 Min.   :1.000   Min.   : 10.00   Min.   : 882.0   Min.   :   0.0              
 1st Qu.:1.000   1st Qu.: 30.00   1st Qu.: 987.0   1st Qu.:   0.0              
 Median :1.000   Median : 45.00   Median :1000.0   Median : 110.0              
 Mean   :1.898   Mean   : 50.02   Mean   : 993.6   Mean   : 146.3              
 3rd Qu.:3.000   3rd Qu.: 65.00   3rd Qu.:1007.0   3rd Qu.: 220.0              
 Max.   :5.000   Max.   :165.00   Max.   :1024.0   Max.   :1440.0              
 NA's   :14382                                     NA's   :9512                
 hurricane_force_diameter
 Min.   :  0.00          
 1st Qu.:  0.00          
 Median :  0.00          
 Mean   : 14.81          
 3rd Qu.:  0.00          
 Max.   :300.00          
 NA's   :9512

Missing Data

A missing value occurs when the value of something isn’t known. R uses the special object NA to represent missing value. If you have a missing value, you should represent that value as NA. Note: The character string "NA" is not the same thing as NA.

The storms data has properly coded 14,382 missing values for category since storms that are not hurricanes do not have a category.
The storms data has properly coded 9,512 missing values for each of tropicalstorm_force_diameter and hurricane_force_diameter since these value only began being recorded in 2004.

Assignment to New (or Existing) Objects

To store a data structure in the computer’s memory we must assign it a name.

Data structures can be stored using the assignment operator <- or =.

Some comments:

In general, both <- and = can be used for assignment.
<- and = can be used identically most of the time, but not always.
It’s safer and more conventional to use <- for assignment.

In the following code, we compute the mean of a vector. Why can’t we see the result after running it?

w <- storms$wind  # wind is now stored in w
xbar.w <- mean(w)  # compute mean wind speed and assign to xbar.w

Once an object has been assigned a name, it can be printed by executing the name of the object.

xbar.w  # print the mean wind speed to screen

[1] 50.01741

We can also print an object to screen using the print() function.

print(xbar.w)  # print the mean with print() command

[1] 50.01741

We can calculate, assign, and print the result by putting parenthesis around the assignment.

# calculate, assign, and print standard deviation
(s <- sd(w))  # note ( ) around the entire command

[1] 25.50103

Sometimes you want to see the result of a code cell, and sometimes you will not.

Basic Data Types

R has 6 basic data types:

character: collections of characters. E.g., "a", "hello world!"
double: decimal numbers. e.g., 1.2, 1.0
integer: whole numbers. In R, you must add L to the end of a number to specify it as an integer. E.g., 1L is an integer but 1 is a double.
logical: Boolean values, TRUE and FALSE
complex: complex numbers. E.g., 1+3i
raw: a type to hold raw bytes.

Checking Data Type Using `typeof()`

The typeof() function returns the R internal type or storage mode of any object.

typeof(1.0)

[1] "double"

typeof(2)

[1] "double"

typeof(3L)

[1] "integer"

typeof("hello")

[1] "character"

typeof(TRUE)

[1] "logical"

typeof(storms$status)

[1] "integer"

typeof(storms$year)

[1] "double"

typeof(storms$name)

[1] "character"

Investigating Data Types with `is.numeric()`

The is.numeric(x) function tests whether or not an object x is numeric.
The is.character(x) function tests whether x is a character or not.
The is.factor(x) function tests whether x is a factor or not.

Note

Categorical data is typically stored as a factor in R.

is.numeric(storms$year)  # year is numeric

[1] TRUE

is.numeric(storms$category)  # category is also numeric

[1] TRUE

is.numeric(storms$name)  # name is not numeric

[1] FALSE

is.character(storms$name)  # name is character string

[1] TRUE

is.numeric(storms$status)  # status is not numeric

[1] FALSE

is.character(storms$status)  # status is not a character

[1] FALSE

is.factor(storms$status)  # status is a factor which is categorical

[1] TRUE

The function str(x) provides information about the levels or classes of x.

str(storms$status)

 Factor w/ 9 levels "disturbance",..: 7 7 7 7 7 7 7 7 8 8 ...

Changing Data Types

Converting to Categorical Data with `factor()`

Sometimes we think a variable is one data type, but it is actually being stored (and thus interpreted by R) as a different data type.
One common issue is categorical data is stored as characters. We would like observations with the same values to be group together.
The status variable in storms is being properly stored as a factor!
The category variable in storms is being stored as a numeric since it is ordinal.
With ordinal categories, we may choose to keep it stored as numeric, or we may prefer to treat them as factors.

summary(storms$category)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   1.000   1.898   3.000   5.000   14382

The summary of category computes statistics such as mean and median.
Typically with categorical data, we prefer to count how many observations are in each class of the variable.
In the code cell below, we convert category to a factor, and then observe the resulting summary.

storms$category <- factor(storms$category)
summary(storms$category)

    1     2     3     4     5  NA's 
 2478   973   579   539   115 14382

Converting Data Types with `as.numeric()`, `as.integer()`, etc.

From the summary of the storms data set we first found above, we see that the variables year and month are being stored as double. These variables actually are integer values.

We can convert another variable of one format into another format using as.[new_datatype]()

For example, to convert to year to integer, we use as.integer(storms$year).
To convert a data type to character, we can use as.character(x).
To convert to a decimal (double), we can use as.numeric(x)

typeof(storms$year)

[1] "double"

typeof(storms$month)

[1] "double"

storms$year <- as.integer(storms$year)
storms$month <- as.integer(storms$month)
typeof(storms$year)

[1] "integer"

typeof(storms$month)

[1] "integer"

Data structures

R operates on data structures. A data structure is simply some sort of “container” that holds certain kinds of information

R has 5 basic data structures:

vector: One dimensional object of a single data type.
matrix: Two dimensional object of a single data type.
array: $n$ dimensional object of a single data type.
data frame: Two dimensional object where each column can be a different data type.
list: An object that contains elements of different types like (and possibly another list inside it).

See R documentation for more info.

Vectors

A vector is a single-dimensional set of data of the same type.

Creating Vectors from Scratch

The most basic way to create a vector is the combine function c. The following commands create vectors of type numeric, character, and logical, respectively.

x1 <- c(1, 2, 5.3, 6, -2, 4)
x2 <- c("one", "two", "three")
x3 <- c(TRUE, TRUE, FALSE, TRUE)
x4 <- c(TRUE, 3.4, "hello")
typeof(x1)

[1] "double"

typeof(x2)

[1] "character"

typeof(x3)

[1] "logical"

typeof(x4)

[1] "character"

We can check the data structure of an object using commands such as is.vector(), is.list(), is.matrix(), and so on.

is.list(x1)

[1] FALSE

is.vector(x1)

[1] TRUE

is.list(x4)

[1] FALSE

is.vector(x4)

[1] TRUE

Data Frames

Data frames are two-dimensional data objects and are the fundamental data structure used by most of R’s libraries of functions and data sets.

Tabular data is tidy if each row corresponds to a different observation and column corresponds to a different variable.

Each column of a data frame is a variable (stored as a vector). If the variable:

Is measured or counted by a number, it is a quantitative or numerical variable.
Groups observations into different categories or rankings, it is a qualitative or categorical variable.

Creating Data Frames from Scratch

Data frames are created by passing vectors into the data.frame() function.

The names of the columns in the data frame are the names of the vectors you give the data.frame function.

Consider the following simple example.

# create basic data frame
d <- c(1, 2, 3, 4)
e <- c("red", "white", "blue", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
df <- data.frame(d,e,f)
df

  d     e     f
1 1   red  TRUE
2 2 white  TRUE
3 3  blue  TRUE
4 4  <NA> FALSE

Naming Column Headers

The columns of a data frame can be renamed using the names() function on the data frame.

# name columns of data frame
names(df) <- c("ID", "Color", "Passed")
df

  ID Color Passed
1  1   red   TRUE
2  2 white   TRUE
3  3  blue   TRUE
4  4  <NA>  FALSE

The columns of a data frame can be named when you are first creating the data frame by using [new_name] = [orig_vec_name] for each vector of data.

# create data frame with better column names
df2 <- data.frame(ID = d, Color = e, Passed = f)
df2

  ID Color Passed
1  1   red   TRUE
2  2 white   TRUE
3  3  blue   TRUE
4  4  <NA>  FALSE

Checking Data Structure

The is.matrix(x) function tests whether or not an object x is a matrix.
The is.vector(x) function test whether x is a vector.
The is.data.frame(x) function test whether x is a data frame.

is.matrix(df)

[1] FALSE

is.vector(df)

[1] FALSE

is.data.frame(df)

[1] TRUE

Extracting and Slicing Data Frames

Extracting a Column By Name

The column vectors of a data frame may be extracted using $ and specifying the name of the desired vector.

df$Color would access the Color column of data frame df.

df$Color  # prints column of data frame df named Color

[1] "red"   "white" "blue"  NA

Slicing Rows and Columns By Indexing

Part of a data frame can also be extracted by thinking of at as a general matrix and specifying the desired rows or columns in square brackets after the object name.

Note R starts with index 1 which is different from Python which indexes starting from 0.

For example, if we had a data frame named df:

df[1,] would access the first row of df.
df[1:2,] would access the first two rows of df.
df[,2] would access the second column of df.
df[1:2, 2:3] would access the information in rows 1 and 2 of columns 2 and 3 of df.

df[,2]  # second column is Color

[1] "red"   "white" "blue"  NA

df[2,]  # second row of df

  ID Color Passed
2  2 white   TRUE

df[1:2,2:3]  # first and second rows of columns 2 and 3

If you need to select multiple columns of a data frame by name, you can pass a character vector with column names in the column position of [].

df[, c("ID", "Passed")] would extract the ID and Passed columns of df.

df[, c("ID", "Passed")]

  ID Passed
1  1   TRUE
2  2   TRUE
3  3   TRUE
4  4  FALSE

df[, c(1, 3)]  # another we to pick columns 1 and 3

  ID Passed
1  1   TRUE
2  2   TRUE
3  3   TRUE
4  4  FALSE

# another we to pick columns 1 and 3
df[, -2]  # exclude column 2

  ID Passed
1  1   TRUE
2  2   TRUE
3  3   TRUE
4  4  FALSE

Importing an External File as a Data Frame

The read.table function imports data from file into R as a data frame.

Usage: read.table(file, header = TRUE, sep = ",")

file is the file path and name of the file you want to import into R.
- If you don’t know the file path, set file = file.choose() will bring up a dialog box asking you to locate the file you want to import.
header specifies whether the data file has a header (variable labels for each column of data in the first row of the data file).
- If you don’t specify this option in R or use header = FALSE, then R will assume the file doesn’t have any headings.
- header = TRUE tells R to read in the data as a data frame with column names taken from the first row of the data file.
sep specifies the delimiter separating elements in the file.
- If each column of data in the file is separated by a space, then use sep = " "
- If each column of data in the file is separated by a comma, then use sep = ","
- If each column of data in the file is separated by a tab, then use sep = "\t".

Here is an example reading a csv (comma separated file) with a header:

# import data as data frame
bike.store <- read.table(file="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Data/Transactions.csv",
                         header = TRUE,  # Keep column headers as names
                         sep = ",")  # comma as separator of columns

glimpse(bike.store)

Rows: 20,000
Columns: 13
$ transaction_id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ product_id              <int> 2, 3, 37, 88, 78, 25, 22, 15, 67, 12, 5, 61, 3…
$ customer_id             <int> 2950, 3120, 402, 3135, 787, 2339, 1542, 2459, …
$ transaction_date        <chr> "25-02-2017", "21-05-2017", "16-10-2017", "31-…
$ online_order            <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, F…
$ order_status            <chr> "Approved", "Approved", "Approved", "Approved"…
$ brand                   <chr> "Solex", "Trek Bicycles", "OHM Cycles", "Norco…
$ product_line            <chr> "Standard", "Standard", "Standard", "Standard"…
$ product_class           <chr> "medium", "medium", "low", "medium", "medium",…
$ product_size            <chr> "medium", "large", "medium", "medium", "large"…
$ list_price              <dbl> 71.49, 2091.47, 1793.43, 1198.46, 1765.30, 153…
$ standard_cost           <dbl> 53.62, 388.92, 248.82, 381.10, 709.48, 829.65,…
$ product_first_sold_date <int> 41245, 41701, 36361, 36145, 42226, 39031, 3416…

The glimpse() function provides a nice summary of the structure.
Run the code cell below to see the various options of read.table().
There are other functions and packages that may be better at reading in tabular data. read.table() is a good place to start!

?read.table

Logical Statements

Sometimes we need to know if the elements of an object satisfy certain conditions. This can be determined using the logical operators <, <=, >, >=, ==, !=.

== means equal to.
!= means NOT equal to.

Execute the following commands in R and see what you get.

a <- seq(2, 16, by = 2) # creating the vector a
a

[1]  2  4  6  8 10 12 14 16

a > 10

[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

a <= 4

[1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

a == 10

[1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

a != 10

[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

And and Or Statements

More complicated logical statements can be made using & and |.

& means “and”
- Both statements must be true for state1 & state2 to return TRUE.
| means “or”
- Only one of the the two statements must be true for state1 | state2 to return TRUE.
- If both statements are true in an “or” statement, the statement is also TRUE.

Below is a summary of “and” and “or” logic:

TRUE & TRUE returns TRUE
FALSE & TRUE returns FALSE
FALSE & FALSE returns FALSE
TRUE | TRUE returns TRUE
FALSE | TRUE returns TRUE
FALSE | FALSE returns FALSE

# relationship between logicals & (and), | (or)
TRUE & TRUE

[1] TRUE

FALSE & TRUE

[1] FALSE

FALSE & FALSE

[1] FALSE

TRUE | TRUE

[1] TRUE

FALSE | TRUE

[1] TRUE

FALSE | FALSE

[1] FALSE

Execute the following commands in R and see what you get.

b <- 3  # b is equal to the number 3

# complex logical statements
(b > 6) & (b <= 10)  # FALSE and TRUE

[1] FALSE

(b <= 4) | (b >= 12)  # TRUE or FALSE

[1] TRUE

Logical Indexing

We can use a logical statement as an index to extract certain entries from a vector or data frame. For example, if we want to to know the product_id (column 2), brand (column 7), product_line (column 8), and list_price (column 11) of all transactions that have a list_price greater than $2,090, we can run the code cell below.

We use a logical index for the row to extract just the rows that have a list_price value strictly greater than 2090.
We indicate we want to keep just columns 2, 7 through 8, and 11 with the column index c(2, 7:8, 11).
We store the results to a new data frame named expensive.
Finally, we print the first 6 rows of our new data frame with the head() function to check the results.

expensive <- bike.store[bike.store$list_price > 2090, c(2, 7:8, 11)]
head(expensive)

    product_id         brand product_line list_price
2            3 Trek Bicycles     Standard    2091.47
16           3 Trek Bicycles     Standard    2091.47
69          38 Trek Bicycles     Standard    2091.47
154          3 Trek Bicycles     Standard    2091.47
165          3 Trek Bicycles     Standard    2091.47
188          3 Trek Bicycles     Standard    2091.47

Creative Commons License Information

Creative Commons License

Statistical Methods: Exploring the Uncertain by Adam Spiegler is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Introduction

Loading Packages with the library() Command

Help Documentation

Getting to Know Our Data

Missing Data

Assignment to New (or Existing) Objects

Basic Data Types

Checking Data Type Using typeof()

Investigating Data Types with is.numeric()

Changing Data Types

Converting to Categorical Data with factor()

Converting Data Types with as.numeric(), as.integer(), etc.

Data structures

Vectors

Creating Vectors from Scratch

Data Frames

Creating Data Frames from Scratch

Naming Column Headers

Checking Data Structure

Extracting and Slicing Data Frames

Extracting a Column By Name

Slicing Rows and Columns By Indexing

Importing an External File as a Data Frame

Logical Statements

And and Or Statements

Logical Indexing

Creative Commons License Information

Loading Packages with the `library()` Command

Checking Data Type Using `typeof()`

Investigating Data Types with `is.numeric()`

Converting to Categorical Data with `factor()`

Converting Data Types with `as.numeric()`, `as.integer()`, etc.