Appendix A — Fundamentals of Working with Data

Author

Adam Spiegler, University of Colorado Denver

Open In Colab

Introduction


Understanding the data types of the variables in our data set, and the structure of our data is crucial before we can identify what aspects might need to be cleaned and transformed so we can perform statistical analysis more efficiently.

This notebook is intended to be a brief overview of some fundamentals of working with data in R.

These topics are important. This notebook just scratches the surface on many concepts. If you do not find a complete answer here, there are free resources online that dig deeper and more completely. Below are two such recommended references.

Loading Packages with the library() Command


To explore some fundamentals of working with data in R, we will use the storms data set which is located in the package dplyr.

  • The dplyr package is already installed in Google Colaboratory
  • We still need to use a library command to load the package.
  • Run the code cell below to load the dplyr package.
# load the library of functions and data in dplyr
library(dplyr)
Caution

Each time you connect or restart a session, you will need to run a library() command in order to access data and scripts in a package.

Help Documentation


The functions introduced in this document have robust help documentation with lots of options to customize. If you want to view help documentation for any of the functions used in this document, run commands such?typeof, ?is.numeric, ?read.table, and so on.

# access help documentation for storms
?storms  # side panel should open with help manual for storms
# access help documentation for typeof
?typeof

Getting to Know Our Data


The package dplyr contains a data set called storms. Let’s find some useful information about this data.

  • The code cell below will provide a numeric summary of all variables in the storms data.
  • Recall we need to first run the command library(dplyr) in the code cell above to be able to access storms.
# get a numerical summary of all variables
summary(storms)
     name                year          month             day       
 Length:19066       Min.   :1975   Min.   : 1.000   Min.   : 1.00  
 Class :character   1st Qu.:1993   1st Qu.: 8.000   1st Qu.: 8.00  
 Mode  :character   Median :2004   Median : 9.000   Median :16.00  
                    Mean   :2002   Mean   : 8.699   Mean   :15.78  
                    3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:24.00  
                    Max.   :2021   Max.   :12.000   Max.   :31.00  
                                                                   
      hour             lat             long                         status    
 Min.   : 0.000   Min.   : 7.00   Min.   :-109.30   tropical storm     :6684  
 1st Qu.: 5.000   1st Qu.:18.40   1st Qu.: -78.70   hurricane          :4684  
 Median :12.000   Median :26.60   Median : -62.25   tropical depression:3525  
 Mean   : 9.094   Mean   :26.99   Mean   : -61.52   extratropical      :2068  
 3rd Qu.:18.000   3rd Qu.:33.70   3rd Qu.: -45.60   other low          :1405  
 Max.   :23.000   Max.   :70.70   Max.   :  13.50   subtropical storm  : 292  
                                                    (Other)            : 408  
    category          wind           pressure      tropicalstorm_force_diameter
 Min.   :1.000   Min.   : 10.00   Min.   : 882.0   Min.   :   0.0              
 1st Qu.:1.000   1st Qu.: 30.00   1st Qu.: 987.0   1st Qu.:   0.0              
 Median :1.000   Median : 45.00   Median :1000.0   Median : 110.0              
 Mean   :1.898   Mean   : 50.02   Mean   : 993.6   Mean   : 146.3              
 3rd Qu.:3.000   3rd Qu.: 65.00   3rd Qu.:1007.0   3rd Qu.: 220.0              
 Max.   :5.000   Max.   :165.00   Max.   :1024.0   Max.   :1440.0              
 NA's   :14382                                     NA's   :9512                
 hurricane_force_diameter
 Min.   :  0.00          
 1st Qu.:  0.00          
 Median :  0.00          
 Mean   : 14.81          
 3rd Qu.:  0.00          
 Max.   :300.00          
 NA's   :9512            

Missing Data


A missing value occurs when the value of something isn’t known. R uses the special object NA to represent missing value. If you have a missing value, you should represent that value as NA. Note: The character string "NA" is not the same thing as NA.

  • The storms data has properly coded 14,382 missing values for category since storms that are not hurricanes do not have a category.
  • The storms data has properly coded 9,512 missing values for each of tropicalstorm_force_diameter and hurricane_force_diameter since these value only began being recorded in 2004.

Assignment to New (or Existing) Objects


To store a data structure in the computer’s memory we must assign it a name.

Data structures can be stored using the assignment operator <- or =.

Some comments:

  • In general, both <- and = can be used for assignment.
  • <- and = can be used identically most of the time, but not always.
  • It’s safer and more conventional to use <- for assignment.

In the following code, we compute the mean of a vector. Why can’t we see the result after running it?

w <- storms$wind  # wind is now stored in w
xbar.w <- mean(w)  # compute mean wind speed and assign to xbar.w
  • Once an object has been assigned a name, it can be printed by executing the name of the object.
xbar.w  # print the mean wind speed to screen
[1] 50.01741
  • We can also print an object to screen using the print() function.
print(xbar.w)  # print the mean with print() command
[1] 50.01741
  • We can calculate, assign, and print the result by putting parenthesis around the assignment.
# calculate, assign, and print standard deviation
(s <- sd(w))  # note ( ) around the entire command
[1] 25.50103
  • Sometimes you want to see the result of a code cell, and sometimes you will not.

Basic Data Types


R has 6 basic data types:

  1. character: collections of characters. E.g., "a", "hello world!"
  2. double: decimal numbers. e.g., 1.2, 1.0
  3. integer: whole numbers. In R, you must add L to the end of a number to specify it as an integer. E.g., 1L is an integer but 1 is a double.
  4. logical: Boolean values, TRUE and FALSE
  5. complex: complex numbers. E.g., 1+3i
  6. raw: a type to hold raw bytes.

Checking Data Type Using typeof()


  • The typeof() function returns the R internal type or storage mode of any object.
typeof(1.0)
[1] "double"
typeof(2)
[1] "double"
typeof(3L)
[1] "integer"
typeof("hello")
[1] "character"
typeof(TRUE)
[1] "logical"
typeof(storms$status)
[1] "integer"
typeof(storms$year)
[1] "double"
typeof(storms$name)
[1] "character"

Investigating Data Types with is.numeric()


  • The is.numeric(x) function tests whether or not an object x is numeric.
  • The is.character(x) function tests whether x is a character or not.
  • The is.factor(x) function tests whether x is a factor or not.
Note

Categorical data is typically stored as a factor in R.

is.numeric(storms$year)  # year is numeric
[1] TRUE
is.numeric(storms$category)  # category is also numeric
[1] TRUE
is.numeric(storms$name)  # name is not numeric
[1] FALSE
is.character(storms$name)  # name is character string
[1] TRUE
is.numeric(storms$status)  # status is not numeric
[1] FALSE
is.character(storms$status)  # status is not a character
[1] FALSE
is.factor(storms$status)  # status is a factor which is categorical
[1] TRUE
  • The function str(x) provides information about the levels or classes of x.
str(storms$status)
 Factor w/ 9 levels "disturbance",..: 7 7 7 7 7 7 7 7 8 8 ...

Changing Data Types


Converting to Categorical Data with factor()


  • Sometimes we think a variable is one data type, but it is actually being stored (and thus interpreted by R) as a different data type.
  • One common issue is categorical data is stored as characters. We would like observations with the same values to be group together.
  • The status variable in storms is being properly stored as a factor!
  • The category variable in storms is being stored as a numeric since it is ordinal.
  • With ordinal categories, we may choose to keep it stored as numeric, or we may prefer to treat them as factors.
summary(storms$category)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   1.000   1.898   3.000   5.000   14382 
  • The summary of category computes statistics such as mean and median.
  • Typically with categorical data, we prefer to count how many observations are in each class of the variable.
  • In the code cell below, we convert category to a factor, and then observe the resulting summary.
storms$category <- factor(storms$category)
summary(storms$category)
    1     2     3     4     5  NA's 
 2478   973   579   539   115 14382 

Converting Data Types with as.numeric(), as.integer(), etc.


From the summary of the storms data set we first found above, we see that the variables year and month are being stored as double. These variables actually are integer values.

We can convert another variable of one format into another format using as.[new_datatype]()

  • For example, to convert to year to integer, we use as.integer(storms$year).
  • To convert a data type to character, we can use as.character(x).
  • To convert to a decimal (double), we can use as.numeric(x)
typeof(storms$year)
[1] "double"
typeof(storms$month)
[1] "double"
storms$year <- as.integer(storms$year)
storms$month <- as.integer(storms$month)
typeof(storms$year)
[1] "integer"
typeof(storms$month)
[1] "integer"

Data structures


R operates on data structures. A data structure is simply some sort of “container” that holds certain kinds of information

R has 5 basic data structures:

  • vector: One dimensional object of a single data type.
  • matrix: Two dimensional object of a single data type.
  • array: \(n\) dimensional object of a single data type.
  • data frame: Two dimensional object where each column can be a different data type.
  • list: An object that contains elements of different types like (and possibly another list inside it).

See R documentation for more info.

Vectors


A vector is a single-dimensional set of data of the same type.

Creating Vectors from Scratch


The most basic way to create a vector is the combine function c. The following commands create vectors of type numeric, character, and logical, respectively.

x1 <- c(1, 2, 5.3, 6, -2, 4)
x2 <- c("one", "two", "three")
x3 <- c(TRUE, TRUE, FALSE, TRUE)
x4 <- c(TRUE, 3.4, "hello")
typeof(x1)
[1] "double"
typeof(x2)
[1] "character"
typeof(x3)
[1] "logical"
typeof(x4)
[1] "character"
  • We can check the data structure of an object using commands such as is.vector(), is.list(), is.matrix(), and so on.
is.list(x1)
[1] FALSE
is.vector(x1)
[1] TRUE
is.list(x4)
[1] FALSE
is.vector(x4)
[1] TRUE

Data Frames


Data frames are two-dimensional data objects and are the fundamental data structure used by most of R’s libraries of functions and data sets.

  • Tabular data is tidy if each row corresponds to a different observation and column corresponds to a different variable.

Each column of a data frame is a variable (stored as a vector). If the variable:

  • Is measured or counted by a number, it is a quantitative or numerical variable.
  • Groups observations into different categories or rankings, it is a qualitative or categorical variable.

Creating Data Frames from Scratch


Data frames are created by passing vectors into the data.frame() function.

The names of the columns in the data frame are the names of the vectors you give the data.frame function.

Consider the following simple example.

# create basic data frame
d <- c(1, 2, 3, 4)
e <- c("red", "white", "blue", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
df <- data.frame(d,e,f)
df
  d     e     f
1 1   red  TRUE
2 2 white  TRUE
3 3  blue  TRUE
4 4  <NA> FALSE

Naming Column Headers


The columns of a data frame can be renamed using the names() function on the data frame.

# name columns of data frame
names(df) <- c("ID", "Color", "Passed")
df
  ID Color Passed
1  1   red   TRUE
2  2 white   TRUE
3  3  blue   TRUE
4  4  <NA>  FALSE

The columns of a data frame can be named when you are first creating the data frame by using [new_name] = [orig_vec_name] for each vector of data.

# create data frame with better column names
df2 <- data.frame(ID = d, Color = e, Passed = f)
df2
  ID Color Passed
1  1   red   TRUE
2  2 white   TRUE
3  3  blue   TRUE
4  4  <NA>  FALSE

Checking Data Structure


  • The is.matrix(x) function tests whether or not an object x is a matrix.
  • The is.vector(x) function test whether x is a vector.
  • The is.data.frame(x) function test whether x is a data frame.
is.matrix(df)
[1] FALSE
is.vector(df)
[1] FALSE
is.data.frame(df)
[1] TRUE

Extracting and Slicing Data Frames


Extracting a Column By Name


The column vectors of a data frame may be extracted using $ and specifying the name of the desired vector.

  • df$Color would access the Color column of data frame df.
df$Color  # prints column of data frame df named Color
[1] "red"   "white" "blue"  NA     

Slicing Rows and Columns By Indexing


Part of a data frame can also be extracted by thinking of at as a general matrix and specifying the desired rows or columns in square brackets after the object name.

  • Note R starts with index 1 which is different from Python which indexes starting from 0.

For example, if we had a data frame named df:

  • df[1,] would access the first row of df.
  • df[1:2,] would access the first two rows of df.
  • df[,2] would access the second column of df.
  • df[1:2, 2:3] would access the information in rows 1 and 2 of columns 2 and 3 of df.
df[,2]  # second column is Color
[1] "red"   "white" "blue"  NA     
df[2,]  # second row of df
  ID Color Passed
2  2 white   TRUE
df[1:2,2:3]  # first and second rows of columns 2 and 3

If you need to select multiple columns of a data frame by name, you can pass a character vector with column names in the column position of [].

  • df[, c("ID", "Passed")] would extract the ID and Passed columns of df.
df[, c("ID", "Passed")]
  ID Passed
1  1   TRUE
2  2   TRUE
3  3   TRUE
4  4  FALSE
df[, c(1, 3)]  # another we to pick columns 1 and 3
  ID Passed
1  1   TRUE
2  2   TRUE
3  3   TRUE
4  4  FALSE
# another we to pick columns 1 and 3
df[, -2]  # exclude column 2
  ID Passed
1  1   TRUE
2  2   TRUE
3  3   TRUE
4  4  FALSE

Importing an External File as a Data Frame


The read.table function imports data from file into R as a data frame.

Usage: read.table(file, header = TRUE, sep = ",")

  • file is the file path and name of the file you want to import into R.
    • If you don’t know the file path, set file = file.choose() will bring up a dialog box asking you to locate the file you want to import.
  • header specifies whether the data file has a header (variable labels for each column of data in the first row of the data file).
    • If you don’t specify this option in R or use header = FALSE, then R will assume the file doesn’t have any headings.
    • header = TRUE tells R to read in the data as a data frame with column names taken from the first row of the data file.
  • sep specifies the delimiter separating elements in the file.
    • If each column of data in the file is separated by a space, then use sep = " "
    • If each column of data in the file is separated by a comma, then use sep = ","
    • If each column of data in the file is separated by a tab, then use sep = "\t".

Here is an example reading a csv (comma separated file) with a header:

# import data as data frame
bike.store <- read.table(file="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Data/Transactions.csv",
                         header = TRUE,  # Keep column headers as names
                         sep = ",")  # comma as separator of columns

glimpse(bike.store)
Rows: 20,000
Columns: 13
$ transaction_id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ product_id              <int> 2, 3, 37, 88, 78, 25, 22, 15, 67, 12, 5, 61, 3…
$ customer_id             <int> 2950, 3120, 402, 3135, 787, 2339, 1542, 2459, …
$ transaction_date        <chr> "25-02-2017", "21-05-2017", "16-10-2017", "31-…
$ online_order            <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, F…
$ order_status            <chr> "Approved", "Approved", "Approved", "Approved"…
$ brand                   <chr> "Solex", "Trek Bicycles", "OHM Cycles", "Norco…
$ product_line            <chr> "Standard", "Standard", "Standard", "Standard"…
$ product_class           <chr> "medium", "medium", "low", "medium", "medium",…
$ product_size            <chr> "medium", "large", "medium", "medium", "large"…
$ list_price              <dbl> 71.49, 2091.47, 1793.43, 1198.46, 1765.30, 153…
$ standard_cost           <dbl> 53.62, 388.92, 248.82, 381.10, 709.48, 829.65,…
$ product_first_sold_date <int> 41245, 41701, 36361, 36145, 42226, 39031, 3416…
  • The glimpse() function provides a nice summary of the structure.
  • Run the code cell below to see the various options of read.table().
  • There are other functions and packages that may be better at reading in tabular data. read.table() is a good place to start!
?read.table

Logical Statements


Sometimes we need to know if the elements of an object satisfy certain conditions. This can be determined using the logical operators <, <=, >, >=, ==, !=.

  • == means equal to.
  • != means NOT equal to.

Execute the following commands in R and see what you get.

a <- seq(2, 16, by = 2) # creating the vector a
a
[1]  2  4  6  8 10 12 14 16
a > 10
[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
a <= 4
[1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
a == 10
[1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
a != 10
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

And and Or Statements


More complicated logical statements can be made using & and |.

  • & means “and”
    • Both statements must be true for state1 & state2 to return TRUE.
  • | means “or”
    • Only one of the the two statements must be true for state1 | state2 to return TRUE.
    • If both statements are true in an “or” statement, the statement is also TRUE.

Below is a summary of “and” and “or” logic:

  • TRUE & TRUE returns TRUE
  • FALSE & TRUE returns FALSE
  • FALSE & FALSE returns FALSE
  • TRUE | TRUE returns TRUE
  • FALSE | TRUE returns TRUE
  • FALSE | FALSE returns FALSE
# relationship between logicals & (and), | (or)
TRUE & TRUE
[1] TRUE
FALSE & TRUE
[1] FALSE
FALSE & FALSE
[1] FALSE
TRUE | TRUE
[1] TRUE
FALSE | TRUE
[1] TRUE
FALSE | FALSE
[1] FALSE

Execute the following commands in R and see what you get.

b <- 3  # b is equal to the number 3

# complex logical statements
(b > 6) & (b <= 10)  # FALSE and TRUE
[1] FALSE
(b <= 4) | (b >= 12)  # TRUE or FALSE
[1] TRUE

Logical Indexing


We can use a logical statement as an index to extract certain entries from a vector or data frame. For example, if we want to to know the product_id (column 2), brand (column 7), product_line (column 8), and list_price (column 11) of all transactions that have a list_price greater than $2,090, we can run the code cell below.

  • We use a logical index for the row to extract just the rows that have a list_price value strictly greater than 2090.
  • We indicate we want to keep just columns 2, 7 through 8, and 11 with the column index c(2, 7:8, 11).
  • We store the results to a new data frame named expensive.
  • Finally, we print the first 6 rows of our new data frame with the head() function to check the results.
expensive <- bike.store[bike.store$list_price > 2090, c(2, 7:8, 11)]
head(expensive)
    product_id         brand product_line list_price
2            3 Trek Bicycles     Standard    2091.47
16           3 Trek Bicycles     Standard    2091.47
69          38 Trek Bicycles     Standard    2091.47
154          3 Trek Bicycles     Standard    2091.47
165          3 Trek Bicycles     Standard    2091.47
188          3 Trek Bicycles     Standard    2091.47

Creative Commons License Information


Creative Commons License

Statistical Methods: Exploring the Uncertain by Adam Spiegler is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.