Chapter 1 Quick intro to R.
Skipping download, installation and basic navigation instructions for R and R Studio.
In R, you work with variables. Variables contain values, just like cells in Excel do. They can contain numbers, characters (text), logical operators (T/F). A variable name can contain numbers, but it cannot begin with a number.
A vector in R is a collection of objects that can contain objects of only one kind, ie, either numbers, logicals, or character strings.
1.1 Assigning a value to a variable
You assign a vector, or any other object, to a variable using the assignment characters <-
. So if you want x to have a value of 2, you will write x <- 2
. Spaces don’t matter, x<-2
would have worked just as well. This is the notation you will see all the time in R code in most places.
However, there is an alternative to =
, which is the =
sign. I personally prefer to use that as it is fewer keys to press.
To see the value in a variable, just type it on the console.
Let us try something hands on:
x = 2
y = 2
x
## [1] 2
y
## [1] 2
x + y
## [1] 4
1.2 Examples of vector operations
Let us create 2 numeric vectors x and y and try a few things.
x = c(2, 3, 4)
y = c(10, 15, 20)
x; y
## [1] 2 3 4
## [1] 10 15 20
Now try the following. Do the results match what you would have intuitively thought? If not, study how R did it.
x + y
x - y
x * y
x/y
1.3 Working directory
R has a base working directory. You can find out what it is on your installation by using the command getwd()
. Try it. The working directory is the place where R looks for files you are trying to load, or save.
1.4 Using packages
When you start R, it comes with a base set of functionality. You can do a lot of things with the base set of features. You may read or hear the term “base R function”, which means that the function or feature being talked about is part of the base set of functionality available by default, and does not require any special libraries or packages.
However lots of people have had needs that go beyond base R functionality, and thousands of packages have been created to meet those needs. These packages are available freely for download. Consider them like Excel add-ins. For most advanced tasks, you will need a specialized package. Most experienced R users load up as many as a few dozen packages by default to do their work.
Loading packages is a two step process. First, you need to download them to your local drive. Second, you need to tell R to open or load them. The first step has to be done only once on a computer. The second step is required each time you start R (unless you have them set up to be loaded each time R starts.)
Installing packages
We use the command install.packages("package-name")
to install a package. This is generally not required in corporate installations as administrators would have already installed the packages you need.
Loading packages
We use the command library(package-name)
to load the package into memory. Once done, all the functions built under the package become available to us for use on the r console.
1.5 Reading external data
R can read a variety of data formats. To read csv files, we use the command read.csv
. So if you have a file called Book1.csv in your working directory, all you will need to do is to type x = read.csv("Book1.csv")
on the console to read the file into a data frame called x
.
1.6 The data frame
Of all the different types of data objects in R, the data frame is the most useful to data scientists. There are arrays, matrices, lists, but the data frame is the true work horse that drives much of the work.
The data frame is like a table, or a spreadsheet.
To look at a data frame, type the following commands on the console:
data(mtcars)
mtcars
You will see the mtcars data frame. There are rows, and there are columns. In this case, both rows and columns have names that you can see on the top and the bottom.
- To look at the top few rows of a data frame, use the command
head(dataframe-name)
. Try it with mtcars -head(mtcars)
.
- To look at the last few rows of a data frame, try
tail(mtcars)
, or whatever your data frame name is.
- Use
summary(mtcars)
to look at the summary of the data frame.
- Use
nrow(mtcars)
to see the number of rows.
- Use
dim(mtcars)
to see the dimensions of the data frame, number of rows followed by the number of columns.
- Use
str(mtcars)
to look at the structure of the data frame.
Try all of the above out, don’t just read it!
1.7 How do you refer to a single column of a data frame?
You type the data frame name, followed by the $
sign, then the name of the column.
mean(mtcars$wt) # Give me the mean of the wt column in the mtcars data frame
## [1] 3.21725
We will be using this notation throughout!
1.8 Subsetting
Subsetting is one of the most basic skills you will need to know in R. Subsetting is extracting the data you need from an R object. For example, consider the mtcars data. Now imagine you are interested in cars where mpg>20. You may also be interested in only the number of cylinders and the displacement columns. Extracting just the elements you are interested in is called subsetting.
1.8.1 Vectors
Vectors are subsetted by the [] operator, with the elements you need placed inside these brackets.
x = c(letters) # letters is an inbuilt function that returns a to z. LETTERS returns A to Z
x
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
length(x) # Count the number of elements in vector x
## [1] 26
x[1] # Extract the first element
## [1] "a"
x[4] # Extract the fourth element
## [1] "d"
x[1:3] # Extract the first three elements
## [1] "a" "b" "c"
x[c(1,3,5)] # extract the 1st, 3rd and 5th elements
## [1] "a" "c" "e"
x[x>"g"] # extract elements greater than g
## [1] "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [18] "y" "z"
x = 1:15 # Reset the variable to contain values
x # List what is in x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x[x>9] # List all values of x greater than 9
## [1] 10 11 12 13 14 15
length(x[x>9]) # Count how many values in x are greater than 9
## [1] 6
1.8.2 Removing NA values
Your data will very often have NA values. These are essentially NULL values. You have to do something about them as most functions do not like NAs.
x = c("Alpha", "Bravo", NA, "Delta", NA, "Fox")
is.na(x) # Gives a boolean vector evaluating each member of x
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
!is.na(x) # ! stands for NOT. It will change TRUE to FALSE and vice-versa.
## [1] TRUE TRUE FALSE TRUE FALSE TRUE
x[is.na(x)] # we can use subsetting with the is.na functiontion to extract only the NA values of x
## [1] NA NA
x[!is.na(x)] # This though is more useful
## [1] "Alpha" "Bravo" "Delta" "Fox"
x = x[!is.na(x)] # We set x to be only its non-NA values
1.8.3 Subsetting data frames
We will be subsetting data frames the most often, as most data will be data frame based.
There are two ways to subset data sets - using base R functionality, and using dplyr, a package written by Hadley Wickham. Base R functionality to subset data set can appear cumbersome. Dplyr on the other hand is a bit more intutitive. We will try both, and you can make a choice as to what you are most comfortable with.
1.8.3.1 Create a data frame:
tt = data.frame(age = c(1), ht = c(3))
tt
## age ht
## 1 1 3
1.8.3.2 Add a new row to a data frame:
tt = rbind(tt, c(3, 4), c(7, 8))
tt
## age ht
## 1 1 3
## 2 3 4
## 3 7 8
1.8.3.2.1 Add a column to a data frame:
tt = cbind(tt, new = c(8, 9, 0))
tt
## age ht new
## 1 1 3 8
## 2 3 4 9
## 3 7 8 0
1.8.3.3 Or to add a column, you can do it simply without using cbind:
cbind
means bind a column, or column bind.
tt = data.frame(price = c(2,4,3), quantity = c(10,20,5))
tt$NewCol = tt$price * tt$quantity
tt
## price quantity NewCol
## 1 2 10 20
## 2 4 20 80
## 3 3 5 15
1.8.3.4 Change column names
tt
## price quantity NewCol
## 1 2 10 20
## 2 4 20 80
## 3 3 5 15
colnames(tt)
## [1] "price" "quantity" "NewCol"
colnames(tt)[3] = "value"
tt
## price quantity value
## 1 2 10 20
## 2 4 20 80
## 3 3 5 15
1.8.3.4.1 Modifying a column
tt$new = tt$quantity * 0.5
1.8.3.4.2 Add a blank column with zeroes (or NAs)
tt = cbind(tt, another=0)
tt
## price quantity value new another
## 1 2 10 20 5.0 0
## 2 4 20 80 10.0 0
## 3 3 5 15 2.5 0
1.8.3.5 Remove a column by setting it to NULL
tt$new = NULL
tt
## price quantity value another
## 1 2 10 20 0
## 2 4 20 80 0
## 3 3 5 15 0
1.8.3.6 Removing all rows in a data frame that have at least one NA in it
x = data.frame(company = c("Goldman", "JPMC", "BofA", "Citi"), shareprice = c(235, 96, NA, 73))
x
## company shareprice
## 1 Goldman 235
## 2 JPMC 96
## 3 BofA NA
## 4 Citi 73
na.omit(x) # All rows of data frame x without an NA
## company shareprice
## 1 Goldman 235
## 2 JPMC 96
## 4 Citi 73
x = na.omit(x) # Same as the above, only that we store this back in x, and the NA is permanently removed
x
## company shareprice
## 1 Goldman 235
## 2 JPMC 96
## 4 Citi 73
##
## Setting `na.rm = TRUE` also takes care of NAs.
x = data.frame(company = c("Goldman", "JPMC", "BofA", "Citi"), shareprice = c(235, 96, NA, 73))
mean(x$shareprice) # Returns an error because shareprice has an NA in it.
## [1] NA
mean(x$shareprice, na.rm = TRUE) # Now we are good
## [1] 134.6667
Selecting rows and columns that meet a certain condition generally takes the form x[conditions for rows here , conditions for columns here]
. What this means is if you are looking for only some types of rows, you will still need to insert the comma.
Selecting some rows based on a condition (akin to auto-filter in Excel). (Results not shown, please try these out on your console.)
mtcars[mtcars$mpg > 22, ] # Select all rows where mpg is greater than 22
mtcars[, c("mpg", "cyl")] # Extracting the columns mpg and cyl. Note there is nothing before the comma, which is telling R to select all rows.
mtcars[mtcars$mpg > 22, c("mpg", "cyl")] # All rows where mpt>22 and the columns mpg and cyl
mtcars[mtcars$mpg > 22 | mtcars$mpg < 18, c("mpg", "cyl")] #Rows where mpg>22 or less than 18. Note that *|* stands for *OR* and *&* stands for *AND*.
mtcars[mtcars$mpg > 22 | mtcars$mpg < 18, -c(1,2)] # All columns except the first and the second.
1.9 Other useful functions
1.9.1 table
table
is a great function to provide a very quick summary of counts. For example, in the mtcars dataset if you want to see how many cars of each cylinder type are included in the data set, you just do the following:
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
How about if you want a cross-tab of cyl versus number of gears?
table(mtcars$cyl, mtcars$gear)
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
How about if you want to throw in number of carburetors? The output has been suppressed as you should try it out.
table(mtcars$cyl, mtcars$gear, mtcars$carb)
1.9.2 unique
unique
gives you the distinct values for anything. Suppose you want to know which are the distinct gears in the data set.
unique(mtcars$gear)
## [1] 4 3 5
Sometimes there may be too many values and you are only interested in know how many unique values exist, and don’t really care about the values themselves. You can calculate the length
of the vector returned by unique
to know that.
length(unique(mtcars$mpg))
## [1] 25
1.9.3 quantile
quantile
is an extremely useful function that gives you the quantiles (roughly percentiles) for any data. Suppose you wish to know the 50th, 70th, 99th quantiles for mpg data in the mtcars dataset.
quantile(mtcars$mpg, c(0.5, 0.7, 0.8, 0.9, 0.95, 0.99 ))
## 50% 70% 80% 90% 95% 99%
## 19.200 21.470 24.080 30.090 31.300 33.435
1.9.4 help
help
: the help function provides help on any function. Just type help(function-name)
, eg, help(summary)
or its equivalent ?summary
which is the same thing.
If you do not remember the exact function name, try ??summary
for example, and R will try to find what you are looking for.
1.9.5 read_csv
readr::read_csv
:
x = read_csv("filename", col_types="ccciiinnn") #c=character, i=integer and n=numeric
1.9.6 seq
seq
is a function that provides you a sequence of numbers, separated by whatever gap you specify.
1:9 #lists 1 through 9
## [1] 1 2 3 4 5 6 7 8 9
seq(1, 9, 0.5) # lists 1 through 9, separated by 0.5
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0
1.9.7 rep
rep
is similar to sequence, repeates something any number of times.
rep(1, 10)
## [1] 1 1 1 1 1 1 1 1 1 1
1.9.8 ifelse
ifelse
:
zip_wrong = ifelse(zip < 0, TRUE, FALSE)
1.9.9 merge
merge
: Joins data frames
Imagine two data frames df1 and df2. You need to specify the common fields in the two data frames on which to join in the by.x
and by.y
parameters. Putting all=T
creates an outer join, ie, if no corresponding value found in df2, then have a row with NA in the merged dataset.
merged_df = merge(df1, df2, by.x = "id_column", by.y = "id_col_name", all = T)