Chapter 3 Dplyer

The dplyr package makes selecting rows and columns really simple. Here are the key commands:

First, make sure you have dplyr installed. To do that, type install.packages("dplyr") on the console. Then, load it with the library command as below.

library(dplyr)

Following are the main commands for dplyr
* select: return selected columns of a data frame
* filter: extract rows from a data frame based on conditions
* arrange: sort rows
* mutate: add new variables/columns or transform existing variables
* summarize: akin to GROUP BY, summarizes variables in a data frame
* %>%: connects multiple actions in a plain English-y way, allowing more readable code.


3.1 select

When giving the select command, we specify the dataset first. Then we list the columns we want, or don’t want.

Here are some examples of using select. We use the mtcars dataset. Commands are self-explanatory, the output is hidden for brevity. Try them out!

data(mtcars)
select(mtcars, mpg, cyl) # From mtcars, select columns mpg and cyl
select(mtcars, mpg, cyl, 5:8) # Select mpg, cyl, and columns 5:8
select(mtcars, mpg, cyl, 5:8, -6) # Select mpg, cyl, and columns 5:8, but not the 6th
select(mtcars, contains("a")) # All columns that contain the character 'a' 
select(mtcars, starts_with("c")) # All columns that start with the letter c
select(mtcars, ends_with("p")) # All columns that end with the letter p

Look at the columns in mtcars.

head(mtcars, 2)
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

We can also use

select(mtcars, hp:vs) # Select all columns from hp to vs.  This is truly flexible!

3.2 filter

We use filter to extract the rows we want. The pipe symbol (|) represents OR, and & represents AND conditions. Or the Try the below commands out! (You must try)

Note the ==. That means equal to. A single equal to sign is considered the assignment symbol.

mtcars$car = rownames(mtcars) #Okay, so we need to do this because unfortunately dplyr ignors rownames. Sigh.  
filter(mtcars, disp==160 & mpg==21) #From mtcars, show rows where displacement=160 and mpg=21.
##   mpg cyl disp  hp drat    wt  qsec vs am gear carb           car
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4     Mazda RX4
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4 Mazda RX4 Wag
filter(mtcars, mpg>22 | mpg <12, cyl>6) # Rows where mpg is either more than 22 or less than 12 *AND* there are more than 6 cylinders
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb                 car
## 1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4  Cadillac Fleetwood
## 2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4 Lincoln Continental

3.3 Using select and filter together

You can nest select and filter together, for example if you wish to pick only certain rows and columns. For example, you might like to pick only the mpg, cyl, disp and car name from the set of rows where displacement is 160 and mpg is 21.

  select(filter(mtcars, disp==160 & mpg==21), mpg, cyl, disp, car)
##   mpg cyl disp           car
## 1  21   6  160     Mazda RX4
## 2  21   6  160 Mazda RX4 Wag

3.4 arrange

Arrange essentially means sort. You can have a multilevel sort by putting in multiple columns. To sort descending, add desc to the column name as in the example below.

arrange(mtcars, mpg) #sort mtcars by mpg
arrange(mtcars, desc(mpg)) #sort mtcars in descending order of mpg
arrange(mtcars, cyl, desc(mpg)) #sort mtcars by cylinders, then mpg in desc order

3.5 summarize

This is an extremely useful command. It allows us to group data together and perform calculations on groups (a bit like, though not identical to, pivots in Excel).
The command takes the format summarize(group_by(df, colname), function(colname))

summarize(group_by(mtcars, cyl), mean(disp)) #Summarize rows in mtcars by cyl, then show me the mean for displacement
## # A tibble: 3 x 2
##     cyl `mean(disp)`
##   <dbl>        <dbl>
## 1     4         105.
## 2     6         183.
## 3     8         353.

Play around with this a bit.

summarize(group_by(mtcars, cyl, gear), mean(carb), mean(wt)) # Note group_by both cyl and gear
summarize(group_by(mtcars, cyl, gear), carbmean=mean(carb), wtmean=mean(wt), count=n()) # giving names to columns returned, n() gives count
summarize(group_by(mtcars, gear), n_distinct(carb), count=n()) # Total count under grouping and distinct carburetor counts

3.6 mutate

Mutate adds new columns based on a calculation, eg, mutate(dataframe, newcolname = calculation).
Best explained by an example.

# Simple example.  To illustrate, we create a data frame with 6 banks' share prices.
x = data.frame(company = c("Goldman", "JPMC", "BofA", "Citi", "BNP", "Deutsche", "Barclays"), 
               shareprice = c(235, 96, 25, 73, 68, 17, 10), 
               region=c(rep("US",4), rep("EU", 3))) #Note how the rep function has been used here

x
##    company shareprice region
## 1  Goldman        235     US
## 2     JPMC         96     US
## 3     BofA         25     US
## 4     Citi         73     US
## 5      BNP         68     EU
## 6 Deutsche         17     EU
## 7 Barclays         10     EU
# Assume we want to add a column that multiplies the USD share price by 100 to get the JPY equivalent.
mutate(x, JPY = 100 * shareprice)
# But that is a very simple example, real life is complex.
# Assume we want to see the average share price by region.  Easy with summarize and group_by.
summarize(group_by(x, region), mean(shareprice))
#
# What we would like to do is to add this calculation to each row in the main table.  
# We can do it with `mutate` by replacing summarize with mutate as below.
mutate(group_by(x, region), region_avg = mean(shareprice))