Chapter 3 Dplyer
The dplyr
package makes selecting rows and columns really simple. Here are the key commands:
First, make sure you have dplyr installed. To do that, type install.packages("dplyr")
on the console. Then, load it with the library
command as below.
library(dplyr)
Following are the main commands for dplyr
* select
: return selected columns of a data frame
* filter
: extract rows from a data frame based on conditions
* arrange
: sort rows
* mutate
: add new variables/columns or transform existing variables
* summarize
: akin to GROUP BY, summarizes variables in a data frame
* %>%
: connects multiple actions in a plain English-y way, allowing more readable code.
3.1 select
When giving the select command, we specify the dataset first. Then we list the columns we want, or don’t want.
Here are some examples of using select. We use the mtcars dataset. Commands are self-explanatory, the output is hidden for brevity. Try them out!
data(mtcars)
select(mtcars, mpg, cyl) # From mtcars, select columns mpg and cyl
select(mtcars, mpg, cyl, 5:8) # Select mpg, cyl, and columns 5:8
select(mtcars, mpg, cyl, 5:8, -6) # Select mpg, cyl, and columns 5:8, but not the 6th
select(mtcars, contains("a")) # All columns that contain the character 'a'
select(mtcars, starts_with("c")) # All columns that start with the letter c
select(mtcars, ends_with("p")) # All columns that end with the letter p
Look at the columns in mtcars.
head(mtcars, 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
We can also use
select(mtcars, hp:vs) # Select all columns from hp to vs. This is truly flexible!
3.2 filter
We use filter
to extract the rows we want. The pipe symbol (|) represents OR, and & represents AND conditions. Or the Try the below commands out! (You must try)
Note the ==
. That means equal to. A single equal to sign is considered the assignment symbol.
mtcars$car = rownames(mtcars) #Okay, so we need to do this because unfortunately dplyr ignors rownames. Sigh.
filter(mtcars, disp==160 & mpg==21) #From mtcars, show rows where displacement=160 and mpg=21.
## mpg cyl disp hp drat wt qsec vs am gear carb car
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag
filter(mtcars, mpg>22 | mpg <12, cyl>6) # Rows where mpg is either more than 22 or less than 12 *AND* there are more than 6 cylinders
## mpg cyl disp hp drat wt qsec vs am gear carb car
## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 Cadillac Fleetwood
## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 Lincoln Continental
3.3 Using select
and filter
together
You can nest select and filter together, for example if you wish to pick only certain rows and columns. For example, you might like to pick only the mpg, cyl, disp and car name from the set of rows where displacement is 160 and mpg is 21.
select(filter(mtcars, disp==160 & mpg==21), mpg, cyl, disp, car)
## mpg cyl disp car
## 1 21 6 160 Mazda RX4
## 2 21 6 160 Mazda RX4 Wag
3.4 arrange
Arrange essentially means sort. You can have a multilevel sort by putting in multiple columns. To sort descending, add desc
to the column name as in the example below.
arrange(mtcars, mpg) #sort mtcars by mpg
arrange(mtcars, desc(mpg)) #sort mtcars in descending order of mpg
arrange(mtcars, cyl, desc(mpg)) #sort mtcars by cylinders, then mpg in desc order
3.5 summarize
This is an extremely useful command. It allows us to group data together and perform calculations on groups (a bit like, though not identical to, pivots in Excel).
The command takes the format summarize(group_by(df, colname), function(colname))
summarize(group_by(mtcars, cyl), mean(disp)) #Summarize rows in mtcars by cyl, then show me the mean for displacement
## # A tibble: 3 x 2
## cyl `mean(disp)`
## <dbl> <dbl>
## 1 4 105.
## 2 6 183.
## 3 8 353.
Play around with this a bit.
summarize(group_by(mtcars, cyl, gear), mean(carb), mean(wt)) # Note group_by both cyl and gear
summarize(group_by(mtcars, cyl, gear), carbmean=mean(carb), wtmean=mean(wt), count=n()) # giving names to columns returned, n() gives count
summarize(group_by(mtcars, gear), n_distinct(carb), count=n()) # Total count under grouping and distinct carburetor counts
3.6 mutate
Mutate adds new columns based on a calculation, eg, mutate(dataframe, newcolname = calculation)
.
Best explained by an example.
# Simple example. To illustrate, we create a data frame with 6 banks' share prices.
x = data.frame(company = c("Goldman", "JPMC", "BofA", "Citi", "BNP", "Deutsche", "Barclays"),
shareprice = c(235, 96, 25, 73, 68, 17, 10),
region=c(rep("US",4), rep("EU", 3))) #Note how the rep function has been used here
x
## company shareprice region
## 1 Goldman 235 US
## 2 JPMC 96 US
## 3 BofA 25 US
## 4 Citi 73 US
## 5 BNP 68 EU
## 6 Deutsche 17 EU
## 7 Barclays 10 EU
# Assume we want to add a column that multiplies the USD share price by 100 to get the JPY equivalent.
mutate(x, JPY = 100 * shareprice)
# But that is a very simple example, real life is complex.
# Assume we want to see the average share price by region. Easy with summarize and group_by.
summarize(group_by(x, region), mean(shareprice))
#
# What we would like to do is to add this calculation to each row in the main table.
# We can do it with `mutate` by replacing summarize with mutate as below.
mutate(group_by(x, region), region_avg = mean(shareprice))