Chapter 2 Excel to R
This chapter deals with providing R equivalent actions for the most common Excel operations.
2.1 Dealing with NAs
2.1.1 To replace all NAs in a dataframe with a single value:
x[is.na(x)] = 0
The same logic applies if we want to do this for a single or a set of columns. Instead of x
, we use the subsetted values of the dataframe in the two places in the code above where x
appears.
dataframe_name[,c("colname")][is.na(dataframe_name[,c("colname")])] = 0
2.1.2 Keeping only complete rows
Another way to do the above is to use the na.omit() function
df <- na.omit(df)
2.1.3 Identifying rows with no NAs
The function complete.cases(dataframe)
gives a logical (T F T T etc) vector of whether there is any NA value in a row.
So the following will return a data frame with all the rows that have at least one NA value excluded.
df = df[complete.cases(df),] # to just retain the complete rows.
df = df[!complete.cases(df),] # to just retain the NA rows. The ! sign reverses the logical values.
This looks the same as na.omit but with one important difference - complete.cases gives you a vector that is more flexible. So if you are only concerned with NA values in only a few of the columns rather than all columns, then complete.cases
provides a way to do it that is not possible with na.omit()
eg df[complete.cases(df[ ,1]), ]
if one were only concerned with NA values in the first column.
2.1.4 na.rm = TRUE
na.rm
is an argument used in some functions to exclude NA values (eg mean(mtcars$hp, na.rm = TRUE)
) . It is not a function by itself.
2.2 Add columns
2.2.1 Using base R
Pretty straightforward, here is an example where we add a new column to the dataset mtcars which is equal to the sum of columns hp and drat.
mtcars$newcol = mtcars$hp + mtcars$drat
2.2.2 Using dplyr
mutate
You add columns quite easily with dplyr’s mutate function as follows:
mutate(mtcars, hpdrat = hp + drat)
# Note that this provides the result without changing mtcars. To update mtcars, you have to:
mtcars = mutate(mtcars, hpdrat = hp + drat)
2.3 Delete a column
2.3.1 Set the column to null, eg
mtcars$newcol = NULL
2.3.2 Use subsets
Subset by putting a minus sign in front of a vector of the column numbers. This method only works with column numbers, not names.
df = df[,-c(1, 2)]
2.3.3 Delete using column names
In the below code names(mtcars) %in% c("mpg", "cyl", "disp")
gives a logical vector (equal in length to the number of columns) of whether each column name was found in the c("mpg", "cyl", "disp")
.
mtcars = mtcars[ , !(names(mtcars) %in% c("mpg", "cyl", "disp"))]
2.4 Change a column
df$new = df$new * 0.5
2.5 Add rows
Use rbind
eg
m = mtcars[1:5,]
m2 = mtcars[8:10,]
m = rbind(m, m2)
2.6 Reorder columns
mtcars = mtcars[c(1,3,2, 4:7)]
2.7 Delete a row
Subset by putting a minus sign in front of a vector of the row numbers. This method only works with row numbers.
mtcars = mtcars[-c(2, 4, 6), ]
2.8 Filter rows based on criteria
2.8.1 Base R
Use subsetting
mtcars[mtcars$hp>100, ]
mtcars[mtcars$hp>100 & mtcars$disp<300, ] # Use & for AND.
mtcars[(mtcars$hp>100 & mtcars$disp<300) | mtcars$cyl==8, ] # Use | for OR.
2.8.2 Using dplyr
We use filter
to extract the rows we want. The pipe symbol (|) represents OR, and & represents AND conditions. Or the Try the below commands out! (You must try)
Note the ==
. That means equal to. A single equal to sign is considered the assignment symbol.
mtcars$car = rownames(mtcars) #Okay, so we need to do this because unfortunately dplyr ignors rownames. Sigh.
filter(mtcars, disp==160 & mpg==21) #From mtcars, show rows where displacement=160 and mpg=21.
filter(mtcars, mpg>22 | mpg <12, cyl>6) # Rows where mpg is either more than 22 or less than 12 *AND* there are more than 6 cylinders
2.9 Pick only some columns
2.9.1 Using base R
2.9.2 Using dplyr
When giving the select command, we specify the dataset first. Then we list the columns we want, or don’t want.
Here are some examples of using select. We use the mtcars dataset. Commands are self-explanatory, the output is hidden for brevity. Try them out!
data(mtcars)
select(mtcars, mpg, cyl) # From mtcars, select columns mpg and cyl
select(mtcars, mpg, cyl, 5:8) # Select mpg, cyl, and columns 5:8
select(mtcars, mpg, cyl, 5:8, -6) # Select mpg, cyl, and columns 5:8, but not the 6th
select(mtcars, contains("a")) # All columns that contain the character 'a'
select(mtcars, starts_with("c")) # All columns that start with the letter c
select(mtcars, ends_with("p")) # All columns that end with the letter p
Look at the columns in mtcars.
head(mtcars, 2)
We can also use
select(mtcars, hp:vs) # Select all columns from hp to vs. This is truly flexible!
2.10 Bin data using cut
The cut
function can be used to split a numeric column into a defined number of categories, or at different ‘cut’ points. The function returns a factor vector with the format (a, b]; which means a is the lower limit of the bin, and b is the higher limit of the bin. The (
indicates that the value has to be greater than a, and ]
indicates that the value has to be lower than or equal to b.
mtcars$mpgnew = cut(mtcars$mpg, 4) # Create 4 breaks
mtcars$mpgnew = cut(mtcars$mpg, breaks = c(0, 15, 20, 24, 100)) # Create breaks at arbitrary points
mtcars$mpgnew = cut(mtcars$mpg, breaks = c(0, 15, 20, 24, 100), labels = c("L1", "L2", "L3", "L4")) # Use custom labels instead of (a, b]
Note that new column that cut returns is a factor. This can be used to split
the dataframe by the new categories. eg, the below gives us four dataframes in a list
mtcars$mpgnew = cut(mtcars$mpg, breaks = c(0, 15, 20, 24, 100), labels = c("L1", "L2", "L3", "L4")) # Use custom labels instead of (a, b]
split(mtcars, mtcars$mpgnew)
2.11 Replace a value
2.11.1 Replace a value anywhere in the dataframe (ie in all columns)
df[df=="OldValue"] = "NewValue"
2.11.2 Replace a value in a particular column(s)
Essentially the same as above except replace references to df by a subset of df as df[,colnames]
df[,c("colname")][df[,c("colname")]=="OldValue"] = "NewValue"
2.11.3 Use recode
Use dplyr’s recode function
df$colname = recode(df$colname, "OldValue1" = "NewValue1", "OldValue2" = "NewValue2", .default = df$colname)
#The last piece is necessary otherwise dplyr will replace everything that you don't want changed with NAs.
#Also if you are changing numeric values, then you still need to put the numeric value in quotes, eg, "4" = ".4"
#In such a case, the .default will need to be set to be equal to `as.character(df$colname)`.
2.12 Sort
2.12.1 Using base R
mtcars = mtcars[order(mtcars$mpg),] #order ascending
mtcars = mtcars[order(mtcars$mpg, mtcars$cyl),] #order ascending using two variables
mtcars = mtcars[order(-mtcars$mpg),] #order descending
mtcars = mtcars[order(-mtcars$mpg, mtcars$cyl),] #order descending mpg, ascending cyl
2.12.2 dplyr arrange
Arrange essentially means sort. You can have a multilevel sort by putting in multiple columns. To sort descending, add desc
to the column name as in the example below.
arrange(mtcars, mpg) #sort mtcars by mpg
arrange(mtcars, desc(mpg)) #sort mtcars in descending order of mpg
arrange(mtcars, cyl, desc(mpg)) #sort mtcars by cylinders, then mpg in desc order
2.13 Pivot
2.13.1 Using base R
Use the table
command to pivot. The first field is rows, the second is the columns, and any additional ones are the page levels. The numbers that appear are only counts, not sums, averages or anything else!
table(mtcars$cyl, mtcars$gear, mtcars$am)
2.13.2 Using reshape2
You can get pivot table equivalents using reshape2::dcast
. Read the reshape section.
2.14 vlookup
merge
: Joins data frames
We can get the same results as a vlookup in R by using merge, only that we should limit the lookup table dataframe to just two columns - the value being looked up, and the look up value itself.
Imagine two data frames df1 and df2. You need to specify the common fields in the two data frames on which to join in the by.x and by.y parameters. Putting all=T creates an outer join, ie, if no corresponding value found in df2, then have a row with NA in the merged dataset.
merged_df = merge(df1, df2[,c("looked_up_value", "lookup")], by.x = "id_column", by.y = "id_col_name", all.x = TRUE)
In merge
, use all.x = TRUE
for a left join, all.y = TRUE
for a right join, all = T
for an outer join, and use nothing for an inner join (intersection)
2.15 If Else
ifelse
:
zip_wrong = ifelse(zip < 0, TRUE, FALSE)
2.16 Ramdom selections
When picking a certain number of sample rows from a dataframe, there are two approaches:
2.16.1 Use rbinom
n = rbinom(n = nrow(mtcars),size=1,prob=.5) #n will be a vector with 0s and 1s, the prob for getting a 1 being 50%
mtcars[n, ] #pick the row numbers where n==1
2.16.2 Use sample_n
from dplyr
mtcars = sample_n(mtcars, 8) #where 8 is the number of samples, and mtcars is the dataframe