Chapter 2 Excel to R

This chapter deals with providing R equivalent actions for the most common Excel operations.

2.1 Dealing with NAs

2.1.1 To replace all NAs in a dataframe with a single value:

x[is.na(x)] = 0

The same logic applies if we want to do this for a single or a set of columns. Instead of x, we use the subsetted values of the dataframe in the two places in the code above where x appears.

 dataframe_name[,c("colname")][is.na(dataframe_name[,c("colname")])] = 0

2.1.2 Keeping only complete rows

Another way to do the above is to use the na.omit() function

df <- na.omit(df)

2.1.3 Identifying rows with no NAs

The function complete.cases(dataframe) gives a logical (T F T T etc) vector of whether there is any NA value in a row.

So the following will return a data frame with all the rows that have at least one NA value excluded.

df = df[complete.cases(df),] # to just retain the complete rows.
df = df[!complete.cases(df),] # to just retain the NA rows.  The ! sign reverses the logical values.

This looks the same as na.omit but with one important difference - complete.cases gives you a vector that is more flexible. So if you are only concerned with NA values in only a few of the columns rather than all columns, then complete.cases provides a way to do it that is not possible with na.omit() eg df[complete.cases(df[ ,1]), ] if one were only concerned with NA values in the first column.

2.1.4 `na.rm = TRUE`

na.rm is an argument used in some functions to exclude NA values (eg mean(mtcars$hp, na.rm = TRUE)) . It is not a function by itself.

2.2 Add columns

2.2.1 Using base R

Pretty straightforward, here is an example where we add a new column to the dataset mtcars which is equal to the sum of columns hp and drat.

 mtcars$newcol = mtcars$hp + mtcars$drat

2.2.2 Using `dplyr` mutate

You add columns quite easily with dplyr’s mutate function as follows:

mutate(mtcars, hpdrat = hp + drat)
# Note that this provides the result without changing mtcars.  To update mtcars, you have to:

mtcars = mutate(mtcars, hpdrat = hp + drat)

2.3 Delete a column

2.3.1 Set the column to null, eg

mtcars$newcol = NULL

2.3.2 Use subsets

Subset by putting a minus sign in front of a vector of the column numbers. This method only works with column numbers, not names.

df = df[,-c(1, 2)]

2.3.3 Delete using column names

In the below code names(mtcars) %in% c("mpg", "cyl", "disp") gives a logical vector (equal in length to the number of columns) of whether each column name was found in the c("mpg", "cyl", "disp").

mtcars = mtcars[ , !(names(mtcars) %in% c("mpg", "cyl", "disp"))]

2.4 Change a column

df$new = df$new * 0.5

2.5 Add rows

Use rbind

m = mtcars[1:5,]
m2 = mtcars[8:10,]

m = rbind(m, m2)

2.6 Reorder columns

mtcars = mtcars[c(1,3,2, 4:7)]

2.7 Delete a row

Subset by putting a minus sign in front of a vector of the row numbers. This method only works with row numbers.

mtcars = mtcars[-c(2, 4, 6), ]

2.8 Filter rows based on criteria

2.8.1 Base R

Use subsetting

mtcars[mtcars$hp>100, ]

mtcars[mtcars$hp>100 & mtcars$disp<300, ] # Use & for AND.

mtcars[(mtcars$hp>100 & mtcars$disp<300) | mtcars$cyl==8, ] # Use | for OR.

2.8.2 Using `dplyr`

We use filter to extract the rows we want. The pipe symbol (|) represents OR, and & represents AND conditions. Or the Try the below commands out! (You must try)

Note the ==. That means equal to. A single equal to sign is considered the assignment symbol.

mtcars$car = rownames(mtcars) #Okay, so we need to do this because unfortunately dplyr ignors rownames. Sigh.  
filter(mtcars, disp==160 & mpg==21) #From mtcars, show rows where displacement=160 and mpg=21.
filter(mtcars, mpg>22 | mpg <12, cyl>6) # Rows where mpg is either more than 22 or less than 12 *AND* there are more than 6 cylinders

2.9 Pick only some columns

2.9.1 Using base R

2.9.2 Using `dplyr`

When giving the select command, we specify the dataset first. Then we list the columns we want, or don’t want.

Here are some examples of using select. We use the mtcars dataset. Commands are self-explanatory, the output is hidden for brevity. Try them out!

data(mtcars)
select(mtcars, mpg, cyl) # From mtcars, select columns mpg and cyl
select(mtcars, mpg, cyl, 5:8) # Select mpg, cyl, and columns 5:8
select(mtcars, mpg, cyl, 5:8, -6) # Select mpg, cyl, and columns 5:8, but not the 6th
select(mtcars, contains("a")) # All columns that contain the character 'a' 
select(mtcars, starts_with("c")) # All columns that start with the letter c
select(mtcars, ends_with("p")) # All columns that end with the letter p

Look at the columns in mtcars.

head(mtcars, 2)

We can also use

select(mtcars, hp:vs) # Select all columns from hp to vs.  This is truly flexible!

2.10 Bin data using `cut`

The cut function can be used to split a numeric column into a defined number of categories, or at different ‘cut’ points. The function returns a factor vector with the format (a, b]; which means a is the lower limit of the bin, and b is the higher limit of the bin. The ( indicates that the value has to be greater than a, and ] indicates that the value has to be lower than or equal to b.

mtcars$mpgnew = cut(mtcars$mpg, 4) # Create 4 breaks
mtcars$mpgnew = cut(mtcars$mpg, breaks = c(0, 15, 20, 24, 100)) # Create breaks at arbitrary points
mtcars$mpgnew = cut(mtcars$mpg, breaks = c(0, 15, 20, 24, 100), labels = c("L1", "L2", "L3", "L4")) # Use custom labels instead of (a, b]

Note that new column that cut returns is a factor. This can be used to split the dataframe by the new categories. eg, the below gives us four dataframes in a list

mtcars$mpgnew = cut(mtcars$mpg, breaks = c(0, 15, 20, 24, 100), labels = c("L1", "L2", "L3", "L4")) # Use custom labels instead of (a, b]
split(mtcars, mtcars$mpgnew)

2.11 Replace a value

2.11.1 Replace a value anywhere in the dataframe (ie in all columns)

df[df=="OldValue"] = "NewValue"

2.11.2 Replace a value in a particular column(s)

Essentially the same as above except replace references to df by a subset of df as df[,colnames]

df[,c("colname")][df[,c("colname")]=="OldValue"] = "NewValue"

2.11.3 Use `recode`

Use dplyr’s recode function

df$colname = recode(df$colname, "OldValue1" = "NewValue1", "OldValue2" = "NewValue2", .default = df$colname)
#The last piece is necessary otherwise dplyr will replace everything that you don't want changed with NAs.
#Also if you are changing numeric values, then you still need to put the numeric value in quotes, eg, "4" = ".4"
#In such a case, the .default will need to be set to be equal to `as.character(df$colname)`.

2.12 Sort

2.12.1 Using base R

mtcars = mtcars[order(mtcars$mpg),] #order ascending
mtcars = mtcars[order(mtcars$mpg, mtcars$cyl),] #order ascending using two variables
mtcars = mtcars[order(-mtcars$mpg),] #order descending
mtcars = mtcars[order(-mtcars$mpg, mtcars$cyl),] #order descending mpg, ascending cyl

2.12.2 dplyr `arrange`

Arrange essentially means sort. You can have a multilevel sort by putting in multiple columns. To sort descending, add desc to the column name as in the example below.

arrange(mtcars, mpg) #sort mtcars by mpg
arrange(mtcars, desc(mpg)) #sort mtcars in descending order of mpg
arrange(mtcars, cyl, desc(mpg)) #sort mtcars by cylinders, then mpg in desc order

2.13 Pivot

2.13.1 Using base R

Use the table command to pivot. The first field is rows, the second is the columns, and any additional ones are the page levels. The numbers that appear are only counts, not sums, averages or anything else!

table(mtcars$cyl, mtcars$gear, mtcars$am)

2.13.2 Using reshape2

You can get pivot table equivalents using reshape2::dcast. Read the reshape section.

2.14 vlookup

merge: Joins data frames

We can get the same results as a vlookup in R by using merge, only that we should limit the lookup table dataframe to just two columns - the value being looked up, and the look up value itself.

Imagine two data frames df1 and df2. You need to specify the common fields in the two data frames on which to join in the by.x and by.y parameters. Putting all=T creates an outer join, ie, if no corresponding value found in df2, then have a row with NA in the merged dataset.

merged_df = merge(df1, df2[,c("looked_up_value", "lookup")], by.x = "id_column", by.y = "id_col_name", all.x = TRUE)

In merge, use all.x = TRUE for a left join, all.y = TRUE for a right join, all = T for an outer join, and use nothing for an inner join (intersection)

2.15 If Else

ifelse:

zip_wrong = ifelse(zip < 0, TRUE, FALSE)

2.16 Ramdom selections

When picking a certain number of sample rows from a dataframe, there are two approaches:

2.16.1 Use `rbinom`

n = rbinom(n = nrow(mtcars),size=1,prob=.5) #n will be a vector with 0s and 1s, the prob for getting a 1 being 50%
mtcars[n, ] #pick the row numbers where n==1

2.16.2 Use `sample_n` from dplyr

mtcars = sample_n(mtcars, 8) #where 8 is the number of samples, and mtcars is the dataframe