Chapter 14 Logistic Regression

14.1 The basics - Bernoulli and Binomial

A bernoulli trial means either a 0 or a 1 outcome. These are individual trials of one. For example, a coin toss with a single coin.

Binomial distribution is a collection of multiple Bernoulli trials, ie collection of n Bernoulli trials, with each trial having a size.

Has three parameters:
- number of observations (how many times is/are the coin(s) tossed?)
- number of trials in each observation (how many coins are tossed together?)
- probability of success of each coin toss.

eg if 5 coins are tossed together 3 times, then this would be rbinom(n = 3, size = 5, prob = 0.5), giving us as an example [1] 2 3 3.

14.2 Odds and log odds

Odds = p / (1-p)

Log of the odds = Logit function = log(p/(1-p))

In R, logit is calculated using the function logit.

The inverse of logit is the function invlogit.

Odds (ie p/(1-p)) go from 0 to Inf. So it can be modeled using the log function that also goes from 0 to Inf. invlogit goes from 0 to 1.

14.3 Fitting a logistic regression model

Logistic Regression is modeled as:
log(odds) = beta0 + beta1x1 + beta2x2..betanxn

Logistic regression allows us to predict binary values. It predicts a dichotomous dependent variable based on one or more categorical or continuous independent variables.

We try to predict the variable vs in the mtcars dataset.

What does this ‘vs’ variable mean? Whether the engine is a V-shaped or a straight engine. A V engine where the cylinders and pistons are aligned, in two separate planes or ‘banks’, so that they appear to be in a “V” The straight or inline engine is an internal-combustion engine with all cylinders aligned in one row.

We will need to convert it to a factor first of course.

mtcars$vs = factor(mtcars$vs)
##  0  1 
## 18 14
inTrain <- createDataPartition(y=mtcars$cyl, p=0.75, list=FALSE)
training <- mtcars[inTrain,]
testing <- mtcars[-inTrain,]

Next we fit the logistic regression model. We may see some warnings, but the model may still work.

fit <- train(vs ~ wt + qsec,  data=training, method="glm", family="binomial")

Check the model:

## Generalized Linear Model 
## 25 samples
##  2 predictor
##  2 classes: '0', '1' 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 25, 25, 25, 25, 25, 25, ... 
## Resampling results:
##   Accuracy   Kappa    
##   0.9400635  0.8585464
confusionMatrix(testing$vs, predict(fit, newdata = testing))
## Confusion Matrix and Statistics
##           Reference
## Prediction 0 1
##          0 3 0
##          1 0 4
##                Accuracy : 1          
##                  95% CI : (0.5904, 1)
##     No Information Rate : 0.5714     
##     P-Value [Acc > NIR] : 0.01989    
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4286     
##          Detection Rate : 0.4286     
##    Detection Prevalence : 0.4286     
##       Balanced Accuracy : 1.0000     
##        'Positive' Class : 0          