Chapter 14 Logistic Regression
14.1 The basics - Bernoulli and Binomial
A bernoulli trial means either a 0 or a 1 outcome. These are individual trials of one. For example, a coin toss with a single coin.
Binomial distribution is a collection of multiple Bernoulli trials, ie collection of n Bernoulli trials, with each trial having a size.
Has three parameters:
- number of observations (how many times is/are the coin(s) tossed?)
- number of trials in each observation (how many coins are tossed together?)
- probability of success of each coin toss.
eg if 5 coins are tossed together 3 times, then this would be rbinom(n = 3, size = 5, prob = 0.5)
, giving us as an example [1] 2 3 3
.
14.2 Odds and log odds
Odds = p / (1-p)
Log of the odds = Logit function = log(p/(1-p))
In R, logit is calculated using the function logit
.
The inverse of logit
is the function invlogit
.
Odds (ie p/(1-p)) go from 0 to Inf. So it can be modeled using the log function that also goes from 0 to Inf. invlogit
goes from 0 to 1.
14.3 Fitting a logistic regression model
Logistic Regression is modeled as:
log(odds) = beta0 + beta1x1 + beta2x2..betanxn
Logistic regression allows us to predict binary values. It predicts a dichotomous dependent variable based on one or more categorical or continuous independent variables.
We try to predict the variable vs
in the mtcars
dataset.
What does this ‘vs’ variable mean? Whether the engine is a V-shaped or a straight engine. A V engine where the cylinders and pistons are aligned, in two separate planes or ‘banks’, so that they appear to be in a “V” The straight or inline engine is an internal-combustion engine with all cylinders aligned in one row.
We will need to convert it to a factor first of course.
library(caret)
data(mtcars)
mtcars$vs = factor(mtcars$vs)
table(mtcars$vs)
##
## 0 1
## 18 14
inTrain <- createDataPartition(y=mtcars$cyl, p=0.75, list=FALSE)
training <- mtcars[inTrain,]
testing <- mtcars[-inTrain,]
Next we fit the logistic regression model. We may see some warnings, but the model may still work.
fit <- train(vs ~ wt + qsec, data=training, method="glm", family="binomial")
Check the model:
fit
## Generalized Linear Model
##
## 25 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 25, 25, 25, 25, 25, 25, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9400635 0.8585464
confusionMatrix(testing$vs, predict(fit, newdata = testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3 0
## 1 0 4
##
## Accuracy : 1
## 95% CI : (0.5904, 1)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 0.01989
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4286
## Detection Rate : 0.4286
## Detection Prevalence : 0.4286
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##