Generalized linear model (R code)
R packages
- car
- Mass
- lme4 (liner mixed model)
1.Linear regression model
- Linear relationship of coefficients
- Normality of residuals (Shapilo wilk test, Q-Q plot, histogram)
- Homoscedasticity (statistical test, variance does not depend on level of X)
- Independent observation (i.e., no correlation between samples)
#plot
RR_plot <- ggplot(data, aes(x = X, y = Y, colour = males))
RR_plot <- RR_plot + geom_point()
RR_plot <- RR_plot + scale_colour_brewer(palette = "Set1")
RR_plot <- RR_plot + geom_smooth(method = "lm", formula = y ~ x, aes(col = “lift”))
RR_plot <- RR_plot + labs(title = "Linear regression", x = "Number of drivers (log10)", y = "deaths (log10)")
RR_plot <- RR_plot + theme_classic() + theme(
axis.line = element_line(linewidth = 1.0, lineend = "square"),
text = element_text(colour ="black", size = 14),
legend.position = "none",)
#linear model
res_lm <- lm(Y ~ X + males + X*males, data = data)
summary(res_lm)
#Regression and confidence interval
X_new <- data.frame(X = 23:60)
Conf <- predict(res_lm, X_new, interval = ‘confidence’, level = 0.95)
Conf <- predict(res_lm, X_new, interval = ‘prediction’, level = 0.95)
- Normality & Homoscedasticity
- Outliners
- Collinearity
- Coversion of variables
Interaction - Confounding
- Fitness or predictive performance
Conversion of varaibles, interaction, confunding could be identified with likelihood test.(But Baysian modeling is better)
- Fit the model containing the variables we want to test and save the log-likelihood.
- Fit the model removing the variables we want to test and save the log-likelihood.
- Compare the two log-likelihoods to perform the likelihood ratio test (e.g., testing whether interaction is present)
- Compute the odds ratio
Step-wise method (should not be used)
lm_model <- lm(data = fev, logfev ~ Age + Hgt + sex + smoke)
ols_step_both_p(lm_model, pent = 0.05, prem = 0.051) #The P-value of 0.05 is the threshold.
ols_step_best_subset(lm_model)
2.Logistic regression analysis
- Output is binary variable
- Linear relationship of coefficient
- Independent observations
#pin : プロット領域, par : 余白の指定
par(pin = c(3,3))
par(mar = c(3, 3, 3, 3))
p1 <- ggplot(data, aes(x = Clinical, y = WBE)) #WBE: 0 or 1
p1 <- p1+ geom_point(alpha = 0.4, shape = 21, size = 2, colour = "black", fill = "grey") + scale_x_log10()
p1 <- p1 + stat_smooth(method = glm, method.args = list(family = binomial), fullrange = TRUE)
lo_1 <- glm(disease ~ passive, data = ex_1[ex_1$personal ==0, ], family = binomial(link = “logit”))
summary(lo_1)
#compute the odds ratio
library(oddsratio)
or_glm(ex_1, lo_1, incr = list(passive = 1))
- likelihood ratio test
- Interaction
- Confounding
- Goodness of fit (Hosmer-Lemeshow test)
- Goodness of fit (Outlinear & ROC curve)
3. Multiple impuation
- library(mice)
- library(Amelia)
- library (miceadds)
- R code
詳しくは、”欠測データ処理 Rによる単一代入法と多重代入法, P97
library(mice)
library(miceadds) #library(Amelia)を使用するのもよい
# 例となるデータの作成 (# 一部の値を欠損させる)
set.seed(123)
mydata <- data.frame(
y = rbinom(100, 1, 0.5), # 応答変数(0か1の二項変数)
x1 = rnorm(100), # 説明変数1
x2 = rnorm(100), # 説明変数2
x3 = rnorm(100) # 説明変数3
)
mydata[sample(1:100, 20), "x1"] <- NA
mydata[sample(1:100, 20), "x2"] <- NA
# 欠損値を多重代入で補完(mice)
imp <- mice(mydata, m = 5, method = "pmm", seed = 500)
# 補完されたデータセットでロジスティック回帰モデルをフィッティング
logit_model <- with.mids(imp, glm(y ~ x1 + x2 + x3, family = binomial()))
pooled_results <- pool(logit_model)
summary(pooled_results)