This method uses a generalized linear model to estimate the effect of
each level of a factor predictor on the outcome. These values are
retained to serve as the new encodings for the factor levels. This is
sometimes referred to as likelihood encodings.
embed
has two estimation methods for accomplishing this:
with and without pooling.
The example used here is the Grant data from Kuhn and Johnson (2013),
these data are used to predict whether a grant application was accepted.
One predictor, the sponsor, is represented by the factor variable
sponsor_code
. The frequencies of sponsors in the data set
used here vary between 0 person and 1874 per code. There are 298 sponsor
codes in the data. Rather than producing 297 indicator variables for a
model, a single numeric variable can be used to represent the
effect or impact of the factor level on the outcome.
In this case, where a factor outcome is being predicted (accepted or
not), the effects are quantified by the log-odds of the sponsor code for
being accepted.
We first calculate the raw log-odds for the data (independent of any model):
library(tidymodels)
library(embed)
tidymodels_prefer()
theme_set(theme_bw() + theme(legend.position = "top"))
data(grants)
props <-
grants_other %>%
group_by(sponsor_code) %>%
summarise(
prop = mean(class == "successful"),
log_odds = log(prop / (1 - prop)),
n = length(class)
) %>%
mutate(label = paste0(gsub("_", " ", sponsor_code), " (n=", n, ")"))
props %>%
select(-label)
## # A tibble: 291 × 4
## sponsor_code prop log_odds n
## <fct> <dbl> <dbl> <int>
## 1 100D 0.6 0.405 10
## 2 101A 0.125 -1.95 16
## 3 103C 0.111 -2.08 9
## 4 105A 0.167 -1.61 6
## 5 107C 1 Inf 2
## 6 10B 1 Inf 1
## 7 111C 0.167 -1.61 6
## 8 112D 0.667 0.693 12
## 9 113A 0.5 0 10
## 10 118B 0.5 0 2
## # ℹ 281 more rows
# later, for plotting
rng <- extendrange(props$log_odds[is.finite(props$log_odds)], f = 0.1)
In subsequent sections, a logistic regression model is used. When the outcome variable is numeric, the steps automatically use linear regression models to estimate effects.
No Pooling
In this case, the effect of each sponsor code can be estimated separately for each factor level. One method for conducting this estimation step is to fit a logistic regression with the acceptance classification as the outcome and the sponsor code as the predictor. From this, the log-odds are naturally estimated by logistic regression.
For these data, a recipe is created and step_lencode_glm
is used:
grants_glm <-
recipe(class ~ ., data = grants_other) %>%
# specify the variable being encoded and the outcome
step_lencode_glm(sponsor_code, outcome = vars(class)) %>%
# estimate the effects
prep(training = grants_other)
The tidy
method can be used to extract the encodings and
are merged with the raw estimates:
## # A tibble: 292 × 2
## level value
## <chr> <dbl>
## 1 100D 4.05e- 1
## 2 101A -1.95e+ 0
## 3 103C -2.08e+ 0
## 4 105A -1.61e+ 0
## 5 107C 1.66e+ 1
## 6 10B 1.66e+ 1
## 7 111C -1.61e+ 0
## 8 112D 6.93e- 1
## 9 113A 0
## 10 118B 3.14e-16
## # ℹ 282 more rows
glm_estimates <-
glm_estimates %>%
set_names(c("sponsor_code", "glm")) %>%
inner_join(props, by = "sponsor_code")
For the sponsor codes with n > 1
, the estimates are
effectively the same:
glm_estimates %>%
dplyr::filter(is.finite(log_odds)) %>%
mutate(difference = log_odds - glm) %>%
dplyr::select(difference) %>%
summary()
## difference
## Min. :-1.776e-15
## 1st Qu.:-2.220e-16
## Median : 0.000e+00
## Mean :-2.789e-17
## 3rd Qu.: 2.220e-16
## Max. : 8.882e-16
Note that there is also a effect that is used for a novel sponsor code for future data sets that is the average effect:
## # A tibble: 1 × 3
## level value terms
## <chr> <dbl> <chr>
## 1 ..new -2.88 sponsor_code
Partial Pooling
This method estimates the effects by using all of the sponsor codes at once using a hierarchical Bayesian generalized linear model. The sponsor codes are treated as a random set that contributes a random intercept to the previously used logistic regression.
Partial pooling estimates each effect as a combination of the separate empirical estimates of the log-odds and the prior distribution. For sponsor codes with small sample sizes, the final estimate is shrunken towards the overall mean of the log-odds. This makes sense since we have poor information for estimating these sponsor codes. For sponsor codes with many data points, the estimates reply more on the empirical estimates. This page has a good discussion of pooling using Bayesian models.
Bayesian Methods
One appraoch to partial pooling is the function
step_lencode_bayes
uses the stan_glmer
function in the rstanarm
package. There are a number of
options that can be used to control the model estimation routine,
including:
opts <-
list(
## the number of chains
chains = 4,
## how many cores to use
cores = 4,
## the total number of iterations per chain (low here for time)
iter = 1000,
## set the random number seed
seed = 8779
)
Using the default priors, the model is estimated via:
grants_glmer <-
recipe(class ~ ., data = grants_other) %>%
step_lencode_bayes(
sponsor_code,
outcome = vars(class),
options = opts
) %>%
prep(training = grants_other)
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
This took more time than the simple non-pooled model. The embeddings are extracted in the same way:
all_estimates <-
tidy(grants_glmer, number = 1) %>%
dplyr::select(-terms, -id) %>%
set_names(c("sponsor_code", "glmer")) %>%
inner_join(glm_estimates, by = "sponsor_code")
all_estimates %>% dplyr::select(sponsor_code, log_odds, glm, glmer)
## # A tibble: 291 × 4
## sponsor_code log_odds glm glmer
## <chr> <dbl> <dbl> <dbl>
## 1 100D 0.405 4.05e- 1 0.245
## 2 101A -1.95 -1.95e+ 0 -1.67
## 3 103C -2.08 -2.08e+ 0 -1.58
## 4 105A -1.61 -1.61e+ 0 -1.25
## 5 107C Inf 1.66e+ 1 0.718
## 6 10B Inf 1.66e+ 1 0.218
## 7 111C -1.61 -1.61e+ 0 -1.18
## 8 112D 0.693 6.93e- 1 0.502
## 9 113A 0 0 -0.0891
## 10 118B 0 3.14e-16 -0.252
## # ℹ 281 more rows
Note that the n = 1
sponsor codes have estimates that
are less extreme that the naive estimates. Also,
Let’s see the effect of the shrinkage induced by partial pooling by plotting the naive results versus the new results (finite data only):
pooled_plot <-
all_estimates %>%
dplyr::filter(is.finite(log_odds)) %>%
ggplot(aes(x = log_odds, y = glmer)) +
geom_abline(col = "red", alpha = .5) +
geom_point_interactive(
aes(size = sqrt(n), tooltip = label),
alpha = .5
) +
xlim(rng) +
ylim(rng)
# Convert the plot to a format that the html file can handle
ggiraph(ggobj = pooled_plot)
## Function `ggiraph()` is replaced by `girafe()` and will be removed soon.
New levels are encoded as:
## # A tibble: 1 × 2
## level value
## <chr> <dbl>
## 1 ..new -0.468
Empirical Bayesian Methods/Mixed Models
The same generalized linear model can be fit using mixed models via a
random intercept. The lme4
package can also be used to get
pooled estimates via step_lencode_mixed
.
grants_mixed <-
recipe(class ~ ., data = grants_other) %>%
step_lencode_mixed(
sponsor_code,
outcome = vars(class),
) %>%
prep(training = grants_other)
all_estimates <-
tidy(grants_mixed, number = 1) %>%
dplyr::select(-terms, -id) %>%
set_names(c("sponsor_code", "mixed")) %>%
inner_join(all_estimates, by = "sponsor_code")
all_estimates %>%
dplyr::select(sponsor_code, log_odds, glm, glmer, mixed)
## # A tibble: 291 × 5
## sponsor_code log_odds glm glmer mixed
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 100D 0.405 4.05e- 1 0.245 0.222
## 2 101A -1.95 -1.95e+ 0 -1.67 -1.57
## 3 103C -2.08 -2.08e+ 0 -1.58 -1.47
## 4 105A -1.61 -1.61e+ 0 -1.25 -1.14
## 5 107C Inf 1.66e+ 1 0.718 0.607
## 6 10B Inf 1.66e+ 1 0.218 0.218
## 7 111C -1.61 -1.61e+ 0 -1.18 -1.14
## 8 112D 0.693 6.93e- 1 0.502 0.470
## 9 113A 0 0 -0.0891 -0.0949
## 10 118B 0 3.14e-16 -0.252 -0.258
## # ℹ 281 more rows
Comparing the raw and mixed model estimates:
mixed_plot <-
all_estimates %>%
dplyr::filter(is.finite(log_odds)) %>%
ggplot(aes(x = log_odds, y = mixed)) +
geom_abline(col = "red", alpha = .5) +
geom_point_interactive(
aes(size = sqrt(n), tooltip = label),
alpha = .5
) +
xlim(rng) +
ylim(rng)
ggiraph(ggobj = mixed_plot)
## Function `ggiraph()` is replaced by `girafe()` and will be removed soon.
These values are very similar to the Bayesian estimates.