Entity Embeddings of Categorical Variables using TensorFlow
Source:vignettes/Applications/Tensorflow.Rmd
Tensorflow.Rmd
The approach encodes categorical data as multiple numeric variables using a word embedding approach. Originally intended as a way to take a large number of word identifiers and represent them in a smaller dimension. Good references on this are Guo and Berkhahn (2016) and Chapter 6 of Francois and Allaire (2018).
The methodology first translates the C factor levels as a
set of integer values then randomly allocates them to the new D
numeric columns. These columns are optionally connected in a neural
network to an intermediate layer of hidden units. Optionally, other
predictors can be added to the network in the usual way (via the
predictors
argument) that also link to the hidden layer.
This implementation uses a single layer with ReLu activations. Finally,
an output layer is used with either linear activation (for numeric
outcomes) or softmax (for classification).
To translate this model to a set of embeddings, the coefficients of the original embedding layer are used to represent the original factor levels.
As an example, we use the Ames housing data where the sale price of houses are being predicted. One predictor, neighborhood, has the most factor levels of the predictors.
library(tidymodels)
data(ames)
length(levels(ames$Neighborhood))
## [1] 29
The distribution of data in the neighborhood is not uniform:
ames %>%
count(Neighborhood) %>%
ggplot(aes(n, reorder(Neighborhood, n))) +
geom_col() +
labs(y = NULL) +
theme_bw()
Fo plotting later, we calculate the simple means per neighborhood:
means <-
ames %>%
group_by(Neighborhood) %>%
summarise(
mean = mean(log10(Sale_Price)),
n = length(Sale_Price),
lon = median(Longitude),
lat = median(Latitude)
)
We’ll fit a model with 10 hidden units and 3 encoding columns:
library(embed)
tf_embed <-
recipe(Sale_Price ~ ., data = ames) %>%
step_log(Sale_Price, base = 10) %>%
# Add some other predictors that can be used by the network
# We preprocess them first
step_YeoJohnson(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
step_range(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
step_embed(
Neighborhood,
outcome = vars(Sale_Price),
predictors = vars(Lot_Area, Full_Bath, Gr_Liv_Area),
num_terms = 5,
hidden_units = 10,
options = embed_control(epochs = 75, validation_split = 0.2)
) %>%
prep(training = ames)
theme_set(theme_bw() + theme(legend.position = "top"))
tf_embed$steps[[4]]$history %>%
filter(epochs > 1) %>%
ggplot(aes(x = epochs, y = loss, col = type)) +
geom_line() +
scale_y_log10()
The embeddings are obtained using the tidy
method:
hood_coef <-
tidy(tf_embed, number = 4) %>%
dplyr::select(-terms, -id) %>%
dplyr::rename(Neighborhood = level) %>%
# Make names smaller
rename_at(
vars(contains("emb")),
funs(gsub("Neighborhood_", "", ., fixed = TRUE))
)
hood_coef
## # A tibble: 30 × 6
## embed_1 embed_2 embed_3 embed_4 embed_5 Neighborhood
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0.0343 -0.0136 -0.0180 0.0418 0.0229 ..new
## 2 -0.000622 0.0234 0.0538 -0.0860 0.00240 North_Ames
## 3 -0.0531 -0.00936 -0.0267 -0.0535 -0.0713 College_Creek
## 4 -0.0326 0.0663 -0.00537 -0.0590 0.0550 Old_Town
## 5 0.0464 -0.0118 -0.0427 0.00593 0.0445 Edwards
## 6 -0.0725 -0.0798 -0.0351 0.00125 -0.0914 Somerset
## 7 -0.0446 -0.102 0.0305 -0.0128 -0.140 Northridge_Heights
## 8 0.0263 -0.0334 0.00209 -0.00356 -0.0324 Gilbert
## 9 -0.0254 -0.0125 -0.0386 -0.0341 0.0216 Sawyer
## 10 -0.0127 -0.0541 -0.0271 0.0123 0.00500 Northwest_Ames
## # ℹ 20 more rows
hood_coef <-
hood_coef %>%
inner_join(means, by = "Neighborhood")
hood_coef
## # A tibble: 28 × 10
## embed_1 embed_2 embed_3 embed_4 embed_5 Neighborhood mean n
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <int>
## 1 -0.000622 0.0234 0.0538 -0.0860 0.00240 North_Ames 5.15 443
## 2 -0.0531 -0.00936 -0.0267 -0.0535 -0.0713 College_Cre… 5.29 267
## 3 -0.0326 0.0663 -0.00537 -0.0590 0.0550 Old_Town 5.07 239
## 4 0.0464 -0.0118 -0.0427 0.00593 0.0445 Edwards 5.09 194
## 5 -0.0725 -0.0798 -0.0351 0.00125 -0.0914 Somerset 5.35 182
## 6 -0.0446 -0.102 0.0305 -0.0128 -0.140 Northridge_… 5.49 166
## 7 0.0263 -0.0334 0.00209 -0.00356 -0.0324 Gilbert 5.27 165
## 8 -0.0254 -0.0125 -0.0386 -0.0341 0.0216 Sawyer 5.13 151
## 9 -0.0127 -0.0541 -0.0271 0.0123 0.00500 Northwest_A… 5.27 131
## 10 -0.0107 -0.0377 -0.0150 0.000668 -0.0173 Sawyer_West 5.25 125
## # ℹ 18 more rows
## # ℹ 2 more variables: lon <dbl>, lat <dbl>
We can make a simple, interactive plot of the new features versus the outcome:
tf_plot <-
hood_coef %>%
dplyr::select(-lon, -lat) %>%
gather(variable, value, starts_with("embed")) %>%
# Clean up the embedding names
# Add a new variable as a hover-over/tool tip
mutate(
label = paste0(gsub("_", " ", Neighborhood), " (n=", n, ")"),
variable = gsub("_", " ", variable)
) %>%
ggplot(aes(x = value, y = mean)) +
geom_point_interactive(aes(size = sqrt(n), tooltip = label), alpha = .5) +
facet_wrap(~variable, scales = "free_x") +
theme_bw() +
theme(legend.position = "top") +
labs(y = "Mean (log scale)", x = "Embedding")
ggiraph(ggobj = tf_plot)
## Warning: 'ggiraph' is deprecated.
## Use 'girafe' instead.
## See help("Deprecated")
However, this has induced some between-predictor correlations:
## embed_1 embed_2 embed_3 embed_4 embed_5
## embed_1 1.00 0.31 -0.20 -0.05 0.26
## embed_2 0.31 1.00 -0.23 -0.26 0.57
## embed_3 -0.20 -0.23 1.00 -0.09 -0.18
## embed_4 -0.05 -0.26 -0.09 1.00 0.10
## embed_5 0.26 0.57 -0.18 0.10 1.00