step_discretize_cart
creates a specification of a recipe step that will
discretize numeric data (e.g. integers or doubles) into bins in a
supervised way using a CART model.
step_discretize_cart( recipe, ..., role = NA, trained = FALSE, outcome = NULL, cost_complexity = 0.01, tree_depth = 10, min_n = 20, rules = NULL, skip = FALSE, id = rand_id("discretize_cart") ) # S3 method for step_discretize_cart tidy(x, ...)
recipe | A recipe object. The step will be added to the sequence of operations for this recipe. |
---|---|
... | One or more selector functions to choose which variables are
affected by the step. See |
role | Defaults to |
trained | A logical to indicate if the quantities for preprocessing have been estimated. |
outcome | A call to |
cost_complexity | The regularization parameter. Any split that does not
decrease the overall lack of fit by a factor of |
tree_depth | The maximum depth in the final tree. Corresponds to
|
min_n | The number of data points in a node required to continue
splitting. Corresponds to |
rules | The splitting rules of the best CART tree to retain for each variable. If length zero, splitting could not be used on that column. |
skip | A logical. Should the step be skipped when the
recipe is baked by |
id | A character string that is unique to this step to identify it. |
x | A |
An updated version of recipe
with the new step added to the
sequence of existing steps (if any).
step_discretize_cart()
creates non-uniform bins from numerical
variables by utilizing the information about the outcome variable and
applying a CART model.
The best selection of buckets for each variable is selected using the standard cost-complexity pruning of CART, which makes this discretization method resistant to overfitting.
This step requires the rpart package. If not installed, the step will stop with a note about installing the package.
Note that the original data will be replaced with the new bins.
library(modeldata) data(ad_data) library(rsample) split <- initial_split(ad_data, strata = "Class") ad_data_tr <- training(split) ad_data_te <- testing(split) cart_rec <- recipe(Class ~ ., data = ad_data_tr) %>% step_discretize_cart(tau, age, p_tau, Ab_42, outcome = "Class", id = "cart splits") cart_rec <- prep(cart_rec, training = ad_data_tr)#> Warning: `step_discretize_cart()` failed to find any meaningful splits for predictor 'age', which will not be binned.#> # A tibble: 16 x 3 #> terms values id #> <chr> <dbl> <chr> #> 1 tau 6.15 cart splits #> 2 tau 6.25 cart splits #> 3 tau 6.32 cart splits #> 4 tau 6.42 cart splits #> 5 tau 6.66 cart splits #> 6 p_tau 3.90 cart splits #> 7 p_tau 4.36 cart splits #> 8 p_tau 4.40 cart splits #> 9 p_tau 4.49 cart splits #> 10 p_tau 4.54 cart splits #> 11 p_tau 4.62 cart splits #> 12 Ab_42 9.98 cart splits #> 13 Ab_42 10.3 cart splits #> 14 Ab_42 11.1 cart splits #> 15 Ab_42 11.2 cart splits #> 16 Ab_42 11.3 cart splitsbake(cart_rec, ad_data_te, tau)#> # A tibble: 82 x 1 #> tau #> <fct> #> 1 [6.147,6.25) #> 2 [6.418,6.661) #> 3 [-Inf,6.147) #> 4 [-Inf,6.147) #> 5 [6.147,6.25) #> 6 [-Inf,6.147) #> 7 [6.25,6.322) #> 8 [6.661, Inf] #> 9 [-Inf,6.147) #> 10 [-Inf,6.147) #> # … with 72 more rows