`step_discretize_cart()`

creates a *specification* of a recipe step that will
discretize numeric data (e.g. integers or doubles) into bins in a supervised
way using a CART model.

## Usage

```
step_discretize_cart(
recipe,
...,
role = NA,
trained = FALSE,
outcome = NULL,
cost_complexity = 0.01,
tree_depth = 10,
min_n = 20,
rules = NULL,
skip = FALSE,
id = rand_id("discretize_cart")
)
```

## Arguments

- recipe
A recipe object. The step will be added to the sequence of operations for this recipe.

- ...
One or more selector functions to choose which variables are affected by the step. See

`selections()`

for more details.- role
Defaults to

`"predictor"`

.- trained
A logical to indicate if the quantities for preprocessing have been estimated.

- outcome
A call to

`vars`

to specify which variable is used as the outcome to train CART models in order to discretize explanatory variables.- cost_complexity
The regularization parameter. Any split that does not decrease the overall lack of fit by a factor of

`cost_complexity`

is not attempted. Corresponds to`cp`

in`rpart::rpart()`

. Defaults to 0.01.- tree_depth
The

*maximum*depth in the final tree. Corresponds to`maxdepth`

in`rpart::rpart()`

. Defaults to 10.- min_n
The number of data points in a node required to continue splitting. Corresponds to

`minsplit`

in`rpart::rpart()`

. Defaults to 20.- rules
The splitting rules of the best CART tree to retain for each variable. If length zero, splitting could not be used on that column.

- skip
A logical. Should the step be skipped when the recipe is baked by

`recipes::bake()`

? While all operations are baked when`recipes::prep()`

is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using`skip = TRUE`

as it may affect the computations for subsequent operations- id
A character string that is unique to this step to identify it.

## Value

An updated version of `recipe`

with the new step added to the
sequence of any existing operations.

## Details

`step_discretize_cart()`

creates non-uniform bins from numerical variables by
utilizing the information about the outcome variable and applying a CART
model.

The best selection of buckets for each variable is selected using the standard cost-complexity pruning of CART, which makes this discretization method resistant to overfitting.

This step requires the rpart package. If not installed, the step will stop with a note about installing the package.

Note that the original data will be replaced with the new bins.

## Tidying

When you `tidy()`

this step, a tibble with columns `terms`

(the columns that is selected), `values`

is returned.

## Tuning Parameters

This step has 3 tuning parameters:

`cost_complexity`

: Cost-Complexity Parameter (type: double, default: 0.01)`tree_depth`

: Tree Depth (type: integer, default: 10)`min_n`

: Minimal Node Size (type: integer, default: 20)

## Case weights

This step performs an supervised operation that can utilize case weights.
To use them, see the documentation in recipes::case_weights and the examples on
`tidymodels.org`

.

## Examples

```
library(modeldata)
data(ad_data)
library(rsample)
split <- initial_split(ad_data, strata = "Class")
ad_data_tr <- training(split)
ad_data_te <- testing(split)
cart_rec <-
recipe(Class ~ ., data = ad_data_tr) %>%
step_discretize_cart(
tau, age, p_tau, Ab_42,
outcome = "Class", id = "cart splits"
)
cart_rec <- prep(cart_rec, training = ad_data_tr)
# The splits:
tidy(cart_rec, id = "cart splits")
#> # A tibble: 16 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 tau 6.05 cart splits
#> 2 tau 6.17 cart splits
#> 3 tau 6.31 cart splits
#> 4 tau 6.37 cart splits
#> 5 tau 6.66 cart splits
#> 6 age 0.986 cart splits
#> 7 age 0.987 cart splits
#> 8 age 0.988 cart splits
#> 9 age 0.988 cart splits
#> 10 age 0.988 cart splits
#> 11 p_tau 4.62 cart splits
#> 12 Ab_42 9.66 cart splits
#> 13 Ab_42 10.8 cart splits
#> 14 Ab_42 11.2 cart splits
#> 15 Ab_42 11.3 cart splits
#> 16 Ab_42 11.7 cart splits
bake(cart_rec, ad_data_te, tau)
#> # A tibble: 84 × 1
#> tau
#> <fct>
#> 1 [6.166,6.308)
#> 2 [6.045,6.166)
#> 3 [6.374,6.661)
#> 4 [-Inf,6.045)
#> 5 [-Inf,6.045)
#> 6 [-Inf,6.045)
#> 7 [-Inf,6.045)
#> 8 [-Inf,6.045)
#> 9 [-Inf,6.045)
#> 10 [6.374,6.661)
#> # ℹ 74 more rows
```