`step_collapse_stringdist()`

creates a *specification* of a recipe step that
will collapse factor levels that have a low stringdist between them.

## Usage

```
step_collapse_stringdist(
recipe,
...,
role = NA,
trained = FALSE,
distance = NULL,
method = "osa",
options = list(),
results = NULL,
columns = NULL,
skip = FALSE,
id = rand_id("collapse_stringdist")
)
```

## Arguments

- recipe
A recipe object. The step will be added to the sequence of operations for this recipe.

- ...
One or more selector functions to choose which variables are affected by the step. See

`selections()`

for more details. For the`tidy`

method, these are not currently used.- role
Not used by this step since no new variables are created.

- trained
A logical to indicate if the quantities for preprocessing have been estimated.

- distance
Integer, value to determine which strings should be collapsed with which. The value is being used inclusive, so

`2`

will collapse levels that have a string distance between them of 2 or lower.- method
Character, method for distance calculation. The default is

`"osa"`

, see stringdist::stringdist-metrics.- options
List, other arguments passed to

`stringdist::stringdistmatrix()`

such as`weight`

,`q`

,`p`

, and`bt`

, that are used for different values of`method`

.- results
A list denoting the way the labels should be collapses is stored here once this preprocessing step has be trained by

`prep()`

.- columns
A character string of variable names that will be populated (eventually) by the

`terms`

argument.- skip
A logical. Should the step be skipped when the recipe is baked by

`bake()`

? While all operations are baked when`prep()`

is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using`skip = TRUE`

as it may affect the computations for subsequent operations.- id
A character string that is unique to this step to identify it.

## Value

An updated version of `recipe`

with the new step added to the
sequence of existing steps (if any). For the `tidy`

method, a tibble with
columns `terms`

(the columns that will be affected) and `base`

.

## Tidying

When you `tidy()`

this step, a tibble with columns `"terms"`

(the column being modified), `"from"`

(the old levels), `"to"`

(the new
levels), and `"id"`

.

## Examples

```
library(recipes)
library(tibble)
data0 <- tibble(
x1 = c("a", "b", "d", "e", "sfgsfgsd", "hjhgfgjgr"),
x2 = c("ak", "b", "djj", "e", "hjhgfgjgr", "hjhgfgjgr")
)
rec <- recipe(~., data = data0) %>%
step_collapse_stringdist(all_predictors(), distance = 1) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 6 × 2
#> x1 x2
#> <fct> <fct>
#> 1 a ak
#> 2 a b
#> 3 a djj
#> 4 a b
#> 5 sfgsfgsd hjhgfgjgr
#> 6 hjhgfgjgr hjhgfgjgr
tidy(rec, 1)
#> # A tibble: 11 × 4
#> terms from to id
#> <chr> <chr> <chr> <chr>
#> 1 x1 a a collapse_stringdist_F7qKG
#> 2 x1 b a collapse_stringdist_F7qKG
#> 3 x1 d a collapse_stringdist_F7qKG
#> 4 x1 e a collapse_stringdist_F7qKG
#> 5 x1 hjhgfgjgr hjhgfgjgr collapse_stringdist_F7qKG
#> 6 x1 sfgsfgsd sfgsfgsd collapse_stringdist_F7qKG
#> 7 x2 ak ak collapse_stringdist_F7qKG
#> 8 x2 b b collapse_stringdist_F7qKG
#> 9 x2 e b collapse_stringdist_F7qKG
#> 10 x2 djj djj collapse_stringdist_F7qKG
#> 11 x2 hjhgfgjgr hjhgfgjgr collapse_stringdist_F7qKG
rec <- recipe(~., data = data0) %>%
step_collapse_stringdist(all_predictors(), distance = 2) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 6 × 2
#> x1 x2
#> <fct> <fct>
#> 1 a ak
#> 2 a ak
#> 3 a djj
#> 4 a ak
#> 5 sfgsfgsd hjhgfgjgr
#> 6 hjhgfgjgr hjhgfgjgr
tidy(rec, 1)
#> # A tibble: 11 × 4
#> terms from to id
#> <chr> <chr> <chr> <chr>
#> 1 x1 a a collapse_stringdist_m6FIF
#> 2 x1 b a collapse_stringdist_m6FIF
#> 3 x1 d a collapse_stringdist_m6FIF
#> 4 x1 e a collapse_stringdist_m6FIF
#> 5 x1 hjhgfgjgr hjhgfgjgr collapse_stringdist_m6FIF
#> 6 x1 sfgsfgsd sfgsfgsd collapse_stringdist_m6FIF
#> 7 x2 ak ak collapse_stringdist_m6FIF
#> 8 x2 b ak collapse_stringdist_m6FIF
#> 9 x2 e ak collapse_stringdist_m6FIF
#> 10 x2 djj djj collapse_stringdist_m6FIF
#> 11 x2 hjhgfgjgr hjhgfgjgr collapse_stringdist_m6FIF
```