David Dale
Apr 2, 2022

--

If I retrained the tokenizer from scratch, it would lose the connection with token embeddings in the neural network, and I would have to re-train the neural network weights as well. And this is long and expensive.

For example, the old tokenizer might split the word “babushka” as “bab”+“#ushka”, whereas the new one (trained from scratch) could learn different tokens, such as “ba”+“#bush”+“#ka”. The model already has embeddings for “bab” and “#ushka”, and I want to reuse these embeddings, instead of training the model on a large corpus until it learns embeddings for “ba”, “#bush”, and “#ka”.

--

--

David Dale
David Dale

Written by David Dale

NLP researcher at FAIR, Meta. Low-resource language enthusiast. See daviddale.ru.

Responses (1)