TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

How to adapt a multilingual T5 model for a single language

Load embeddings only for the tokens of your language to reduce the model size

David Dale
TDS Archive
Published in
4 min readMay 4, 2021

--

T5 is an encoder-decoder transformer from Google that once was SOTA on several NLU and NLG problems and is still very useful as a base for seq2seq tasks such as text summarization. The first T5 model was for English only, and then the massively multilingual version followed. This model covers 101 languages and is massive indeed.

This post shows how to extract a single-language model from the multilingual one by pruning its redundant embeddings. This reduces the number of parameters more than twice without significant loss in quality. Our result is for Russian, but you can try it with any other of the 101 languages that mT5 features.

Two-thirds of MT5 parameters are embeddings, and we can drop the unused ones. Image by the author.

Selecting the vocabulary

The idea is similar to one in the paper Load What You Need: Smaller Versions of Multilingual BERT. We use the original tokenizer to process a Russian corpus, count the frequencies of different tokens, and preserve only the tokens that were used frequently enough, pruning all the others.

We also preserve a small number of English tokens in the model to make it bilingual. We need this to enable the model to transfer knowledge from English to Russian downstream tasks, and also because English words and phrases often occur within modern Russian texts.

We start by loading the existing multilingual model.

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")
model = T5ForConditionalGeneration.from_pretrained('google/mt5-base')

The model consists mostly of embeddings: 33% of its parameters are input embeddings (shared between its encoder and decoder) and 33% are output embeddings.

def msize(m):
return sum(p.numel() for p in m.parameters())
print(msize(model.shared) / msize(model)) # 0.3298
print(msize(model.lm_head) / msize(model)) # 0.3298

To estimate the frequency of different tokens, we take a Russian and an English sentence corpora from the Leipzig corpora collection. We use these two languages because we want our model to be bilingual in the end.

import pandas as pd
import csv
from collections import Counter
from tqdm.auto import tqdm, trange
df_ru = pd.read_csv('rus-ru_web-public_2019_1M-sentences.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_ru.columns = ['idx', 'text']
cnt_ru = Counter()
for text in tqdm(df_ru.text):
cnt_ru.update(tokenizer.encode(text))
print(len(cnt_ru), len(cnt_ru)/tokenizer.vocab_size)
# 58438 0.2336

After counting the tokens in the Russian corpus we discover that only 23% of the model vocabulary was used. Moreover, the top 20K tokens constitute more than 99% of the Russian corpus. For English, the statistics are similar.

for top in 10_000, 20_000, 30_000:
print(top, sum(v for k, v in cnt_ru.most_common(top)) / sum(cnt_ru.values()))
# 10000 0.9645
# 20000 0.9940
# 30000 0.9982

We decide to use the following composition of vocabulary:

  • 1K of top tokens of the original tokenizer (just in case)
  • Top 10K of the English vocabulary
  • Top 20K of the Russian vocabulary
  • The 100 special tokens that T5 uses

This gives us a vocabulary of 30K tokens, 12% of the 250K tokens in the multilingual version.

new_tokens = set(range(1000))
for i, (k, v) in enumerate(cnt_en.most_common(10_000)):
if k not in new_tokens:
new_tokens.add(k)
for i, (k, v) in enumerate(cnt_ru.most_common(25_000)):
if len(new_tokens) == 29_900:
print(i, 'Russan tokens are included')
break
if k not in new_tokens:
new_tokens.add(k)
for t in range(tokenizer.vocab_size - 100, tokenizer.vocab_size):
new_tokens.add(t)
print(len(new_tokens))
kept_ids = sorted(new_tokens)

Updating the model

Updating the neural networks is easy: just replace the parameters of its input and output embeddings. This reduces the model size by 58% (from 2.2GB to 0.9GB).

new_size = len(kept_ids)
new_emb = torch.nn.Embedding(new_size, model.shared.embedding_dim)
new_head = torch.nn.Linear(in_features=model.lm_head.in_features, out_features=new_size, bias=False)
for new_id, old_id in enumerate(kept_ids):
new_emb.weight.data[new_id] = model.shared.weight.data[old_id]
new_head.weight.data[new_id] = model.lm_head.weight.data[old_id]
model.shared.weight = new_emb.weight
model.lm_head.weight = new_head.weight
model.config.__dict__['vocab_size'] = new_size
model.config.__dict__['_name_or_path'] = 'cointegrated/rut5-base'

Updating the tokenizer is surprisingly more tricky. T5 uses Sentencepiece tokenizer, which is implemented in C and is opaque to Python. Fortunately, we can download its model and deploy it into Python using its Protobuf representation.

! wget https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto
! protoc --python_out=. sentencepiece_model.proto
import sentencepiece_model_pb2 as spmp
smp = tokenizer.sp_model.serialized_model_proto()
m = spmp.ModelProto()
m.ParseFromString(smp)
print('the loaded model has pieces:', len(m.pieces))
new_pieces = [m.pieces[idx] for idx in kept_ids]
print('the new pieces:', len(new_pieces))
# replace the content of the first 30K pieces
for i, p in enumerate(new_pieces):
m.pieces[i].piece = p.piece
m.pieces[i].score = p.score
m.pieces[i].type = p.type
# drop the remaining pieces
n = len(new_pieces)
for i in trange(len(m.pieces) - n):
m.pieces.pop(len(m.pieces) - 1)
print(len(m.pieces))
with open('new_sp.model', 'wb') as f:
f.write(m.SerializeToString())
new_tokenizer = T5Tokenizer('new_sp.model', extra_ids=0)

Now we can save the new model and the new tokenizer.

new_tokenizer.save_pretrained('rut5-base')
model.save_pretrained('rut5-base')

All the code for creating models up to this stage is available on Github. The Russian T5 model is available in the Huggingface repository.

Frankly, this model is pretty useless by itself, because mT5 was trained only on the unsupervised task of predicting missing words. However, this model can be fine-tuned for many other tasks: text summarization, translation, dialogue response generation, paraphrasing, etc. In the next post, we will show how to perform such fine-tuning. Subscribe to stay tuned!

The post was written by David Dale (https://daviddale.ru/en), a research scientist in NLP and developer of chatbots.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

David Dale
David Dale

Written by David Dale

NLP researcher at FAIR, Meta. Low-resource language enthusiast. See daviddale.ru.

Responses (7)

Write a response