5/09/2022

BERT Tokenizer, string to token, token to string

 

BERT Tokenizer token understanding examples

..

text = "I am e/mail"
# text = "I am a e-mail"
tokens = tokenizer.tokenize(text)
print(f'Tokens: {tokens}')
print(f'Tokens length: {len(tokens)}')
encoding = tokenizer.encode(text)
print(f'Encoding: {encoding}')
print(f'Encoding length: {len(encoding)}')
tok_text = tokenizer.convert_tokens_to_string(tokens)
print(f'token to string: {tok_text}')

..

output:

Tokens: ['I', 'Δ am', 'Δ e', '/', 'mail']
Tokens length: 5
Encoding: [0, 100, 524, 364, 73, 6380, 2]
Encoding length: 7
token to string: I am e/mail

--
Thank you.
www.marearts.com