2/07/2023

Tokenizer token grouping, token entity grouping

 refer to below example code:


..

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


words = ["on", "a", "reported", "basis", "and", "10.4%", "on", "a", "like-for-like", "basis."]
sentence = ' '.join(words)

tokens = tokenizer(sentence, return_tensors="np", max_length=128, padding='max_length')

print('origin tokenizer result -------')
print(f'words({len(words)}), {words}')
print(f'sentence: {sentence}')
print(f'tokens ({ len(tokens["input_ids"][0]) }) : {tokens}')

print('grouping tokens -------')
word_tokens_list = {x : tokenizer.encode(x, add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False) for x in words}
print('word_tokens_list: ', word_tokens_list)

idx_for_words =[tokenizer.encode(x, add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False) for x in words]
print(f'idx_for_words ({len(idx_for_words)}): ',idx_for_words)

desired_output = []
idx = 0
for token in idx_for_words:
tokenoutput = []
for ids in token:
tokenoutput.append(idx)
idx +=1
desired_output.append(tokenoutput)

print('tokens in grouped list')
print(desired_output)

..



output

..

origin tokenizer result -------
words(10), ['on', 'a', 'reported', 'basis', 'and', '10.4%', 'on', 'a', 'like-for-like', 'basis.']
sentence: on a reported basis and 10.4% on a like-for-like basis.
tokens (128)
input_ids: [[ 101 1113 170 2103 3142 1105 1275 119 125 110 1113 170 1176 118
1111 118 1176 3142 119 102 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0]]
token_type_ids: [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
attention_mask: [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
grouping tokens -------
word_tokens_list: {'on': [1113], 'a': [170], 'reported': [2103], 'basis': [3142], 'and': [1105], '10.4%': [1275, 119, 125, 110], 'like-for-like': [1176, 118, 1111, 118, 1176], 'basis.': [3142, 119]}
idx_for_words (10): [[1113], [170], [2103], [3142], [1105], [1275, 119, 125, 110], [1113], [170], [1176, 118, 1111, 118, 1176], [3142, 119]]
tokens in grouped list
[[0], [1], [2], [3], [4], [5, 6, 7, 8], [9], [10], [11, 12, 13, 14, 15], [16, 17]]

..


Thank you.

πŸ™‡πŸ»‍♂️

www.marearts.com

No comments:

Post a Comment