Where to find the token ids of the tokenizer ?

#22
by Mohamed123321 - opened

Hello,
I was wondering how can I access and change the tokenizer's token ids ?
Thanks !

I may add that I speak about the mapping from tokens (part of words) and ids

Google org

Hey! The tokenizer by default is based on sentencepiece. You can't really change it but you can add tokens using add_tokens and see the vocab using tokenizer.get_vocab()

sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
sentence_encoded["input_ids"][0],
skip_special_tokens=True
)

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)


ENCODED SENTENCE:
tensor([ 363, 97, 19, 34, 6, 3059, 58, 1])

DECODED SENTENCE:
What time is it, Tom?

If this helps.

Sign up or log in to comment