Where to find the token ids of the tokenizer ?

#22

by Mohamed123321 - opened Oct 7, 2023

Discussion

Mohamed123321

Oct 7, 2023

Hello,
I was wondering how can I access and change the tokenizer's token ids ?
Thanks !

ybelkada

Oct 9, 2023

cc @ArthurZ

Mohamed123321

Oct 9, 2023

I may add that I speak about the mapping from tokens (part of words) and ids

ArthurZ

Google org Oct 10, 2023

Hey! The tokenizer by default is based on sentencepiece. You can't really change it but you can add tokens using add_tokens and see the vocab using tokenizer.get_vocab()

r-rahulsingh

Jun 2

sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
sentence_encoded["input_ids"][0],
skip_special_tokens=True
)

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([ 363, 97, 19, 34, 6, 3059, 58, 1])

DECODED SENTENCE:
What time is it, Tom?

If this helps.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Where to find the token ids of the tokenizer ?

DECODED SENTENCE:What time is it, Tom?

DECODED SENTENCE:
What time is it, Tom?