File size: 2,314 Bytes
8d260b4
 
631c7b7
8d260b4
 
631c7b7
8d260b4
631c7b7
8d260b4
d3d9ba2
8d260b4
f6230c8
 
 
631c7b7
 
8d260b4
631c7b7
8d260b4
631c7b7
 
 
 
 
8d260b4
631c7b7
f6230c8
76560b6
631c7b7
 
 
 
76560b6
8d260b4
631c7b7
8d260b4
 
631c7b7
8d260b4
631c7b7
 
 
8d260b4
631c7b7
 
 
8d260b4
631c7b7
 
8d260b4
631c7b7
 
 
 
 
8d260b4
631c7b7
 
8d260b4
631c7b7
8d260b4
631c7b7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
library_name: transformers
tags: ["gemma","chatml"]
---

# ChatML Tokenizer for Gemma

This repository includes a fast tokenizer for [google/gemma-7b](https://huggingface.co/google/gemma-7b) with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id `106` (`<start_of_turn>`) and `107` (`<end_of_turn>`) with the chatML tokens `<|im_start|>` and `<|im_end|>`. 

No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified. 


_Note: It is important to note that this tokenizer is not 100% ChatML compliant, since it seems [google/gemma-7b](https://huggingface.co/google/gemma-7b), always requires the original `<bos>` token to be part of the input. This means the chat template is `<bos>` + `chatml` + `<eos>`_

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")

messages = [
  {"role": "system", "content": "You are Gemma."},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)
# <bos><|im_start|>system
# You are Gemma.<|im_end|>
# <|im_start|>user
# Hello, how are you?<|im_end|>
# <|im_start|>assistant
# I'm doing great. How can I help you today?<|im_end|>\n<eos>

```


## Test

```python
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")

# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)

# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"

# tokenize messages 
messages = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)

print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}")

```