metadata

title: Burmese Tokenizers Comparison
emoji: 🐬
colorFrom: blue
colorTo: pink
sdk: streamlit
sdk_version: 1.34.0
app_file: app.py
pinned: false
license: cc-by-nc-sa-4.0

Burmese Tokenizers Comparison

This project is a tool for comparing the performance of various tokenizers on the Burmese language. It visualizes how different tokenizers process the same Burmese text, highlighting disparities in tokenization.

Inspiration

This project is inspired by the insights from the article "All languages are NOT created (tokenized) equal!" by Yennie Jun. The original article explores token length for various LLM tokenizers on many different languages and emphasizes the non-uniformity of tokenization across languages.

Dataset

The project uses the Burmese subset of the MASSIVE dataset, a parallel dataset introduced by Amazon consisting of 1 million realistic, parallel short texts translated across 52 languages and 18 domains. Specifically, the dev split of the dataset, which consists of 2,033 texts translated into each of the languages, is utilized. The dataset is available on Hugging Face under the CC BY 4.0 license.

Features

Select from a wide range of tokenizers to compare their performance on Burmese text.
Visualize the number of tokens generated by each tokenizer for a randomly sampled Burmese text.
Explore the distribution density and variability of token counts across selected tokenizers.

Distribution Density and Variability

The project provides two visualizations to analyze the distribution and variability of token counts across the selected tokenizers:

Distribution Density Plot: This plot shows the probability density function (PDF) of the token counts for each tokenizer. It allows you to compare the most likely token counts and the spread of the distribution across tokenizers. A higher peak indicates a more common range of token counts, while a wider curve suggests greater variability.
Box Plot: This plot displays the variability or dispersion of token counts for each tokenizer. It summarizes the distribution using the median, quartiles, and outliers. A taller box indicates greater variability in token counts, while the position of the median line provides information about the skewness of the distribution.

By examining these plots, you can gain insights into which tokenizers tend to generate more or fewer tokens, how the token counts are distributed, and the level of consistency or variability in tokenization across different tokenizers.

Usage

Select the desired tokenizers from the sidebar. You can choose to select all tokenizers or manually pick specific ones.
The app will display a randomly sampled Burmese text from the MASSIVE dataset.
The number of tokens generated by each selected tokenizer for the sampled text will be shown in a table.
A distribution density plot and a box plot will visualize the token count distribution and variability across the selected tokenizers.

Acknowledgements

The original inspiration and insights for this project come from the article "All languages are NOT created (tokenized) equal!" by Yennie Jun.
The MASSIVE dataset used in this project was introduced by Amazon and is available on Hugging Face.
The code for the Streamlit app is adapted from the Tokenizers Languages Hugging Face Space by Yennie Jun.

License

This project is licensed under the CC BY-NC-SA 4.0 license.