Edit model card

bert-mini-amharic

This model has the same architecture as bert-mini and was pretrained from scratch using the Amharic subsets of the oscar and mc4 datasets, on a total of 137 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k. It achieves the following results on the evaluation set:

  • Loss: 3.11
  • Perplexity: 22.42

Even though this model only has 10.7 Million parameters, its performance is only slightly behind the 26x larger 279 Million parameter xlm-roberta-base model on the same Amharic evaluation set.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

[{'score': 0.6525624394416809,
  'token': 9617,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
 {'score': 0.22671808302402496,
  'token': 9345,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
 {'score': 0.07071439921855927,
  'token': 10898,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
 {'score': 0.02838180586695671,
  'token': 9913,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
 {'score': 0.006343209184706211,
  'token': 22459,
  'token_str': 'ዓመታትን',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]

Finetuning

This model was finetuned and evaluated on the following Amharic NLP tasks

Finetuned Model Performance

The reported F1 scores are macro averages.

Model Size (# params) Perplexity Sentiment (F1) Named Entity Recognition (F1)
bert-medium-amharic 40.5M 13.74 0.83 0.68
bert-small-amharic 27.8M 15.96 0.83 0.68
bert-mini-amharic 10.7M 22.42 0.81 0.64
bert-tiny-amharic 4.18M 71.52 0.79 0.54
xlm-roberta-base 279M 0.83 0.73
am-roberta 443M 0.82 0.69

Amharic News Category Classification

Model Size(# params) Accuracy Precision Recall F1
bert-small-amharic 25.7M 0.89 0.86 0.87 0.86
bert-mini-amharic 9.67M 0.87 0.83 0.83 0.83
xlm-roberta-base 279M 0.9 0.88 0.88 0.88
Downloads last month
24
Safetensors
Model size
10.7M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Datasets used to train rasyosef/bert-mini-amharic

Collection including rasyosef/bert-mini-amharic