File size: 5,705 Bytes
3d4ccb3 c89fcb5 3d4ccb3 c89fcb5 544f700 c89fcb5 67740f0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
license: apache-2.0
language: fr
library_name: nemo
datasets:
- mozilla-foundation/common_voice_13_0
- multilingual_librispeech
- facebook/voxpopuli
- google/fleurs
- gigant/african_accented_french
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- FastConformer
- CTC
- Transformer
- pytorch
- NeMo
- hf-asr-leaderboard
model-index:
- name: stt_fr_fastconformer_hybrid_large
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 13.0
type: mozilla-foundation/common_voice_13_0
config: fr
split: test
args:
language: fr
metrics:
- name: WER
type: wer
value: 9.16
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Multilingual LibriSpeech (MLS)
type: facebook/multilingual_librispeech
config: french
split: test
args:
language: fr
metrics:
- name: WER
type: wer
value: 4.82
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: VoxPopuli
type: facebook/voxpopuli
config: french
split: test
args:
language: fr
metrics:
- name: WER
type: wer
value: 9.23
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Fleurs
type: google/fleurs
config: fr_fr
split: test
args:
language: fr
metrics:
- name: WER
type: wer
value: 8.65
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: African Accented French
type: gigant/african_accented_french
config: fr
split: test
args:
language: fr
metrics:
- name: WER
type: wer
value: 6.55
---
# FastConformer-Hybrid Large (fr)
<style>
img {
display: inline;
}
</style>
| [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)
This model aims to replicate [nvidia/stt_fr_fastconformer_hybrid_large_pc](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc) with the goal of predicting only the lowercase French alphabet, hyphen, and apostrophe. While this choice sacrifices broader functionalities like predicting casing, numbers, and punctuation, it can enhance accuracy for specific use cases.
Similar to its sibling, this is a "large" version of the FastConformer Transducer-CTC model (around 115M parameters). It's a hybrid model trained using two loss functions: Transducer (default) and CTC.
## Performance
We evaluated our model on the following datasets and re-ran the evaluation on other models for comparison. Please note that the reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase.
![Benchmarks](https://huggingface.co/bofenghuang/stt_fr_fastconformer_hybrid_large/resolve/main/assets/bench.png)
All the evaluation results can be found [here](https://drive.google.com/drive/folders/1adZTgGAptYx2ut9jddjmlj5--dkY2XWZ?usp=sharing).
## Usage
The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
```python
# Install nemo
# !pip install nemo_toolkit['all']
import nemo.collections.asr as nemo_asr
model_name = "bofenghuang/stt_fr_fastconformer_hybrid_large"
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)
# Path to your 16kHz mono-channel audio file
audio_path = "/path/to/your/audio/file"
# Transcribe with defaut transducer decoder
asr_model.transcribe([audio_path])
# (Optional) Switch to CTC decoder
asr_model.change_decoding_strategy(decoder_type="ctc")
# (Optional) Transcribe with CTC decoder
asr_model.transcribe([audio_path])
```
## Datasets
This model has been trained on a composite dataset comprising over 2500 hours of French speech audio and transcriptions, including [Common Voice 13.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Fleurs](https://huggingface.co/datasets/google/fleurs), [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french), and more.
## Limitations
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
The model exclusively generates the lowercase French alphabet, hyphen, and apostrophe. Therefore, it may not perform well in situations where uppercase characters and additional punctuation are also required.
## References
[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
## Acknowledgements
Thanks to Nvidia's research on the advanced model architecture and the NeMo team's training framework.
|