File size: 5,705 Bytes
3d4ccb3
 
c89fcb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3d4ccb3
c89fcb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
544f700
c89fcb5
 
 
 
 
 
67740f0
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: apache-2.0
language: fr
library_name: nemo
datasets:
- mozilla-foundation/common_voice_13_0
- multilingual_librispeech
- facebook/voxpopuli
- google/fleurs
- gigant/african_accented_french
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- FastConformer
- CTC
- Transformer
- pytorch
- NeMo
- hf-asr-leaderboard
model-index:
- name: stt_fr_fastconformer_hybrid_large
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 13.0
      type: mozilla-foundation/common_voice_13_0
      config: fr
      split: test
      args:
        language: fr
    metrics:
    - name: WER
      type: wer
      value: 9.16
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: Multilingual LibriSpeech (MLS)
      type: facebook/multilingual_librispeech
      config: french
      split: test
      args:
        language: fr
    metrics:
    - name: WER
      type: wer
      value: 4.82
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: VoxPopuli
      type: facebook/voxpopuli
      config: french
      split: test
      args:
        language: fr
    metrics:
    - name: WER
      type: wer
      value: 9.23
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: Fleurs
      type: google/fleurs
      config: fr_fr
      split: test
      args:
        language: fr
    metrics:
    - name: WER
      type: wer
      value: 8.65
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: African Accented French
      type: gigant/african_accented_french
      config: fr
      split: test
      args:
        language: fr
    metrics:
    - name: WER
      type: wer
      value: 6.55
---

# FastConformer-Hybrid Large (fr)

<style>
img {
 display: inline;
}
</style>

| [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)

This model aims to replicate [nvidia/stt_fr_fastconformer_hybrid_large_pc](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc) with the goal of predicting only the lowercase French alphabet, hyphen, and apostrophe. While this choice sacrifices broader functionalities like predicting casing, numbers, and punctuation, it can enhance accuracy for specific use cases.

Similar to its sibling, this is a "large" version of the FastConformer Transducer-CTC model (around 115M parameters). It's a hybrid model trained using two loss functions: Transducer (default) and CTC.

## Performance

We evaluated our model on the following datasets and re-ran the evaluation on other models for comparison. Please note that the reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase.

![Benchmarks](https://huggingface.co/bofenghuang/stt_fr_fastconformer_hybrid_large/resolve/main/assets/bench.png)

All the evaluation results can be found [here](https://drive.google.com/drive/folders/1adZTgGAptYx2ut9jddjmlj5--dkY2XWZ?usp=sharing).

## Usage

The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

```python
# Install nemo
# !pip install nemo_toolkit['all']

import nemo.collections.asr as nemo_asr

model_name = "bofenghuang/stt_fr_fastconformer_hybrid_large"
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)

# Path to your 16kHz mono-channel audio file
audio_path = "/path/to/your/audio/file"

# Transcribe with defaut transducer decoder
asr_model.transcribe([audio_path])

# (Optional) Switch to CTC decoder
asr_model.change_decoding_strategy(decoder_type="ctc")

# (Optional) Transcribe with CTC decoder
asr_model.transcribe([audio_path])
```

## Datasets

This model has been trained on a composite dataset comprising over 2500 hours of French speech audio and transcriptions, including [Common Voice 13.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Fleurs](https://huggingface.co/datasets/google/fleurs), [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french), and more.

## Limitations

Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

The model exclusively generates the lowercase French alphabet, hyphen, and apostrophe. Therefore, it may not perform well in situations where uppercase characters and additional punctuation are also required.

## References

[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)

[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

## Acknowledgements

Thanks to Nvidia's research on the advanced model architecture and the NeMo team's training framework.