weiqipedia commited on
Commit
566afff
1 Parent(s): cab651c

Update metrics in README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -43,13 +43,14 @@ The evaluation was done zero-shot with Indonesian prompts and only a sample of 1
43
  | Model                          | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) |
44
  |--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------|
45
  | SEA-LION-7B-Instruct-Research  | 24.86   | 76.13          | 24.45         | 52.50             | 46.82             | 15.44             | 33.20     | 23.80        |
46
- | SEA-LION-7B-Instruct           | **68.41**   | **91.45**          | 17.98         | 57.48             | 58.04             | **17.54**             | **53.10**     | 60.80        |
47
  | SeaLLM 7B v1                   | 30.96   | 56.29          | 22.60         | 62.23             | 41.55             | 14.03             | 26.50     | 56.60        |
48
- | SeaLLM 7B v2                   | 44.40   | 80.13          | **55.24**         | 64.01             | **63.28**             | 17.31             | 43.60     | **82.00**        |
49
- | Sailor-7B (Base)               | 65.43   | 59.48          | 20.48         | **64.27**             | 60.68             | 8.69              | 15.10     | 38.40        |
 
50
  | Llama 2 7B Chat                | 11.12   | 52.32          | 0.00          | 44.09             | 57.58             | 9.24              | 0.00      | 0.00         |
51
  | Mistral 7B Instruct v0.1       | 38.85   | 74.38          | 20.83         | 30.60             | 51.43             | 15.63             | 28.60     | 50.80        |
52
- | GPT-4 | 73.60 | 74.14 | 63.96 | 69.38 | 67.53 | 18.71 | 83.20 | 96.00 |
53
 
54
  - For Natural Language Understanding (NLU) tasks, we tested the model on Sentiment Analysis (`Sentiment`) using the NusaX dataset, Question Answering (`QA`) using the TyDiQA dataset, and Toxicity Detection (`Toxicity`) using the Indonesian Multi-Label Hate Speech Detection dataset. The metrics used are F1 scores for all three tasks.
55
  - For Natural Language Generation (NLG) tasks, we tested the model on Machine Translation from English to Indonesian (`Eng>Indo`) and from Indonesian to English (`Indo>Eng`) using the FLORES-200 dataset, and Abstractive Summarization (`Summary`) using the XLSum dataset. The metrics used for Machine Translation and Abstractive Summarization are ChrF++ and ROUGE-L respectively.
 
43
  | Model                          | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) |
44
  |--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------|
45
  | SEA-LION-7B-Instruct-Research  | 24.86   | 76.13          | 24.45         | 52.50             | 46.82             | 15.44             | 33.20     | 23.80        |
46
+ | SEA-LION-7B-Instruct           | **68.41**| **91.45**     | 17.98         | 57.48             | 58.04             | **17.54**         | 53.10     | 60.80        |
47
  | SeaLLM 7B v1                   | 30.96   | 56.29          | 22.60         | 62.23             | 41.55             | 14.03             | 26.50     | 56.60        |
48
+ | SeaLLM 7B v2                   | 44.40   | 80.13          | **55.24**     | 64.01           | **63.28**         | 17.31             | 43.60     | 82.00   |
49
+ | Sailor-7B (Base)               | 65.43   | 59.48          | 20.48         | **64.27**         | 60.68             | 8.69              | 15.10     | 38.40        |
50
+ | Sailor-7B-Chat | 38.02 | 87.64 | 52.07 | 64.25 | 61.87 | 15.28 | **68.30** |**85.60** |
51
  | Llama 2 7B Chat                | 11.12   | 52.32          | 0.00          | 44.09             | 57.58             | 9.24              | 0.00      | 0.00         |
52
  | Mistral 7B Instruct v0.1       | 38.85   | 74.38          | 20.83         | 30.60             | 51.43             | 15.63             | 28.60     | 50.80        |
53
+ | GPT-4 (gpt-4-0314) | 73.60 | 74.14 | 63.96 | 69.38 | 67.53 | 18.71 | 83.20 | 96.00 |
54
 
55
  - For Natural Language Understanding (NLU) tasks, we tested the model on Sentiment Analysis (`Sentiment`) using the NusaX dataset, Question Answering (`QA`) using the TyDiQA dataset, and Toxicity Detection (`Toxicity`) using the Indonesian Multi-Label Hate Speech Detection dataset. The metrics used are F1 scores for all three tasks.
56
  - For Natural Language Generation (NLG) tasks, we tested the model on Machine Translation from English to Indonesian (`Eng>Indo`) and from Indonesian to English (`Indo>Eng`) using the FLORES-200 dataset, and Abstractive Summarization (`Summary`) using the XLSum dataset. The metrics used for Machine Translation and Abstractive Summarization are ChrF++ and ROUGE-L respectively.