open-llm-leaderboard/open_llm_leaderboard · [FLAG] fblgit/una-xaberius-34b-v1beta

Dec 10, 2023

•

edited Dec 10, 2023

It was trained from Yi-34B, however:
Listed between 2 flagged models, with half of its size.

compared with its base model:

ARC 64.59->70.39 (↑8%)
TruthfulQA 56.23->61.45 (↑5.22%)
GSM8K 50.64->63.38 (↑25.2%)

The maker claimed that he used a method called UNA, however he never say a word about what it is. He said that he will share later when he is free, but I think he will never say anyting about that. And it is a suspicious one man work.

and there are discussion about its hallucinations, indicating that it was overfitting on benchmarks: https://huggingface.co/fblgit/una-xaberius-34b-v1beta/discussions/2 and https://www.reddit.com/r/LocalLLaMA/comments/18ejwm1/what_is_fblgituna_unified_neural_alignment_looks/

Either he explain what he had done, or this model should be flagged.

XXXGGGNEt

Dec 10, 2023

These endless cheats are killing the credibility of the leaderboard. @clefourrier

deleted

Dec 10, 2023

•

edited Dec 11, 2023

Not sure If I'm convinced, but transparency on UNA would be nice, especially since it's popping up more and more (e.g. in mergers).

Edit: Something's not right. People keep taking the Mistral leader, adding some mysterious DPO to it, and now they're scoring higher than any Llama 70b, such as the current leader Marcoroni 7b v3 with a score of 72.5 (https://huggingface.co/AIDC-ai-business/Marcoroni-7B-v3).

fblgit

Dec 11, 2023

•

edited Dec 11, 2023

UNA models are not like non-UNA models, their properties are unique and this is now known. This FLAG is a nonsense and I won't release something that indeed can be dangerous to society, specially under bigcorp.com... U can say whatever u want, bring a contamination evidence.. u making it look like the models are not extremely intelligent :)

Either he explain what he had done, or this model should be flagged.

What a nonsense... u should avoid type under the influence.

XXXGGGNEt

Dec 11, 2023

UNA models are not like non-UNA models, their properties are unique and this is now known.

So you are not going to say a word about the method, leaving alone the suspicious 'good' performance?
Fair enough, doubt that you cannot tell us the truth as you cheat, either direct training on bench or its rewrites.
Or you should say something before we examine the difference of losses on train/test/dev sets and prove it is seriously contaminated on puropse.

XXXGGGNEt

Dec 11, 2023

You have been evading direct answers to your different methods. Combined with this questionable result, if you cannot explain it, then as an observer other than yourself, you can only draw one conclusion: you have not maintained the most basic integrity.

fblgit

Dec 11, 2023

•

edited Dec 11, 2023

So you are not going to say a word about the method, leaving alone the suspicious 'good' performance? LOL...
IDK man, the thousands ppl highlighting the unique of UNA and its models... the hundreds of people training for second time their model without degrading.. .. the score... the performance in real task.. is not enough.. What is needed is you to validate the code... hah...
this really looks like a kid in the supermarket whimping in the floor for a source code he dont have.. not much arguments, no rationale... Talk is free, commits are not..

Go find some evidences to support your fairy tale.. So far, UNA models are SOLID.. Juanako, Cybertron, Xaberius.. they are Unique UNA Kings .. no contamination.
The only contamination resides in your mouth.. And to confirm, Im not sharing with you the source code kid.

UNA : Uniform Neural Alignment. It goes in the Attention and the multiperceptrons, and it does what it says. There are multiple phases... Juanako = UNAv1 only implemented at perceptron level. Cybertron = UNAv2 .. applied both MLP and Attention.. Xaberius = UNAv1.. meaning I can release a much more powerful version of it. Its based on platypus-34b and if u compare the performance.. its not too distant from it. And if u compare what UNA increases (Rationale/Logic capacity).. u'll see the pattern across them.

End of the story, and if I feel being extorted for a source code.. I'll do magnets :)

XXXGGGNEt

Dec 11, 2023

•

edited Dec 11, 2023

cringe.
cuz there is NO method but cheat on bench data. and no one is begging for your code boy.
Kejserens nye Klæder
You're just too afraid to say a word on your mythical method.

126 hidden messages

Expand all

deleted

Dec 17, 2023

@catalystman You're just a troll. Your account is virgin other than this thread. There's no conspiracy. Open source models are progressing astonishingly fast. Just last year they couldn't even solve simple problems, yet some are now solving problems that 375b GPT3.5 usually gets wrong. They wouldn't be able to do this, along with countless other things like fixing user code, writing prompted stories, summarizing random papers... if their test scores were a sham. The proof is in the pudding.

My guess is you got caught cheating and came back here with a different account to angrily accuse everyone else of cheating.

migtissera

Dec 17, 2023

Close this discussion please, it's spamming my inbox..

fblgit

Dec 17, 2023

Discussion will be closed when fair equality is performed and all contaminated models gets flagged and not just 1. I'm waiting HF to do something before I open hundreds of flagged issues...

deleted

Dec 17, 2023

This comment has been hidden

catalystman

Dec 18, 2023

@phil337, I'm a concerned citizen at best, I have no models of my own nor has any of what I have said displayed anger, I'm but a proactive user. I'm concerned because I use Yi and Deepseek with tremendous success and In my use cases(advanced signals processing programming and design) they have outperformed all others of similar or smaller size(however, the mistral base is still impressive) . The metrics I've seen presented by the creators for these 2 models match my experience. I fear some in this thread are focused on discrediting Chinese models and possibly destabilizing the open source community. I shall keep eating my pudding, and hope the chef's keep improving these dishes and aren't demotivated by propositions in this thread that seemingly give no real bearing in practical application. If i was into conspiracy Phil, I would probably be talking about the significance of the number 4 in Chinese numerology(bad luck) and the coincidence of this discussion being #444.

ChuckMcSneed

Dec 18, 2023

@fblgit Do you want me to test your model using my own tests? If your model is really as great as you claim it to be, the tests will show it. Or if you are a dirty cheater, you will get dunked on.

tarruda

Dec 18, 2023

Is there a way to unsubscribe from a HF thread so that new messages don't show in the inbox? I posted a message here and it seems there's no escape from notifications. Would be a good feature to have @clefourrier

julien-c locked this discussion Dec 18, 2023

osanseviero

Open LLM Leaderboard org Dec 18, 2023

Hey all! Please read @clefourrier replies above (e.g., https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444#657c12befcba5f698c2e3fed).

This discussion is going beyond the initial scope of the report, with lots of counterproductive comments, so we're proceeding to lock this conversation. As discussed above, we have ongoing work to analyze contaminants at scale. Feel free to flag contaminated models and open new discussions about concrete issues or ideas.

Thanks, everyone, for contributing to improving the Open LLM Leaderboard, and have a llamastic day! 🤗