Why not include MedQA in your benchmarks?

#1
by Hugman2345 - opened

It's one of the good reasoning benchmarks built on USMLE questions. This benchmark was included in phi-3 and its June update so it makes sense to include it in phi-3.5 benchmarks no?

Thanks for the model and all your work too!

Thank you for your interest in the Phi-3.5 models! We did benchmark MedQA 🩺 but we will let the community to run this benchmark by themself (hint: we think the Phi-3.5 MoE and Mini are very competitive 🌞)

It's great and competes with much bigger models on USMLE/Medical questions, information and reasoning. In this area, phi-3.5 is better than other 7b,8b,9b competitors and phi-3.5's bigger context size is a plus, sadly it feels like it doesn't beat Phi-3-small-8k and Phi-3-medium-4k in this particular area. This is just from first impressions and needs to be confirmed by others. Definitely so much better than other tiny models it's not even remotely close.

Thanks for Phi-3.5, I don't know how such a small model is even close to the level of big models.

@Hugman2345 Thank you for your effort on independently benchmarking the Phi-3.5 models on MedQA. It is great to see that the models perform within our expectation.

Sign up or log in to comment