phi3 4K vs 128K

#87
by Emilio - opened

Hello there, a quick question here. I just read the technical report and I am not sure what is the beneifit of using the 4K version. Apparently the 128K has a similar performance across the board.

Thanks!

Depending on your hardware and how it implements the attention computation, the 4K version can be much faster than the 128K version due to the quadratic scaling of attention computation with respect to context size. Note that 4K context is more than enough for simple interactions.

Thanks armankaz, I undertand that the computation scale quadratically with the seq. lenght. However, my understanding is that both architecture are exactly the same, the only difference is the post-training stage, where the data used for the 128K version is different. So, let me rephrase my question: For the same dataset that I want to forward through the model, is there any difference in terms of speed if I use the 4K or the 128K version? My understanding is that there is not difference, since it is the same architecture, and I can only expect improvements with the 128K model if my data for some reason exceeds the 4K length in some of the examples in the data

4K and 128K is a major architectural difference. The 128K model will never be faster than the 4K model. If you don't need the 128K model, you shouldn't use it.
If data exceeds 4K, the 4K version will give an error and stop computation. This is why there is a 128K version.

Sign up or log in to comment