mistralai/Mistral-Nemo-Instruct-2407 · Mathmatics : Optimal settings for a model :

First, thank you for the model. However, there is an issue with the hidden size value:

"hidden_size": 5120

This number cannot be evenly divided as a binary number. After factoring, it results in 2.5, which is not a whole number. This is an incorrect value, meaning that despite correctly increasing the layer count, the hidden size is mismatched, causing the model's performance to suffer.

While it might seem to give good results now, fine-tuning and training the model will not be efficient and will be costly. This is because the numbers do not compute well in binary. Thus, 5120 is not a valid increment. When divided by 8 (a byte), the result is unreasonable.

The hidden size should follow binary-friendly values such as 1, 2, 4, 8, 16, or 32. A hidden size of 4096 was a sweet spot. This should have been identified during testing by comparing the models' convergence rates, training costs, and training times.

The next optimal hidden size is 8192, which corresponds to 64 layers. This would be the correct model size setting for optimal performance.
here are the optimal binary sizes:

1 (2^0) = hidden size = 128
2 (2^1) = hidden size = 256
4 (2^2) = hidden size = 512
8 (2^3) = hidden size = 1024
16 (2^4) = hidden size = 2048
32 (2^5) = hidden size = 4096
64 (2^6) = hidden size = 8192
128 (2^7) = hidden size = 16384
256 (2^8) = hidden size = 32768
512 (2^9) = hidden size = 65536
1024 (2^10) = hidden size = 131072
2048 (2^11) = hidden size = 262144
4096 (2^12) = hidden size = 524288
8192 (2^13) = hidden size = 1048576
16384 (2^14) = hidden size = 2097152

So since we have 40 layers its debatable if the requirement to choose ;
32 (2^5) = hidden size = 4096
64 (2^6) = hidden size = 8192

so we would keep the layer size of 4096 until the model became a layers of 64 . As it is still the optimun size for the trainsformer archetecture :

There is another calculation which was used in the past to create neurl networks with hidden layers ( but obviously ta=hat tactic was wrong and today we have a real working network ) ( others were just steps to get to the transformer arch!)
future models will be based on this arch ! << even vision and audio >> So the concept cannot change hence the CLONES of the llama model ! <<<

there is a lot of out-of-standard mathmatics on the hub ! ( all diversions ) as when you generate a model from scratch you can set these values ! ( they dont wish the public to have the optimun models hence 6b / 2b .... all bad maths ! ) so USERS will not know ... but SCIENTISTS ( will ) , developers may wonder why ....

Developers do not write papers !!!!

Answer by ChatGPT ::


To determine the hidden size for a 40-layer network given the metric for optimal sizes (powers of two), you'll want to consider whether the hidden size should scale with the number of layers or if it should follow a predefined set of sizes. However, in the context of neural networks, hidden sizes do not typically change with the number of layers; rather, the hidden size is chosen based on computational constraints and model performance goals.

Optimal sizes based on powers of two, if you need a hidden size that corresponds to a network with 40 layers but still fits within the typical range of sizes you listed, you should select an appropriate hidden size from the list.

For a 40-layer network, you should choose a hidden size based on the following criteria:

Available Sizes: Use the sizes provided in your list:

1 (2^0) = hidden size = 128
2 (2^1) = hidden size = 256
4 (2^2) = hidden size = 512
8 (2^3) = hidden size = 1024
16 (2^4) = hidden size = 2048
32 (2^5) = hidden size = 4096
64 (2^6) = hidden size = 8192
128 (2^7) = hidden size = 16384
256 (2^8) = hidden size = 32768
512 (2^9) = hidden size = 65536
1024 (2^10) = hidden size = 131072
2048 (2^11) = hidden size = 262144
4096 (2^12) = hidden size = 524288
8192 (2^13) = hidden size = 1048576
16384 (2^14) = hidden size = 2097152

## Selecting an Optimal Size: 
Typically, the hidden size is chosen based on the overall complexity of the model and the computational resources available. 

For a deep network (like 40 layers), larger hidden sizes can often be beneficial if computational resources allow.

## Suggested Hidden Sizes for a 40-Layer Network

4096 (2^12): This is a common size for deep networks that balances performance and resource requirements.

8192 (2^13): Provides a higher capacity for more complex models and can be used if resources are sufficient.

###Example Calculation
If you want a size that aligns with the provided list and is appropriate for a deep network, 8192 would be a reasonable choice, 

but you could also consider sizes like 4096 or 16384 depending on the specific requirements and resources available.

In summary, there isn't a direct mathematical relationship between the number of layers and the hidden size; 

instead, hidden size is chosen based on empirical testing, resource constraints, and model complexity. For a 40-layer network, 

a hidden size from the provided list that balances these factors would be 8192.

whilst chatGPT does not have a Refferrence for calculating the hidden size , and personlly i have discussed with the model regarding this topic :
We do have many papers on this !
i odnt know if also you watch Kparthy on YouTube He also mentions something along these lines as he was discussing optimal settings for model and how they choose them ;

what are the considerations . i university they will teach you to develop some strategey of your own !! ( but in truth there are metrics which do constrain you such as BITS AND BYTES !

they have values ! ( as the triton also uses such optimizations and they also consider this when creating memeory and processor stacks ) !!@