timm/mobilenetv4_hybrid_large.e600_r384_in1k · Questions about training recipe

Jun 20

•

Hi Ross,

If it is possible, can you share the hyperparameters that you used in training the MobileNetV4 hybrid models? If I understand it correctly, you got 83.8 top-1 vs ours 83.4 of MNv4-Hybrid-Large with 384px input, and 81.3 vs. our 80.7 on MNv4-Hybrid-Medium with 256px input. I am very curious to know how you achieve that.

Best,
Danfeng

rwightman

PyTorch Image Models org Jun 20

•

edited Jun 20

@rise123 I plan to add the hparams (via yaml arg files) to these repos soon, have to run through and gather them all. Overall most things aren't that far off the paper. For the hybrid models I did run into some stability issues and had to dial back LR a bit, or lower beta2.

My weight init was based on scheme for efficientnet, not sure if the default tf.keras layers follow that? Though, I recently changed my weight init for the MQA module to try and match keras more closely and improve stability (the xavier uniform had a lower std-dev than my default for the qkv/out proj), and I have a 83.95 for the hybrid-large @384 now.

timm has a rather different implemtnation of rand-augment, most of the tf.data based impl I'm aware of don't consistenly vary the magnitude of augs with the M value, some of them go up and some go down, and there's some really questionable outputs for posterize etc at certain M values... could be a factor.

NOTE: In the supplementary material, section A. for ResNet Strikes paper, I outlined some of the differences and features of the timm augs which may be having an impact here: https://arxiv.org/abs/2110.00476

rwightman

PyTorch Image Models org Jun 24

@rise123 https://gist.github.com/rwightman/f6705cb65c03daeebca8aa129b1b94ad ... posted hparams

I redid the hybrid-large @ 384 and have 83.99 result now.

Also ran higher res hybrid-mediums at 256 and also at 384, the 384 is pretty nice, matches my conv-large at 384, 82.97.

rwightman changed discussion status to closed Jun 24

rwightman changed discussion status to open Jun 24

rise123

Jun 25

Hi Ross,

Many thanks for the insightful discussion. Your new results on hybrid models are amazing! 0.6% acc improvement on hybrid-large is quite a lot. Your new results on hybrid-medium@384 just rocks. We are interested in investigating more and figuring out the key enablers. Please see detailed comments below.

“For the hybrid models I did run into some stability issues and had to dial back LR a bit, or lower beta2.”
We met the same issue. We found LayerScale is very helpful in stabilizing the training. Additionally, we found very large batch sizes (16k), and 2x longer warm up epochs helps to stabilize it. We have tried a lower learning rate, it helped with stability but led to much worse accuracy. FYI, our final training curve still looks a bit odd, training curves looked good, but the eval curve first had a big dip, then went back up again, and finally recovered. A reason for the instability could be that MobileNets are very deep, because of ExtraDepthwise.

Thanks for the insights of RandAugment and ResNet Strikes back paper. I read the paper a long time ago and learned a lot from it. I will read it carefully again. Do you have any ablation study about apple-to-apple comparison of Timm RandAugment vs. TF RandAugment? As the MNv4-Conv models have rather similar results in Timm and TF, while RandAugment is used in both, would that be the main contributor to the large improvements you see in hybrid models?

One thing that could make the difference is numeric precision. Please correct me if this is not the case. In GPU, default precision is FP32, which is what you use in timm training, while we use BF16 on TPU in all of our training. When we are hitting training stability issues, the numeric precision could have played a bigger role that we underlook. We will try to enable FP32 on our side, and update you on that.

Best,
Danfeng

fornoni

Jul 3

Hi Ross,

Just to confirm: I guess the ImageNet numbers reported here are obtained using an inference input resolution higher WRT the training one. Correct? If so:

That would explain part of the delta WRT the results in the MobileNet V4 paper.
Any idea of how your numbers look without this trick?

Context

The release notes mention that Image size: test = 448 x 448
Which appears consistent with the contents of config.json:

    "input_size": [
      3,
      384,
      384
    ],
    "test_input_size": [
      3,
      448,
      448
    ],

Thanks!

rwightman

PyTorch Image Models org Jul 3

•

edited Jul 3

@fornoni all results are posted at both the train and test sizes, the train ones are still good, at least matching and in some cases a statistically significant amount above.

https://huggingface.co/timm/mobilenetv4_hybrid_large.e600_r384_in1k#model-comparison

If you look at the tables, you see two results for each weight, the 'rxx' in the model tag is the train res, and you'll see an eval result at that value and a higher one. Also in the collection for these models, I added a note with the top-1 evals, first is the higher res, and then train res second: https://huggingface.co/collections/timm/mobilenetv4-pretrained-weights-6669c22cda4db4244def9637

This is train-test discrepancy (https://arxiv.org/abs/1906.06423), though it's been noticed since you do not need to fine-tune the model to leverage it, especially with more aggressive aug schemes like RandAug on top of the usual random-resize-crop. Other papers post both (or sometimes just) the higher res eval. I feel it's worth posting both to be fair with papers that only post train res results. In Table 4, page 14 of ResNet Strikes Back (https://arxiv.org/abs/2110.00476), we posted a graph of eval vs size for the main model trained at 224. You usually see the peak at 1.2 - 1.4x the train res, and at the higher res moving to no crop is usually better.