`norm_act_layer` altered training dynamics for `mobilenetv2_120d` ? #1447

AffineParameter · 2022-08-29T19:59:43Z

AffineParameter
Aug 29, 2022

I've noticed a fairly significant change in my training dynamics after updating timm from 0.5.4 to 0.6.5. See Figure 1 below.

In the above figure, all inputs, configurations, and dependencies are identical except for timm. Orange (the curve starting at ~900 and falling to ~450) was generated by pip installing 07379c6 directly from this repo. The blue curve (the one starting at roughly ~9k and falling to ~1.5k) was generated by pip installing the following commit, 95cfc9b.

My current hypothesis, based on the delta between those to commits (here), is that the introduction of the norm_act_layer into the EfficientNet architecture is the cause. Perhaps this was not entirely backwards compatible with existing pertained weights? I haven't yet been able to confirm this, or rule it out.

Details

Unfortunately, the model code is not something I can readily share. However, I have included a few details below. The most interesting of which are the ONNX export differences. It's very clear the convolutional layer stack is different but some elements, like the clamp are confusing as I haven't found its origin yet.

ONNX Differences

I have no idea where that clamp is coming from in the older 07379c6 model.

Old Export

Input normalization layers and initial block of mobilenetv2_120d, exported to ONNX (OPSET=11) for 07379c6

New Export

Input normalization layers and initial block of mobilenetv2_120d, exported to ONNX (OPSET=11) for 95cfc9b

Common Model Config

  backbone:
    model_name: mobilenetv2_120d
    global_pool: avg
    pretrained: true
    preprocess: true

Common Optimizer Config

  optimizer_config:
    opt: adam
    lr: 1.0e-3
    weight_decay: 1.0e-5
    filter_bias_and_bn: True
    kwargs:
      eps: 1.0e-7

Common Scheduler Config

  scheduler_config:
    sched: cosine
    warmup_lr: 1.0e-5
    warmup_epochs: 2
    decay_epochs: 1.0
    decay_rate: 1.0
    epochs: 30
    lr_cycle_decay: 0.95
    lr_cycle_limit: 20

Common Dependencies

...
torch                     1.12.1
torchmetrics              0.9.3
torchvision               0.13.1
pytorch-lightning         1.7.2
onnx                      1.12.0
onnx-simplifier           0.4.7
onnxruntime               1.12.1
...

If anyone has already encountered this or notices something I have missed, I would be very thankful.

Thanks for your time, and thanks for the amazing repo! I am very motivated to figure this out so that I can benefit from all the improvements since 0.5.4!!

Answered by rwightman

Aug 29, 2022

@AffineParameter please see #1444 and #1254 ... does that answer the issue? (ie are you using sync BN?) .. in general I would avoid syncbn unless you really need it (are down in very low batch sizes like < 16),

the torch native sync bn conversion hack does not work with norm + act layers, so I've added a timm version (works for native AMP + syncbn, but I haven't added support for APEX)

View full answer

rwightman · 2022-08-29T21:48:23Z

rwightman
Aug 29, 2022
Maintainer

@AffineParameter please see #1444 and #1254 ... does that answer the issue? (ie are you using sync BN?) .. in general I would avoid syncbn unless you really need it (are down in very low batch sizes like < 16),

the torch native sync bn conversion hack does not work with norm + act layers, so I've added a timm version (works for native AMP + syncbn, but I haven't added support for APEX)

5 replies

AffineParameter Aug 29, 2022
Author

Oh man, I really hope this is my issue! Gonna check right away.

AffineParameter Aug 29, 2022
Author

Ok, just validated it. That fixed it. Does this effect the whole EfficientNet family I suppose? In any case your experience that it's finicky is enough for me to look from some BN-less alternatives in the same parameter-count range.

#1254 (comment)

Thanks a ton for your patience and intuition!

rwightman Aug 29, 2022
Maintainer

@AffineParameter yes, it impacts all of EfficientNet / MobileNetv3 and related. There are a number of other models that use Norm + Act layers too as it allows using a broader range of alternative norm layers, some of which are integrated with the activation...

In order to maintain weight compatility with models that had separate BN + act layers, I had to inherit to avoid adding a new hieararchy level to the param keys... so synbn wrapper hacks used in Apex and Torch end up re-writing the BN+Act back to just a BN. The timm helper adds the act back. I can support APEX too if people are still using it.. So you can still safely use Synb BN with the updated timm models, just need to convert it differently.

I have often found SyncBN to be trouble for other reasons (behaves poorly with EMA weight averaging, slows training down). I would try without it if you don't REALLY need it and it shouldn't be necessary with a mobilenet as you can get large batch sizes. Without sync-BN, one does need to make sure the running stat buffers are synced across replicas, this is done by torch native DDP by default, but needs to be done manually if using APEX (timm train script has a dist-bn once after each epoch to cover that).

I've been liking ConvNeXt, I have some smaller versions, although not as blazing fast as small MobileNets and EfficientNet, esp using channels_last on GPU. Smaller RegNet-Y models are good too.

AffineParameter Aug 29, 2022
Author

Thanks a ton!

zifuwanggg Oct 31, 2022

@rwightman Thanks for your explanation. Just want to add a note on this:"it shouldn't be necessary with a mobilenet as you can get large batch sizes".

I had the same issue with MobilenetV2 as a backbone for my semantic segmentation model. With 16GB P100, I have to set batch size=4. I think it might a good idea to add some warning message if one is using PyTorch's syncbn BN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`norm_act_layer` altered training dynamics for `mobilenetv2_120d` ? #1447

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Old Export

New Export

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

norm_act_layer altered training dynamics for mobilenetv2_120d ? #1447

Uh oh!

Uh oh!

AffineParameter Aug 29, 2022

Details

Old Export

New Export

Replies: 1 comment · 5 replies

Uh oh!

rwightman Aug 29, 2022 Maintainer

Uh oh!

AffineParameter Aug 29, 2022 Author

Uh oh!

AffineParameter Aug 29, 2022 Author

Uh oh!

Uh oh!

rwightman Aug 29, 2022 Maintainer

Uh oh!

AffineParameter Aug 29, 2022 Author

Uh oh!

zifuwanggg Oct 31, 2022

`norm_act_layer` altered training dynamics for `mobilenetv2_120d` ? #1447

AffineParameter
Aug 29, 2022

Replies: 1 comment 5 replies

rwightman
Aug 29, 2022
Maintainer

AffineParameter Aug 29, 2022
Author

AffineParameter Aug 29, 2022
Author

rwightman Aug 29, 2022
Maintainer

AffineParameter Aug 29, 2022
Author