[BUG] Why training hand writing digits produce incorrect results #1558

shunmian · 2022-11-21T08:49:18Z

shunmian
Nov 21, 2022

Describe the bug

I have been trying to train hand writing digits with efficientnet_b2. However, the training is not running correctly.

To Reproduce

Step 1: Train

./distributed_train.sh 1 ./HWD/ --model efficientnet_b2 -b 128 --sched step --epochs 100 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 8 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.3 --drop-path 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .016 --num-classes 10

The ./HWD/ folder has following structure:

- train
   - 0
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
    - 1
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
    ...
    - 9
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
     
- validation
   - 0
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg
    - 1
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg
    ...
    - 9
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg

When trainning finished, the log says "Best metric: 10.0 (epoch 0)"
The full training log is as following:

epoch,train_loss,eval_loss,eval_top1,eval_top5,lr
0,3.4499157269795737,2.3024010416666667,10.0,50.0,1e-06
1,3.624408103801586,2.3024010416666667,8.266666666666667,50.0,0.0032008
2,3.4603706112614385,2.3025208333333333,7.866666666666666,50.0,0.0064006
3,2.9570982544510453,2.302640625,10.0,50.0,0.0096004
4,2.7964119204768427,2.3029270833333335,10.0,50.0,0.0128002
5,2.799766266787494,2.302546875,10.0,49.46666666666667,0.015054399999999999
6,2.5497683684031167,2.3022604166666665,10.0,50.0,0.015054399999999999
7,2.4894364321673357,2.3029270833333335,10.0,49.8,0.015054399999999999
8,2.4517396291097007,2.302640625,10.0,50.0,0.014602768
9,2.4140525658925376,2.3028541666666666,10.0,50.0,0.014602768
....
90,0.9781680151268288,2.3033385416666667,10.0,50.0,0.00518410931230147
91,0.9978100569159897,2.3030052083333334,10.0,50.0,0.00518410931230147
92,0.9757056457025034,2.303671875,10.0,50.0,0.005028586032932427
93,0.9612530094605906,2.3033854166666665,10.0,50.0,0.005028586032932427
94,0.9917539711351748,2.3035520833333334,10.0,50.0,0.004877728451944453
95,0.9673527876536051,2.303765625,10.0,50.0,0.004877728451944453
96,0.957898685225734,2.302765625,10.0,50.0,0.00473139659838612
97,0.9409491817156473,2.303598958333333,10.0,50.0,0.00473139659838612
98,0.9428280658192105,2.302979166666667,10.0,50.0,0.00473139659838612
99,0.9692743840040984,2.3031458333333332,10.0,50.0,0.0045894547004345365

Step 2: Inference

When I do inference with

python inference.py ./output/inference/hwd/input --model efficientnet_b2 --checkpoint ./output/train/20221121-221541-efficientnet_b2-256/last.pth.tar --output_dir ./output/inference/hwd/output --num-classes 10

It produce unexpected result:

0-352.jpg,9,3,5,2,8
0-353.jpg,9,3,5,2,8
1-351.jpg,9,3,5,2,8
1-352.jpg,9,3,5,2,8
2-351.jpg,9,3,5,2,8
2-352.jpg,9,3,5,2,8
3-351.jpg,9,3,5,2,8
3-352.jpg,9,3,5,2,8
4-351.jpg,9,3,5,2,8
4-352.jpg,9,3,5,2,8
5-351.jpg,9,3,5,2,8
5-352.jpg,9,3,5,2,8
6-351.jpg,9,3,5,2,8
6-352.jpg,9,3,5,2,8
7-351.jpg,9,3,5,2,8
7-352.jpg,9,3,5,2,8
8-351.jpg,9,3,5,2,8
8-352.jpg,9,3,5,2,8
9-351.jpg,9,3,5,2,8
9-352.jpg,9,3,5,2,8

What would be the possible cause of that?

The training data is here.

rwightman · 2022-11-21T16:32:44Z

rwightman
Nov 21, 2022
Maintainer

@shunmian you can't take hparams that are tuned for imagenet and expect them to work on a task that's closer to mnist. These RMSProp settings are unlikely to work on a smaller dataset, AdamW as an optimizer will be more forgiving. Also, disable model-ema until you get some result and then enable with a much shorter time-const (like 0.99 - 0.999).

1 reply

shunmian Nov 26, 2022
Author

Thanks for the reply！ Actually when add --pretrained in the train scrip show a good training log.

epoch,train_loss,eval_loss,eval_top1,eval_top5,lr
0,1.3350351367677962,1.641640625,54.83333343505859,82.33333333333333,1e-06
1,1.095807475703103,1.6381119791666667,54.83333343505859,82.33333333333333,0.0032008
2,0.9124079289890471,1.6352083333333334,54.91666676839193,82.58333333333333,0.0064006
3,0.917520576999301,1.6275520833333332,55.08333343505859,82.91666666666667,0.0096004
4,0.9753972462245396,1.6193098958333334,55.83333343505859,83.25,0.0128002
5,1.0293909850574674,1.6094140625,56.50000010172526,83.5,0.015054399999999999
.....
95,0.6453268073853993,0.5199479166666666,95.33333353678385,99.75,0.004877728451944453
96,0.6515413153739202,0.5161979166666667,95.41666687011718,99.75,0.00473139659838612
97,0.672047149567377,0.5122916666666667,95.58333353678385,99.75,0.00473139659838612
98,0.6521482410885039,0.5087369791666667,95.66666687011718,99.75,0.00473139659838612
99,0.6687697569529215,0.5046223958333333,95.75000020345053,99.75,0.0045894547004345365

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Why training hand writing digits produce incorrect results #1558

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

[BUG] Why training hand writing digits produce incorrect results #1558

Uh oh!

Uh oh!

shunmian Nov 21, 2022

Replies: 1 comment · 1 reply

Uh oh!

rwightman Nov 21, 2022 Maintainer

Uh oh!

Uh oh!

shunmian Nov 26, 2022 Author

shunmian
Nov 21, 2022

Replies: 1 comment 1 reply

rwightman
Nov 21, 2022
Maintainer

shunmian Nov 26, 2022
Author