Reassessing adversarial training with fixed data augmentation

A recent bug discovery on Pytorch+Numpy got me thinking: how much does this bug impact adversarial robustness?

Image credit: Tanel Pärnamaa

Overview

A couple months ago, a post on Reddit highlighted a bug in PyTorch + NumPy that affects how data augmentation works (see image above). Knowing nearly all of my projects use this combination, I read through the linked blog by Tanel Pärnamaa to see what it was all about. I was a bit shocked that it took our community this long to notice a bug this severe! Nearly all data-loaders use more than one worker. Unfortunately, not many people (clearly, since it took us all so long to notice this bug) sit down to debug data augmentation at this level within their ML pipeline.

Reading through this bug, I remembered how (proper) data-augmentation had been proposed as a means to reduce robust overfitting by authors at DeepMind. This paper got me thinking: “Could fixing this augmentation bug and rerunning adversarial training lead to gains in robustness?”. Curious to see the impact of fixing this data augmentation bug, I decided to run some experiments of my own. You can head over to the repository and run them yourself if you want.

I chose the CIFAR-10 dataset: small enough to iterate experiments fast and challenging enough to observe performance gains. On the other hand, standard training (without adversarial loss) with the fixed data-augmentation pipeline hurt performance a bit, compared to using faulty augmentation:

ModelStandard Accuracy (%)Robust Accuracy (ε = 8/255) (%)
Standard89.1400.000
Standard (augmentation)94.7200.000
Standard (fixed augmentation)94.6200.000

Not thinking much about the 0.1% performance drop (probably statistical noise, right?), I ran adversarial training with $L_\infty$ robustness ($\epsilon=\frac{8}{255}$):

ModelStandard Accuracy (%)Robust Accuracy (ε = 8/255) (%)Robust Accuracy (ε = 16/255) (%)
Robust79.52044.37015.680
Robust (augmentation)86.32051.40017.480
Robust (fixed augmentation)86.73051.88017.570

As visible here, there’s an absolute 0.4% performance gain for $\epsilon=\frac{8}{255}$, and 0.09% performance gain for $\epsilon=\frac{4}{255}$, when using the fixed augmentation pipeline. Although the 0.09% here is not very significant, the 0.4% improvement seems non-trivial. This improvement is especially significant compared to the kind of performance differences reported on benchmarks for this dataset. Additionally, accuracy on clean data sees an improvement as well: absolute 0.41% change.

Not wanting to make any claims based on experiments on just the $L_\infty$ norm, I reran the same set of experiments for the $L_2$ norm ($\epsilon=1$).

ModelStandard Accuracy (%)Robust Accuracy (%), ε = 0.5Robust Accuracy (%), ε = 1
Robust78.19061.74042.830
Robust (augmentation)80.56067.20051.140
Robust (fixed augmentation)81.07067.62051.220

Performance gains appear in this case as well. Accuracy on clean data bumps up by 0.51%, while robustness on $\epsilon=0.5$ and $\epsilon=1.0$ improves by 0.42% and 0.08%, respectively. The fact that this case sees a consistent, albeit small, improvement in both clean and perturbed-data performance hints at how simply fixing this augmentation may provide a nice bump in existing training methods. It is very much possible that these gains are just coincidental fluctuations in the randomness of model training. Regardless, fixing data-loaders is something that should be done anyway. The goal of these experiments was to try and quantify the impact of improper augmentation. It would be great if someone with sufficient resources could run these experiments on a larger scale to rule out statistical noise.

Takeaway: Fixing data augmentation can have a non-trivial (and positive) impact when training for robustness. Anyone training robust models (especially with adversarial training, since that is what I tested on) should fix their data-loaders.

Anshuman Suri
Anshuman Suri
PhD Student

My research interests include security and privacy in machine learning.