Gradient Descent Adversarial Attack Example

Introduction

Sommeliers have a knack for identifying great wine, but even with decades of experience, they can still be tricked by imposters.

"In a sneaky study, Brochet dyed a white wine red and gave it to 54 enology (wine science) students. The supposedly expert panel overwhelmingly described the beverage like they would a red wine. They were completely fooled."

A gradient descent attack is a lot like tricking a wine expert. In this article, we'll learn how to purposefully change our input (dye the wine) to trick the model (the wine expert) into producing the exact output we want.

Remember: our random noise attack was able to trick the model into giving a false answer, but this more advanced technique will allow us to choose the output we want.

This is a powerful attack, but there are a few caveats. As we discussed in our overview article, gradient descent (GD) attacks require white-box knowledge of the model - including its weights.

Overview of Gradient Descent

Gradient descent is an algorithm used to update model weights during training. If we apply the same technique with an adversarial mindset, we can find the boundaries of classification decisions.

Our model - MNIST image classifier

In our previous article, we used the MNIST ML database to train an image classifier. We'll be using that model again, so please refer to that page for any additional context.

Here's a direct link to the code:

Client: https://github.com/cyberaiguy/attacking-mnist/blob/main/client.py

Server: https://github.com/cyberaiguy/attacking-mnist/blob/main/server.py

If you haven't already, please build the code; from here on we'll be expanding client.py to include a gradient descent attack.

Gradient Descent Adversarial Attacks

Visualize this: our wine expert has memorized various aspects of how different vintages taste. They vary in acidity, flavor, dryness, etc. Each of these aspects are somewhere within a range, and when he tries to identify a wine he compares the unknown wine to this series of tastes. But what if we map these tastes to numerical values?

That's basically a neural network. Ranges of features (or 'flavors') have been memorized, and the output of the neural network is the best guess when comparing the input to the memorized data.

If we wanted to trick the neural network, we can subtly change, say, the acidity. Maybe it results in a misclassification, maybe it doesn't. We could randomly change every value by some amount, but the result would be a disgusting wine.

But since we have intricate knowledge of the model (the memorized numerical values of each taste), we can look at what change we need to make, to get which output we want.

It's easy to visualize. Think of a 3D plot with random hills and valleys.

We map our memorized tastes on a 3D grid, where the hills and valleys represent different wines (e.g., one hill might be a Bourdeaux, one valley might be a chardonnay, etc.). It's our map - our guide.

We taste a wine, determine it has 12% acidity, we plot it on our graph. It's a light color, so we plot that point on this graph. We continue this for each aspect of the unknown wine, we land on one hill, and we can determine it's a chardonnay.

So, if we wanted to trick our map (e.g., execute an adversarial attack), we could use this graph. Starting from the chardonnay hill, we know that to get to a Bourdeaux wine, we need to reduce acidity, add a little color, and make it a little sweet.

This is the same idea as a gradient descent attack. We start on one hill and descend into another area to get a new answer from our model.

There are two classes of gradient descent attacks, FSGM and PGD.

Fast Gradient Sign Method (FGSM)

A FGSM attack starts at one hill, takes a single glance at which direction to go, and then launches in that direction. In our analogy, we start with a Chardonnay. To get to a Bourdeaux, we need to add some deep red dye, throw in some dark fruit flavor, and take out some creamy/buttery flavor.

In FGSM, we make all these changes in one large haphazard step.

Projected Gradient Descent (PGD)

PGD, on the other hand, is simply an iterative implementation of the FGSM. We start on one hill, look at which direction to go, and take a small step in that direction. We do this process over and over again until we get to our target area.

Comparison

PGD will get us to a better answer because we keep pausing, looking around, and selecting the best path. FGSM will be much faster to compute, but won't find the best solution.

Implementation

We're starting with the code we built in the last article: an MNIST image recognition model built with Keras. The article can be found here. Make sure to run the server and save the model to disk.

Load model in client

For any Gradient Descent attack to work, we'll need knowledge of the model. Update the client to load the model from disk.

# Load the pre-trained model
model = tf.keras.models.load_model('mnist-saved-model')

Build GD algorithm

def calculate_adversarial_gradient(input_image, target_label):
    target_label = tf.convert_to_tensor([target_label], dtype=tf.int64)

    with tf.GradientTape() as tape:
        tape.watch(input_image)
        prediction = model(input_image)
        loss = tf.keras.losses.sparse_categorical_crossentropy(target_label, prediction)

    # Calculate loss for given input image 
    gradient = tape.gradient(loss, input_image)
    return gradient

Load images from MNIST

Now that we can find a direction to "walk down the hill", let's load up some images to start testing with.

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# grab a random image from the MNIST dataset
random_index = np.random.choice(test_images.shape[0])
random_image = test_images[random_index]
random_label = test_labels[random_index]

Pick an attack direction

# Choose a target value 
target_label = 5 
# Convert to tf.Tensor
image = tf.convert_to_tensor([random_image], dtype=tf.float32)

Build a helper function to apply changes to an image

def apply_perturbations(image, epsilon, iterations=20):
    adv_image = tf.identity(image)
    for i in range(iterations):
        perturbations = calculate_adversarial_gradient(adv_image, target_label)
        # Actually apply the changes
        adv_image = adv_image + epsilon * perturbations
        # Make sure the image is still valid; throw away excess changes
        adv_image = tf.clip_by_value(adv_image, 0, 1)
    return adv_image

Putting it together - Execute the attack

epsilon = 0.1  # Adjust epsilon based on your image scaling
iterations = 10  # Number of iterations for the attack (1 for FGSM; increase epsilon)
adversarial = apply_perturbations(image, epsilon, iterations)

Measure the results

adversarial_prediction = np.argmax(model.predict(adversarial))
original_prediction = np.argmax(model.predict(image))

print("Original Image Prediction:", original_prediction)
print("Adversarial Image Prediction:", adversarial_prediction)

$ python ./gd-attacks.py 

Original Image Prediction: 8
Adversarial Image Prediction: 5

And review the images

plt.subplot(1, 2, 1)
plt.axis('off')
plt.title(f"Original Image")
plt.imshow(image.numpy().reshape(28, 28), cmap='gray')  # Use cmap='gray' for grayscale images
plt.subplot(1, 2, 2)
plt.title(f"Adversarial Image")
plt.imshow(adversarial.numpy().reshape(28, 28), cmap='gray')  # Use cmap='gray' for grayscale images
plt.axis('off')  # Turn off axis numbers and ticks
plt.show()

We can run this a series of times.

When we display the images, it's very obvious we've made changes. Think about it for a second though - the actual range of possible values for our MNSIT format is awfully limited. We've got tiny 28x28 images for a total of 784 pixels. Then, the grayscale is defined by a simple range between 0-255. That's it. Our entire dataset is so small, that we could practically run this gradient descent attack by hand.

With larger inputs, our changes will be so small relative to the range of values (and thus the perceptibility of humans) that they'll escape notice.

Conclusion

In our article, we've shown just how easy it is to abuse neural network classifier models. With knowledge of the model weights, we can simply "look around" from hilltops (combinations of input values) to determine how to trick the model into misclassifying input after subtle changes.

This is important. Our "wine sommelier" example is fairly benign, but models are created daily to handle all sorts of sensitive tasks. For example, a model in charge of assisting a judicial process could misclassify someone's guilt or innocence simply by incorporating a small change in its evaluated data. This could be small and seemingly irrelevant - a small sticker on a scanned document or a strange middle name of a defendant.

Remember, we're attacking the models in a particular direction, so in theory, anyone with knowledge of the model weights can build these attacks to specify their outcome.

There are defenses to these techniques, and we'll discuss them in a future article, but they ultimately fall short of making these models immune to gradient descent attacks. It's a manifestation of the employed technology - we can't at once have models trained using weighted nodes (via gradient descent) and have the nodes immune to gradient descent attacks.

Gradient Descent Adversarial Attacks

Introduction

Overview of Gradient Descent

Our model - MNIST image classifier

Gradient Descent Adversarial Attacks

Fast Gradient Sign Method (FGSM)

Projected Gradient Descent (PGD)

Implementation

Load model in client

Build GD algorithm

Load images from MNIST

Pick an attack direction

Build a helper function to apply changes to an image

Putting it together - Execute the attack

Measure the results

And review the images

Conclusion

More from this blog

Updating the Purdue model for AI threats

Industrial Series - Don't use LLMs

LLM safety and CS Lewis

Putting the 'I' in CIA for AI Models: A Framework for Model Integrity

Malicious ML series - generate ELF training data

Command Palette

Introduction

Overview of Gradient Descent

Our model - MNIST image classifier

Gradient Descent Adversarial Attacks

Fast Gradient Sign Method (FGSM)

Projected Gradient Descent (PGD)

Implementation

Load model in client

Build GD algorithm

Load images from MNIST

Pick an attack direction

Build a helper function to apply changes to an image

Putting it together - Execute the attack

Measure the results

And review the images

Conclusion

More from this blog