Brute Generative Adversarial Network

In this approach, we use a GAN to generate entire binaries. GANs sound perfect - they try and generate a binary from some noise, use a discriminator to find out if it was correct, and then goes back and tries again. However, there's a lot of nuance which prevents this from being reliable (or really useful at all). But it's fun!

This 'brute' approach is an experiment to see how well a GAN can generate a functional binary. It's not likely to work, but it'll be interesting to see how far we can get with the easy approach before worrying about feature extraction (like individual binary sections, .data and .text).

Code

Import and preprocess

We'll build off the binaries we generated using MSFVenom; small snippets of ~300bytes.


import numpy as np
import os

def load_binary_files(directory, file_size):
    samples = []
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        with open(file_path, 'rb') as file:
            binary_data = bytearray(file.read(file_size))
            # Ensure each file is exactly file_size bytes
            if len(binary_data) < file_size:
                # NOP padding
                binary_data += b'\x90' * (file_size - len(binary_data))
            samples.append(np.array(binary_data))
    return np.array(samples, dtype=np.float32) / 255.  # Normalize byte values to [0, 1]

directory = 'aimwg-ph/'
file_size = 300  # or whatever your target size is

Build the model

Remember, a GAN needs a generator to build the binary and a discriminator to find out if it's a functional binary or not.


import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LeakyReLU, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

def build_generator(latent_dim, output_dim):
    """Builds the generator model."""
    inputs = Input(shape=(latent_dim,))
    x = Dense(128)(inputs)
    x = LeakyReLU(alpha=0.2)(x)
    x = BatchNormalization(momentum=0.8)(x)
    x = Dense(256)(x)
    x = LeakyReLU(alpha=0.2)(x)
    x = BatchNormalization(momentum=0.8)(x)
    x = Dense(512)(x)
    x = LeakyReLU(alpha=0.2)(x)
    x = BatchNormalization(momentum=0.8)(x)
    outputs = Dense(output_dim, activation='tanh')(x)

    model = Model(inputs, outputs)
    return model

def build_discriminator(input_dim):
    """Builds the discriminator model."""
    inputs = Input(shape=(input_dim,))
    x = Dense(512)(inputs)
    x = LeakyReLU(alpha=0.2)(x)
    x = Dense(256)(x)
    x = LeakyReLU(alpha=0.2)(x)
    outputs = Dense(1, activation='sigmoid')(x)

    model = Model(inputs, outputs)
    model.compile(loss='binary_crossentropy',
                  optimizer=Adam(0.0002, 0.5),
                  metrics=['accuracy'])
    return model

Train the model

def train_gan(generator, discriminator, combined, data, epochs, batch_size, latent_dim):
    """Trains the GAN for generating binary data."""
    valid = np.ones((batch_size, 1))
    fake = np.zeros((batch_size, 1))

    for epoch in range(epochs):
        # Train discriminator
        idx = np.random.randint(0, data.shape[0], batch_size)
        real_samples = data[idx]

        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        generated_samples = generator.predict(noise)

        d_loss_real = discriminator.train_on_batch(real_samples, valid)
        d_loss_fake = discriminator.train_on_batch(generated_samples, fake)
        d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)

        # Train generator
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        g_loss = combined.train_on_batch(noise, valid)

        # Print progress
        print(f"Epoch: {epoch+1}/{epochs} | D Loss: {d_loss[0]}, D Acc: {100*d_loss[1]} | G Loss: {g_loss}")

latent_dim = 100
output_dim = 300  # Adjust based on your binary size

# Build and compile the discriminator
discriminator = build_discriminator(output_dim)

# Build the generator
generator = build_generator(latent_dim, output_dim)

# The generator takes noise as input and generates samples
z = Input(shape=(latent_dim,))
sample = generator(z)

# For the combined model we will only train the generator
discriminator.trainable = False

# The discriminator takes generated samples as input and determines validity
valid = discriminator(sample)

# The combined model (stacked generator and discriminator)
# Trains the generator to fool the discriminator
combined = Model(z, valid)
combined.compile(loss='binary_crossentropy', optimizer=Adam(0.0002, 0.5))

data = load_binary_files(directory, file_size)

# Train the GAN
train_gan(generator, discriminator, combined, data, epochs=10000, batch_size=32, latent_dim=latent_dim)

Generate some new malware!

num_samples_to_generate = 10  # Specify the number of samples you want to generate
latent_dim = 100  
random_latent_vectors = np.random.normal(size=(num_samples_to_generate, latent_dim))
generated_samples = generator.predict(random_latent_vectors)
generated_samples = np.round(generated_samples * 255).astype(np.uint8)

for i, sample in enumerate(generated_samples):
    # Save each generated sample to a binary file
    file_path = f"generated3/generated_binary_{i}.bin"
    with open(file_path, "wb") as file:
        file.write(sample.tobytes())

Commentary

Well, we generated something. Interestingly, we do have one file (binary_6.bin) that looks to be functional, but don't be fooled! It has some correct header information, but in no way is a functional binary.

For that, we'll have to improve our process. In our next article, we look at feature extraction and using Docker in the discriminator to measure the effectiveness of the generated malware.

Malicious ML series - GAN to generate binaries

Brute Generative Adversarial Network

Code

Import and preprocess

Build the model

Train the model

Generate some new malware!

Commentary

More from this blog

Updating the Purdue model for AI threats

Industrial Series - Don't use LLMs

LLM safety and CS Lewis

Putting the 'I' in CIA for AI Models: A Framework for Model Integrity

Malicious ML series - generate ELF training data

Command Palette

Brute Generative Adversarial Network

Code

Import and preprocess

Build the model

Train the model

Generate some new malware!

Commentary

More from this blog