Skip to main content

Command Palette

Search for a command to run...

Malicious ML series - VAE to generate binaries

Updated
5 min read
C

I hold a PhD in Computer Science and have been published in a variety of international peer-reviewed journals.

AI is going to be a problem. I don't know what will cause the first "big issue"; it might be from a courtroom where a defendant is sent to jail based off erroneous AI-generated data, it could be a death in a medical setting.. but, something is going to happen.

Let's take the existing adversarial AI research (there's been plenty) and make it useful.

I'm here to bring you up to speed.

Brute Variational Autoencoder

In this approach, we use a VAE to generate entire binaries.

This 'brute' approach is an experiment to see if it can generate functional binaries. Although unlikely to work, it will be interesting to see how far we can get before worrying about feature extraction or metadata interpolation (e.g., extract PE headers and correct the metadata of a generated binary).

Code

Import and preprocess

Here we use the ELFs generated from our earlier work and normalize to 300 byte lengths, using \x90 NOPs as filler.


import numpy as np

def preprocess_samples(samples):
    # Assuming 'samples' is a list of byte sequences
    max_length = 300
    processed_samples = []

    for sample in samples:
        if len(sample) < max_length:
            # Pad samples with NOPs
            sample += b'\x90' * (max_length - len(sample))
        processed_samples.append(np.array(list(sample), dtype=np.float32) / 255.0)  # Normalize byte values to [0, 1]

    return np.array(processed_samples)

import os

def load_binary_files(directory):
    samples = []  # List to hold the byte sequences
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        if os.path.isfile(filepath):
            # Open the file in binary read mode
            with open(filepath, 'rb') as file:
                binary_data = file.read()
                samples.append(binary_data)
    return samples

# Example usage
directory = 'aimwg-ph/'
samples = load_binary_files(directory)

print(samples[:5])
print(len(samples))

pp = preprocess_samples(samples)

Compile the model



from tensorflow.keras import layers, models, backend as K

def sampling(args):
    z_mean, z_log_var = args
    batch = K.shape(z_mean)[0]
    dim = K.int_shape(z_mean)[1]
    epsilon = K.random_normal(shape=(batch, dim))
    return z_mean + K.exp(0.5 * z_log_var) * epsilon

input_dim = 300  # Input dimension: 350 bytes
intermediate_dim = 64  # Intermediate dimension
latent_dim = 2  # Latent space dimension

# Encoder
inputs = layers.Input(shape=(input_dim,))
x = layers.Dense(intermediate_dim, activation='relu')(inputs)
z_mean = layers.Dense(latent_dim)(x)
z_log_var = layers.Dense(latent_dim)(x)
z = layers.Lambda(sampling)([z_mean, z_log_var])

# Decoder
latent_inputs = layers.Input(shape=(latent_dim,))
x = layers.Dense(intermediate_dim, activation='relu')(latent_inputs)
outputs = layers.Dense(input_dim, activation='sigmoid')(x)

encoder = models.Model(inputs, [z_mean, z_log_var, z], name='encoder')
decoder = models.Model(latent_inputs, outputs, name='decoder')
outputs = decoder(encoder(inputs)[2])
vae = models.Model(inputs, outputs, name='vae')

# Loss function
reconstruction_loss = K.mean(K.binary_crossentropy(inputs, outputs)) * input_dim
kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
vae_loss = K.mean(reconstruction_loss + kl_loss)
vae.add_loss(vae_loss)
vae.compile(optimizer='adam')

Train and evaluate

from sklearn.model_selection import train_test_split

# Assuming 'samples' is your list of preprocessed samples
# Convert 'samples' to a numpy array if it's not already
#samples = np.array(samples)

# Split the data into training and test sets
X_train, X_test = train_test_split(pp, test_size=0.2, random_state=42)
# Assuming X_train and X_test are arrays of 1D samples
# X_train = np.reshape(X_train,350)  # Reshape correctly with 350 features
# X_test = np.reshape(X_test, 350)   # Reshape correctly with 350 features

# Verify the shape
print("Training shape:", X_train.shape)
print("Testing shape:", X_test.shape)

# Train the VAE
# X_train is your training data, normalized and preprocessed as needed
# For a VAE, the input data is also used as the target data
vae.fit(X_train, X_train,epochs=500,batch_size=32, validation_data=(X_test, X_test))  # Using X_test as both input and target for validation



loss = vae.evaluate(X_test, X_test, batch_size=32)  # Using X_test as both input and target
print("Reconstruction loss:", loss)

Generate new samples

import numpy as np

def sample_latent_points(latent_dim, num_samples):
    # Sample from a standard normal distribution
    return np.random.normal(loc=0.0, scale=1.0, size=(num_samples, latent_dim))

def generate_samples(decoder, latent_points):
    # Decode the latent points to generate new data
    generated_data = decoder.predict(latent_points)
    return generated_data

latent_dim = 2  # This should match the latent dimension size used in your VAE model
num_samples = 10  # Number of samples you want to generate

# Sample points in the latent space
latent_points = sample_latent_points(latent_dim, num_samples)

# Generate new data samples from these latent points
generated_samples = generate_samples(decoder, latent_points)

def postprocess_binary_samples(samples):
    # Assuming samples were normalized to [0, 1], convert back to byte values
    samples = np.round(samples * 255).astype(np.uint8)
    return samples

generated_binaries = postprocess_binary_samples(generated_samples)

!mkdir generated/

def save_generated_binaries(generated_binaries, output_dir):
    for i, sample in enumerate(generated_binaries):
        filepath = os.path.join(output_dir, f"generated_binary_{i}.bin")
        with open(filepath, 'wb') as file:
            file.write(sample)

# Example usage
output_dir = 'generated/'
save_generated_binaries(generated_binaries, output_dir)

Try and use them!

!ls -la generated/
!file generated/*

total 48
drwxr-xr-x 2 root root 4096 Mar  9 01:47 .
drwxr-xr-x 1 root root 4096 Mar  9 02:02 ..
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_0.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_1.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_2.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_3.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_4.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_5.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_6.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_7.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_8.bin
-rw-r--r-- 1 root root  300 Mar  9 02:27 generated_binary_9.bin
generated/generated_binary_0.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_1.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_2.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_3.bin: data
generated/generated_binary_4.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_5.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_6.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_7.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_8.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
generated/generated_binary_9.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header

Commentary

Interestingly, we have created something resembling a binary. The model has learned the first few bytes of the binary, enough to fool the file command, but is lacking section headers.

Predictably, execution of any of these binaries results in immediate failure (although for one sample it actually generates a segfault, which oddly feels like great progress). Debugging is equally unfruitful.

The brute approach is a fun experiment, but is doomed to failure as we haven't addressed any specific features of the binary. I suspect it's possible to create a 'fixer' application that takes this raw unstructured ELF and reformats it into an executable binary, but then what's the point of training a model to do the heavy lifting for us.

Let's move on to GANs!