Yara, Entropy and a bit of Math

corpo-governance

When people ask what I think the number one tool in my arsenal is, I tell them without pause it is YARA.  From versatility to function, YARA beats the stuffing out of just about everything else I use.  Mainly, from the fact that in investigations, in research, in threat hunting, and even in practicing my intelligence tradecraft — everything comes down to patterns.  Evolving and then searching through data with a pattern.  Nothing beats doing that better than YARA.  It just shines.  When it doesn’t find what I’m looking for, invariably the issue isn’t YARA, but my logic.  I asked the wrong question or in the wrong way and once that’s corrected the answer to the question rises to the surface.

 

Now don’t get me wrong, Python is my go-to as well.  If I can’t beat something down with YARA, then Python steps in to fill the gap and I commonly use them together.  When the data is known and well defined, even if it’s obfuscated or encrypted in some fashion, both these tools function beyond expectation to derive the answer — usually some variation of it is malicious, suspicious or just plain interesting.  When data is less defined, however, when it’s filled with “soft” variables or hard to measure, the challenge level rises.  I’ve discussed this topic with many folks over the years.  People tend to discard YARA, in either of those two situations, and move on to other tools.  Since the Math module came out to supplement YARA a while back, it’s been even easier to handle these softer use cases.   One of my favorite functions to play with is entropy.

 

Entropy is referred as the measurement of particular data in digital values or perhaps better in this context as a representation of data sets in a specific file.  File Entropy, for example, is a measurement of the amount of data in a file.  More importantly, we are more interested in Entropy in general, as a measurement of unpredictable value or informative content.  That way we could not only focus on the entropy of a file, but also on the entropy of chunks of data or strings, like the entropy of a domain name or other, interesting things.

 

YARA lets you play in that arena with the Math module.  It not only can help you determine entropy via the math.entropy() function, but also check for values in range, deviation, mean and other fun mathematical functions.  With that in mind, you can do basic things with YARA, like measuring the entropy of a file or the entropy of portable executable section names (with a little help from the PE module):

 

import “math”

 

rule low_file_entropy {

condition:

math.entropy(0, filesize) < 4.5

}

 

import “math”

import “pe”

 

//need this private rule to constrain detections to just PE files…

private rule PE

{

condition:

 

uint16(0) == 0x5A4D and

uint32(uint32(0x3C)) == 0x00004550

}

 

rule section_name_entropy {

condition:

PE and for any i in (0..pe.number_of_sections -1 ) : (math.entropy(pe.sections[i].name) >= 2.4 or math.entropy(pe.sections[i].name) <= 1)

}

 

In the first rule, I’m asking to match on a file with lower than normal entropy.  In the second, any PE section file name with an entropy of 2.4 or higher or one at 1.0 or lower.  If you run the normal list of regular section names — CODE, rsrc, BSS, data, rdata, idata, data, .tls, and so on — you find they range from 1.5 (BSS) to 2.35 (.CODE).  Randomly generated names, like “pjdzicab” and “ajhgnkzj” come in a 3.0 and 2.75 respectively.  Blank and very short, non-standard section names tend to fall under 1.0, e.g., blank or one character or repeated 1 character section name is 0.

 

Of course, like I’ve mentioned before, you can localize your detections to certain portions of the file instead of calculating Entropy for the entire file or even strings.

 

import “math”

import “pe”

 

//need this private rule to constrain detections to just PE files…

private rule PE

{

condition:

 

uint16(0) == 0x5A4D and

uint32(uint32(0x3C)) == 0x00004550

}

 

rule final_1k {

condition:

PE and math.entropy(filesize – 1024, 1024) > 7

}

 

rule final_512 {

condition:

PE and math.entropy(filesize – 512, 512) <= 1

}

 

rule first_200 {

condition:

PE and math.entropy(0, 200) > 4.7

}

 

These look at the last 1024 or 512 bytes of a PE file or the first 200 bytes respectively and calculate entropy for those areas.  Why?  If you add extra data to the end of a file, it can be found this way.  For the front end, well, the header sections of a PE file are fairly static, and if you add or subtract from those values it can indicate byte-reuse or other techniques.

 

Some other fun use cases for non-binaries, since that seems to get the most attention.  As nature would have it, “normal” ASCII text files run between 4-5 bits of entropy per byte.  Homogenous ASCII files with lots of repeated characters run much lower, in the 0 – 2 ranges mainly and highly skewed to the lower end.  Highly random ones, like the output of a key logger have much higher bits of entropy.  Going back to full file entropy checks, you might map that out and look for those use cases:

 

import “math”

 

rule interesting_entropy {

condition:

math.entropy(0, filesize) <= 3.7 or math.entropy(0, filesize) >= 5.2

}

 

You’ll notice I didn’t lock this to a file type.  Since Entropy is a measure of randomness or information density and I’m looking for high and low values, finding matches regardless of the file type becomes interesting.  While I’m using the context of looking at ASCII files above, consider the use case where we could apply it anyway we want.  How about BMP files, with

higher than normal entropy?

 

import “math”

 

private rule BMP

{

strings:

$BMP = { 42 4D }

condition:

$BMP at 0

}

 

rule BMP_high_entropy {

condition:

BMP and math.entropy(0, filesize) > 3.0

}

 

A while back I was part of an intelligence cell that measured entropy of PDFs, both genuine and malicious ones.  We collected a couple of thousand malicious PDFs from Virustotal and retrieved a similar pool of “clean” PDFs from other sources.  We then calculated entropy for the various PDF sets, and found some interesting results.  While entropy values alone don’t provide direct pointer to maliciousness, it does constitute a strong indication towards genuine versus malicious. Clean PDFs averaged an entropy of 7.42, and ranged from a few low values at 2.7 to 8.0.  Malicious ones had a much lower average, 4.96, with a wider variance of 0.9 to 8.0.  That presented some obvious Entropy YARA insights:

 

private rule pdf_file

{

strings:

$a = “%PDF-”

condition:

$a at 0

}

 

rule PDF_low_entropy {

condition:

pdf_file and math.entropy(0, filesize) <= 7.5

}

 

Plenty more hidden in the Math module of YARA and it’s something we explore in the two-day class on YARA.  Check the next class on YARA to find out more!

About the author

Monty St John

Monty is a security professional with more than two decades of experience in threat intelligence, digital forensics, malware analytics, quality services, software engineering, development, IT/informatics, project management and training. He is an ISO 17025 laboratory auditor and assessor, reviewing and auditing 40+ laboratories. Monty is also a game designer and publisher who has authored more than 24 products and 35 editorial works.