YARA works well, very well, in fact, against a diverse range of targets. One of those is webpages. As a target selection, it’s tough to find a more diverse and testy target to build an accurate rule. They contain text, HTML, scripts, CSS and plenty more, which complicates devising a solid strategy to consistently and accurately detect via YARA. Detection then, is not just a matter of focusing on the right target elements to match, but also paying attention to the location of the elements and the order of occurrence.
HTML and webpages are not the normal fodder of YARA talk. It’s an occasional blip as a conversation piece. Attacks, however, come in all shapes and sizes and exploit kit pages, redirection portals, footprinting scripts, infection scripts, iframe pop ups and more are all eligible targets for YARA. If you happen to leverage the OWASP Web Scanner, the project is focused on scanning webpages. Of course, you don’t need that. If you happen to routinely pull down webpages that represent attack sites, you might have a need for YARA to dig through them. It’s a fun use of YARA to identify the coding approach used to generate the iframe, redirection or information gathering occurring on the webpage. Same for phishing — it’s a good use of the hash module to hash full and partial chunks of webpages to look for identical deployments of the same phishing kit.
Finding the right targets
Since the majority of a webpage is plain text, you are in the opposite situation that you could be in with an executable. The abundance of text and repetitiveness of strings means careful selection of target strings is even more crucial. For their diversity, webpages are relatively defined. Certain types of tags are required for HTML, CSS, scripts and the other elements you come across in webpages. I harp on this a lot, but remember to keep the objective in mind. If you are looking for a script, then focus on finding the script. If you are looking for the construction of an iframe, then keep the crosshairs there. If it’s an html5 interaction that’s stirring the pot, then cast your eye there.
Once you know what you are looking to find, pinpoint the representations that define it. Avoid common elements. Know what you shouldn’t use for matching. Script tags are poor matches beyond ensuring that scripts are present. Functions inside of scripts can be equally low motility matches if they are too common, e.g., document.createDocumentFragment(), window.addEvent, or encodeURIComponent. Url-encoded values can be worthy choices if selected carefully and combined with identifiable elements. Alone, they are rarely unique enough to constitute solid detectors. Values inside of iframe that push the iframe off the screen can speak to intent, but again need to be combined with others elements to provide a good detection.
Once you have an idea of what you are looking to find, reducing the amount of debris you need to sift to do so becomes ideal. Webpages have an organization to them. Only certain items will appear in the header, for example. Scripts can be just about anywhere, but will have a script tag. They might be made hard to read, but still have to be present. Leverage to your benefit the focus on presentation that can limit where some elements have to be placed. Zoom in on situations where that axiom is broken.
YARA looping structures can really shine here. If the target lies within a div tag, for example, you could iterate through the buckets of content they represent in the condition line:
For any i in (1..#divopen) : my_target in (@divopen[i]..@divclose[i])
You can easily switch that to say scripts, since they have a defined <script> </script> set of tags just like <div> </div> tags do.