Contents

Pattern matching and YARA rules

What is Pattern Matching?

Pattern matching is the process of identifying specific sequences or structures within a larger set of data. At its core, it involves detect identifiable details within large volumes of data; whether that data consists of lines of code, network packets, or logs. In other words, every piece of data carries a pattern, a signature, a behavior, or a series of repetitive elements. Pattern matching is the technique we use to detect these elements.

Based on one of my favorite principle that said if you need to perform a task more than twice, you should think about automating it, pattern matching is invaluable. Automation not only speeds up the process of sifting through large volumes of data but also scales effortlessly, ensuring that even as data grows, detection is still possible.

One of the most impactful uses of pattern matching is in malware detection. EDR, IDS, and NIDS continuously scan files and network traffic, comparing them against known malware signatures sources. Another useful use case can be for threat hunting, enabling to easy look for known IOCs, using it to analyze network traffic, for example by looking to C2 servers, or to make behavioral-based hunting.

There are three primary approaches to pattern matching:

  • Signature-Based Matching: This method relies on predefined signatures of known threats. It is highly efficient for detecting previously identified malware, exploits, or attack patterns but struggles against new, polymorphic, or obfuscated threats, which are usually the most dangerous ones. Signature-based matching is commonly used in antivirus software, intrusion detection systems (IDS), and firewall rules.
  • Fuzzy or Approximate Signature Matching: This technique allows for minor variations in patterns to detect threats that do not match exactly but share similarities with known malicious activity. It is useful for identifying polymorphic malware, slightly modified attack payloads, and basic evasive techniques used by attackers. Some examples include:
    • Levenshtein Distance: measures the difference between two strings by counting the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another. This allows detecting slightly modified malware samples by comparing known malicious code against potentially altered versions.
    • Context-Triggered Piecewise Hashing (CTPH) (e.g., ssdeep) generates fuzzy hashes to identify similar files, even if they are not identical. Unlike traditional cryptographic hashes which change completely with minor modifications, CTPH detects partial similarities between files. This can be useful to identify variants of malware that share common code blocks or to find reused malicious scripts across different attack campaigns.
    • N-Gram Analysis is another techniques that helps identify patterns, by breaking a piece of data into overlapping sequences of N elements (characters, bytes, or words). An example of a 3-gram breakdown of the string “malware” would be: “mal”, “alw”, “lwa”, “war”, “are”.
  • Heuristic and Behavioral Analysis: Unlike signature-based methods, which rely on predefined patterns, heuristic and behavioral analysis detect threats based on deviations from normal activity threshold. This can include monitoring system calls, analyzing network traffic patterns, or detecting unusual user behavior. Behavioral techniques are particularly effective against zero-day attacks and APTs but require significant computational resources and tuning to avoid false positives.

Based on the Pyramid of Pain, it is clear that behavioral matching can be much more efficient than signature-based detection in disrupting an attacker’s operations. Behavioral patterns are harder for adversaries to evade because they focus on how an attack behaves rather than specific indicators like hashes or IP addresses, which can be easily changed. However, implementing behavioral analysis is significantly more complex, requiring Threat Intelligence, machine learning models, or rule-based detection systems that are maintained over time.

Limitations

While pattern matching is a powerful tool in threat detection, it has several limitations that reduce its effectiveness against advanced cyber threats. One of the primary drawbacks is its reliance on known patterns. Traditional pattern-matching techniques, such as signature-based detection explained above, can only identify threats that have been previously analyzed and documented. This makes them ineffective against zero-day attacks and newly developed malware, which do not match any existing signatures.

Another significant limitation is the susceptibility to evasion techniques. Cybercriminals can bypass detection by making minor modifications to malware, such as inserting junk code, applying polymorphic encryption, or using obfuscation techniques. Because pattern matching relies on detecting exact or near-exact sequences, even subtle alterations in code structure can render signature-based detection rules ineffective. I’ll explore evasion techniques in greater depth in a future post, since it’s a very fascinating and complex topic on its own.

Additionally, performance overhead is a concern, particularly when scanning large datasets or monitoring real-time network traffic. Running complex pattern-matching operations across vast volumes of data can slow down security systems, impacting overall performance. This is especially problematic in environments where speed is critical, such as incident response and real-time malware detection.

YARA rules

Now that we have completed this lengthy introduction to pattern matching, we can discuss about YARA rules. YARA is an open-source tool used for pattern matching. It allows users to create rules that search for specific patterns, strings, or specific characteristics within files. Rather than duplicating the well-written and much more complete YARA documentation with single definition of each function, I will provide a brief introduction and dive into practical use cases to have a nice overview of this valuuable tool.

Every YARA rule begins with the keyword “rule”, followed by a unique identifier. This identifier must be distinct from others in your ruleset and should be descriptive. After the identifier, you can optionally add rule tags, which help organize and categorize rules. Tags allow you to quickly apply all rules related to a specific category, improving efficiency and organization. For example: rule Ransomware_Kangaroo : tag1 tag2 {

The structure of a rule is pretty simple, since it’s made of just three main blocks:

  1. Meta: contains metadata for organization and documentation.
  2. String: defines patterns (e.g. text, hex, regex) that the rule searches for.
  3. Condition: specifies the logic for matching a file.

From version 3.0 YARA does also support modules that are designed to extend functionality, providing additional features, such as parsing PE headers, inspecting file metadata, and analyzing embedded scripts. YARA modules are written in C and built into YARA as part of the compiling process.

1. Meta

The meta block is the most boring block, it’s an optional (not too optional) section that provides metadata about the rule. While it doesn’t influence how the rule’s logic works, it allows to organize, document, and maintain rules over time, specially if they’re part of an info sharing program.

Creating a great meta block can allows to automatically aggregate rules by author, malware family or confidence level, and generate automated documentation from the metadata. It’s also possible to implement rule lifecycle management based on creation/expiration dates.

Following you can see a very unrealistic but complete meta section, just to show off some of the most used and common values:

meta: 
	author = "Matteo" 
	description = "Detects a suspicious executable file" 
	date = "2025-02-23" 
	version = "1.0" 
	severity = "high" 
	tlp = "amber"
	mitre_att = "T1547.001,T1055.012"
	reference = "https://example.com"

2. String

The String block is the core of a YARA rule, defining the specific patterns to identify within the target data. These patterns can be simple text strings, binary sequences, or regular expressions. The String section provides the “fingerprints” that YARA uses to match against files, memory, or other data being analyzed.

Text Strings
Text strings in YARA rules match literal ASCII or Unicode text found in files. They are the simplest form of patterns and are enclosed in double quotes. Some additional functions to use with it can be ‘wide’ that matches Unicode strings, ‘nocase’ to proceed with a case-insensitive matching or ‘fullword’ to only match the complete word.

strings:
    $text_string = "CreateRemoteThread"
    $login_text = "Enter your credentials"
    $unicode_string = "password" wide
    $case_insnsitive = "Admin" nocase 
    $fullword_match = "rare sequence" fullword

Hexadecimal Strings
Hex strings match binary data that may not be representable as text. They are useful for a variety of use cases, but keep in mind that due to endianness, different rule based on the architecture could be needed. Some use cases includes:

  • Direct Binary Representation: Hex patterns match raw binary data byte-by-byte, allowing to identify content that doesn’t have a textual representation or contains non-printable characters.
  • Obfuscation Resistance: Many malware authors obfuscate text strings but overlook binary structures that must remain intact for the malware to function. An example could be identifying files that maintain valid PE structure but have unusual modifications between standard sections.
  • Format Agnostic: Hex patterns can match across any file type or memory region regardless of encoding, providing detection at opcode level rather than relying on higher-level patterns.

Examples of hex patterns include:

  • Modified PE header: 4D 5A [10-30] 50 45 00 00 [4-8] 57 69 6E 33 32
  • RDTSC for anti-debugging: 0F 31
  • INT 2D for anti-debugging: CD 2D
  • Windows Syscall pattern: { B8 ?? ?? ?? ?? BA ?? ?? ?? ?? E8 [0-6] FF D0 }

Additional hex operators functions can be:

  • Wildcard (??): represents any byte value, useful when certain bytes are variable or irrelevant to the pattern.
  • Jump Intervals ([x-y]): specify a variable number of bytes to skip. For example, [4-8] means “skip between 4 and 8 bytes.”
  • Exact Jump ([x]): skips exactly x bytes.
  • Alternation ((A|B)): matches either pattern A or pattern B.
  • Grouping ({}): group related hex bytes together for clarity or to apply operations to multiple bytes.
  • Negation (!): matches any byte except the specified value.
   strings:
        $hex_pattern = {
            90 90 90 ?? ??                // Wildcard (??)
            68 [4-8] 6A 02               // Jump Intervals ([x-y])
            33 C0 [3] 50                 // Exact Jump ([x])
            (E8 | FF 15)                 // Alternation ((A|B))
            { 8B 45 FC 83 }              // Grouping ({})
            !F0 89 ?? ??                 // Negation (!): 
        }

Regular Expressions
Regular expressions in YARA are one of the most powerful tool to use. They’re are enclosed within forward slashes, like in PERL, in fact it uses PCRE (Perl Compatible Regular Expressions) syntax which can include also recursive patterns. An example of a quite simple regex could be:

strings: 
	$reg = /http:\/\/[a-z0-9.]+\.[a-z]{2,3}\/login\.php/i

Remember that limiting ranges can improve scanning speed, since instead of searching for * it will just search withing a defined range. Regular expressions can be also followed by nocase, ascii, wide, and fullword modifiers just like in text strings. Few weeks ago I discovered Regex101 website which can be a great playground to test and make practice with regular expressions.

3. Condition

The Condition block is the decision-making engine of a YARA rule. It defines the logical criteria that determine whether a rule matches a file. This block is evaluated after the string block, which is an important thing to remember when creating rules. It’s like a big “if statement” of the rules, since it uses the string you’ve defined to express when a detection should trigger.

What makes this section particularly powerful is its expressiveness. From simple string matching to complex combinations of file properties, positional relationships between patterns, and even references to external rules.

Let’s start with some basic operators. As you can imagine, the condition block supports all the logical operator AND, OR, NOT. It also support , matches to use regex to match strings, contains to check if a string contains another string, and for [...] of to iterate over string occurrences. Following you can see an example of for operator where all strings defined with $s prefix must appear more than 3 times:

condition: 
	for all of ($s*) : (# > 3) 

It is also possible to create time-based conditions using the “Time” module, which can be useful to create time-sensitive rules. The timestamp function retrieves the timestamp from a file, while current_time allows for evaluating conditions. Same thing can be done with file propriety, since YARA provides filesize keyword to check the size of a file in bytes, pe.entry_point (from PE module) to retrieves the offset where execution begins in an executable file and also the ability to inspect values at specific file offsets using uint8, uint16, and uint32.

condition: 
	filesize < 100KB and uint16(0) == 0x5A4D

Rules optimization

Literal strings are processed using the Aho-Corasick algorithm, which is highly efficient for multiple pattern matching. Regular expressions, while more flexible, use a different matching engine that typically incurs higher performance costs.

// More efficient
$str1 = "suspicious_string"

// Less efficient but more flexible
$re1 = /susp[i1]c[i1]ous_str[i1]ng/

When possible, prefer literal strings over regular expressions. As said before, when regular expressions are necessary, make them as specific as possible. String modifiers affect how YARA searches for patterns. Using them wisely can improve both performance and detection accuracy:

  • nocase: while useful for case-insensitive matching, it increases the number of potential matches YARA must evaluate.
  • wide: searches for UTF-16 encodings, effectively doubling the search space.
  • ascii: limits searching to ASCII encoding, potentially improving performance.
  • fullword: adds word boundary checks but can improve specificity.

Hex strings should be made as specific as possible while accommodating necessary variations, using jumps ([-] or [4-8]) instead of multiple wildcards when appropriate, as they can provide both flexibility and efficiency.

YARA evaluates conditions from left to right using short-circuit logic. This means:

  • In AND conditions, if the left operand is false, the right is never evaluated.
  • In OR conditions, if the left operand is true, the right is never evaluated. Try to leverage this behavior by placing the most likely-to-fail or least expensive checks first in AND conditions, and the most likely-to-succeed checks first in OR conditions.

Using file properties like filesize or entrypoint can quickly filter out files before performing expensive string matching, limiting the scope of string matching when possible.

Private rules act as building blocks in your YARA ruleset. They perform preliminary filtering that can be reused across multiple detection rules. Create private rules that identify potential candidates, then use these in more complex rules. In the following example, IsPotentialMalware performs the initial heavy lifting by identifying executable files with a reasonable size limit, to ensures that subsequent rules conserve computational resources:

private rule IsPotentialMalware
{
    condition:
        uint16(0) == 0x5A4D and filesize < 5MB
}

rule ActualMalwareDetection
{
    strings:
        $s1 = "malicious_string"
    condition:
        IsPotentialMalware and $s1
}

Finally, here are some broader guidelines for creating future-proof rules to minimize the need for recreation or editing:

  1. Focus on Behavior: target fundamental behaviors rather than easily changed artifacts.
  2. Parameterization: use global rule variables or external variables to make rules adjustable.
  3. Modular Design: build rules from composable components that can be updated independently.
  4. Documentation: include comprehensive comments explaining the purpose and logic behind optimization

YARA Examples

Ransomhub

Let’s start with the first example designed to detect the Ransomhub ransomware family. I’ll skip the meta block, since it’s quite self-explanatory. The string block is divided into five section that covers all the most common identifiable elements of a ransomware:

  1. The block starts with a check of MZ header 4D 5A hex pattern that identify executable files, to preliminary filter to only PE files.
  2. We then search for common ransom note common patterns. These are usually quite similar for each ransomware family.
  3. Then the most interesting part, where we search for encryption logic patterns. This part could be expanded by including related cryptographic API calls:
    • The first pattern include includes a PUSH instruction (68), a possible memory or stack operation (8D 45 F? which is LEA), and a indirect call to an imported function (FF 15) (e.g. CryptEncrypt) which is often seen in cryptographic operations.
    • The second pattern contains CMP (81 F9) and a conditional jump (0F 85), often found in key scheduling or encryption loop condition.
  4. We check for common file extension patterns.
  5. Search for command execution patterns usually exploited to disable recovery options.

The condition block combines the string patterns and it’s divided into three logic sections:

  1. Initial Filters:
    • uint16(0) == 0x5A4D: Ensures the file is a Windows executable
    • filesize < 5MB: Limits analysis to reasonably sized files for performance
  2. Logical Structure:
    • Uses short-circuit evaluation with OR conditions for efficiency
    • Provides multiple detection paths to catch different variants
  3. Three Detection Paths:
    • Ransom note + file extension: (2 of ($note*) and 1 of ($ext*))
    • Encryption logic: all of ($encrypt*)
    • Anti-recovery + ransom note: (2 of ($cmd*) and 1 of ($note*))
rule Ransomhub_Example
{
    meta:
        author = "Matteo"
        description = "Detects Ransomhub ransomware example"
        date = "2025-03-1"
        version = "1.0"
        severity = "critical"
        confidence = "high"
        tlp = "amber"
        mitre_att = "T1083,T1486,T1490,"
        malware_family = "Ransomhub"
        malware_type = "Ransomware"
        category = "Ransomware"

    strings:
        // PE prelimin filter
        $mz = { 4D 5A }
        
        // Ransom note patterns
        $note1 = "Your files have been encrypted by Ransomhub" ascii nocase
        $note2 = "Pay 1.5 BTC to restore your files" ascii nocase
        $note3 = "RANSOMHUB_README.txt" ascii
        $note4 = "Visit our payment portal at" ascii
        
        // Encryption code patterns
        $encrypt1 = { 68 83 F0 ?? 8D 45 F? 50 FF 15 }
        $encrypt2 = { 81 F9 ?? ?? 55 00 0F 85 }
        
        // Extension patterns
        $ext1 = ".ransomhub" ascii fullword
        $ext2 = ".crypted" ascii fullword
        
        // cmd execution patterns
        $cmd1 = "vssadmin delete shadows /all" ascii nocase
        $cmd2 = "bcdedit /set" ascii nocase
        $cmd3 = "wbadmin delete catalog" ascii nocase

    condition:
        uint16(0) == 0x5A4D and
        filesize < 5MB and
        (
            (2 of ($note*) and 1 of ($ext*)) or
            all of ($encrypt*) or
            (2 of ($cmd*) and 1 of ($note*))
        )
}

Crypto Mining

Our second example is a crypto-mining detection rule. Let’s break down the key components and philosophy behind this last detection approach. The string block is divided into seven logic blocks to cover all the distinctive characteristics:

  1. As we did in the last example, the block starts with a check of MZ header 4D 5A hex pattern, to preliminary filter to only PE files.
  2. Most common mining algorithm rule string. We can change them based on our needed, we could also create a single rule for each algorithm to improve the effectiveness.
  3. Identify common mining pools like XMR pools, Nanopool and more.
  4. Hex pattern rule to identify assembly instructions commonly found in CPU mining code
    • 0F 57 XORPS (Exclusive OR for Packed Single-Precision Floating-Point Values) opcode is often used for zeroing out registers before performing mathematical operations, a common practice in mining calculations, since it needs to perform many hashing operations, and each new hash calculation typically starts with cleared registers
    • 0F 29MOVAPS (Move Aligned Packed Single-Precision Floating-Point Values), used in SIMD (Single Instruction, Multiple Data) operations, which are leveraged by crypto miners for efficient parallel computation to achieve higher hash rates.
    • 48 8D 54 24 pattern often used to calculate addresses for hash state buffers, set up pointers to data being hashed, and manipulate memory without actual memory access
  5. String detection of common GPU libraries and device enumeration functions.
  6. String detection of hiding techniques
  7. Finally a rule to target Monero and Ethereum wallet addresses and Json configuration files with mining pool URLs

Regarding the condition block:

  • Starts by initially applying two filtering conditions:
    1. uint16(0) == 0x5A4D: immediately filters out non-Windows executable files.
    2. filesize < 15MB: limits the rule to files smaller than 15 megabytes, to avoid excessive scanning time on large files and because most cryptominers are relatively compact.
  • After passing the initial filters, the file must match at least one of the four detection scenarios.
rule Crypto_Mining_Example
{
    meta:
        author = "Matteo"
        description = "Detect cryptocurrency mining software"
        date = "2025-03-01"
        version = "1.0"
        severity = "medium"
        confidence = "high"
        tlp = "amber"
        mitre_att = "T1496"
        malware_type = "Cryptominer"
        category = "Cryptomining"

    strings:
        // PE prelimin filter
        $mz = { 4D 5A }
        
        // mining algorithm strings
        $algo_1 = "cryptonight" ascii nocase
        $algo_2 = "randomx" ascii nocase
        $algo_3 = "ethash" ascii nocase
        $algo_4 = "keccak" ascii nocase
        $algo_5 = "argon2" ascii nocase
        
        // pool strings
        $pool_1 = "pool.minexmr.com" ascii nocase
        $pool_2 = "xmrpool.eu" ascii nocase
        $pool_3 = "pool.supportxmr.com" ascii nocase
        $pool_4 = "nanopool.org" ascii nocase
        $pool_5 = "stratum+tcp://" ascii
        $pool_6 = "stratum+ssl://" ascii
        
        // CPU code patterns
        $cpu_1 = { 0F 57 C0 0F 29 44 24 ?? 0F 29 44 24 ?? 48 8D 54 24 }
        
        // GPU code patterns
        $gpu_1 = "OpenCL" ascii
        $gpu_2 = "CL_DEVICE_TYPE_GPU" ascii
        $gpu_3 = "cudaGetDeviceCount" ascii
        $gpu_4 = "nvmlDeviceGetHandleByIndex" ascii
        
        // hiding techniques
        $hide_1 = "SetProcessPriorityBoost" ascii
        $hide_2 = "SetPriorityClass" ascii
        $hide_3 = "CreateMutexA" ascii

        $wallet_mon = /[43][a-zAZ1-9]{94}/ // Monero wallet
	    $wallet_eth =/0x[a-fA-F0-9]{40}/ //Ethereum wallet

        $config = /"url":\s*"stratum\+tcp:\/\// //file pattern
        
    condition:
        uint16(0) == 0x5A4D and
        filesize < 15MB and
        (
            // Mining algorithm with pool or wallet
            (1 of ($algo*) and (1 of ($pool*) or $wallet or $config)) or
            
            // CPU mining specific patterns
            (1 of ($cpu*) and 1 of ($pool*) and 1 of ($algo*)) or
            
            // GPU mining specific patterns
            (1 of ($gpu*) and 1 of ($pool*)) or
            
            // Process hiding with mining components
            (2 of ($hide*) and 1 of ($algo*) and 1 of ($pool*))
        )
}