Malware Functionality - Data Encoding

Posted on Apr 28, 2023

Notes
1. Simple Ciphers
  1. XOR
    1. NULL-Preserving Single-Byte XOR Encoding
  2. Other Simple Encoding Schemes
2. Common Cryptographic Algorithms
  1. Searching for High-Entropy Content
Labs

Notes

A malware author might use a layer of encoding for these purposes:

To hide configuration information, such as a command-and-control domain.
To save information to a staging file before stealing it.
To store strings used by the malware and decode them just before they are needed.
To disguise the malware as a legitimate tool, hiding the strings used for malicious activities.

Simple Ciphers

XOR

NULL-Preserving Single-Byte XOR Encoding

Malware authors have actually developed a clever way to mitigate this issue by using a NULL-preserving single-byte XOR encoding scheme. Unlike the regular XOR encoding scheme, the NULL-preserving single-byte XOR scheme has two exceptions:

If the plaintext character is NULL or the key itself, then the byte is skipped.
If the plaintext character is neither NULL nor the key, then it is encoded via an XOR with the key. This NULL-preserving XOR technique is especially popular in shellcode, where it is important to be able to perform encoding with a very small amount of code.

Other Simple Encoding Schemes

ADD, SUB
ROL, ROR
ROT
Multibyte
Chained or loopback

Common Cryptographic Algorithms

Malware often uses simple cipher schemes because they are easy and often sufficient. Also, using standard cryptography does have potential drawbacks, particularly with regard to malware:

Cryptographic libraries can be large, so malware may need to statically integrate the code or link to existing code.
Having to link to code that exists on the host may reduce portability.
Standard cryptographic libraries are easily detected (via function imports, function matching, or the identification of cryptographic constants).
Users of symmetric encryption algorithms need to worry about how to hide the key.

💡 Malware often employs the RC4 algorithm, probably because it is small and easy to implement in software, and it has no cryptographic constants to give it away.

Searching for High-Entropy Content

64-byte string with 64 distinct byte values has the highest possible entropy value. The 64 values are related to the entropy value of 6 (which refers to 6 bits of entropy), since the number of values that can be expressed with 6 bits is 64.

Another setting that can be useful is a chunk size of 256 with entropy above 7.9. This means that there is a string of 256 consecutive bytes, reflecting nearly all 256 possible byte values.

Labs

Lab 1

1. Compare the strings in the malware (from the output of the strings command) with the information available via dynamic analysis. Based on this comparison, which elements might be encoded?

First of all, we take a look at the imports and strings via static analysis.

Imports

Strings

When we run the malware, we observe that it sends some HTTP requests to “hxxp://practicalmalwareanalysis[.]com” but the GET request is encoded.

It is noticeable that the sample also performs some registry manipulation.

To answer the first question, the URL in the GET requests is encoded.

2. Use IDA Pro to look for potential encoding by searching for the string xor. What type of encoding do you find?

We are going to do it with Ghidra

In order to remove the XOR operations that are used to clear out registers, we are going to apply this regular expression: XOR (.*),\1

From 105 entries to 9 entries! 😁

If we take a look at the first one we can observe the typical loop for encoding.

It is called in subroutine located at 0x401300, which handles with the resource of the program.

3. What is the key used for encoding and what content does it encode?

The key that is used is 0x3B, which is number 59 in decimal, and the whole operation is done for XOR-encoding the resource attached.

4. Use the static tools FindCrypt2, Krypto ANALyzer (KANAL), and the IDA Entropy Plugin to identify any other encoding mechanisms. What do you find?

Ghidra also shows the entropy. In the image below, the lighter the color is, the higher the entropy. That is, grey sections contain the highest entropy (around 6.3).

As we can see, it is a base64 encoding string.

5. What type of encoding is used for a portion of the network traffic sent by the malware?

If we take a look at the strings we can observe that there are some related to a User-Agent of a browser.

It is being used in subroutine at 0x4011C9. We can suppose that this function has some network functionality, so we can take a closer look to its function call graph to get more knowledge about it.

Function at 0x4010B1 calls the base64 string function that we pointed out before, so we can assure that it is performing base64 encoding.

6. Where is the Base64 function in the disassembly?

At 0x4010B1.

7. What is the maximum length of the Base64-encoded data that is sent? What is encoded?

The function takes a buffer of 12 characters, which copies the hostname string and adds 4 additional characters.

8. In this malware, would you ever see the padding characters (= or == ) in the Base64-encoded data?

Yes, as it is shown in the code of the function.

9. What does this malware do?

This malware is a loader that contains another sample in its resource section. It is used to give an extra layer of stealthiness as it uses base64 encoding in the GET requests.

Lab 2

1. Using dynamic analysis, determine what this malware creates.

First of all with static analysis we observe some suspicious imports such as WriteFile or TerminateProcess. Regarding strings, there is not anything of value.

With dynamic analysis we observe the creation of multiple files inside the directory where the malware has been run.

2. Use static techniques such as a xor search, FindCrypt2, KANAL, and the IDA Entropy Plugin to look for potential encoding. What do you find?

In the following picture we can observe XOR operations at different locations. As we can see, most of the calls are performed in subroutine_401739.

The subroutine has a loop that seems to be in charge of the encoding stage.

3. Based on your answer to question 1, which imported function would be a good prospect for finding the encoding functions?

Both CreateFile and WriteFile are functions that must be tightly related to the enconding subroutines.

For instance, function at 0x401000 is shared between them. Another thing that captured my attention was that WriteFile was being called in the following way: For asynchronous write operations, _hFile_ can be any handle opened with the CreateFile function using the **FILE_FLAG_OVERLAPPED**

After surfing for a while throughout cross-references, we manage to get a suspicious workflow. What the following image shows is that function at 0x40181F calls the XOR-encoding method and after that function at 0x401000 calls CreateFile and WriteFile.

4. Where is the encoding function in the disassembly?

It is located at 0x40181F.

5. Trace from the encoding function to the source of the encoded content. What is the content?

The encoding function is called in func_401851. As we can observe, the encoding function takes two parameters that are previously used in func_401070.

This is what we find if we dive into the function. It operates with device contexts (Windows data structure containing information about the drawing attributes of a device such as a display or a printer) and bitmaps.

6. Can you find the algorithm used for encoding? If not, how can you decode the content?

It is not a standard algorithm.

7. Using instrumentation, can you recover the original source of one of the encoded files?

The book’s answer assumes that decoding can also be performed with the same encryption function.

The goal is to set a breakpoint before the XOR-encryption and just after its end. This means, one breakpoint at 0x00401880 (before func_40181F) and another one at 0x0040190A (after the return).

Right-click the top value on the stack in the stack pane (the value located at ESP) and select Follow in Dump. Then, with a hex editor, copy the content from one of the already encrypted files and paste in the same section of the dump with select Binary -> Binary Paste.

In my case the file seemed to be corrupted although the screen resolution was not changed during the process. Anyway, the goal of the lab was to practice the concepts, not retrieving the original file :)

Lab 3

1. Compare the output of strings with the information available via dynamic analysis. Based on this comparison, which elements might be encoded?

The first thing to do is to obtain the strings and imports with PEStudio.

Imports

Strings

There are a lot of junk strings but at the end there are some of them related to cryptographic ciphers and blocks, apart from the URL “www[.]practicalmalwareanalysis[.]com” and a long string, quite similar a base64 encoding string.

Regarding dynamic analysis, we observe a request to the URL that we previously mentioned but what it stands out is the number of queries to the registers that are related to socket. Even worse, the sample modifies registry keys concerning TCP protocol.

Random TCP packets to port 8910

Registry queries to WinSocket

Registry queries to TCP protocol

To answer the question: network parameters seem to be encoded.

2. Use static analysis to look for potential encoding by searching for the string xor. What type of encoding do you find?

If we look for potential encoding by searching the string xor we collect more results than in the previous labs.

First of all, we land to main, and the first thing that we notice is a call to func_401AC2, which is one of the subroutines that has several calls to XOR operation.

Diving into this function we observe all the magic. Ghidra detects AES encryption apart from the appearance of several strings related to keys, that were detected via static analysis.

3. Use static tools like FindCrypt2, KANAL, and the IDA Entropy Plugin to identify any other encoding mechanisms. How do these findings compare with the XOR findings?

Ghidra detects an area with 8.0 as entropy value (red part). From 0x40EB54 to 0x40F7BD.

4. Which two encoding techniques are used in this malware?

Base64 (at 0x40103F) and AES encryption.

5. For each encoding technique, what is the key?

The key for base64 is CDEFGHIJKLMNOPQRSTUVWXYZABcdefghijklmnopqrstuvwxyzab0123456789+/ and the one for AES is ijklmnopqrstuvwx.

6. For the cryptographic encryption algorithm, is the key sufficient? What else must be known?

For AES, also the length of the key and the length of block must be provided.

7. What does this malware do?

As we can observe in the function call graph from main subroutine we appreciate encryption algorithms and network connections through socket. This communication is encrypted, and it is done via a shell process.

Malware Functionality - Data Encoding

Table of Contents

Notes

Simple Ciphers

XOR

NULL-Preserving Single-Byte XOR Encoding

Other Simple Encoding Schemes

Common Cryptographic Algorithms

Searching for High-Entropy Content

Labs

Lab 1

Imports

Strings

Lab 2

Lab 3

Imports

Strings