12.3 Obfuscating Code

12.3.1 Problem

Most C programs use common programming idioms based on C statements, default data types, and function invocation/return conventions based on the C standard library. Those familiar with C and how it is compiled to assembly language can easily identify these idioms in compiled binary code.

12.3.2 Solution

Obfuscating compiled code requires understanding how the code will look at an assembly-language level. The purpose of obfuscating C code is to create maintainable source code that will run at close to the speed of the original, but that is difficult to understand when represented in assembly language. This difficulty may arise from an increase in the complexity of the algorithm, from an apparent increase in complexity, or from a misrepresentation of the constants, data types, and conditional expressions used in an algorithm.

The examples presented in the discussion for this recipe represent only a handful of ways in which code can be obfuscated. More involved transformations include blurring the boundaries between functions by interleaving the code of two or more functions into a multipurpose function, using custom virtual machines or emulators to execute a byte-code representation of a function, and spawning new threads or processes to perform trivial or irrelevant tasks.

12.3.3 Discussion

Increased code obfuscation comes at the price of code maintainability. In general, it is preferable to combine several simple techniques along with data obfuscation than to dedicate development and debugging time to perfecting a single, advanced obfuscation technique.

The most common idiom in C programs is "test-and-branch": a value is tested, and the result of the test determines the next statement to be executed. The test-and-branch idiom is the underlying mechanism for conditional expressions (if, if-else, switch) and loops (for, while, do-while), and it is usually implemented in assembly language as:

cmp value, constant
jcc if_true_handler

where jcc is a conditional branch determined by the type of test being performed. Table 12-1 lists the Intel conditional branch instructions and their corresponding C comparison operators.

Table 12-1. Intel conditional branch instructions and their C comparison operators

C operator

Asm mnemonic

Flags tested

= =

jz, je

ZF = = 1


jnz, jne

ZF = = 0


jge, jnl

SF = = OF


jae, jnb, jnc

CF = = 0


jg, jnle

ZF = = 0 && SF = = OF


ja, jnbe

CF = = 0 && ZF = = 0


jle, jng

ZF = = 1 && SF != OF


jbe, jna

ZF = = 1 && CF = = 1


jl, jnge

SF != OF


jb, jc, jnae

CF = = 1

Intel provides conditional branch instructions that check the parity (PF) flag as well as the zero (ZF), sign (SF), overflow (OF), and carry (CF) flags. The parity flag is set if the least-significant byte in the result of an operation contains an even number of 1 bytes; the zero flag is set if an operation returns zero; the sign flag is set to the most-significant bit of the result; the overflow flag is set if an operation overflows the bounds of a signed integer; and the carry flag is set on arithmetic carry or borrow, and when an operation overflows the bounds of an unsigned integer.

In compiled C code, equality tests make use of ZF, while greater-than and less-than tests make use of OF, CF, and SF. By rewriting test-and-branch code to use the PF, or to use the sign, overflow, or carry flags in circumstances where a zero flag would be expected, the purpose of the test-and-branch can be made less obvious. A simple example can be found in the test-for-zero operation, often implemented in C as:

if (!value) {
  ; /* zero-handling code here */

This produces the following assembly language:

  movl value, %eax
  test %eax, %eax           ; equivalent to (%eax & %eax)
  jnz  nonzero_value        ; jump over zero-handling code
                            ; zero-handling code is here
                            ; execution resumes here

In the following alternate implementation, the negl instruction replaces the contents of the eax register with its two's complement. More importantly, it sets CF to 0 if the eax register is 0, and to 1 otherwise. A test for equality has now been replaced by what appears to be a bounds or range check.

Removing the conditional branch can make things even less obvious. The rcl instruction, for example, can be used to rotate CF into a register, which can then be used as an index into a two-element table of addresses. The following IF_ZERO macro demonstrates this technique.

#define IF_ZERO(val)                             \
        asm("   xorl %%ebx, %%ebx\n\t"           \
            "   negl %%eax\n\t"                  \
            "   rcl $3, %%ebx\n\t"               \
            "   movl 0f( , %%ebx ), %%eax \n\t"  \
            "   jmp *%%eax \n"                   \
            "0: \n\t"                            \
            "   .long 1f\n\t"                    \
            "   .long 2f\n"                      \
            "1: \n"                              \
            : : "a" (val) : "%ebx");
#define ELSE                                     \
        asm("   jmp 3f\n\t"                      \
            "2: \n");
#define ENDIF                                    \
        asm("3: \n");

The IF_ZERO macro places the value to be tested in the eax register, then uses the negl instruction to set the carry flag if the value in the eax register is nonzero. The carry flag is then rotated into a register and used as an index into a jump table. The macro can be used to test for equality by subtracting one value from another and passing it the result. The following example demonstrates how to use IF_ZERO to test the result of calloc( ). Note that the ELSE macro must be included even if an else condition is not needed.

struct MY_STRUCT my_struct;
my_struct = calloc(sizeof(struct MY_STRUCT), 1);
  fprintf(stderr, "alloc failed\n");
  return 0;
ELSE /* the else is required */

The C if statement itself is simple, and it is easy to recognize in a binary. For example:

int value = check_input(user_input);
if (value) {
  ; /* success-handling code here */

This will usually be compiled as a test of value followed by a jnz instruction. Comparing value with a constant results in a jnz instruction following a compare of value with that constant. Changing the type of the value being tested from an integer to a floating-point number will change not only its representation in memory, but also the actual assembly-language comparison instruction:

float value = check_input(user_input);
if (value =  = 1.0) {
  ; /* success-handling code here */

Comparing the assembly code generated for the integer test and the float test clearly illustrates the difference between the two from a code obfuscation standpoint:

; First, the integer test: if (value) ...
 8048346:       8b 45 fc                mov    0xfffffffc(%ebp),%eax
 8048349:       85 c0                   test   %eax,%eax
 804834b:       74 10                   je     804835d <main+0x35>
; Compare with the float test: if (value =  = 1.0) ...
 804835d:       d9 45 f8                flds
 8048360:       d9 e8                   fld1
 8048362:       d9 c9                   fxch %st(1)
 8048364:       da e9                   fucompp
 8048366:       df e0                   fnstsw %ax
 8048368:       80 e4 45                and $0x45,%ah
 804836b:       80 fc 40                cmp $0x40,%ah
 804836e:       74 02                   je 8048372 <main+0x4a>

When a constant value is used in a comparison, it can be increased or decreased as long as value is adjusted by the same amount:

if ((value + 8) << 2  =  = 32) { /* if (! value )  */
  ; /* success-handling code here */ 
if (!(--value)) { /* if ( value =  = 1 ) */
  ; /* success-handling code here */

A conditional expression in an if or while statement can be made more confusing by adding additional expressions that will always evaluate to true or false but that appear to be real conditions from within the context of the expression:

volatile int bogus_value = rand(  ) % 7;
if (value =  = MAGIC_CONSTANT) {
  ; /* success-handling code here */
} else if (bogus_value > 8) {
  ; /* this will never be true */

The volatile keyword is used here to prevent the compiler from optimizing the else if block out of existence; many "dead code" obfuscations will be recognized as such and discarded by an optimizing compiler. See Recipe 13.2 for a more in-depth discussion of compiler dead-code elimination optimizations.

The best type of bogus condition involves entirely unrelated data, thereby implying that a connection exists between the data in the real and the bogus conditions. Function pointers are ideal candidates for this type of obfuscation:

volatile int const_value = (int) printf;
if (value =  = MAGIC_CONSTANT && (const_value & 0xFFFF0000)) {
  ; /* success-handling code here */

Because library functions are loaded into a predictable range of memory, the upper half of a library function's address can be used as a runtime constant. In the previous code, the second half of the logical AND operation always evaluates to true.

Most programs link to shared libraries using dynamic linking resolved by the program loader, which creates references to the shared library functions at the point where they are called. To make compiled code more difficult to understand, shared library functions should be referenced as far away as possible from the calls to them?if not replaced entirely with custom code. By explicitly loading a library with functions like dlopen( ) on Unix or LoadLibrary( ) on Windows, you can refer only to the function pointers where the function is called. The function pointers can be re-used during the course of execution so that different library functions are stored in the same function pointer. Alternatively, a function can be used to return the function pointer from a list or table of such pointers, thereby frustrating automatic analysis:

#ifdef WIN32
#include <windows.h>
#define SPC_C_RUNTIME              "msvcrt.dll"
#define SPC_LOAD_LIBRARY(name)     LoadLibrary((name))
#define SPC_RESOLVE_SYM(lib, name) GetProcAddress((lib), (name))
#include <dlfcn.h>
#define SPC_C_RUNTIME              "libc.so"
#define SPC_LIBRARY_TYPE           void *
#define SPC_LOAD_LIBRARY(name)     dlopen((name), RTLD_LAZY);
#define SPC_RESOLVE_SYM(lib, name) dlsym((lib), (name))
enum file_op_enum {
  fileop_open, fileop_close, fileop_read, fileop_write, fileop_seek
void *file_op(enum file_op_enum op) {   
  static SPC_LIBRARY_TYPE lib = 0;
  static struct FILEOP {
    void *open, *close, *read, *write, *seek;
  } s = {0};
  if (!lib) lib = SPC_LOAD_LIBRARY(SPC_C_RUNTIME);
  switch (op) {
    case fileop_open:
      if (!s.open) s.open = SPC_RESOLVE_SYM(lib, "open");
      return s.open;
    case fileop_close:
      if (!s.close) s.close = SPC_RESOLVE_SYM(lib, "close");
      return s.close;
    case fileop_read:
      if (!s.read) s.read = SPC_RESOLVE_SYM(lib, "read");
      return s.read;
    case fileop_write:
      if (!s.write) s.write = SPC_RESOLVE_SYM(lib, "write");
      return s.write;
    case fileop_seek:
      if (!s.seek) s.seek = SPC_RESOLVE_SYM(lib, "seek");
      return s.seek;
  return 0;

The names of the libraries and functions should of course be stored as encrypted strings (see Recipe 12.11) to provide the best possible obfuscation; additional unused library and function names can be stored in plaintext to mislead the analyst.

12.3.4 See Also

Recipe 12.11, Recipe 13.2