Intro to Reverse Engineering

Created On 02. May 2021

Updated: 2021-06-06 01:41:01.809061000 +0000

Created By: acidghost

Reverse Engineering is a tedious process, but just like everything else, it's rewarding in a special way. Before getting into the heavy stuff, let's define first what is it.The concept of reverse engineering dates to around the times of industrial revolution. It was the act of extracting information from human made objects. Before computers and software became popular, this was done on engines and other electronic devices that could be physically dissected and investigated. Reverse engineering is present also in sciences, such as biology where certain cells are examined under a microscope. In software, reverse engineering refers to the act of retrieving the details of design and implementation of a program for which no documentation or source code is available. Software reverse engineering involves code breaking, puzzle solving, programming and logical analysis. In the industry it is useful for software security and development. In security, reverse engineering plays a central part in malware analysis while in development it is often helpful for analyzing the quality of third party code. In development, it quite often intersects with the legal questions about reversing a certain software. I won't be touching and discussing anything that is not legally permitted and would also ask the readers to make their extensive research on this topic to keep everyone happy. On my blog I will be concentrating on GNU software and diverse malware, that anyone can modify without worrying about any legal consequences. One of the great parts about this is that someone well versed in reverse engineering can usually patch themselves and continue working with a nice 3rd party tool. Bugs are not rare and it is certain that with use of any software they will be encountered quite often. While reverse engineering opens up many doors it becomes often an almost impossible job not only because of anti-reversing techniques deployed in the software, but also due to the stripped code after the compilation process.

The Compilation Process

When a program is compiled, it is translated to assembly and from there an executable is born. Let's see go through this process in a C program:

# 1 "hi.c"
// array of characters gets taken as input
void hi(char *name)
{
        printf("Hi! %s!\n", name);
}

int main()
// in this variable is the name and the size of the buffer that gets taken as input above
{      
        char name[128];
        scanf("%1000s", name);
        say_hi(name);
}

In the program above, we have some information in the comments about what the program does. When the program is compiled like thiscpp hi.c, we will see that the comments are gone. The cpp processor handles all the include, macros and removes the comments. If we would include a library in the program with for example #include <stdio.h> before the compilation, all the libraries from stdio.h would be included in the compiled program (therefore the library will be contained statically in the file).
After this the program gets assembled. With gcc -masm-intel hi.c, the program can be compiled into assembly. It can be viewed from the hi.s file. If we would look into the result of the compilation, we will notice that the character array char name is gone and becomes just a memory reference.
The program can be finished to be compiled with gcc -o hi hi.s. Like this we will still see in the memory the text location of hi as a separate function, however, software is usually also stripped during the final compilation stage. It can be stripped like this strip hi. This will remove all the unnecessary metadata and hide the functions, that previously were directly visible in the text section as a separate hi function.

Data Access

When reversing a file the data can be in:

  • .data - pre-initialized global writable data
  • .rodata - global read only data
  • .bss - uninitialized writable global data
  • stack - statically-allocated local variables
  • heap - dynamically-allocated variables

A big confusion at the beginning is usually understanding where does the data go when it's pushed and poped? With the push instruction, the data will be stored on the top/left of the stack. Pop will retrieve the data from the top/left of the stack. When we are storing or retrieving the data from the stack, we can access it by using rsp or rbp relative access. When you push, rsp is decreased by 8. When you pop rsp is increased by 8. With rsp the offsets will be positive, because it points to the top/left edge of the function frame. When you push, rbp is increased by 8. When you pop, rbp is decreased by 8. rbp has negative offsets, because it points to the lower/right edge of the frame.
An example would be:

push 0x01        ; 1 is now on the top of the stack
mov rax, rsp     ; 1 is now passed from the top of the stack into rax

The stack is growing upwards or to the left. There is no conceptual difference between the two representations and it is just to personal preference.

| Stack  |  🠕
|        |  |
|        |  |
|        |  |          <----------
|        |  |         ____________________
|   rsp  |          
|        |          Stack     rsp     rbp
|        |            ____________________
|   rbp  |           

The .bss, .rodata and .data are stored at known offsets from the program code and can be accesed via rip-relative instructions. A rip-relative instruction looks like this: mov rax, [rip+0x10000] - where 0x10000 is the offset.
Data that is stored on the heap cannot be accessed directly. The heap data is mostly stored in memory or on the stack. This data is accessed by a stack stored reference.

mov rax, [rsp] ; the data at the stack pointer is dereferenced into rax
               ; the data that was in the memory location at rsp is now in rax
mov rdx, [rax]; that data is now dereferenced from rax into rdx.

Modules and Functions

Software often contains public libraries. When reversing such software, these libraries can make lives easier, because they have well documented functionality.
Functions have goals that can be directly defined. For example, a function can get some data, make a calculation or do some other specific action. They can be reverse engineered in isolation. When analyzing functions, they will be usually represented in separate blocks. When a function is called (in assembly call mnemonic), the address that the called function should return to is implicitly pushed onto the stack. This return address is implicitly popped when the function returns.
Every function has a function frame.

a representation of a function frame
 _________________________________________________________________________________________
|                                                                                                                             
|Stack | buffer | saved rbp | argc | arg[0] | arg[1] | null | envp[1] | null | strings ..   
|_________________________________________________________________________________________

A function frame is the space on the stack where all the addresses, variables, strings etc. are. They also have a prologue and epilogue. In the prologue, the space for the data that will be used is created. This space is known as buffer. Below is described the process how this happens.

  • Prologue
  1. save off the caller's base pointer (push rbp)
  2. set the current stack pointer as the base pointer (mov rbp, rsp)
  3. "allocate" space on the stack - subtract the stack pointer (sub rsp)

The epilogue prepares the function to return. It frees the allocated space on the stack, returns the value into rbp back and returns the control flow to the caller.

  • Epilogue
  1. "deallocate" the stack (mov rsp, rbp)
  2. restore the old base pointer (pop rbp)
  3. return ( ret 0)

Function prologues and epilogues is usually how disassemblers detect the functions.

Disassemblers

A disassembler decodes the binary machine code into a readable assembly syntax. Since assembly language is architecture and platform specific, a disassembler will be as well. However, since reversing is such a hot topic, many disassemblers have support for more than one specific platform or architecture.

Decompilers

These are the dream of every reverse prodigy. Decompilers produce from the low-level assembly a representation of the code in a high-level language. Decompilation is mostly an experimental feature and while some tools have it, many of them won't do a great job at it. Reverse engineers should still always look for answers first in the disassembled view. This feature is highly appreciated though and is always useful to have by side a decompiled perspective as well.

a function seen in the ghidra decompiler used in cutter

Static and Dynamic Tools

Static tools are usually more simple as they analyse the program at rest. These are the same tools that can disassemble and give information on elf files. Some of them are mentioned in Binary Files and Processes in Linux. Few others are nm, checksec, and strings.
Dynamic tools have more advanced functionality that allow to analyze the program in action. Some tools that can analyze the program at runtime are strace, rr and qira.

Reversing Tools

Without the right tool to reverse something, the job just can't be done (obviously :wink:). For newcomers it is easy to get confused which tool to use. There are thousands of them and some are free while the basic package in others can hit up to $3000. Unfortunately, there is no universal one tool that can do everything. This is because reversing often requires different approaches. Not every tool supports every available file format there, not all of them can decompile or have an own scripting language and so on. All these little features make lives easier and that's why there is such a overhead of amount and price between them.

1. GDB

Gdb comes already by default in some Linux distributions. It is certainly a good start for any beginner, but can also become your worst enemy. Compared many other tools, the vanilla gdb is just a cold blank mysterious command line tool in your terminal. Yes, it doesn't have an UI :shipit:. However, don't let this make your day worse, since it can do a lot of things, really! One of the things gdb shines at is using it as a debugger. Here are some instructions:

  • x/s - print string;
  • x/s $rip - print string from the address that is currently being pointed at in the memory;
  • x/gx $rip - print qwords;
  • x/2dx %rip - dwords;
  • x/4hx $rip - halfword;
  • x/8b $rip - bytes.
  • si - step to the next instruction - into the call
  • rsi - step backwards - into the call;
  • ni - step to the next instruction - over call instructions
  • rni - step backwards - over call instructions
  • c - continue to the next breakpoint
  • rc - go back at the beginning
  • int3 - gdb instruction analog to a breakpoint. This is what gdb sets when it breaks somewhere.

However, for anyone starting out reverse engineering, the vanilla gdb is not the best tool. It has more alternatives, such Pwndbg, that display more information by default and make reversing easier especially for beginners. See more on this here.


default fanciness of Pwndbg after starting a binary

default view in vanilla gdb after starting a binary and displaying next 24 instructions

2. Binary Ninja


a function graph in Binary Ninja
This is a tool that comes in an online and downloadable version. Luckily, the online version is free. While you won't get much debugging there, it shows beautiful graphs and supports many features such as disassembly and decompilation. This is certainly one of the best tools for any beginner to start out on their reversing journey.

3. Radare2

Radare2 is an awesome tool and probably my favorite. It is free and works cross platform and Linux, Windows and MacOS. In recent few years its developer base has grown a lot and they have been pushing awesome features. While radare2 runs in terminal, it can draw nice graphs, debug, decompile and even has its own scripting language. However, radare2 is quite complex and involves a steep curve in mastering it. Still it is one of the strongest competitors out there.

4. Cutter

Cutter in it's design acts as a GUI version of radare2. While experienced reversers might hit the limit fast, it is still worth the attention as it can decompile and draw beautiful graphs as well. Additionally, it has some unique to itself features, that every reverser has to see :wink:

5. IDA Pro

The gold standard of reverse engineering is IDA Pro. Every professional reverse engineer must have worked with it. Without describing it more, if you can afford, then it's definitively worth checking it out.

6. Ghidra

Ghidra is a reversing tool created by NSA. It has been made open source after more years from its launch and certainly it is one of the most powerful ones.

7. Angr

Angr is an academic binary analysis framework and is also definitively worth a try along the reversing journey.

Anti-reversing techniques

Anti-reversing techniques are deployed in software to hinder the attackers get the information they want. However, there is no way to entirely prevent reversing. What can be done, is to hinder and make reversers job harder. Eventually the evil reverse engineers would give up, however, if they are very skilled and motivated, then chances that an anti-reversing technique will work is quite low.
There are few popular techniques deployed in software for anti-reversing.

1. Destroy all symbolic information

Similarly to stripping the binary, destroying all symbolic information would imply that such information as class names, class member names, names of instantiated global objects must not exist.

2. Antidebugger code

In this approach, usually a live analysis is hindered. Thea idea is to damage or terminate the debugger while its being stepped through. There are also other more sophisticated ways of dealing with attached debuggers.

3. Obfuscation

When the code is obfuscated, usually the symbols in a program are turned into meaningless sequence of characters. One downside of this, is that an obfuscated program will run slower. Ironically, if a program is obfuscated and undergoes a license check online, it could have considerable performance handicaps (depending on the resource load), compared to a patched version that can run offline.
A simple example of obfuscation that can be used as an anti-reversing technique would be:

sprintf(buf, "%c%c%s", "h", "i", "friend");

References

Eilam Eldad - Reversing: Secrets of Reverse Engineering
Dennis Yurichev - Reverse Engineering for Beginners
https://pwn.college/modules/reversing

Section: Reverse Engineering

Back