You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
aaaaaa aaaaaaa eaf5645f37 testcase loop & tail rec 6 anni fa
test_cases test case branch 6 anni fa
tester disable crashing hooks 6 anni fa
third_party add libraries 7 anni fa
.gitattributes Add .gitignore and .gitattributes. 7 anni fa
.gitignore Add .gitignore and .gitattributes. 7 anni fa
.gitmodules add libraries 7 anni fa
README.md testcase loop & tail rec 6 anni fa
hook_tests.sln polyhook 7 anni fa
mhook.txt mhook can't hook _loop ;) 6 anni fa

README.md

Introduction

This project aims to give a simple overview on how good various x64 hooking engines (on windows) are. I’ll try to write various functions, that are hard to patch and then see how each hooking engine does.

I’ll test:

(I’d like to test detours, but I’m not willing to pay for it. So that isn’t tested :( )

There are multiple things that make hooking difficult. Maybe you want to patch while the application is running -- in that case you might get race conditions, as the application is executing your half finished hook. Maybe the software has some self protection features (or other software on the system provides that, e.g. Trustee Rapport)

Evaluating how the hooking engines stack up against that is not the goal here. Neither are non-functional criteria, like how fast it is or how much memory it needs for each hook. This is just about the challenges the function to be hooked itself poses.

Namely:

  • Are jumps relocated?
  • What about RIP adressing?
  • If there’s a loop at the beginning / if it’s a tail recurisve function, does the hooking engine handle it?
  • How good is the dissassembler, how many instructions does it know?
  • Can it hook already hooked functions?

At first I will give a short walk through of the architecture, then quickly go over the test cases. After that come the results and an evaluation for each engine.

I think I found a flaw in all of them; I’ll publish a small POC which should at least detect the existence of problematic code.

A word of caution: my results are worse than expected, so do assume I have made a mistake in using the libraries. I went into this expecting that some engines at least would try to detect e.g. the loops back into the first few bytes. But none did? That’s gotta be wrong.

Another word of caution: parts of this are rushed and/or ugly. Please double check parts that seem suspicious. And I’d love to get patches, even for the most trivial things -- spelling mistakes? Yes please.

Architecture

This project is made up of two parts. A .DLL with the test cases and an .exe that hooks those, tests whether they still work and prints the results.

(I could have done it all in the .exe but this makes it trivial to (at some point) force the function to be hooked and the target function to be further apart than 2GB. Just set fixed image bases in the project settings and you’re done)

My main concern was automatically identifying whether the hook worked. I consider a hook to work if: a) the original function can still execute successfully and b) the hook was called.

The criteria a) is really similar to a unit test. Verify that a function returns what is expected. So for a) the .exe just runs unit tests after all the hooks have been applied. Each failing function is reported (or the program crashes and I can look at the callstack) so I can correlate that with which hooking engine I’m currently testing and see where those fail. I’ve used Catch2 for the unit tests, because I wanted to try it anyway.

From the get-to it was clear that I wanted to test multiple hooking engines. And they all needed to do the same steps in the same order -- so I implemented a basic AbstractHookingEngine with a boolean for every test case and make a child class for each engine. The children classes have to overwrite hook_all and unhook_all. Inbetween the calls to that, the unit tests run.

Test case: Small

This is just a very small function; it is smaller than the hook code will be - so how does the library react?

_small:
	xor eax, eax
	ret

Test case: Branch

Instead of the FASM code I’ll show the disassembled version, so you can see the instruction lengths & offsets.

0026 | 48 83 E0 01 | and rax,1
002A | 74 17       | je test_cases.0043 ----+
002C | 48 31 C0    | xor rax,rax            |
002F | 90          | nop                    |
0030 | 90          | nop                    |
0031 | 90          | nop                    |
0032 | 90          | nop                    |
0033 | 90          | nop                    |
0034 | 90          | nop                    |
0035 | 90          | nop                    |
0036 | 90          | nop                    |
0037 | 90          | nop                    |
0038 | 90          | nop                    |
0039 | 90          | nop                    |
003A | 90          | nop                    |
003B | 90          | nop                    |
003C | 90          | nop                    |
003D | 90          | nop                    |
003E | 90          | nop                    |
003F | 90          | nop                    |
0040 | 90          | nop                    |
0041 | 90          | nop                    |
0042 | 90          | nop                    |
0043 | C3          | ret  <-----------------+

This function has a branch in the first 5 bytes. Hooking it detour-style isn’t possible without fixing that branch in the trampoline. The NOP sled is just so the hooking engine can’t cheat and just put the whole function into the trampoline. Instead the jump in the trampoline needs to be modified so it jumps back to the original destinations

Test case: RIP relative

One of the new things in AMD64 is RIP relative addressing. I guess the reason to include it was to make it easier to generate PIC -- all references to data can now be made relative, instead of absolute. So it doesn’t matter anymore where the program is loaded into memory and there’s less need for the relocation table.

A quick and dirty[1] test for this is re-implementing the well known C rand function.

public _rip_relative
_rip_relative:
	mov rax, qword[seed]
	mov ecx, 214013
	mul ecx
	add eax, 2531011
	mov [seed], eax

	shr eax, 16
	and eax, 0x7FFF
	ret

seed dd 1

The very first instruction uses rip relative addressing, thus it needs to be fixed in the trampoline.

Test case: AVX & RDRAND

The AMD64 instruction set is extended with every CPU generation. Becayse the hooking engines need to know the instruction lengths and their side effects to properly apply their hooks, they need to keep up.

The actual code in the test case is boring and doesn’t matter. I’m sure there are disagreements on whether I’ve picked good candidates of “exotic” or new instructions, but those were the first that came to mind.

Test case: loop and TailRec

My hypothesis before starting this evaluation was that those two cases would make most hooking engines fail. Back in the good ol’ days of x86 detour hooking didn’t require any special thought because the prologue was exactly as big as the hook itself -- 5 bytes for PUSH ESP; MOV EBP, ESP and 5 bytes for JMP +- 2GB[2]. That isn’t so easy for AMD64: a) the hook sometimes needs to be way bigger b) due to changes in the calling convention and the general architecture of AMD64 there just isn’t a common prologue, used for almost all functions, anymore.

Those by itself arn’t a problem, since the hooking engines can fix all the instructions they would overwrite. However I hypothesized that only a few would check whether the function contained a loop that jumps back into the instructions that have been overwritten. Consider this:

public _loop
_loop:
	mov rax, rcx
@loop_loop:
	mul rcx
	nop
	nop
	nop
	loop @loop_loop ; lol
	ret

There’s only 3 bytes that can be safely overwritten. Right after that is the destination of the jump backwards. This is a very simple (and kinda pointless) function so detecting that the loop might lead to problems shouldn’t be a problem. Basically the same applies for the next example:

public _tail_recursion
_tail_recursion:
	test ecx, ecx
	je @is_0
	mov eax, ecx
	dec ecx
@loop:
	test ecx, ecx
	jz @tr_end

	mul ecx
	dec ecx

	jnz @loop
	jmp @tr_end
@is_0:
	mov eax, 1
@tr_end:
	ret

(Preliminary) Results

+----------+-----+------+------------+---+------+----+-------+ | Name|Small|Branch|RIP Relative|AVX|RDRAND|Loop|TailRec| +----------+-----+------+------------+---+------+----+-------+ | PolyHook| X | X | X | X | | | | | MinHook| X | X | X | | | | X | | MHook| | | X | | | | | +----------+-----+------+------------+---+------+----+-------+

[1] This is one of the things that could easily be improved, but haven’t been because I just couldn’t motivate myself. Putting the data right after the func meant that a section containing code needed to be writable. Which is bad. Also I load the seed DWORD as a QWORD -- which only works because the upper half is then thrown away by the multiplication. It’s shitty code is what I’m saying.

In retrospect I should have used a jump table like a switch-case could be compiled into. That would be read only data. Oh well.

[2] And Microsoft decided at some point to make it even easier for their code with the advent of hotpatching.