You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389
  1. Introduction
  2. ============
  3. This project aims to give a simple overview on how good various x64 hooking
  4. engines (on windows) are. I'll try to write various functions, that are hard to
  5. patch and then see how each hooking engine does.
  6. I'll test:
  7. * [EasyHook](https://easyhook.github.io/)
  8. * [PolyHook](https://github.com/stevemk14ebr/PolyHook)
  9. * [MinHook](https://www.codeproject.com/Articles/44326/MinHook-The-Minimalistic-x-x-API-Hooking-Libra)
  10. * [Mhook](http://codefromthe70s.org/mhook24.aspx)
  11. (I'd like to test detours, but I'm not willing to pay for it. So that isn't
  12. tested :( )
  13. There are multiple things that make hooking difficult. Maybe you want to patch
  14. while the application is running -- in that case you might get race conditions,
  15. as the application is executing your half finished hook. Maybe the software has
  16. some self protection features (or other software on the system provides that,
  17. e.g. Trustee Rapport)
  18. Evaluating how the hooking engines stack up against that is not the goal here.
  19. Neither are non-functional criteria, like how fast it is or how much memory it
  20. needs for each hook. This is just about the challenges the function to be
  21. hooked itself poses.
  22. Namely:
  23. * Are jumps relocated?
  24. * What about RIP adressing?
  25. * If there's a loop at the beginning / if it's a tail recurisve function, does
  26. the hooking engine handle it?
  27. * How good is the dissassembler, how many instructions does it know?
  28. * Can it hook already hooked functions?
  29. At first I will give a short walk through of the architecture, then quickly go
  30. over the test cases. After that come the results and an evaluation for each
  31. engine.
  32. I think I found a flaw in all of them; I'll publish a small POC which should at
  33. least detect the existence of problematic code.
  34. **A word of caution**: my results are worse than expected, so do assume I have
  35. made a mistake in using the libraries. I went into this expecting that some
  36. engines at least would try to detect e.g. the loops back into the first few
  37. bytes. But none did? That's gotta be wrong.
  38. **Another word of caution**: parts of this are rushed and/or ugly. Please
  39. double check parts that seem suspicious. And I'd love to get patches, even for
  40. the most trivial things -- spelling mistakes? Yes please.
  41. Architecture
  42. ============
  43. This project is made up of two parts. A .DLL with the test cases and an .exe
  44. that hooks those, tests whether they still work and prints the results.
  45. (I could have done it all in the .exe but this makes it trivial to (at some
  46. point) force the function to be hooked and the target function to be further
  47. apart than 2GB. Just set fixed image bases in the project settings and you're
  48. done)
  49. My main concern was automatically identifying whether the hook worked. I
  50. consider a hook to work if: a) the original function can still execute
  51. successfully *and* b) the hook was called.
  52. The criteria a) is really similar to a unit test. Verify that a function
  53. returns what is expected. So for a) the .exe just runs unit tests after all the
  54. hooks have been applied. Each failing function is reported (or the program
  55. crashes and I can look at the callstack) so I can correlate that with which
  56. hooking engine I'm currently testing and see where those fail. I've used
  57. Catch2 for the unit tests, because I wanted to try it anyway.
  58. From the get-to it was clear that I wanted to test multiple hooking engines.
  59. And they all needed to do the same steps in the same order -- so I implemented
  60. a basic AbstractHookingEngine with a boolean for every test case and make a
  61. child class for each engine. The children classes have to overwrite `hook_all`
  62. and `unhook_all`. Inbetween the calls to that, the unit tests run.
  63. Test case: Small
  64. ================
  65. This is just a very small function; it is smaller than the hook code will be -
  66. so how does the library react?
  67. _small:
  68. xor eax, eax
  69. ret
  70. Test case: Branch
  71. =================
  72. Instead of the FASM code I'll show the disassembled version, so you can see the
  73. instruction lengths & offsets.
  74. 0026 | 48 83 E0 01 | and rax,1
  75. 002A | 74 17 | je test_cases.0043 --+
  76. 002C | 48 31 C0 | xor rax,rax |
  77. 002F | 90 | nop |
  78. 0030 | 90 | nop |
  79. ... |
  80. 0041 | 90 | nop |
  81. 0042 | 90 | nop |
  82. 0043 | C3 | ret <----------------+
  83. This function has a branch in the first 5 bytes. Hooking it detour-style isn't
  84. possible without fixing that branch in the trampoline. The NOP sled is just so
  85. the hooking engine can't cheat and just put the whole function into the
  86. trampoline. Instead the jump in the trampoline needs to be modified so it jumps
  87. back to the original destinations
  88. Test case: RIP relative
  89. =======================
  90. One of the new things in AMD64 is RIP relative addressing. I guess the reason
  91. to include it was to make it easier to generate PIC -- all references to data
  92. can now be made relative, instead of absolute. So it doesn't matter anymore
  93. where the program is loaded into memory and there's less need for the
  94. relocation table.
  95. A quick and dirty[1] test for this is re-implementing the well known C rand
  96. function.
  97. public _rip_relative
  98. _rip_relative:
  99. mov rax, qword[seed]
  100. mov ecx, 214013
  101. mul ecx
  102. add eax, 2531011
  103. mov [seed], eax
  104. shr eax, 16
  105. and eax, 0x7FFF
  106. ret
  107. seed dd 1
  108. The very first instruction uses rip relative addressing, thus it needs to be
  109. fixed in the trampoline.
  110. Test case: AVX & RDRAND
  111. =======================
  112. The AMD64 instruction set is extended with every CPU generation. Becayse the
  113. hooking engines need to know the instruction lengths and their side effects to
  114. properly apply their hooks, they need to keep up.
  115. The actual code in the test case is boring and doesn't matter. I'm sure there
  116. are disagreements on whether I've picked good candidates of "exotic" or new
  117. instructions, but those were the first that came to mind.
  118. (It's also doubtful whether you'll ever encounter functions where the first
  119. instructions are of this category, because most probably there's some setup
  120. needed before, e.g. checking that adresses are aligned, initalizing loop
  121. counters, yadda, yadda)
  122. Test case: loop and TailRec
  123. ===========================
  124. My hypothesis before starting this evaluation was that those two cases would
  125. make most hooking engines fail. Back in the good ol' days of x86 detour hooking
  126. didn't require any special thought because the prologue was exactly as big as
  127. the hook itself -- 5 bytes for `PUSH ESP; MOV EBP, ESP` and 5 bytes for `JMP +-
  128. 2GB`[2]. That isn't so easy for AMD64: a) the hook sometimes needs to be *way*
  129. bigger b) due to changes in the calling convention and the general architecture
  130. of AMD64 there just isn't a common prologue, used for almost all functions,
  131. anymore.
  132. Those by itself arn't a problem, since the hooking engines can fix all the
  133. instructions they would overwrite. However I hypothesized that only a few would
  134. check whether the function contained a loop that jumps back into the
  135. instructions that have been overwritten. Consider this:
  136. public _loop
  137. _loop:
  138. mov rax, rcx
  139. @loop_loop:
  140. mul rcx
  141. nop
  142. nop
  143. nop
  144. loop @loop_loop ; lol
  145. ret
  146. There's only 3 bytes that can be safely overwritten. Right after that is the
  147. destination of the jump backwards. This is a very simple (and kinda pointless)
  148. function so detecting that the loop might lead to problems shouldn't be a
  149. problem. But consider what happens with MHook (and all the others):
  150. _loop original:
  151. 008C | 48 89 C8 | mov rax,rcx
  152. 008F | 48 F7 E1 | mul rcx
  153. 0092 | 90 | nop
  154. 0093 | 90 | nop
  155. 0094 | 90 | nop
  156. 0095 | E2 F8 | loop test_cases.008F
  157. 0097 | C3 | ret
  158. _loop hooked:
  159. 008C | E9 0F 69 23 00 | jmp <MHook_Hooks::hookLoop>
  160. 0091 | E1 90 | loope test_cases.0023
  161. 0093 | 90 | nop
  162. 0094 | 90 | nop
  163. 0095 | E2 F8 | loop test_cases.008F
  164. 0097 | C3 | ret
  165. trampoline:
  166. 00007FFF7CD200C0 | 48 89 C8 | mov rax,rcx
  167. 00007FFF7CD200C3 | 48 F7 E1 | mul rcx
  168. 00007FFF7CD200C6 | E9 C7 96 DC FF | jmp test_cases.0092
  169. then executes:
  170. 0092 | 90 | nop
  171. 0093 | 90 | nop
  172. 0094 | 90 | nop
  173. 0095 | E2 F8 | loop test_cases.008F
  174. But that jumps back into the middle of the jump and thus executes:
  175. 008F | 23 00 | and eax,dword ptr ds:[rax]
  176. 0091 | E1 90 | loope test_cases.0023
  177. Which isn't right and will crash horribly.
  178. (Preliminary) Results
  179. =====================
  180. +----------+-----+------+------------+---+------+----+-------+
  181. | Name|Small|Branch|RIP Relative|AVX|RDRAND|Loop|TailRec|
  182. +----------+-----+------+------------+---+------+----+-------+
  183. | PolyHook| X | X | X | X | | | |
  184. | MinHook| X | X | X | | | | X |
  185. | MHook| | | X | | | | |
  186. +----------+-----+------+------------+---+------+----+-------+
  187. As expected nothing could correctly hook the loop. In fact I had to comment out
  188. those parts because even Catch2 couldn't recover from the crashes generated by
  189. the botched hooks. Some hooking engines are a bit lacking in their support for
  190. newer instruction sets, but a simple update of the dissassembler library should
  191. fix that.
  192. I was pleasantly suprised by MinHook, both the general AIP and because it
  193. managed to build a trampoline that worked perfectly even for the tail
  194. recursion case. I'd recommend it, even though it seems theres no chance that
  195. the dissassembler will ever be updated.
  196. Detecting tail recursive functions / loops into overwritten code
  197. ================================================================
  198. Back in 2015 I wanted to write my own hooking engine which would be able to
  199. hook ALL THE FUNCTIONS! And I did actually start to write it and then
  200. abandoded it, before I got to the interesting part. However since then I had
  201. the basic idea down:
  202. 1) Find out how long the function is
  203. 2) Analyze it, by checking whether some jump could jump into the overwritten
  204. instructions
  205. 3) Somehow fix that
  206. Fixing that code probably means putting the whole function in the trampoline,
  207. by definition there is no space where to put the additional/longer instructions.
  208. However I think that hooking engines should at least fail fast if they can't
  209. hook that function and give the user the ability to handle that error at that
  210. stage instead of waiting for unpredictable crashes. I'll post example code
  211. [here](https://git.free-hack.com/wacked/x64hook) and outline the general
  212. technique below.
  213. (My x64hook hooking engine doesn't work. There's literally two interesting
  214. functions in it, and I give pseudocode for them below)
  215. Estimate the length of a function
  216. ---------------------------------
  217. Note: This is an estimation of the function length. There's various ways to go
  218. about to do it, one way would be to search pro- and epilogue. Which would fail
  219. for all functions that -- for whatever reason -- don't have that. I'm sure this
  220. way also isn't perfect, but maybe it could be used as another source of
  221. information[5].
  222. Over the years I've seen various attempts at estimating the function length.
  223. One of the top hits for my google history is a question on stackoverflow
  224. which[3] uses the same technique that I've seen in various malware strains -
  225. checking byte for byte until the RET opcode is found. Which won't work if
  226. either:
  227. 1) The `RET imm16` opcode is used, which is often the case for __stdcall funcs.
  228. 2) There are multiple returns
  229. 3) The function doesn't actually return with the RET instruction. For example
  230. if a function A at its end calls another function B, with A and B sharing the
  231. same parameters and either A or B not modifying the stack pointer it is
  232. perfectly possible to just jump to function B. Exectution will continue in B,
  233. which ends with a normal RET.
  234. 4) The value 0xC3 appears for some other reason in the function.
  235. 4) can be easily solved by using a length disassember engine and just checking
  236. the actual instruction byte. 1) and 3) aren't that hard either, you'll just
  237. need to check for some additional opcodes. What about 2)?
  238. The key insight I had was why a function might have multiple returns -- because
  239. it needed to do additional work in some cases. Which meant that there had to be
  240. branching, to sometimes skip some instructions or get to them.
  241. If there is a branch backwards it's a loop. But a branch forwards means that
  242. the function extends at least up to there[4]. Or in pseudocode:
  243. offsetOfInstr = 0
  244. funcLen = 0
  245. furthestJump = 0
  246. while(can dissasemble next instruction)
  247. {
  248. offsetOfInstr += funcLen;
  249. op = getOpcode(instruction);
  250. if(is_jump(op))
  251. {
  252. off = get_jump_offset(instruction);
  253. if(off > furthestJump)
  254. furthestJump = off;
  255. }
  256. if(is_end_of_function(op, furthestJump, offsetOfInstr))
  257. {
  258. break;
  259. }
  260. }
  261. bool is_end_of_function(opc, furthestJump, instrOffset)
  262. {
  263. if(opc == RET && furthestJump <= instrOffset)
  264. return true;
  265. else if(opc == UD_Ijmp)
  266. {
  267. if(destination is IMM || destination is register)
  268. return true;
  269. }
  270. return false;
  271. }
  272. Detecting loops to the start of a function
  273. ------------------------------------------
  274. firstJumpOffset = MAX_INT
  275. foreach(instruction in function)
  276. if(instruction is a jump)
  277. jumpOffset = getOffset(instruction) // relative to function start
  278. /* jumps to exactly the start of a function are fine, since that is
  279. where our overwritten code starts. Thus it doesn't jump into the middle
  280. of an instruction */
  281. if(jumpOffset == 0)
  282. continue
  283. if(jumpOffset < firstJumpOffset)
  284. firstJumpOffset = jumpOffset;
  285. return firstJumpOffset < lengthNeededForHook
  286. ------------
  287. [1] This is one of the things that could easily be improved, but haven't been
  288. because I just couldn't motivate myself. Putting the data right after the func
  289. meant that a section containing code needed to be writable. Which is bad. Also
  290. I load the seed DWORD as a QWORD -- which only works because the upper half is
  291. then thrown away by the multiplication. It's shitty code is what I'm saying.
  292. In retrospect I should have used a jump table like a switch-case could be
  293. compiled into. That would be read only data. Oh well.
  294. [2] And Microsoft decided at some point to make it even easier for their code
  295. with the advent of hotpatching.
  296. [3] https://stackoverflow.com/questions/8705215/get-the-size-length-of-a-c-function
  297. [4] With some caveats, e.g. one could assume that no function is longer than
  298. 512 bytes. And obviously keeping in mind point 3
  299. [5] Another heuristic would be to check for the next slide of filler
  300. instructions, such as INT3 or NOP. Some compilers align functions on 16byte
  301. boundarys and fill the gaps with those