Browse Source

add the content i already wrote

master
wacked 5 years ago
parent
commit
8bf5c2c436
3 changed files with 154 additions and 0 deletions
  1. +102
    -0
      content/posts/arm-glibc-strlen.md
  2. +42
    -0
      content/posts/defeating-mbr-filter-with-techniques-from-2011.md
  3. +10
    -0
      content/projects/theme

+ 102
- 0
content/posts/arm-glibc-strlen.md View File

@@ -0,0 +1,102 @@
---
title: "Arm Glibc Strlen"
date: 2019-07-03T22:14:20+01:00
draft: false
---
# Intro
I wanted to get back into ARM assembler so I wrote my own strlen. And then I looked at the strlen() glibc uses and did not understand
a single thing. So I sat down and figured it out.

XXX IMG HERE XXX

On the left you see a flow diagram of the building blocks of my naive implementation. On the right you see glibc's. You might
notice that it is more complex. (How much faster it is, and which optimization exactly makes it fast is an interesting topic.
maybe another time)

I want to go through the function with a focus on two things:
- the actual counting (d'uh)
- the optimization if the string does not start on a word[0] boundary

# 1. The actual counting / the inner loop
The first thing I noticed about the inner loop is that it is unrolled. That is a fairly obvious optimization, made a bit harder
because strlen does not clearly unroll as the input is cleanly divisible by word size. So on the end of every basic block there
is a check which skips out of the loop and to the end where the result is calculated and returned.

XXX IMG HERE XXX

The basic blocks are basically identical. First the registers r2 and r3 are populated with the next two words to be checked for
the null byte. Now r2 and r3 each contain one word (i.e. 4 bytes). How do you check whether there is a null byte *somewhere* in them?
Without going through the bytes seperatly of course.

ARM processors have a status word called Application Program Status Register (APSR). It contains the result of ALU operations,
for example if the operation overflowed, if it was zero, ... It also has 4 bits corrosponding to the "Greater or Equal" (GE) result of
the ARM SIMD[1] instructions. But every possible value is "greater or equal" to zero. So how does that help?

The bitwise opposite of 0 is 0xFF or 255 in decimal. Because it is the biggest possible value representable in one byte a GE
comparion will only match if 0xFF is compared with 0xFF itself. The code already set up r12 to the value of 0xFFFFFFFF so one
part of the equation is available. However the usual compare instructions do not set the 4 GE bits. They are only set by
ADD16, ASX, SAX, SUB16, ADD8 and SUB8.

One possible way to turn a 0 into an 0xFF is to just add 0xFF. For all other values the result will overflow and be less.
Luckily r12 already contains 4 bytes of 0xFF. And that is the second step in every basic block of the inner loop.

So for example if you had
`r2 = 0x30313200`, `r12 = 0xffffff` and `ASPR.GE = [0, 0, 0, 0]`
after `r2 = r2 + r12` `r2` contains `0x293031FF` and `ASPR.GE = [0, 0, 0, 1]`.

While it is great to know that where used to be a 0 now is a 0xFF it did not actually help. Instead of trying to figure out
where a zero byte was now we have to figure out where a 0xFF is. So we did not actually gain anything.

The arm manual for the GE flags contains this little sentence:
> These flags can control a later SEL instruction. For more information, see SEL on page A8-602.


The SEL opcode "selects each byte of its result from either its first operand or its second operand, according to the
values of the GE flags". So this is a way to quickly eliminate all non zero (well non 0xFF) bytes.
SEL r2, r4, r12
r4 contains all zeros, so if the GE flag for this byte is false, the value from r4 is taken. r2 now contains a zero. Else it still contains the 0xFF.
Now you could just reverse that value and count the leading zeros to get the position of the null byte. However the `ldrd`
instruction did not only read one word into r2 but also one into r3. Since zero bytes are rare it would be a waste to not
also check if there maybe wasn't one in the second word -- which we already read anyway.

It also adds r12 to turn all zero bytes into 0xFF. However it can't then do the exact same thing, because that would destroy
the information from the previous word. So instead it SEL with r2 instead of an all zero register. That means that in the case
that there was a null byte (and thus the GE flag is set) at this position in r3 the 0xFF from r12 is taken. If not however the
value from r2 is. Which might or might not be a 0xFF.

If no zero byte was in those two words r3 will now be all zero. If not it will be something else. So the last instruction of
the basic block is a `cbnz r3` skipping the next two unrolled blocks. This completes the inner loop.

# 2. Alignment
The `ldrd` instruction can not do unaligned accesses. But a string might be unaligned, for all sorts of reasons. (And POSIX probably mandates
that strlen() should handle unaligned strings). So like in all SIMD cases, there need to be a couple steps before the actual work begins. OFten
that is going through the data one byte at a time until it is aligned. However this implementation does something much smarter.

First it checks whether the string is unaligned by checking if the lower three bits are zero using straightforward bit twiddeling. But if they are not,
the code actually puts the argument into r1 and rounds down. So from now on it is safe to load from there. While doing that it puts by how much it rounded
down into r4 and then sets r0 to `-r4`. So effectively the result is that this string starts with a length of -4 instead of 0.

# 3. Other cute features
When calculating the end result, there is a problem. In every basic block (but the first one obv) the code already incremented the length. But now the null byte could be in either r2 or r3
-- which is it? The code firstly checks if r2 is all zeros. If that is the case the it has to be r3. So there is a conditial move `r2 = r3` and incrementing the length by 4.
(if you recall the length started counting at -4 also for unaligned access, so now the length would be zero)

It is reasonable to assume that every strlen() implementation will be memory bound. So at the beginning of the inner loop,
there is a preload instruction to load the next cacheline. As there are 4 unrolled basic block, each checking 2 words the loop in total checks 64bits.
which is exactly a cacheline. This ensure that the data to be checked should always be there, minimising waiting time.

# 4. Wrap up
Going through this was a valuable learning experience. I learned some new instructions, learned some new tricks (`rev r2, r2` then `clz r2, r2` to get the index in a word).

There are a couple obvious further topics just with this function:

is it even worth it? How much faster is this? Which trick contributed how much?
do those changes break anything? If it reads two words at a time, what happens if it reads into un-allocated memory?
how would it look with NEON instructions?
wait... i never even looked at a real life AMD64 implementation of strlen
it doesn't look like the code in glibc. What gives?


Footnotes
[0] word as in machine width (32bit in this case), not the windows.h #define WORD.
[1] Not NEON (Advanced SIMD).

+ 42
- 0
content/posts/defeating-mbr-filter-with-techniques-from-2011.md View File

@@ -0,0 +1,42 @@
---
title: "Defeating Mbr Filter With Techniques From 2011"
date: 2016-10-21T12:27:31+01:00
draft: false
---
# Summary
A few days ago TalosIntel, a cisco subunit I guess, released MBRFilter. It is a simple driver protecting the MBR from malicious modification - and it does exactly what it says on the tin.

@SwiftOnSecurity, softpedia and bleepingcomputer picked it up, so I guess many people saw it. I'm not entirely sure if even the bleepingcomputer video really conveyed how much/little security it actually offers. (Minor nitpick: Ransomware would not encrypt the MBR but the Master File Table)

I have two problems with this release:

- It does not actually solve the problem, as any later stage in the boot process can be overwritten instead - as Rovnix did years ago
- direct disk access already is protected -- only processes running with administrator rights can modify it. If malware is able to run as administrator the PC should be considered toast anyway, even if the MBR itself is protected

I will release an extremely simple POC that circumvents the filter. Malware authors gain nothing from that, if they haven't figured it out by now, this code will not help them. (Not to mention that cisco advertises MBRFilter to protect specifically against ransomware - whose authors would be unable to empty a bucket if the instructions to do that were printed on its bottom...)

# Details about MBRFilter
MBRFilter is a kernelmode driver written in C. It draws heavily on example code by Microsoft, and prevents writes to sector 0 of non-removable devices or CDs.

It installs a callback to filter SCSI Passthrough (in MBRFDevControl()) and requests coming from the normal I/O mechanism (in MBRFReadWrite()). If those requests try to sector 0 of any drive it shows an error message and makes that request fail.

(I'm not entirely sure why the check in MBRFReadWrite() is diskoffset / 512 == 0, instead of just a simple comparison. However I'd be happy to hear about that)

Reading the MBR is not protected however, because there is no possible harm in that correct?

# Rudimentary introduction to the Windows booting procedure
The, unmodified, Windows MBR has, like all Master Boot Records, a size limit of 512 bytes. And not even all of that can be used freely, for example at the end there has to be a 2 byte signature and before that a partition table.

So all the actual MBR does is setup a few small things and then read the next stage of the boot process. The next stage is at the first sector of the partition that is bootable (i.e. has the bootable flag set in that table).

The MBR then reads that next stage (called Virtual Boot Record) into memory and continues execution there.

# Circumventing MBRFilter
As astute readers may have noticed the VBR is on another sector than the MBR. As MBRFilter only protects the MBR, the VBR can be modified using normal WinApi functions. The only challenge is finding the VBR...

Well as MBRFilter does not prevent reading the MBR a malicious application could read the MBR, parse the partion table to find the VBR and then overwrite it.

And would you look at that, I did exactly that. The code is nothing fancy, it doesn't even have any sort of output. Your PC just will be stuck; the only remedy is turning it off. Forever.

# Summary
MBRFilter does what it says. Sadly if you know enough to accurately judge what that exactly means you are basically able to circumvent that. Obviously the average user, and as it seems not even the average tech journalist, don't have that knowledge. And that is fine, but Cisco should make the shortcomings more clear! Putting "defense-in-depth" anywhere in the Text would have been enough to hint at the shortcomings.

+ 10
- 0
content/projects/theme View File

@@ -0,0 +1,10 @@
---
title: "This Blog's Theme"
date: 2019-11-03T21:28:19+01:00
draft: false
---
> Personal blog theme powered by Hugo. I only took [Calin Tataru](http://calintat.github.io/) theme and
> stripped out all the links to varios CDNs.
> I would prefer my website to have as little dependencies outside my control as possible please.

[Download](http://vcs.wacked.codes/wacked/minimal-no-third-party.git)

Loading…
Cancel
Save