Reverse engineering blobs: adding diff to the toolkit

Last time I talked about the benefits of using sed to transform repetitive low-level patterns into meaninful function calls.  And still, doing all that regex magic did not get us a fully working replay. A great portion of the hardware initialization flow is based on situational awareness. What hardware is connected? What are our capabilities? What if …?

That means a simple sequence of writes is, in most cases, not sufficient. We may need to modify registers, wait on other hardware, or respond differently to hardware states. While that seems daunting and tedious, it gives us an unexpected advantage: that every execution of the blob produces a different trace.

This is where diff comes in. By getting a bunch of traces and diffing them, we can see the points where the firmware takes different decisions, and the states which determine those decisions.  It won’t tell us what condition triggers path A or path B, but it allows us to infer that by comparing the hardware states. Let’s have a look:

@@ -305,28 +305,28 @@ void run_replay(void)
 radeon_read_sync(0x6430); /* 04040101 */
 radeon_write_sync(0x6430, 0x04000101);
 radeon_write_sync(0x3f50, 0x00000000);
- radeon_read(0x3f54); /* 000dda12 */
- inl(0x2004); /* 000dda8a */
- inl(0x2004); /* 000ddb1a */
- inl(0x2004); /* 000ddb9c */
- ...
- inl(0x2004); /* 000de3a0 */
- inl(0x2004); /* 000de41a */
+ radeon_read(0x3f54); /* 000d9efb */
+ inl(0x2004); /* 000d9f73 */
+ inl(0x2004); /* 000da003 */
+ inl(0x2004); /* 000da081 */
+ ...
+ inl(0x2004); /* 000da883 */
+ inl(0x2004); /* 000da8fd */
 radeon_read_sync(0x611c); /* 00000000 */
 radeon_write_sync(0x611c, 0x00000002);
 radeon_write_sync(0x6ccc, 0x00007fff);

I’ve chosen an example that shows similarities rather than differences, as I find this to be a more interesting case. Since we’ve already established that 0x2004 is our data port, as long as we don’t touch the index port, we’ll be reading from the same register, in this case 0x3f54.

Now the values returned by this register are completely different in every trace, yet the behavior of the blob is strikingly similar every time. The first key observation is that this register increase monotonically in every trace. It also increases by roughly the same amount on successive reads. The differences between the last read and first read of the register are also strikingly similar in both traces: 0xa08 and 0xa02 respectively.

This register looks to be a monotonic timer, and the loop has all the elements of a delay loop. To determine the actual delay, we could try to extract absolute timing information when collecting the trace; however, in this specific case, I had the AtomBIOS tables handy. By comparing register accesses around this loop, I was able to figure out where in the tables this delay is occuring:

 0200: 0d250c1901 OR reg[190c] [...X] <- 01
 0205: 54300c19 CLEAR reg[190c] [.X..]
 0209: 5132 DELAY_MicroSec 32

The ’32’ in the delay is a hex number. Doing a bit of hex math we see we’re waiting about 51 ticks per microsecond. Comparing more loops, we get between 50 to 52 ticks per microsecond. Since a delay loop normally waits until the minimum time has elapsed, we now have a very convincing case that register 0x3f54 implements a 50MHz monotonic timer. Every time before accessing this register, we also poke register 0x3f50. That looks very much like the timer control register.

We now extend our sed script with:

sed "s/radeon_read($timer);[^$hex\r]*\([$hex]*\)[^\r]*\(\r\tinl($dport);[^$hex\r]*\([$hex]*\)[^\r]*\)*/radeon_delay(0x\3 - 0x\1);/g" |
sed "s/radeon_write_sync($timerctl, 0x0\{1,8\});[^\r]*\r\tradeon_delay(/radeon_delay(/g"

Now when we rerun our logs through the script, the results decrease in size from 20K lines to 13K lines. The diffs between processed logs also decrease in size significantly. All the more proof we were right!

There’s another way in which diff is excellent for our purpose. We can implement our helpers to generate the same output as the processed logs. That allows us to poke the replay from userspace, yet get the same output format. Now we can diff the replay and original log, and observe how the hardware state changes. We can even go as far as implementing our delay with usleep() instead of the timer at 0x3f54. When the diff is independent of the delay method we use, we have another strong proof our assumptions are true. This is the case here.

‘diff’ is an extremely powerful tool. Despite its name, it can show similarities just as well as differences. While regular expressions exaust their usability with simple patterns, diff can take us a lot further. Now that we’ve cracked the delay implementation of the blob, we can more easily see delay and wait loops — again, using diff. Complex, multivariable patterns are too awkward to handle with sed. I’ll go over those some other time. However, once such patterns are simplified to a function call, diff can once again show the story. Different GPU model? diff. New display? diff. HDMI connected? diff. It’s almost as versatile as det cord .

Reverse engineering blobs with replay and sed

I recently started re-visiting the HP Pavilion m6 1035dx with a recent coreboot master. As usual the benign VGA BIOS got in the way again. This time I decided to use coreboot’s YABEL realmode emulator to tell the story. Let’s dive in:

Executing Initialization Vector...
[0000f2cb]c000:3b69 inb(0x03c3) = 0x20
[0000f2cc]c000:3b6f inl(0x204c) = 0x00000000
[0000f2ce]c000:3e06 inl(0x2000) = 0x9ffffffc
[0000f2cf]c000:3b69 inb(0x03c3) = 0x20
[0000f2d0]c000:3b6f inl(0x204c) = 0x00000000
[0000f2d3]c000:3b69 inb(0x03c3) = 0x20
[0000f2d4]c000:3b6f inl(0x204c) = 0x00000000
[0000f2d5]c000:3e61 outl(0x00001728, 0x2000)
[0000f2d5]c000:3e67 outl(0x0008c000, 0x2004)
[0000f2d8]c000:3b69 inb(0x03c3) = 0x20
[0000f2d9]c000:3b6f inl(0x204c) = 0x00000000
[0000f2da]c000:3e49 outl(0x00003f54, 0x2000)

That’s how the coreboot log looks when we enable YABEL traces. Enabling direct hardware access produces a cleaner log to start with. We want to clean that up a little bit. A bit of simple regex substitution gets us there. Excuse the wrap-around:

egrep "c000:[0-9,a-f]{4}|x86emuOp_halt|runInt[0-9,a-f]{2}.*starting" $1 |
grep -v "Running option rom at c000:0003" |
sed "s/c000\:[0-9,a-f]\{4\}\ /\t/g" |

tr '\n' '\r' |

sed "s/\[[0-9,a-f]\{8\}\]//g" |
sed "s/)/);/g" |
sed "s/=\ 0x\([0-9,a-f]*\)/\/\*\ \1\ \*\//g" |

tr '\r' '\n'

That’s enough to get our trace to something that looks more like C code, which could be used as a replay function (HINT!):

 inb(0x03c3); /* 20 */
 inl(0x204c); /* 00000000 */
 inl(0x2000); /* 9ffffffc */
 inb(0x03c3); /* 20 */
 inl(0x204c); /* 00000000 */
 inb(0x03c3); /* 20 */
 inl(0x204c); /* 00000000 */
 outl(0x00001728, 0x2000);
 outl(0x0008c000, 0x2004);
 inb(0x03c3); /* 20 */
 inl(0x204c); /* 00000000 */
 outl(0x00003f54, 0x2000);

We’ve neatly kept the return values of IO input operations for future reference, but the current form doesn’t tell us much. We do, however see patterns emerging. IO to 0x2000 followed by 0x2004 is fairly common. That looks like an index/data pair. Also, access to 0x03c3 and 0x204c before poking the above pair is all too common. Let’s extend our script with:



sed "s/outl(0x0\{0,4\}\([0-9,a-f]*\), $iport);[^\r]*\r\toutl(0x\([0-9,a-f]*\), $dport);/radeon_write(0x\1, 0x\2);/g" |
sed "s/outl(0x0\{0,4\}\([0-9,a-f]*\), $iport);[^\r]*\r\tinl($dport);/radeon_read(0x\1);/g" |

sed "s/inb(0x03c3);[^\r]*\r\tinl($stsport);[^\r]*/sync_read();/g" |

Since we’ve converted all our newlines to carriage returns, we can match patterns from multiple lines. It doesn’t matter what we think these patterns do, or how we call the new functions. We’re just interested in grouping them to see the bigger picture. I could as well have called these hamburger() and french_fries().

 inl(0x2000); /* 9ffffffc */
 radeon_write(0x1728, 0x0008c000);
 radeon_read(0x3f54); /* 002badc3 */
 radeon_read(0x0670); /* 0000fe04 */
 radeon_write(0x0670, 0x0000fe04);

Much better! We’ve replaced the low-level patterns with more meaningful descriptions. We could grep out the sync_read(), or, if we were using this for replay code, incorporate the _sync() into another function with radeon_(). Even in this form, we can start looking for higher-level patterns. If we look into the disassembly of the video BIOS provided by AtomDis, we see:

 0009: 07a59c01fc AND reg[019c] [.X..] <- fc
 000e: 0d659c0180 OR reg[019c] [..X.] <- 80

Since the registers are 32-bit, then their address would be reg << 2. Thus [019c] becomes 0x0760. These read-modify-write sequences appear in the replay trace.

Now we have an idea where we are with respect to the AtomBIOS tables, we see, at a low level, how those tables translate to hardware accesses. As more patterns are identified, they can be transformed into more meaningful function calls

Now we have the opportunity to look for more advanced patterns, and even identify any code that is not in the AtomBIOS tables. This, in turn can allow us to figure out what actually turns on the display. There is still a huge gap before having anything close to native init. What we’ve done here is develop a coreboot-level understanding of the init process, and made the first tiny step towards native VGA initialization.

While there is only so much regex substitution can do for us, it is a necessarry first step towards a larger understanding of the problem. Transforming 0x0760 to 0x019c is ill suited for regex.  Identifying more complex patterns such as waiting for a condition, or delaying a set amount of time is also a more demanding task; however, the power lies in the ability to script and automate the conversion from a nonsensical log into something more human-friendly. It then becomes trivial to diff several traces and see higher-level patterns. I’ll discuss those some other time.

The pink room – a coreboot developer meeting in Prague

A highly unusual sight? Yes. A pink room filled to the brim with hardware, hackers and pizza boxes.
An unexpected place? Indeed. The architecture faculty building of the Czech Technical University in Prague.
An unusual schedule for four days in Prague? Of course. Intense hacking, talks, discussions and hardware destruction/modding.
Highly unusual sightseeing? Absolutely. A steam-powered wastewater treatment plant and the Prague Signal Festival.
Excellent food? Oh yes. Czech specialties at really low prices.
Disasters? Not really. There were canceled trains/planes from and to Prague affecting quite a few of the attendees, but those were eventually overcome.
Bad things? Possibly. We did unspeakable things to BIOS and EFI images, and we’re proud of it, so I’m not sure if that qualifies as bad.
Great meeting? Awesome meeting!

So what did we do?
– Found a way forward for the dozens of boards which haven’t been tested in ages.
– Overhauled the board status reporting model, so be implemented soon.
– Evaluated ways to get board testing into a state where people can do so easily without manual work.
– Discussed hardware vendor interaction.
– Agreed to ship verified boot functionality by default in a way which still keeps all the freedom for the user.
– coreboot will get a compelling security story out of the box.
– Fixed bugs in some ports.
– Ported a few new laptops.
– Learned new tricks to improve suspend-to-RAM functionality the easy way.
– Made a plan to upstream essential drivers in the chromiumos branch of flashrom.
– Adjusted the flashrom development model to something that works with the scarcity of reviewer time currently available.
– Found out that a video projector is a nice way to test VGA output in case the only VGA capable monitor is in use.
– Noticed that pretty much every question in the form of “does anybody have X” can be answered with “yes” if X is something that may be connected to a computer or is remotely useful for coreboot.
– Didn’t get a lot of sleep.
– Listened to talks about the present and future of coreboot.
– Met some longtime community members for the first time.

Photos of the meeting and sightseeing will be added soon. website updates

Woot! A new look for  We have shifted the landing page from the mediawiki to WordPress. DON’T PANIC!, we are still using the wiki as the primary location for developer content. The new landing page and WordPress site is more visually appealing and is the location for news, blogs,  and other basic information for those that are just discovering and learning about coreboot.

GSoC [infrastructure] : Along the way, something went terribly wrong

I started working with AMD platforms’ infrastructure with high hopes of being able  to better manage the CBMEM setup. While I have a selection of family14 boards to work with, things did not continue so well:

First off, tree had literally thousands of lines of copy-pasted or misplaced AGESA interface code remaining in the tree. A lot of that should have been caught in the reviews, but it appears a few years back the attitude was that if coreboot project was lucky enough to get some patches from an industry partner, the code must be good (as the development was paid for!) and just got rubber-stamped and committed.

Second, the agreement I have for chipset documentation is not open-source friendly. It contains a clause saying all documentation behind the site login is to be used for internal evaluation only. I was well aware of this at the beginning of GSoC and at that time I expected my mentor organisation would be able to get me in contact with right people at AMD to get this fixed. But that never happened. I am also concerned of the little amount of feedback received as essentially nobody in community has fam16kb to test.

So currently I am balancing which parts of my work on AGESA I should and can publish and what I cannot. Furthermore, first evidences that vendor has decided to withdraw from releasing  AGESA sources have appeared for review. In practice there has been zero communication with the coreboot community on this so I anticipate the mistakes that were done with FSP binary blobs will get repeated.

Needless to say the impact this has had on my motivation to further work on AGESA as my efforts are likely to go wasted with any new boards using blobs. I guess this leaves pleanty of opportunities for future positive surprises once we have things like complete timestamps, CBMEM and USBDEBUG consoles and generally any working debug output from AGESA implemented. I attempted to initiate communication around these topics already in fall 2013 without success and it is sad to see the communications between different parties interested in overall tree maintenance have not improved at all since then.

GSoC 2014 [flashrom] Support for Intel Bay Trail, Rangeley/Avoton and Wildcat Point

While we were busy updating our AMD driver code to accommodate the new SPI controller found in Kabini and Temash, Intel has also changed their SPI interface(s) in a way that required quite some effort to support it in flashrom. A pending patch set is the result of the work of a number of parties and I will shortly explain some details below.

All started with a patch for ChromiumOS’s flashrom fork in fall 2013 that introduced support for Intel’s Bay Trail SoCs which are used in a number of currently shipping or announced Chromebooks. Bay Trail is part of the Silvermont architecture also in other SoCs intended for different use cases like mobile phones (Merrifield and Moorefield) or special-purpose servers (Avoton and Rangeley). At least for the latter two the SPI interface is equivalent to Bay Trail’s. This was handy for Sage when they were developing their support package for Intel’s Mohon Peak (Rangeley reference board) which was upstreamed to the coreboot repository shortly before this blog post was written. They ported the patch to vanilla flashrom, added the necessary PCI IDs and submitted the result to our mailing list at the end of May.

Because we, the flashrom maintainers, are very picky, the code could not be incorporated as is. I took the patch and completely reworked and refactored it so that more code could be shared and we are hopefully better prepared for future variations of similar changes. Additionally, I have also backported the Intel Wildcat Point support that ChromiumOS got already in May.

The major part of my contribution was not simply integrating foreign code, but refactoring and refining it where possible as well as verifying it against datasheets. While digging these numerous datasheets and SPI programming guides I have also fixed the problem of hardware sequencing not working on Lynx Point (and Wildcat Point) as it was reported in March when I had no time to correct it.

All of this is not committed to the main repository yet, but will be soon. It is mostly untested so far and I would very welcome any testers with the respective hardware.

[GSoC-2014] Payload Loading – Success!

This post marks the completion of the payload re-structuring that I did as a second part of the project. The following summarizes the major highlights of the work:

  • We change the ‘struct payload’ ( in the src/include/payload_loader.h) to have a ‘struct cbfs_media’ and cbfs_file_handle. That way we have the two thing we need to read the contents of the file.
  • In the build_self_segment_lists() we use the above data structures to media->read() the payload metadata. 
  • We use the metadata to segregate segments on the basis of their types and form a linked list of segments for reference later.
  •  Then we do mapping and then decompression for the compressed segments and direct reading for the uncompressed segments.

Some of the major issues were: playing with the data_offset and segment offsets; to get src_address to read to.

I was able to get past all the issues and finally got a successful working boot This completes the revamp of stage and payload loading :D

During the past week I also worked with gerrit; wherein I submitted my patches and received feedback. I am more involved with the development process of coreboot now and leaving aside some minor glitches; the process was very smooth.


[GSoC-2014] [cbfs_media] Stage 2

As per the plan, we set out to investigate the decompression algorithm that is being employed. Mid-way through, we found something interesting that appealed to us. That was the payload loading process. In our existing architecture, we memory map the entire payload. The selfload()’s current API assumes the payload has already been memory mapped. That’s the bad assumption that needs to change. Even if we investigate and resolve the decompression algorithm, and get a pipelined architecture relying on smaller buffer size, still this mapping will cost resources. Hence this needs to be rectified before we go on the the decompression thingy. So we decided this is what we will be targeting. 

I then spent some time trying to break the payload loading and see where we can put some map() saving efforts. Below shows some details of the process:
1. First we locate the payload. Here the process for stage loading happens where we have the following:
Reading done. size = 24 bytes
Load entry 0x14440 file name (32 bytes)…
Mapping size is equal to 32
Found file:offset = 0x14478, len=95716
CBFS: Found file.
default_media->map(0x14478, 0x1761c)
Mapping size is equal to 95804
CBFS: located payload @ 7ec14298, 95716 bytes.
Thus we map the all the segments with this one big mapping.
2. Loading procedure begins by building the segment list (build_self_segment_list() does this) Here w.r.t. payload_segment_types, we check for proper destination address, file_size etc. After checking for the segment; we do simple aligning by pointing the prev and next pointers appropriately; to reach to the further places where to place the segment; and so on.
3. In load_self_segments() we run a simple for loop covering all segments. The loading that happens is
–> Loading segment from rom address 0x7ec14298
code (compression=1)
New segment dstaddr 0x4a000000 memsize 0x39929 srcaddr 0x7ec142d0
filesize 0x175ac
(cleaned up) New segment addr 0x4a000000 size 0x39929 offset 0x7ec142d0 filesize 0x175ac
–> Loading segment from rom address 0x7ec142b4
Entry Point 0x4a000000
(iii) After this we come to load_self_segments()
First a bounce buffer is created:  Bounce Buffer at 7ffcf000, 186192 bytes.
We have one segment that is worked upon; which is compressed hence ulzma(src , dest) reads it.
–>Loading Segment: addr: 0x000000004a000000 memsz: 0x0000000000039929 filesz: 0x00000000000175ac
Post relocation: addr: 0x000000004a000000 memsz: 0x0000000000039929 filesz: 0x00000000000175ac
using LZMA
After that one segment that we see on the logs it says
–> Loaded segments
Hence process complete. In essence we had only 3 segments.
Currently, I am working on a strategy on deciding how to modify the architecture of the API so as to conserve as much sram memory consumption as possible.

[GSoC 2014][cbfs_media] Stage 1 : Mission Accomplished

Firstly, sorry for the delay in posting update on the work. I had been busy getting the design to code and wanted to post after its successful completion.

As I had talked about in the previous post, we did a detailed analysis on the existing read() and map() calls. The original log; with all the extra gibberish removed can be seen here. The first design modification that was done was to remove the mapping done for getting cbfs_header. These were the  0x20 size mappings we see in the log. These were unnecessary and could be done away with. And we did! :P This log shows the first optimized build; Stage 1 -> Part 1 ->done.

Now we moved on to the more complex and colossal mappings. A function cbfs_find_file() was created, which returned the absolute data_offset of the file based on the name and type we ask for. Once we have the whereabouts of the file; modifications were made in cbfs_load_stage() to appropriately read() and/or map() various files.

The files are arranged as  -> [  cbfs_file  ] [  cbfs_stage  ] [  data  ] <Thanks Aaron for this visualization >

cbfs_find_file() : worked with the cbfs_file to get details about the whereabouts of the file

cbfs_load_stage() : we first read fundamental information about the stage; and then do corresponding map() or read()

Voila!! Stage 1 Complete! :D

Now, the major issue we have persisting is that the decompression of file data assumes memory mapped access to its contents, and hence is quite inefficient due the that ‘one’ large buffer. SO  this is what we tackle next, to be more precise, have a pipelined decompression strategy which would eliminate the need for one large data buffer.

Its getting fascinating to work on the project by the day! Until the next post, signing off.

P.S.  Thanks Aaron for helping out with any and every issue I face, and always finding the time to reply, even on sundays! :D