It is now the final week of GSoC, and it is time for me to write my final blog post. Over the past summer I have worked on fixing the Coverity scan issues in coreboot, with the goal of making the code base “Coverity clean”. This has involved writing a substantial number of patches, the vast majority of which are in coreboot, with a sprinkling in a few other projects:
- 146 patches in coreboot
- 6 patches in flashrom
- 6 patches in vboot
- 3 patches in Chromium EC
- 4 patches in OpenSBI
- 2 patches in em100
- 1 in the Linux kernel
At the time of writing, a few of my patches are still under review on Gerrit, so it is possible (and hopeful!) that this list will increase over the next few weeks.
In total, these patches resolved 172 Coverity issue reports of actual bugs. However, Coverity also isn’t always right, and some issues weren’t actually problems that required patches. These issues, 91 in total, were either false positives or intentional and were ignored. At the moment, there are currently 223 remaining reports in the issue tracker for coreboot. Despite being a substantial number, this is almost entirely composed of issues from third-party projects (such as OpenSBI or vboot, which probably shouldn’t be counted in the coreboot tracker anyway), and the AMD vendorcode. The original plan at the beginning of the summer was to work on the AMD vendorcode; however, after discussion with my mentors we decided to skip it, since with the upcoming deprecations for coreboot 4.11 it might not be around much longer. Aside from this, there are roughly 20 remaining issues, which mostly required refactoring or technical knowledge that I don’t have.
With the summary out of the way, I’d like to give everyone a sample of the sort of bugs I’ve worked on during the project, and hopefully give advice for avoiding them in the future. Here is a list of the most common, nasty, or subtle types of bugs I’ve found over the summer.
Missing Break Statements
In switch statements in C, every case statement implicitly falls through to the next one. However, this is almost never the desired behavior, and so to avoid this every case needs to be manually terminated by a break to prevent the fall-through. This unfortunately is very tedious to do and is often accidentally left out. For a prototypical example, let’s look at CB:32180 from the AGESA vendorcode.
switch (AccessWidth) { case AccessS3SaveWidth8: RegValue = *(UINT8 *) Value; break; case AccessS3SaveWidth16: RegValue = *(UINT16 *) Value; break; case AccessS3SaveWidth32: RegValue = *(UINT32 *) Value; default: ASSERT (FALSE); }
In this switch there is a missing break after the AccessS3SaveWidth32
case, which will then fall-through to the false assertion. Clearly not intentional! Other examples of this, though not as severe, can be found in CB:32088 and CB:34293. Fortunately, these errors today can be prevented by the compiler. GCC recently added the -Wimplicit-fallthrough
option, which will warn on all implicit fall throughs and alert to a potentially missing break. However, some fall throughs are intentional, and these can be annotated by a /* fall through */
comment to silence the warning. Since CB:34297 and CB:34300 this warning has been enabled in coreboot, so this should be the last we see of missing break statements.
Off-by-One Errors
There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.
Anonymous
Everyone has been bitten by off-by-one errors. Let’s take a look at CB:32125 from the Baytrail graphics code.
static void gfx_lock_pcbase(struct device *dev) { const u16 gms_size_map[17] = { 0, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512 }; ... u32 gms, gmsize, pcbase; gms = pci_read_config32(dev, GGC) & GGC_GSM_SIZE_MASK; gms >>= 3; if (gms > ARRAY_SIZE(gms_size_map)) return; gmsize = gms_size_map[gms]; ... }
Here we have an array gms_size_map
of 17 elements, and a bounds check on the gms
variable before it is used to index into the array. However, there’s a problem. The bounds check misses the case when gms == ARRAY_SIZE(gms_size_map) == 17
, which is one past 16 – the index of the last array element. The fix is to use >=
in the check instead of >
. This exact error when performing a bounds check is very common: see at least CB:32244, CB:34498, and CL:1752766 for other examples.
Another nasty place where off-by-one errors strike is with strings – in particular, when making sure they are null terminated. Here is CB:34374 from the ACPI setup of the Getac P470.
static long acpi_create_ecdt(acpi_ecdt_t * ecdt) { ... static const char ec_id[] = "\_SB.PCI0.LPCB.EC0"; ... strncpy((char *)ecdt->ec_id, ec_id, strlen(ec_id)); ... }
The problem is that strncpy()
will only copy at most strlen(ec_id)
characters, which excludes the null character. The author might have been thinking of the similar strlcpy()
, which does explicitly null terminate the string buffer even if it never reaches a null character. In this case none of the string-copying functions are needed, since ec_id
is a string buffer and so can be copied using a simple memcpy()
.
Boolean vs Bitwise Operators
In C, all integers are implicitly convertible to boolean values and can be used with all boolean operators. While somewhat convenient, this also makes it very easy to mistakenly use a boolean operator when a bitwise one was intended. Let’s take a look at CB:33454 from the CIMX southbridge code.
void sb_poweron_init(void) { u8 data; ... data = inb(0xCD7); data &= !BIT0; if (!CONFIG(PCIB_ENABLE)) { data |= BIT0; } outb(data, 0xCD7); ... }
Here BIT0
is the constant 0x1
, so !BIT0
expands to 0, with the net effect of data
being completely cleared, regardless of the previous value from inb()
. The intended operator to use was the bitwise negation ~
, which would only clear the lowest bit. For more examples of this sort of bug, see CB:34560 and OpenSBI 3f738f5.
Implicit Integer Conversions
C allows implicit conversions between all integer types, which opens the door for many accidental or unintentional bugs. For an extremely subtle example of this, let’s take a look at OpenSBI 5e4021a.
void *sbi_memset(void *s, int c, size_t count); void sbi_fifo_init(struct sbi_fifo *fifo, void *queue_mem, u16 entries, u16 entry_size) { ... sbi_memset(fifo->queue, 0, entries * entry_size); }
Do you see the problem? The issue is that entries
and entry_size
are both 16-bit integers, and by the rules of C are implicitly converted to int
before the multiplication. An int
cannot hold all possible values of a u16 * u16
, and so if the multiplication overflows the intermediate result could be a negative number. On 64-bit platforms size_t
will be a u64
, and the negative result will then be sign-extended to a massive integer. As the last argument to sbi_memset()
, this could lead to a very large out-of-bounds write. The solution is to cast one of the variables to a size_t
before the multiplication, which is wide enough to prevent the implicit promotion to int
. For other examples of this problem, see CB:33986 and CB:34529.
Another situation where implicit conversions strike is in error handling. Here is CB:33962 in the x86 ACPI code.
static ssize_t acpi_device_path_fill(const struct device *dev, char *buf, size_t buf_len, size_t cur); const char *acpi_device_path_join(const struct device *dev, const char *name) { static char buf[DEVICE_PATH_MAX] = {}; size_t len; if (!dev) return NULL; /* Build the path of this device */ len = acpi_device_path_fill(dev, buf, sizeof(buf), 0); if (len <= 0) return NULL; ... }
With the function prototype right there, the problem is obvious: acpi_device_path_fill()
returns negative values in a ssize_t
to indicate errors, but len
is a size_t
, so all those negative error values are converted to extremely large integers, thus passing the subsequent error check. During code review this may not at all be obvious though.
Both these errors could be prevented using the -Wconversion
compiler option, which will warn about all implicit integer conversions. However, there are an incredible number of such conversions in coreboot, and it would be a mammoth task to fix them all.
Null Pointers
Null pointers need no introduction – they are well known to cause all sorts of problems. For a simple example, let’s take a look at CB:33134 from the HiFive Unleashed mainboard.
static void fixup_fdt(void *unused) { void *fdt_rom; struct device_tree *tree; /* load flat dt from cbfs */ fdt_rom = cbfs_boot_map_with_leak("fallback/DTB", CBFS_TYPE_RAW, NULL); /* Expand DT into a tree */ tree = fdt_unflatten(fdt_rom); ... }
This code attempts to load a device tree from a location in the CBFS. However, cbfs_boot_map_with_leak()
will return a null pointer if the object in the CBFS can’t be found, which will then be dereferenced in the call to fdt_unflatten()
. On most systems dereferencing a null pointer will lead to a segfault, since the operating system has set up permissions that prevent accessing the memory at address 0. However, coreboot runs before the operating systems has even started, so there are no memory permissions at all! If fdt_rom
is a null pointer, fdt_unflatten()
will attempt to expand the device tree from whatever memory is at address 0, leading to who knows what problems. A simple null check will avoid this, but requires the programmer to always remember to put them in.
Another common issues with null pointers is that even if you do a check, it might not actually matter if the pointer has already been dereferenced. For example, here is a problem with the EDID parser in CB:32055.
int decode_edid(unsigned char *edid, int size, struct edid *out)
{
...
dump_breakdown(edid);
memset(out, 0, sizeof(*out));
if (!edid || memcmp(edid, "\x00\xFF\xFF\xFF\xFF\xFF\xFF\x00", 8)) {
printk(BIOS_SPEW, "No header found\n");
return EDID_ABSENT;
}
...
}
In this case the EDID is dumped doing the null pointer check, but at worst there should only be a wonky dump if edid
is null, right? Not necessarily. Since dereferencing a null pointer is undefined behavior, the compiler is allowed to assume that no null pointer dereferences occur in the program. In this case, dereferencing the edid
pointer in dump_breakdown()
is an implicit assertion that edid
is not null, so an over-zealous compiler could remove the following null check! This optimization can be disabled using -fno-delete-null-pointer-checks
(which is done in coreboot), but does not prevent any problems that could have happened in the null dereference before the check took place. See this article in LWN for details on how a vulnerability from this problem was dealt with in the Linux kernel.
Conclusion
C has always had the mantra of “trust the programmer”, which makes mistakes and errors very easy to do. Some of these errors can be prevented at compile time using compiler warnings, but many cannot. Coverity and other static analyzers like it are very useful and powerful tools for catching bugs that slip past the compiler and through peer review. However, it is no silver bullet. All of these errors were present in production code that were only caught after the fact, and there are certainly bugs of this sort left that Coverity hasn’t found. What do we do about them, and how can we ever be sure that we’ve caught them all? Today, there are new languages designed from the beginning to enable safe and correct programming. For example, libgfxinit is written in SPARK, a subset of Ada that can be formally verified at compile time to avoid essentially all of the above errors. There is also the new oreboot project written in Rust, which has similar compile time guarantees due to its extensive type system. I hope to see these languages and others increasingly used in the future so that at some point, this job will have become obsolete. 🙂