In the last post I talked about using aarch64-linux-gnu-gdb and debugging in qemu. In these two weeks I was intensely involved in stepping through gdb, disassembly and in-turn debugging the qemu port. I summarise the major highlights below.
Firstly, the correct instruction to invoke qemu is as follows
./aarch64-softmmu/qemu-system-aarch64 -machine virt -cpu cortex-a57 -machine type=virt -nographic -smp 1 -m 2048 -bios ~/coreboot/build coreboot.rom -s -S
After invoking gdb, I moved onto tracing the execution of the instructions step by step to determine where and how the code fails. A compendium of the code execution is as follows
gdb) target remote :1234
Remote debugging using :1234
(gdb) set disassemble-next-line on
(gdb) stepi
0x0000000000000980 in ?? ()
=> 0x0000000000000980: 02 00 00 14 b 0x988
(gdb)
0x0000000000000988 in ?? ()
=> 0x0000000000000988: 1a 00 80 d2 mov x26, #0x0 // #0
(gdb)
0x000000000000098c in ?? ()
=> 0x000000000000098c: 02 00 00 14 b 0x994
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000000000000750 in ?? ()
=> 0x0000000000000750: 3f 08 00 71 cmp w1, #0x2
The detailed version can be seen here.
The first sign of error can be seen here, where the instruction is 0 and the address is way off.
0x64672d3337303031 in ?? ()
=> 0x64672d3337303031: 00 00 00 00 .inst 0x00000000 ; undefined
To find insights as to why this is happening, I resorted to tracing in gdb. This can be done by adding the following in the qemu invoke command. This creates a log file in /tmp which can be read to determine suitable information.
-d out_asm,in_asm,exec,cpu,int,guest_errors -D /tmp/qemu.log
Looking at the disassembly, it can be seen that execution of instructions till 0x784 is correct and it goes bonkers immediately after it. Looking at the trace, this is where the code hangs
IN:
0x0000000000000784: d65f03c0 ret
The ret goes to somewhere bad. So the stack has been blown or it has executed into an area it should have prior to this. Next, I did a objdump on the bootblock.debug file. Relating to the code at this address, it could be determined that the code fails at “ret in 0000000000010758 <raw_write_sctlr>:”
Next up was determining where the stack gets blown or corrupt. For this, while stepping through each instruction, I looked at the stack pointer. The output
here shows the details. Everything seems to function correctly till 0x00000000000007a0 (0x00000000000007a0: f3 7b 40 a9 ldp x19, x30, [sp] ), then next is 0x00000000000007a4: ff 43 00 91 add sp, sp, #0x10 . This is where saved pc goes corrupt. This code gets called in the “raw_write_sctlr_current” (using objdump)
From the trace, we have the following information : The ret goes bad at 0000000000010758 <raw_write_sctlr>:
0x0000000000000908: 97fffe06 bl #-0x7e8 (addr 0x120)
…
0x0000000000000120: 3800a017 sturb w23, [x0, #10]
0x0000000000000124: 001c00d5 unallocated (Unallocated)
…
Taking exception 1 [Undefined Instruction]
…from EL1
…with ESR 0x2000000
Which is here:
0000000000010908 <arm64_c_environment>:
10908: 97fffe06 bl 10120 <loop3_csw+0x1b>
1090c: aa0003f8 mov x24, x0
This finally gave some leads in the qemu debug. There seems be some misalignment in smp_processor_id.
While tracing in gdb, we have
0x0000000000000908 in ?? ()
=> 0x0000000000000908: 06 fe ff 97 bl 0x120
(which is actually bl smp_processor_id (from src/arch/arm64/stage_entry.S))
Under arm64_c_environment (in objdump) we have;
10908: 97fffe06 bl 10120 <loop3_csw+0x1b>
Also in the trace we have
IN:
0x0000000000000908: 97fffe06 bl #-0x7e8 (addr 0x120)
Now loop3_csw is defined at (from objdump)
0000000000010105 <loop3_csw>:
So this + 0x1b = 10120
Thus it wants to branch and link to 0x120 but smp_processor_id is at 121.
smp_processor_id is at (from objdump)
0000000000010121 <smp_processor_id>:
This gives us where the code is failing. Next up is finding out the reason for this misalignment and rectifying it.