-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Talos II - 0.5.0 - One CPU : Doesn't boot #80
Comments
The bug is here. |
@tlaurion As said on the call, we do not have single CPU setup right now, and it would be troublesome to switch back and forth. Thanks for the report, it is very valuable. Maybe we can try with one or two iterations of fixes which you can test on your side. |
I made function that returns mask of chips to return Ignoring second CPU isn't the same as installing only one, but it's a good approximation. We might want to make a build option for that or do something else that allows limiting number of CPUs. |
@SergiiDmytruk Thanks. @tlaurion Binaries from this MR are here: https://github.com/Dasharo/coreboot/suites/6282695824/artifacts/224773170 |
For reference, my hostboot logs on same machine
|
root@talos:~# cat /var/log/obmc-console.log
|
We also have 8 GiB stick at There is also passing memory data from romstage to ramstage, but it looks quite reliable. @krystian-hebel any ideas what can go wrong on moving to ramstage and how to debug it? Setting log level to "spew" doesn't seem to add anything to the output.
DD21. Hope we didn't mess up with attribute values for different DDs... @tlaurion Does it work with two CPUs or you haven't tried it? |
@SergiiDmytruk didn't try 2 cpus, no. For the moment, I am testing one CPU one ram module, basically Talos lite setup. Shipped board was rejected for sale since some (unknown) ram slots were non-functional. I could buy more RAM to test second CPU later on if there is a way to test one CPU setup even if 2 CPUs are inserted. I would not see myself adding and removing those often and would prefer testing one CPU setup here until you guys can software disable the second CPU and move from there. |
Seems it hangs during loading of ramstage, because final romstage line ( First step of debugging would be checking in GUI if it checkstoped or "just" watchdog timed out. Then, for trying to pinpoint what exactly checkstops one has to read appropriate SCOMs in search of failure. Maybe it is worth trying to prepare a script for dumping all of those, although this kind of depends on installed hardware (e.g. which cores on CPU are enabled) and I don't know if reading all registers is safe. |
@tlaurion can you try installing DIMM in other slots? This looks like memory issue, but it's hard to tell whether it is hardware or configuration issue at this point. On the other hand memory training seems to be successful... |
@krystian-hebel Hostboot boots, as per #80 (comment) provided logs. The test report that came with the board stated that hostboot test run failed on some other ram slots, but the one stated per user manual for cpu0 one ram module is working. |
@krystian-hebel I have not changed jumpers either, if that is a needed thing (secure boot is on by default). I thought it was not, since booting from mounted ROM. But if anything else needs to be done, please advise. |
@tlaurion could you log in the GUI after it hangs and check if |
Four errors are reported from hostboot boot, yes. Export in json from that page: Hostboot boot reported (snippets) from above logs:
and
|
Right. |
@tlaurion those events (unfortunately?) aren't checkstops, which may make debugging harder. Most likely those are just logs from those errors reported by Hostboot. Feel free to clear that log so it won't keep showing them. Still, I'd like to ask you to dump some SCOM registers that would help if it were a checkstop, because some errors may still be masked at that time:
while read scom; do
if [[ "$scom" == "0x"* ]]; then
pdbg -P pib0 getscom $scom
else
echo "$scom"
fi
done < /tmp/fir_scoms.txt > /tmp/scom_dump.log
It is important to get the full dump from the first run - in case of problem in one place it may report further errors when trying to read additional registers, and that can obscure the real issue. Ideally those registers should be accessed in proper order, depending on results of earlier reads, but asking for them one at a time would take too much. |
The problem here, it testing :
Is that the machine bootloops, so that when I run the script:
Here is still the output of it, prior of it giving pdbg: |
IIRC it would stop bootlooping after 3 or 4 tries (there is a way to reduce that number, but Hostboot doesn't like it), but I think all important registers were dumped. I'll take a look an will let you know when I find anything. |
For some reason, next boot didn't hit a bootloop. Running:
obmc-console.log Here is scp'ed logs |
Maybe we should consider some scripts or guides which information to dump and how t do it to help us debugging issues. |
@macpijan would it be useful to have scom dump from successful hostboot booting petitboot? |
Those shouldn't matter, I kind of know what to expect there. What would help would be fuller dump when it hangs and a checkstop happens (i.e. when second register on the list, I don't know how #80 (comment) didn't get to the checkstop, instead it just stopped in a random place. |
@krystian-hebel: not sure what a checkstop is. (0x500f001c is zero here) The logs collected here were collected from machine not stuck in a bootloop. The machine is still on now, not having rebooted automatically. So I understand that to test and report the proper logs, it is required to boot successfully from hostboot first (to reset state and produce bootloops, for which a checkstop is what is triggered when a bootloop occurs and where system reboots automatically?) fir_scoms_issue80.txt |
Thanks for the clarification. There are org.open_power.Host.Boot.Error.Checkstop when booting coreboot, but they only contain PIDs and seemed irrelevant. Content of such logs from last boot attempt were: |
@krystian-hebel anything else you would need ? |
@tlaurion I updated binaries at https://cloud.3mdeb.com/index.php/s/A8qNefnkaBYDbsT, please check with those. I expect it to die on https://github.com/Dasharo/coreboot/blob/raptor-cs_talos-2/ram_debug_fixes/src/soc/ibm/power9/istep_13_11.c#L896. If that is the case, I would like to see both console dump and SCOM dump, if it doesn't hang there just the console output (may be just from start of istep 13.11 onward) will be enough. As the comment above line from link says, there is a lot of code in the workaround missing. I haven't implemented it because all of our DIMMs were nice enough to not require it, so I had no way of testing whether it works. Seems that this may have just changed 🙂 |
Will be reusing:
Ouput:
No checkstop?
Attached are the scom logs: |
@tlaurion to exclude possibility of bad support for CPU, could you test if it works (or at least behaves differently) when this DIMM is installed in other slots, even though instruction tells to use this one? |
@macpijan : user's manual suggests memory bank population for one DIMM to be in B1, on one/two CPU configuration. You suggest that I put my single M393A1K43BB0-CRC in what other slot? EDIT: outside channel answer of @krystian-hebel "Any one should do." |
hmmm. Memory stick in D1:
|
Memory stick now in A1. It seems that the issue above was linked to components being faulty and the Talos deactivating those components to continue on other working ones as described here. To make A1 work, it was necessary to wipe from instructions: Note: On my board, D1 slot is faulty (let's remember the donation came from a board that failed DUT_HW_GUARD_FAILURE), with the device Test report specifying that "2x ram slots fail". What we know as of now if that A1 and B1 (blue slots, under CPU0) works, and hostboot is able to train (as coreboot) the M393A1K43BB0-CRC stick. It seems that the CPU revision in my setup is not known from DEFCONFIG (from Hostboot):
|
@krystian-hebel @macpijan some updates? |
Interesting find. Soon after entering ramstage processor hits an unknown instruction and as a result exception happens. Normally, at that point there should be no interrupt handlers installed, but for some reason around that particular exception (@ 0x840) I see something that looks like proper PPC64 code:
From there it jumps into unused memory (all 0s), hits another exception and goes back into this handler. There are some open questions that I'll try to debug more:
On memory training front: there are no errors when only 1D write centering is performed, instead of 2D centering. This is one of the workarounds Hostboot does when full algorithm fails, but according to my notes all registers that Hostboot tests for failure are reporting a success, even for 2D training. I will have to dig into Hostboot's code again and compare it with my notes. Nevertheless, even with forced 1D write centering, the problem with "random" data in interrupt handlers' area still happens. Other parts of RAM are properly zeroed so it is most likely working as it should. It may be a different issue altogether and this is what I'm assuming right now. At least at that point we have RAM and can install our own interrupt handlers to help with debugging, assuming they won't get overwritten. @SergiiDmytruk constructed a list of differences between different DD revisions of CPU, but it seems that Hostboot doesn't always use everything it defines. There may be more code depending on DD versions that doesn't use attributes listed there. Some things are set before FAPI (used to access attributes, among other things) is enabled, like HRMOR or XIVE. |
More findings:
|
I just hope we don't chase broken hardware here. Please note we may have some more remote setups of Talos II for coparison of behavior soon. I'm working on that. |
@pietrushnic Hostboot works with this hardware, so even if it is not perfect, it should at least hit one of TODO workaround paths in coreboot. For some reason this does not happen, no error is detected until it breaks fatally. |
@krystian-hebel understood, that means we need improvements on our side and we don't what the heck is going. I hope you can invent some diagnostics steps. Also maybe we should redirect this to mailing list, maybe someone there would have ideas how to debug that further? |
@tlaurion I have another request: I need logs from |
Full logs provided off channel. |
@krystian-hebel any updates? |
Sorry for the delay, I was busy with other projects and then reading provided log (almost 200k lines). Based on it, and additional SCOM dumps performed on platform, I have some rough idea of what happens. There were actually multiple errors (11 to be exact) reported in FIR, but some of them probably were caused by trying to load exception handler that does not exist. We messed up a bit during simplification done to various isteps, like this one. As a result, a bit that tells that current CPU is the only one in the system is not set. Because of that, when code tries to write to RAM, a message is sent to other CPU that it should flush and invalidate its L3 cache. That other CPU does not exist, so main CPU doesn't receive an acknowledgement to that message. After quick and dirty fix in place (hardcoding it to 1 CPU version) "only" 4 errors were reported. Further issues were caused by misunderstanding what I also noticed some differences that will become important when we get to OCC initialization, but we're not that far just yet. Also, if OCC was (partially) started earlier, it should be able to gather FIR SCOM, basically what I tried to get with scripts few comments above. This would however require starting OCC (at least) twice, and as I haven't thought it would be necessary, we decided to skip it in coreboot. AFAIK there is no ready-to-use tool for parsing those dumps. Right now we're about to remove one CPU from our platform and see if we get exactly the same issue, as at least some of this problems should happen on every 1 CPU platform, if I understood it correctly. |
Update: with additional changes coreboot is able to boot to ramstage, now it stops at Final cause was wrong Power Bus frequency. Hostboot did read it from MVPD which we mimicked in coreboot, but what we haven't noticed is that the value is overwritten with a hardcoded one later. Hardcoded value is identical to value from MVPD for our CPU, which is why it worked in the first place. WOF is another issue altogether, but I think it shouldn't be too hard to fix. We just didn't thought that there are processors which aren't included in WOF table, but this is what community testing is all about - to catch such corner cases. |
Latest tests with unreleased version supported my single CPU! Current issue now is to have Heads payload output to vga console, which seems to miss either AST+DRM in kernel config and/or proper skiboot passed arguments. Off-channel notes:
|
@krystian-hebel I'm searching for the tool that was developped to sit on bmc and collect the logs without nohup. Can you tag me and point me to where it is? That should be added in a debugging page for Talos II board. |
I think it was here: https://github.com/3mdeb/openpower-coreboot-docs/pull/74/files Please let me know what you believe what be the best place to put it |
@macpijan : https://docs.dasharo.com/variants/talos_2/overview/ should have minimally a link to https://github.com/3mdeb/openpower-coreboot-docs But a debugging page draft would be useful, pointing directly to https://github.com/3mdeb/openpower-coreboot-docs/blob/main/devnotes/scat/README.md to facilitate bug reporting? |
Maybe we should move all user-level documentation to |
Dasharo version
0.5.0 release from https://docs.dasharo.com/variants/talos_2/releases/
Dasharo variant
Workstation
Affected component(s) or functionality
Memory initialization fails ( 1x M393A1K43BB0-CRC in B1 memory slot)
Brief summary
The coreboot output stops at
How reproducible
Always (Single 16 cores CPU, one RAM module: 8GB
(More info on changes needed to be documented on non-flashing instructions under #79)
How to reproduce
Steps to reproduce the behavior:
On laptop:
On a seperated SSH connection to BMC:
On another seperate SSH connection to BMC:
Expected behavior
Ram init succeeds and net steps are engaged
Actual behavior
Stops at
The text was updated successfully, but these errors were encountered: