Skip to content

I'm still getting lockups #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
greyltc opened this issue Sep 15, 2019 · 4 comments
Closed

I'm still getting lockups #1

greyltc opened this issue Sep 15, 2019 · 4 comments

Comments

@greyltc
Copy link

greyltc commented Sep 15, 2019

arduino/ArduinoCore-avr#42
@matthijskooijman

I've been struggling with exactly what you describe in your comments here. In my one-master, one-slave bus, my slave is periodically (~1:100,000 chances) sending its ACK one clock pulse too soon and then my calls to the stock Wire library never return and my firmware is toast. I've dropped the code in this repo in place of my Wire library, but I'm still getting the lockups! I'd be happy for any advise you can offer me. I've been in #arduino in freenode.net irc for a few days trying to figure this out and I've gained a ton of conceptual understanding for the lockups and why they might happen, but haven't yet arrived to a solution, join if you can!

The comment here says you think the master/slave desync is caused by noise on the bus. I'm not so sure it's noise. My leading theory right now is that glitches (tall, thin voltage spikes) I've seen in SDA just after or before many of the the slave/master control ACK/NAK bit handoffs are causing the desync. I often see ~200ns gaps (with 400kHz SCL) during these handoffs where neither the master nor the slave actually does the job of pulling down SDA (when both agree that it should be low). During these gaps, the pullups I have cause little spikes in the waveform. Depending on what SCL is doing at the time, these could be interpreted by either member of the bus as start/stop/repeated start conditions or something else I haven't thought of. Here's what I'm talking about:
DS1Z_QuickPrint9

  • SDA is channel 1 (yellow)
  • SCL is channel 2 (magenta)
  • The low to high clock transition centered on the capture is for the ACK/NAK bit which in this case is sent by the master. So here, I think the master (Arduino MEGA2560) waits 200ns too long to take control of the SDA line from the slave for its NAK.

Similarly, for the opposite case ACK/NAK handoff, where the master just sent 8 bits to the slave and the slave takes the bus to send its ACK/NAK bit to the master, the master (arduino) waits 200ns too long to take SDA back after, causing a 4.2V spike that just should not be:
DS1Z_QuickPrint12

  • here again, the low-->high clock transition centered on the trace is for the slave to send its ACK/NAK bit to the master.

In summary: My salve is taking control of SDA pretty much right at the falling edge of SCL which looks like it makes sense to me (I actually can't find the requirement for this in the spec). The Arduino is waiting 340ns after SCL has fallen to take control of SDA after a NAK/ACK which leaves SDA uncontrolled for 200ns causing unwanted glitches/spikes, which might be the root cause of the lockups that the timeout approach this repo uses to recover from the issue.

@matthijskooijman
Copy link
Collaborator

Hey @greyltc, this completely dropped off my radar. Finally taken some time to look at your analysis now...

I'm not sure if what you say is correct, because:

  • SDA is not allowed to change when SCL is high (that would be a start/stop condition), so I guess doing the handover s bit after SCL went low might actually be intentional and sane (you could also do the handover before SCL goes low when SDA is already low and after SCL goes low when SDA is high for a cleaner signal, but I think it is not actually needed.
  • A start/stop condition occurs when SDA changes when SCL is high. Here, SCL is low, so this glitch could not be seen as a START/STOP that would suggest a second master.
  • A arbitration error is caused when the master wants to write HIGH, but observes a LOW signal on either line (which indicates another master is also writing). This glitch does not look like that.

So, I'm not sure if the glitch you're seeing is actually related to the lockup.

In our case, we were quite positive our lockup was noise-induced, since it only happened when a nearby brushed DC motor was running and became better when we improved our cable shielding. That might not mean that there are no other causes, of course.

In our case, the problem was also fixed by this forked version of the library, because it simply checks for the hardware no longer responding, regardless of the cause (which I think is the only fool-proof solution here). So I'm actually surprised that you were still seeing the lockups in your testing.

@greyltc
Copy link
Author

greyltc commented Mar 28, 2020

So, I'm not sure if the glitch you're seeing is actually related to the lockup.

Yeah, I totally agree with that. I'm not sure either. These spikes are the only suspicious behavior I can find on my bus though which is why I'm keying in on them. Maybe they're harmless since they only occur when SCL is low (though it's only been low for a few hundred ns) and far away from a low-->high transition. Maybe they're expected behavior?

Since my problem is not fixed by your fork and I'm not operating in a noisy environment, then I think it's probaby safe to say that your lockup issue does not have the same root cause as my lockup issue. I'd be curious to know if your lockup issue is solved by the PR arduino/ArduinoCore-avr#107 I made to solve my issue though. My PR is meant to be a universal bandaid to anything unexpectedly causing the Wire library to get stuck in one of its while loops.

@greyltc
Copy link
Author

greyltc commented Mar 28, 2020

Closing as likely unrelated

@greyltc greyltc closed this as completed Mar 28, 2020
@matthijskooijman
Copy link
Collaborator

I will try to test your PR on our hardware (first try the unpatched library to see if the problem is still there). However, I can't make any promised about that - due to Corona I do not have access to the hardware currently and once that clears up, I think we'll have other priorities first...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants