Resolved serial flash issues for legacy devices #172

MartyMacGyver · 2017-01-23T11:36:07Z

Resolves issue #136

Attempts normal connections, then attempts legacy connections
Improves legacy connection compatibility
Doesn't slow down non-legacy connect time

I've tested this in Windows (in MSYS2 latest as well as as a compiled .exe with Arduino-ESP32 in Arduino - it worked equally well with both). Note that I used the latest Silicon Labs and FTDI drivers for the respective boards, without enumeration mode.

I tested against the Core Board v2 and the ESP-WROVER-KIT. In both cases the time to connect was minimal for the given board (since the Core Board requires the delay it takes longer to connect to, but since its attempt is always first the WROVER-KIT board connects quickly as normal).

The net effect is that the Core Board v2 can finally be reliably flashed now via Windows without manual intervention. This may similarly positively impact older boards without requiring extra command-line options such as esp32r0.

MartyMacGyver · 2017-01-25T08:18:08Z

Note: I didn't pull out the previous workaround because it's not quite clear what it works for, if it worked right / reliably at all (I don't have a Core Board v1 to test against but it's quite possible that that specific code and command line option for that is superseded by this). I know esp32r0 did not help with the Core Board v1.

projectgus

Thanks for submitting this. I had considered doing something like this as well, but I was worried about burdening all of the non-esp32r0 devices (all ESP8266s, and all ESP32s after r1 comes out shortly) with an esp32r0-only workaround. I think what you've done here is a good compromise though, and given how extensive this problem appears to be it seems like a good fix.

Left a few comments and I haven't run the automated tests with these changes yet, but generally looks like something that should be merged.

projectgus · 2017-01-27T00:14:45Z

esptool.py

@@ -653,7 +668,8 @@ def set_data_lengths(mosi_bits, miso_bits):
        if data_bits == 0:
            self.write_reg(SPI_W0_REG, 0)  # clear data register before we read it
        else:
-            data = pad_to(data, 4, b'\00')  # pad to 32-bit multiple


The commit seems to have accidentally reverted a bunch of recent unrelated changes on master. Suggest doing something like a git reset HEAD^, then git add -p and stage only the relevant changes, then redo the commit as git commit --reuse-message=2b4bd and force push.

Will definitely fix that - not sure what happened there. :-/

I see the problem now - esp-idf is using an older version (e9e917) so I didn't see that master here has something newer. Will get it straight with the rest of the fixes here.

projectgus · 2017-01-27T00:16:10Z

esptool.py

@@ -281,6 +281,53 @@ def sync(self):
        for i in range(7):
            self.command()

+    def _connect_attempt(self, mode='default_reset', legacy=False):


I see why you called this parameter "legacy" (and "legacy" in the comment) but I think it's better named specifically as the esp32r0_workaround or something like that. It's not really clear what a "legacy" device is in the esptool context (ESP8266s are arguably a legacy device, but they are still fully supported.)

Question: is what's on the Core Board V2 the esp32r0? Either way, yes, easily renamed... but if this is the same problem esp32r0 solves then I may be able to simplify this a bit further so it's cleaner.

Yes, the only esp32 revision released so far is esp32r0. Revision 1 ("esp32r1") is coming in a month or two.

This "workaround" for these reset problems (waiting an extra 400ms for a second reset due to a spurious watchdog timeout) will only ever work on esp32r0, as it's actually a silicon bug. However provided dev board manufacturers increase the capacitance on the EN pin for subsequent board designs, no workaround should be needed for future revisions.

I'm still in favour of keeping the --before esp32r0 option in the command line interface, though, I think. Although maybe it's easier to remove it entirely for now.

I'd prefer to remove the option either now or in a future PR, which would deprecate ESPTOOL_BEFORE as well as the --before option - only those who explicitly use --before esp32r0 would be affected, and only to the extent that they'd just need to not use the option.

projectgus · 2017-01-27T00:17:44Z

esptool.py

+                time.sleep(2.0)  # Need to sleep longer here!
+            self._port.setDTR(True)   # IO0=LOW
+            self._port.setRTS(False)  # EN=HIGH, chip out of reset
+            if mode == 'esp32r0' or legacy:


Agree with your decision to keep the esp32r0 mode here. I don't think it will be necessary to use it after this change lands, but it does make the successful reset happen slightly faster and it also keeps the door open for removing the automatic retry behaviour in some future release, maybe in a year or two when the vast majority of ESP32s are r1 or newer (and hopefully on dev boards where more capacitance is added to the EN pin.)

I'm thinking of how to converge this a bit if what's on a Core Board v2 is esp32r0 (as it seems might be the case). What's on the ESP-WROOM-KIT currently (r0 or r1)?

Also, for older boards esp32r0 mode as an option only saves a fraction of a second (the default attempt is very quick). Given all the info I have now, it would be more maintainable to eliminate this named mode now or eventually.

projectgus · 2017-01-27T00:21:35Z

esptool.py

+            self._port.setDTR(False)  # IO0=HIGH
+            self._port.setRTS(True)   # EN=LOW, chip in reset
+            if mode == 'esp32r0' or legacy:
+                time.sleep(2.0)  # Need to sleep longer here!


Two things:

If the mode isn't esp32r0/legacy, then this changes removes the sleep while EN=LOW. Probably this won't matter because Python+OS+USB creates enough latency to create a meaningful reset period, but I'd prefer to at least keep the old 50ms sleep here.

Does the esp32r0 really need to be held in reset for a whole 2 seconds? That is very unexpected!

I'd prefer to remove the if and slightly increase the sleep time for all modes, maybe to 100ms, provided this works OK in your testing.

I will look into restoring that delay.

As for the two seconds, even then it sometimes misses on the first pass (but rarely). I can try some different values and see what happens with that, though I don't believe it can be much less.

As for the if itself, it'll still need to be there since that mediates whether the extra delay is present for older devices. Or do you mean the other if that's potentially obsoleted by this?

MartyMacGyver · 2017-01-28T00:15:22Z

I've made the requested changes (and fixed the earlier merge problem).

As uploads are never a problem on the ESP-WROOM-KIT I think this is technically more of a board capacitance delay workaround than an r0 thing in particular.

Either way, it doesn't hurt to have the fallback mode (the ESP-WROOM-KIT doesn't end up falling back to it, only the Core Board v2 does (and consistently). If someone is using a esp32r0 in some other configuration they will get a fractionally longer delay to connect (and if they used the same reference design, then they're likely going to have the same capacitance issue).

I'll be curious to see how this works for you against various boards. Note that '_' during connect signifies attempts using this or the explicit esp32r0 workaround.

projectgus · 2017-01-29T23:38:51Z

As uploads are never a problem on the ESP-WROOM-KIT I think this is technically more of a board capacitance delay workaround than an r0 thing in particular.

I think you probably already understand this, but I'm going to repeat it in case I've explained it poorly: The cause of the bug is definitely insufficient capacitance on the EN line. This workaround in software will only ever work on esp32r0, because it exploits a silicon bug (the board resets twice during a normal boot, once when EN goes high and then ~350ms later when a spurious watchdog reset occurs.) By holding the GPIO0 line low for 400ms we make sure that even if the ESP32 misses the initial pull of GPIO0 low it will pick it up on the spurious watchdog reset.

Once esp32r1 comes out, the capacitance on EN will still be a factor but the extra 400ms delays won't do anything apart from delay inevitable failure. However hopefully all boards with insufficient capacitance on EN also have revision 0 chips - Espressif has already updated its designs (as you've noticed with ESP-WROVER-KIT v2, still esp32r0 but sufficient capacitance) and I believe most third parties also have sufficient capacitance in their current designs.

No doubt someone will miss this bug, copy an older ref design and release an incompatible board with revision 1 ESP32 - but unfortunately there's no possible fix for that apart from manually pressing reset.

I'd prefer to remove the option either now or in a future PR, which would deprecate ESPTOOL_BEFORE as well as the --before option - only those who explicitly use --before esp32r0 would be affected,

Fair enough, let's remove "esp32r0" before v2.0 final release. The --before/ESPTOOL_BEFORE has to stay because it also has the --before no_reset option to disable any auto-reset functionality. This is useful for other configurations.

I can make those changes separately to this MR, if you prefer.

MartyMacGyver · 2017-01-30T01:00:09Z

I have a longer reply but first a question: between the current ESP32-DevKitC and ESP-WROVER-KIT, the latest schematics show the same capacitors... The base resistors though are quite a bit different. Is there a capacitor I'm missing here or is this down to the resistors and their effect on that circuit?

projectgus · 2017-01-30T01:34:38Z

I have a longer reply but first a question: between the current ESP32-DevKitC and ESP-WROVER-KIT, the latest schematics show the same capacitors

My mistake, I guess the updated capacitor is coming in ESP-WROVER-KIT V3. There are some circuit differences, but I expect the rise time on EN is the same on both (sorry I don't have time to confirm on an oscilloscope right now.)

I think the reason you don't see the bug on ESP-WROVER-KIT may come down to FTDI driver behaviour. That's the other factor in this bug - there's a race between when DTR is asserted (GPIO->LOW) and when RTS is released (EN->HIGH) - if these happen close enough in time then everything is fine, because GPIO0 is already low when the ESP32 starts up. However if there's a gap between the two, the nature of the "auto reset circuit" is that while DTR & RTS are asserted, both lines go high - and if this window overlaps the ESP32 reading GPIO0, you get the bug. Increasing capacitance on EN means that the line stays low for longer when released, which covers this gap.

The CP2102 drivers on Windows always seem consistently introduce enough of a delay for this to be a problem. Similarly, in VM environments the extra latency over virtualised USB seems to trigger it. However, on my development Linux machine I almost never see any kind of problem - it only seems to occur if there happens to be an OS context switch between setDTR(True) and setRTS(False), and when that does occur it's usually correct on the next retry rather than failing outright.

Maybe the FTDI drivers on Windows don't have the same behaviour (they may batch both of the control line operations into a single USB request, in which case both lines will change simultaneously. Or maybe the change just usually happens close enough in time that it doesn't trigger the bug.)

If I get a chance then I'll try and get some scope captures for this behaviour on various OSes & boards.

MartyMacGyver · 2017-01-30T01:44:42Z

All that is just an effort to better characterize this fix and what it's accomplishing.

As for the general fact that the two board have something different about the reset circuit, I do get that... it was more of whether the extra delay I've added is so much about triggering the watchdog as it is about compensating for the circuit difference.

The ESP-WROVER-KIT has 12x less resistance in that RC loop, which says to me it's got a shorter time constant when you trigger EN. Interestingly, simply pressing EN on that board can often lead into the boot loader multiple times... so from that perspective I wonder if the current circuit isn't too sensitive (whereas the Core Board V2 is so insensitive that it needs the extra time I added there). Perhaps the extra capacitance in v3 deals with all that.

Given that everything we're dealing with here is for r0, I was wondering about nomenclature - what I'm adding isn't so much more of the same r0 workaround as it is a workaround for slower reset circuits in general (and specifically, as they differ between the Core and WROVER boards). The two timeouts seem to do two somewhat different things that together make the Core Board v2 work reliably.

And yes, I meant removing the particular --before esp32r0 selection, not that whole switch. I can make that change later tonight.

On nomenclature (for within the code), let me look at the schematics and the serial protocol a bit too tonight - I want to be very clear in the code what's doing what in this alternate reset regime (one thing is an explicit r0 WDT delay that was already there as part of the old option, while what I've added is more of a "longer reset" thing to compensate for a slow EN or race condition. The fallback code will still do the same steps but it'll be clear what the new step actually accomplishes.

Naming suggestions for this are welcome. And again thanks for taking the time to look at this and explain what needed explaining! 👍

Edit: "latency workaround" has a nice ring to it, for the new delay...

projectgus · 2017-01-30T02:08:46Z

it was more of whether the extra delay I've added is so much about triggering the watchdog as it is about compensating for the circuit difference.

I honestly doubt this, but without putting on the scope/LA I can't say for sure.

I had some speculation about the limited effect of the different base resistor on Q1/Q2, but I can't explain the need to hold the board in reset for so long. So it's probably safest that I'll put both boards on the scope in the next couple of days and double check everything.

MartyMacGyver · 2017-01-30T02:54:41Z

I'm going to call out my extra delay as a latency workaround if that's ok. That and I will get rid of the argument to --board as above.

MartyMacGyver · 2017-01-30T09:59:40Z

Code and code comments clarified
Latency delay increased a bit (it wasn't triggering reliably at 0.8s and even 1.0s was somewhat unreliable)
Removed --before esp32r0 option
Updated README
Updated merge comments

projectgus · 2017-02-08T22:54:10Z

Hi @MartyMacGyver,

Sorry for the extended silence regarding this.

I've run about half the tests I wanted to tun - on the WROVER board, I measure that fall time on GPIO0 is a bit slow (~15ms to reach 0V, less to cross the "low" threshold). However that's not enough to explain why you need such a long additional delay.

Here's a couple of scope captures - yellow trace is EN, blue trace is GPIO0.

First capture is of a "correct" reset (EN goes high before GPIO0 goes high):

Second capture shows "incorrect" timing on Linux (USB on host delays EN so both EN & GPIO0 transition at the same time).

Next step is to run some digital logic analyzer captures from Linux & Windows on both board types and try to figure out exactly what's different. I suspect maybe the Windows FTDI driver is delaying control line transitions until some internal timer expires. I've done a few of these captures, but haven't had time to do them all yet.

MartyMacGyver · 2017-02-10T04:42:41Z

As an aside, running certain code - particularly code with interrupts (all via Arduino-ESP32) can make it difficult to get to the boot loader even using the buttons. The fact that that varies at all suggests to me that the code you run matters here.

That said, it might be a weird edge case.... it's just interesting that code would have an effect on this response time at all.

projectgus · 2017-02-10T05:43:02Z

As an aside, running certain code - particularly code with interrupts (all via Arduino-ESP32) can make it difficult to get to the boot loader even using the buttons.

That shouldn't be technically possible. When the EN line gets pulled low (via DTR), the entire chip should go into a reset state regardless of what the CPU is doing.

Is the chip outputting a lot of serial data at the same time it's failing to go into bootloader mode? Maybe we're not flushing the serial buffers aggressively enough so esptool is reading old/garbage output and assuming the reset failed.

If you have some code that seems to manifest this problem, could you please post here or email it to me? angus at espressif. Thanks!

MartyMacGyver · 2017-02-10T21:06:28Z

It definitely goes into the reset state... but even with the BOOT button it becomes challenging to get into the boot loader state (ready to download). Right now that's the only circumstance when I've seen that happen, and I'm not sure why it happens. If I can reproduce (I'm hopeful it wasn't just a transient problem) it I'll ping you with the details.

EDIT: It's operator error apparently.

I have a device under test (a TCS34725) that was feeding in periodic interrupts (about two per second) to... GPIO2. Evidently that's a particularly problematic situation during boot. My handy pinout chart didn't call that out as particularly special, except for the pullup/downs which except for BOOT and RESET seemed noteworthy but not exceptional. It's much more explicit in the docs I dug up:

https://www.espressif.com/sites/default/files/documentation/esp32_chip_pin_list_en.pdf

That said, I wonder how often the strapping pins are leading to unexpected behavior for people, particularly those using them as inputs?

Note: I normally wasn't using these special pins, until this device arrived... I picked the wrong pin to try HW interrupts on one of them.

MartyMacGyver · 2017-02-11T00:11:06Z

Well, FWIW it still happens occasionally that the Core Board won't go into bootloader mode without a LOT of button pressing, even when I'm carefully avoiding the strapping pins. However, this is not happening in a consistent way so for now I cannot pinpoint it or accurately reproduce it. Is there anything that running code can do that would influence what happens during the next reset, or which would confuse the bootloader?

Improves connection compatibility - Attempts normal connection, then attempts connection with an extended reset delay and a delay that triggers a watchdog timeout (esp32r0-only) to more reliably enter download mode. Fallback design - initial connect is without workarounds - if that fails, the extra delays are used. Resolves issue espressif#136

MartyMacGyver · 2017-02-22T09:32:48Z

Just rebased and squashed. If you require further edits let me know - it'd be nice to see this merged when possible.

projectgus · 2017-02-23T01:09:00Z

Hi Marty,

Sorry again for the extended silence. Thanks for the rebase. I like the new factoring of latency_delay and esp32r0_delay.

Regarding esp32r0_delay, I'm happy to merge this more or less now as I understand the mechanics of it, and it fixes the problem nicely.

Regarding latency_delay, I honestly didn't believe this was a thing until I tested it. And you're right, this helps - although on ESP-WROVER-KIT only. I have some theories about why, but I'd rather not post more details until I'm sure. However as it's a flaw in a single revision of a single dev board on a single platform, I'm hesitant to bake it into the default behaviour of esptool - at least until we understand the mechanics of it better.

If you want to split the PR, I'm happy to merge the first part now. Otherwise I'm actively looking into the latency_delay behaviour today, so we'll hopefully get a firm picture of what's going on.

projectgus · 2017-02-23T01:10:04Z

Is there anything that running code can do that would influence what happens during the next reset, or which would confuse the bootloader?

Sorry, I missed this question. No, this shouldn't ever happen.

MartyMacGyver · 2017-02-23T02:29:14Z

Let me know what you learn about latency_delay - because that's what helps on the Core Board v2 (without it I must manually intervene) (I've been using that board more than the ESP-WROVER-KIT lately, for reasons of form factor). The cost of having this workaround baked in is virtually negligible since it's only active after the primary attempt fails (though I personally think as long as people are using any of these r0 boards these will be useful to have in the code).

MartyMacGyver · 2017-02-23T02:31:16Z

And as for my other question, that was the strapping pins' fault - if you wire them up and they get non-default levels during boot-up, things get screwy. (I thought at first it was the code somehow, unlikely as it was, because I didn't realize what all the strapping pins were.)

projectgus · 2017-02-23T03:48:57Z

Let me know what you learn about latency_delay - because that's what helps on the Core Board v2 (without it I must manually intervene) (I've been using that board more than the ESP-WROVER-KIT lately, for reasons of form factor).

Oh, OK. That's interesting. I don't find this makes any difference on the Core V2 boards I've tested, but it does seem necessary on my ESP-WROVER-KIT.

MartyMacGyver · 2017-02-23T04:36:15Z

Then it's safe to say that this has its uses for both production dev boards, while not adding any significant cycle time and eliminating the need for special toggles (which are not easily made when using things like Arduino-ESP32).

projectgus · 2017-02-23T04:53:53Z

An extra 1.2 seconds is fairly significant, as these things go. On Linux & OS X the initial reset fails sometimes as well, so other platforms are getting this additional delay even though it doesn't seem like they need them (and I still don't understand what's special about Windows in this particular case of latency_delay.)

More generally, from a maintainer's perspective it's problematic to merge fixes for problems when we don't fully understand what the problem is, or how we're fixing it. esptool.py probably has many more years of life ahead of it, and many more chips and revisions of chips. Without fully understanding them, these fixes tend to end up as the kind of magic code that is viewed as "don't touch that, it might be important". I'd rather understand why this works so we can document it properly.

I'm guessing you don't have access to a logic analyzer? I'd be very interested to see your Core Board's behaviour with and without the latency_delay change.

MartyMacGyver · 2017-02-23T05:25:24Z

I'm not sure what you want to see - it seems to me there is no small amount of variability between boards and components, otherwise your WROVER-KIT board would work fine, whereas my Core Board unfailingly requires an extended hold of the boot button to get it into the bootloader during a codeload with the non-revised code (and the defunct esp32r0 option has no effect whatsoever on that). The only thing that actually helps reliably and doesn't require manual intervention or hacking the Arduino-ESP32 codebase is the automatic fallback.

Specific to the occasional spurious failure as you noted on certain OSes and configs, it's already a given that that's going to be a bit costly in terms of times spent retrying even with the current code. As a fraction of the flash time it's barely noticeable, and as a fraction of the time spent having to re-run the flash manually it's negligible (and adding to that, make flash is not at all speedy, assuming you've already built the code... and the time spent unfolding and starting the executable version of esptool is itself quite long when that's in play).

In summary, occasionally adding a second or even two to a spuriously failed boot - let alone one that will fully fail every time because of component and design variances - is still a tiny price to pay for boards to consistently boot without the need to press buttons on it manually. (And with non-flakey boards, failed boots that will recover before hitting fallback are very rare events which means this would cost them almost nothing.) The extra time isn't anywhere near a long straw in any of these cases - and given the architecture and constraints I can't see flash time getting faster, nor do I see make flash or the executable itself getting noticeably faster either. It's a very small price to pay for consistency.

MartyMacGyver · 2017-02-23T05:31:31Z

I will see what kind of data I can get, if I can get anything useful given my setup here. I can tell you that on two different computers, using the latest Si Labs CP210x driver (6.7.4) without enumeration, the need for that delay is constant (that is, a shorter delay becomes marginal, and as stated previously it's the Core Board v2 that has the most noticeable problems - the WROVER-KIT is generally ok).

projectgus · 2017-02-23T06:25:31Z

on-flakey boards, failed boots that will recover before hitting fallback are very rare events which means this would cost them almost nothing.

On my system it's more than 50% of the time that a second reset is needed, no matter which board. Flashing a 400KB image compressed takes 5 seconds at 921600bps, so an extra 1.2 seconds compared to 0.2 seconds is a measurable difference.

That said, I totally agree that making esptool.py stable and reliable is the number one priority. Making it fast is a secondary priority.

I will see what kind of data I can get,

Thanks. I appreciate that you think this is not particularly useful but I would be very grateful for any data you can dig up. I've figured out a fairly simple test (no extra hardware required) that will confirm my hypothesis (see below).

I'm actively looking into the latency_delay behaviour today, so we'll hopefully get a firm picture of what's going on.

I've now compared six boards across three computers.

On Windows, Core boards always require the esp32r0_delay fix. For ESP-WROVER-KIT, my dedicated Windows PC doesn't require any fixes but Windows VM requires esp32r0_delay. On the Windows VM, one particular WROVER-KIT board sometimes also requires latency_delay (works approx 75% of the time with esp32r0_delay only).

Here's what I think is happening, based on some LA captures and experimenting:

The esp32r0 silicon bug where an extra RTCWDT_RTC_RESET happens ~300ms after the first reset doesn't happen consistently on all chips. On my "problem" board, it only happens sometimes.
Holding the problem board ESP32 in reset for longer than a second seems to cause the extra
RTCWDT_RTC_RESET to trigger.
Therefore, this board is immune to the esp32r0 workaround unless it's held in reset for the extra time.

I only have one data point though. If you could please take your Core board and a serial terminal, this should be straightforward to compare:

When you press and release the EN button quickly, you should see something like

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x16 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0008,len:8
load:0x3fff0010,len:3488
(etc, etc.)

But if you hold the EN button for a couple of seconds then release, you get something like:

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
ets Jun  8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0008,len:8
(etc, etc.)

Keen to find out if this is the same for your problem board.

Thanks.

MartyMacGyver · 2017-02-23T06:58:02Z

The logic analyzer does show some interesting behavior.

For one thing, if RTS and DTR are both high, toggling RTS toggles both.
Similarly if both are low, toggling DTR toggles both.
This may or may not be normal, but that may have something to do with the odd little pulse I see below. In turn, that pulse may or may not be significant... I kind of think it isn't since it's present on both traces.

Analog wasn't terribly noteworthy, but it's hard to tell how much the probes might affect capacitance falloff (I suspect a combination of issues, as you may be noticing in your followup note above).

Using the original esptool, and starting with DTR and RTS high:

0.0ms - RTS goes low
1.0ms - DTR goes low
49.5ms - RTS goes high and stays that way
49.7ms - DTR goes high
50.0ms - DTR goes low once again (ending a short but very repeatable pulse)
99.5ms - DTR goes high and stays that way

(Always, about 4ms before the very first connect attempt, we see a 0.5ms low pulse on RTS. I presume this is when serial is connected to the script in the first place.)

Bootloader is not entered.

Using the new method, we see the same initial attempt as above, with newer timings:

0.0ms - RTS goes low
1.1ms - DTR goes low
99.7ms - RTS goes high and stays that way
99.9ms - DTR goes high
100.3ms - DTR goes low once again (ending a short but very repeatable pulse)
149.6ms - DTR goes high and stays that way

After 0.9s we see the first fallback attempt:

0.0ms - RTS goes low
1.1ms - DTR goes low
1299.5ms - RTS goes high and stays that way
1299.7ms - DTR goes high
1300.0ms - DTR goes low once again (ending a short but very repeatable pulse)
1749.5ms - DTR goes high and stays that way

This tends to work on the first fallback try for the Core Board. (The WROVER-KIT never seems to have these problems, but occasionally it's tricky to keep it out of bootload mode - that's beyond the scope of this fix though.)

As for booting, I'm not seeing what you're seeing... a brief tap on EN usually gives me a POWERON_RESET with a flash read err, 1000 followed by a RTCWDT_RTC_RESET. This is with an Arduino program loaded onto it - similar results with native code.

Variant 1:
ets Jun 8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0008,len:8
... and it's running

Variant 2:

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
flash read err, 1000
Falling back to built-in command interpreter.
OK
>ets Jun 8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0008,len:8
... and it's running

If you in turn press BOOT at just the right time after EN and are lucky and get into boot mode:

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x3 (DOWNLOAD_BOOT(UART0/UART1/SDIO_REI_REO_V2))
waiting for download

And finally, if you're doing this manually, it seems a whole lot faster than a second when you press the buttons - but very tricky to get the timing just right. Perhaps the latency I'm fixing is all in the serial stack on the given host, and not so much a matter of the board itself. Whatever the source, it's real and repeatable during software-commanded DTR/RTS controls over USB using the SI Labs drivers, et al. on other platforms.

MartyMacGyver · 2017-02-23T07:33:41Z

Further info - in steady state (DTR and RTS are normally high):

If I press EN, both go low
If I press BOOT, only DTR goes low
If I manage to press EN and then BOOT, then release EN, I get the bootloader

This isn't necessarily interesting behavior in itself, but I'm not seeing the watchdog timeout in that short sequence, and it's pretty fast. The logic when entering the bootloader using the buttons alone is a lot cleaner too:

0.0ms - RTS goes low
1.0ms - DTR goes low
~400ms - RTS goes high and stays that way (varies widely as to how fast I press buttons)
~650ms - DTR goes high and stays that way (varies widely as to how fast I press buttons)

The weird but persistent DTR pulse seen with the software is nowhere to be found when doing it this way.

I'm thinking about rigging up something in python that can test this more expediently, and which would actually log the states it sees as it tries this - a test rig, basically. Probably won't have further info until tomorrow evening at earliest though... it may just be that there are certain quirks between pyserial and/or the driver and/or the USB hardware subsystem and/or even the USB UART on the board that in aggregate make this fix necessary.

MartyMacGyver · 2017-02-23T09:39:42Z

Here's a little truth table for RTS and DTR:

DTR (IO0) RTS (EN)  DTR (IO0) RTS (EN)
 setting   setting   actual    actual 
--------------------------------------
 True      True      High      High
 False     False     High      High
 True      False     Low       High
 False     True      Low       Low

There is no stable state where IO0 is High while EN is low, only transient states ~0.5ms long.

Is this an effect of the CP2102 alone, or is it an interaction with the reset circuit?

To find out, I've ordered a CP2102 breakout board that should arrive later today (a little expensive but worth it). It breaks out ALL the signals for the chip, so it should be enlightening to see how it differs, if at all.

projectgus · 2017-02-23T23:22:38Z

Thanks for all that investigation and analysis, @MartyMacGyver. I appreciate it.

The behaviour you're seeing with DTR/RTS is by design on the development boards. If you look at the schematics for the Core Board & ESP-WROVER-KIT then the same truth table that you derived is shown there.
The reason for that behaviour is that most serial programs assert both DTR & RTS when the port is opened. Early ESP8266 dev board setups wired DTR & RTS directly to IO0 & EN, but this means that a standard serial program will hold the board in reset until you de-assert RTS. The "truth table" circuit was first used in NodeMCU's ESP8266 boards, and has become popular.
You are correct that the glitch where GPIO0 goes high for a short period is a result of the truth table and the order of setting DTR & RTS. ESP8266 takes a short time longer to read its strapping pins, so this is rarely a problem there. ESP32 is a little faster, so the pulse and the reading of strapping pins coincide sometimes. This happens occasionally on OS X & Linux (due to OS scheduling, USB delays) but it happens a lot on Windows, at least with CP2102 driver (Core), seems to happen less with the FTDI driver (ESP-WROVER-KIT). Running any OS inside a VM is also enough to trigger the problem sometimes, when USB is virtualised the timing tends to spread out a bit.
It may look like the GPIO0 glitch is just a result of pyserial's API (one pin per function call) and can be fixed by changing the way DTR/RTS are set. At least on Windows with CP2102 this doesn't work, I tried setting both in a single Windows API call and the driver still makes two separate USB transfers to set the pins.
Instead, the long-term fix for this problem is to increase the capacitance on the EN pin from 1nF to 100nF or more, so that the EN pin rise time (after RTS is released) becomes extended and the chip starts up approx 1ms later, which is enough to miss the GPIO0 pulse. This is a confirmed solution, newer Espressif dev boards will have this capacitor and many third party vendors already include it.
What to do on the existing boards? The trick here is that esp32r0 has the "extra WDT reset" bug where a RTCWDT_RTC_RESET happens 300ms after a power on reset. This is a silicon bug, but in this case we can use it to work around the error. The additional 400ms delay (esp32r0_delay) leveragfes this, we sit around with GPIO0 low (and EN high) until this WDT reset occurs and the ESP32 re-reads the strapping pins. Bingo!
Once esp32r1 comes out, the silicon bug is fixed and this reset won't happen any more so this workaround will no longer be effective. But we're relying on everyone increasing EN pin capacitance on their dev boards by this point (Espressif boards will, most third party vendors either already have or hopefully will), so the default reset will work.

The above is pretty well understood (at least by me, and there are some other Github Issues and forum threads that go over it) and I more or less trust it as being correct. That said: If anything there that doesn't match what you've seen, please let me know.

All of that explains esp32r0_delay, but the case for latency_delay is the more unusual situation.

a brief tap on EN usually gives me a POWERON_RESET with a flash read err, 1000 followed by a RTCWDT_RTC_RESET

This is Core board V2, yes?

I actually get the "flash read err" as well. In the second log I posted above I took it out because I thought it was a separate issue with my board (sorry), but if you're seeing it as well then it's related.

This is different to what I see though, I only get this "flash read err" if I hold the EN button for 1-2 seconds, not if I just tap it.

Variant 1:
Variant 2:

I'm not sure what this means, what is Variant 1 and Variant 2? These logs match what I see for "short EN press" (Variant 1) and "long EN press" (Variant 2). However that's at odds with your statement about "a brief tap on EN".

My hypothesis was: On some (one in six that I have) ESP32s, if EN is not held low for a long time then the extra RTCWDT_RTC_RESET bug doesn't happen so there's no way for the esp32r0_delay work around to be effective. However if EN is held a bit longer, the RTCWDT_RTC_RESET does happen so esp32r0_delay can work around the problem. But I've only seen this behaviour on one ESP32, and what you've posted above may not quite fit that hypothesis.

MartyMacGyver · 2017-02-24T04:13:04Z

Variant 1 and Variant 2 are just the two things that happen - sometimes one, sometimes the other. It isn't strongly dependent on how long I press the button.

projectgus · 2017-02-24T04:48:09Z

Variant 1 and Variant 2 are just the two things that happen - sometimes one, sometimes the other. It isn't strongly dependent on how long I press the button.

That's very odd, because if the button press time isn't a factor then I can't see how this relates to the legacy_delay fix. Even though it seems to clearly relate on my board.

If you have time, there's another debugging step that may help:

Do you have a second USB/serial adapter of some kind? If you wire it to the TX0 pin of the Core Board, you can use a second serial program to monitor the output from the Core Board UART as it is reset by esptool.py.

If you capture the reset sequence with and without legacy_delay, you should be able to see whether or not legacy_delay causes the RTCWDT_RTC_RESET to trigger more reliably. (I recommend using a command like 'esptool.py flash_id' to minimise the amount of other traffic sent over serial.)

It's also possible to do this with the logic analyzer (monitor the TX pin), although I find some logic analyzer programs don't make viewing a long serial stream particularly easy.

MartyMacGyver · 2017-02-24T08:06:02Z

Well, at least overnighting the CP2102 breakout came in handy after all! 💯 It didn't occur to me to monitor the TX pin this way.

Without the latency fix, I just get the following every time it tries to flash (using the normal process):

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0008,len:8
...

With the fixed code, I get the same failure when it tries the quick default pass, followed by success when it uses the latency delay pass:

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0008,len:8
...

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
flash read err, 1000
Falling back to built-in command interpreter.
OK
>ets Jun  8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x3 (DOWNLOAD_BOOT(UART0/UART1/SDIO_REI_REO_V2))
waiting for download
...

MartyMacGyver · 2017-02-24T08:11:04Z

To be honest, maybe it's got something to do with that extra, short DTR pulse I described above after all. I'm really not sure, but that's of course absent when doing this via the buttons. Just a thought - whatever the case the delay required when commanding this via DTR/RTS is what it is, and it's far greater than is needed to perform a clean EN/IO0 boot sequence bia buttons.

projectgus · 2017-02-27T01:21:44Z

With the fixed code, I get the same failure when it tries the quick default pass, followed by success when it uses the latency delay pass:

OK, great. This is what I see.

It's weird that you can't reproduce this by pressing the EN button in the same way I can. All I can think is that it might be a button debouncing issue, so if you looked at the EN trace on a scope when releasing the button then it might toggle high and low a few times, and this is either enough to either get things going properly or to confuse the SPI flash chip into yielding the read error.

Based on your results I think I have a solid picture of what's happening here. Thanks.

The latency_delay workaround appears to be an extension of esp32r0_delay, as the longer reset period helps the esp32r0 watchdog bug trigger more reliably. Ref #172 #136

projectgus · 2017-03-02T01:28:11Z

Confirmed with some colleagues who are able to reproduce the bugs discussed above.

Merged this PR with a few tweaks, as it seems clear latency_delay is a way to trigger the esp32r0_delay more clearly on some chips.

Thanks @MartyMacGyver for figuring this out and sending it in the first place, and bearing with me through the review process. :)

MartyMacGyver · 2017-03-02T04:45:30Z

Thank you!

projectgus suggested changes Jan 27, 2017

View reviewed changes

projectgus mentioned this pull request Jan 27, 2017

ESP32R0 workaround is not Windows specific espressif/esp-idf#305

Closed

MartyMacGyver force-pushed the serial_legacy_fix branch 2 times, most recently from ad45d7d to d75c0a6 Compare January 28, 2017 00:02

MartyMacGyver force-pushed the serial_legacy_fix branch 2 times, most recently from 940a248 to 2cd6768 Compare January 30, 2017 09:57

MartyMacGyver force-pushed the serial_legacy_fix branch from 16e653f to 93e363f Compare February 22, 2017 09:31

projectgus approved these changes Feb 23, 2017

View reviewed changes

projectgus merged commit 93e363f into espressif:master Mar 2, 2017

Resolved serial flash issues for legacy devices #172

Resolved serial flash issues for legacy devices #172

Uh oh!

Conversation

MartyMacGyver commented Jan 23, 2017

Uh oh!

MartyMacGyver commented Jan 25, 2017

Uh oh!

projectgus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartyMacGyver Jan 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartyMacGyver Jan 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartyMacGyver commented Jan 28, 2017

Uh oh!

projectgus commented Jan 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartyMacGyver commented Jan 30, 2017

Uh oh!

projectgus commented Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartyMacGyver commented Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

projectgus commented Jan 30, 2017

Uh oh!

MartyMacGyver commented Jan 30, 2017

Uh oh!

MartyMacGyver commented Jan 30, 2017

Uh oh!

projectgus commented Feb 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartyMacGyver commented Feb 10, 2017

Uh oh!

projectgus commented Feb 10, 2017

Uh oh!

MartyMacGyver commented Feb 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartyMacGyver commented Feb 11, 2017

Uh oh!

MartyMacGyver commented Feb 22, 2017

Uh oh!

projectgus commented Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

projectgus commented Feb 23, 2017

Uh oh!

MartyMacGyver commented Feb 23, 2017

Uh oh!

MartyMacGyver commented Feb 23, 2017

MartyMacGyver Jan 27, 2017 •

edited

Loading

MartyMacGyver Jan 27, 2017 •

edited

Loading

projectgus commented Jan 29, 2017 •

edited

Loading

projectgus commented Jan 30, 2017 •

edited

Loading

MartyMacGyver commented Jan 30, 2017 •

edited

Loading

projectgus commented Feb 8, 2017 •

edited

Loading

MartyMacGyver commented Feb 10, 2017 •

edited

Loading

projectgus commented Feb 23, 2017 •

edited

Loading

MartyMacGyver commented Feb 23, 2017 •

edited

Loading

MartyMacGyver commented Feb 23, 2017 •

edited

Loading

MartyMacGyver commented Feb 23, 2017 •

edited

Loading

projectgus commented Feb 24, 2017 •

edited

Loading

MartyMacGyver commented Feb 24, 2017 •

edited

Loading