-
-
Notifications
You must be signed in to change notification settings - Fork 4k
implement async DNS lookup - no more stuck NTP requests #5332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- adding a wrapper class for DNS lookup (lwIP) - replaces blockig WiFi.hostByName() in NTP and virtual Bus DNS look-ups
WalkthroughAdds a non-blocking AsyncDNS header for lwIP-based DNS resolution, integrates async hostname resolution into BusManager and NTP flows, and changes the AP channel fallback default from 1 to 6 during config deserialization. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I tested this on a C3 and ESP8266 for NTP, I have no hardware setup to thest the virtual bus changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
wled00/bus_manager.cpp (2)
675-679:⚠️ Potential issue | 🔴 CriticalInitialization of
_DNSlookupis outside the ESP32 guard.Line 675 initializes
_DNSlookup = nullptrbut this line is before the#ifdef ARDUINO_ARCH_ESP32block that starts at line 676. Since_DNSlookupis only declared for ESP32 in the header, this will cause a compilation error on ESP8266.Proposed fix
_client = IPAddress(bc.pins[0],bc.pins[1],bc.pins[2],bc.pins[3]); - _DNSlookup = nullptr; `#ifdef` ARDUINO_ARCH_ESP32 + _DNSlookup = nullptr; _hostname = bc.text; resolveHostname(); // resolve hostname to IP address if needed `#endif`
775-781:⚠️ Potential issue | 🟡 MinorMissing cleanup of
_DNSlookupinBusNetwork::cleanup().When
BusNetwork::cleanup()is called, the_DNSlookuppointer is not deleted, which could cause a memory leak if a DNS lookup is pending when the bus is destroyed.Proposed fix
void BusNetwork::cleanup() { DEBUGBUS_PRINTLN(F("Virtual Cleanup.")); + `#ifdef` ARDUINO_ARCH_ESP32 + delete _DNSlookup; + _DNSlookup = nullptr; + `#endif` d_free(_data); _data = nullptr; _type = I_NONE; _valid = false; }
🤖 Fix all issues with AI agents
In `@wled00/asyncDNS.h`:
- Around line 35-36: Fix the typo in the comment inside asyncDNS.h: change
"handlig" to "handling" in the sentence "with the current IDF bug external error
handlig is required anyway or dns will just stay stuck" so the comment reads
"...external error handling is required anyway or dns will just stay stuck."
In `@wled00/ntp.cpp`:
- Around line 209-212: The debug prints occur before ntpServerIP is updated,
causing the wrong IP to be logged; move the assignment ntpServerIP =
ntpDNSlookup->getIP() to before the DEBUG_PRINTF_P/Serial.printf_P calls (and
keep sendNTPPacket() after the assignment) so both debug outputs and
sendNTPPacket() use the newly resolved ntpServerIP.
- Line 201: Replace the direct Serial.printf_P debug prints with the conditional
debug macro so debug output is gated; specifically, change the
Serial.printf_P(...) calls that reference ntpServerName (and the other
occurrence) in ntp.cpp to use DEBUG_PRINTF_P(...) instead, preserving the PSTR
format strings and arguments so the messages only appear when debugging is
enabled.
🧹 Nitpick comments (3)
wled00/bus_manager.cpp (2)
716-716: StaticnextResolveis shared across all BusNetwork instances.The
static unsigned long nextResolvevariable is shared among all network bus instances. This means if you have multiple network buses with different hostnames, they'll all share the same 10-minute resolution timer. This might be intentional to reduce DNS traffic, but if independent resolution per bus is desired, consider making this a member variable instead.
749-752: Unreachable else branch.This
elseblock handles states other thanError, but at this point in the code, the only possible states areIdle,Busy,Success, orError.IdleandSuccessare handled earlier (line 719 checks Success),Busywould mean an in-progress query, andErroris handled in the if block. This else branch appears to be defensive but unreachable.Consider simplifying or adding a comment
} else { - delete _DNSlookup; // cleanup if other error just in case (should not happen) - _DNSlookup = nullptr; + // Note: This branch should be unreachable as all DnsResult states are handled above + // Kept as defensive cleanup in case of unexpected state + delete _DNSlookup; + _DNSlookup = nullptr; }wled00/ntp.cpp (1)
184-184: Global pointer declaration could be more explicit.While C++ zero-initializes global pointers, an explicit
= nullptrwould make the intent clearer and be consistent with the initialization inbus_manager.cpp.-AsyncDNS* ntpDNSlookup; +AsyncDNS* ntpDNSlookup = nullptr;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@wled00/bus_manager.cpp`:
- Around line 717-718: The static local nextResolve in
BusNetwork::resolveHostname is shared across all instances, causing only one
hostname to be resolved per cycle; make the timer instance-specific by adding an
unsigned long member (e.g. _nextResolve = 0) to the BusNetwork class (in
bus_manager.h) and replace the static nextResolve usage inside
BusNetwork::resolveHostname with this member (_nextResolve), ensuring any logic
that checks or updates nextResolve now operates on the instance field.
- Around line 727-728: The comparison using millis() > nextResolve will fail
after millis() wraps; change the check to use unsigned subtraction to handle
overflow correctly (e.g., (millis() - nextResolve) >= 0 or millis() -
nextResolve < someInterval). Update the condition around nextResolve and
_hostname in the bus_manager.cpp block where millis() and nextResolve are used
so the resolve runs every 600000ms even after overflow and keep the existing
update to nextResolve = millis() + 600000 to schedule the next run.
- Around line 751-754: The AsyncDNS instance can be deleted while a DNS query is
still pending, causing a use-after-free when lwIP invokes the static callback;
add a destructor to the AsyncDNS class that checks for a pending state (e.g.
_status == DnsResult::Busy) and cancels any outstanding query via the
appropriate lwIP cancel API (matching how dns_gethostbyname was used) before
allowing destruction; ensure the else branch that deletes _DNSlookup (and any
other code paths) relies on this destructor to safely unregister/cancel
callbacks so lwIP never receives a pointer to a destroyed AsyncDNS.
🧹 Nitpick comments (1)
wled00/bus_manager.cpp (1)
675-681: Consider merging consecutive#ifdefblocks.Two adjacent
#ifdef ARDUINO_ARCH_ESP32blocks can be combined into one for clarity.♻️ Proposed refactor
_client = IPAddress(bc.pins[0],bc.pins[1],bc.pins[2],bc.pins[3]); `#ifdef` ARDUINO_ARCH_ESP32 _DNSlookup = nullptr; - `#endif` - `#ifdef` ARDUINO_ARCH_ESP32 _hostname = bc.text; resolveHostname(); // resolve hostname to IP address if needed `#endif`
| //int ap_pskl = ap[F("pskl")]; | ||
| CJSON(apChannel, ap[F("chan")]); | ||
| if (apChannel > 13 || apChannel < 1) apChannel = 1; | ||
| if (apChannel > 13 || apChannel < 1) apChannel = 6; // reset to default if invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seams unrelated to the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is. I should have changed this line in #5115 but was unaware at the time.
|
Great to find this issue, but not sure async dns is the right approach. surely ntp should but it's own task not part of our main loop? |
willmmiles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea and definitely the right general approach. The possible use-after-free when the object is destructed needs to be fixed though.
wled00/ntp.cpp
Outdated
| } | ||
| } | ||
|
|
||
| AsyncDNS* ntpDNSlookup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use std::unique_ptr instead of naked pointers. Also it should be static inside the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::unique_ptr gives me errors as make_unique is not defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a shim in bus_manager.h. (This might be moot re fixing the safety issues though -- shared_ptr might allow a cleaner solution.)
wled00/asyncDNS.h
Outdated
| uint16_t _errorcount = 0; | ||
|
|
||
| // callback for dns_gethostbyname(), called when lookup is complete or timed out | ||
| static void _dns_callback(const char *name, const ip_addr_t *ipaddr, void *arg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this class is destructed while a query is still outstanding, this will result in the arg pointer becoming invalid, resulting in memory corruption when the callback is eventually executed. Disconnecting from the wifi does not dequeue DNS callbacks -- they're supposedly guaranteed to be called "eventually" by the lwip stack.
Fixing this is nontrivial -- the state data needs to be stored on the heap separately from the interface class, and scoped with tools like std::shared_ptr so it won't be released until the callback is run. I made a little sketch of this, I can add it to the PR if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I am not opposed to "do it properly" but I also see two more trivial solutions:
- make the class global and instantiate only one, add a "key" or a char array and share the class between callers. this adds the complexity of a shared resource though and probably has implications I am currently not thinking of.
- simplest: do not delete if status is busy (works well for NTP use)
both of these will not solve the issue for virtual buses though - maybe adding another function to the class that is actually blocking would simplify its use for places that "need an IP immediately" - what I have implemented in busmanager is way too complex IMHO but mostly due to the fact that it needs to dance around that bug gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just deletion: renew() and reset() are also invalid if status == Busy as the results of a new query will be overwritten by the old callback at some point. If a query is stuck, the state object has to be abandoned -- there's no real alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it the way I did because I never got it to not run the callback - you are right it is not as robust as it could be but it was not my intention to write it "fool proof" i.e. reset while a request is going on.
So which one do you prefer:
- using seperate heap for the callback as you propose
- not use a pointer and make it fully static (like I did now in the busmanager) and not go down the road of managing dangling pointers?
to me the option of not using it dynamically makes sense, at least ATM as DNS is only used for NTP and virtual bus. The class only uses a few bytes of RAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to be able to test/treat an AsyncDNS object as a unit, with safety by design -- "it doesn't crash because we don't happen to delete/reuse it while it's active" isn't good enough IMO, static or otherwise. Eventually that will bite us because the class violates the C++ object requirement that a destructor is always safe.
I took a stab at safety, while keeping the current object and re-use semantics:
willmmiles@26c202a
I think it might be done better with std::future<> style semantics though; the code would be closer to the current class design. Eg. the fundamental call is:
class AsyncDNS {
static std::shared_ptr<AsyncDNS> query(const char* hostname, std::shared_ptr<AsyncDNS> previous = {});
private:
AsyncDNS(); // Can only be produced by making a query
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future-like formulation: willmmiles@efcfc5e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say that I'm having a lot of trouble with DNS hanging on my test system. Do you have any references on that IDF v4 bug you mention?
I do too, that is why I am pretty sure its a bug, it just locks up and does not recover until reconnect. I did not really find concret info about it but the changelog mentions some fix for DNS deadlock: https://github.com/espressif/esp-idf/releases?page=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future-like formulation: willmmiles@efcfc5e
that is a whole lot of fancy c++ magic I do not (yet) understand. Is it to make it multi-thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future-like formulation: willmmiles@efcfc5e
that is a whole lot of fancy c++ magic I do not (yet) understand. Is it to make it multi-thread safe?
Sort of. The goal is to make sure that the addresses that will be touched by the callback in the future cannot under any circumstances be released until the callback has run. std::shared_ptr<> is a reference-counting pointer: it keeps track of how many shared_ptrs address the same storage, so the pointed-to object won't be deleted until the last shared_ptr pointing at it is removed. shared_ptr guarantees that the reference count is atomic and so it is thread-safe in this context.
The first implementation hides the shared_ptr by making the AsyncDNS class a wrapper around it: the only member is a shared_ptr to the DNS results. A copy of the shared_ptr is "gifted" to the callback, who is then guaranteed to be able to write the state. If the AsyncDNS object is destroyed, or the request is abandoned via renew() or reset(), that copy keeps the DNS result struct alive until it's finished with it.
The second implementations strips off the wrapper layer, so clients always have to store a shared_ptr<AsyncDNS>. This is intended to make the behaviour to the client more clear, ie. the thing you hold is a handle to the real object (and removes an extra layer of code). The major change to make that work is the interface to query(): it lets a client pass in a shared_ptr<AsyncDNS> that's been created before, so it can re-use the context (ie. error count) or storage if that request has been completed. (If the request is still busy, or if you pass in an empty pointer, query() makes a new storage object for you.)
I liken the second approach to the use of std::promise<>/std::future<>. Promises and futures are part of the C++11 async library. You can think of them as a thread-safe single result queue: the end holding the std::promise can write a single result value in a safe way, such that the std::future can wait() for it and read it later. Unfortunately actual std::promise and std::future are quite heavyweight, particularly with their use of exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this to your judgement. I would keep it as light-weight as possible. Feel free to add to this PR.
That is a good point actually, we should think about putting the rendering and background work in different tasks but I think the current code structure is not ready for that. Sure could put NTP in a task but that eats up RAM and on single core ESPs the arduino function this PR replaces may still be blocking, I did not look how the espressif "wait for bit set" part is implemented, it may not play nicely with the RTOS as it sounds like a pretty bare-metal thing to wait for a bit to be set. |
…ix busmanager bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@wled00/ntp.cpp`:
- Line 13: NTP_SYNC_INTERVAL is currently set to 15 (seconds) which is far too
frequent; change the macro NTP_SYNC_INTERVAL back to a realistic interval (e.g.,
replace 15 with the previously-intended 42000UL or another sensible value like
43200UL for 12 hours) and restore/update the comment to reflect "Get fresh NTP
time about twice per day"; if 15 was used only for debugging, add a short
comment noting that and ensure the production value is the larger interval.
🧹 Nitpick comments (3)
wled00/asyncDNS.h (3)
35-47:constqualifiers are misplaced on return types instead of member functions.The
constbefore the return type (e.g.,const IPAddress,const result) has no meaningful effect. To indicate these methods don't modify the object state, moveconstafter the parameter list.♻️ Suggested fix
- const IPAddress getIP() { + IPAddress getIP() const { if (_status != result::Success) return IPAddress(0,0,0,0); `#ifdef` ARDUINO_ARCH_ESP32 return IPAddress(_raw_addr.u_addr.ip4.addr); `#else` return IPAddress(_raw_addr.addr); `#endif` } void renew() { _status = result::Idle; } // reset status to allow re-query void reset() { _status = result::Idle; _errorcount = 0; } // reset status and error count - const result status() { return _status; } - const uint16_t getErrorCount() { return _errorcount; } + result status() const { return _status.load(); } + uint16_t getErrorCount() const { return _errorcount; }
50-52: Consider thread safety for_raw_addrand_errorcountaccess.While
_statusis correctly atomic,_raw_addrand_errorcountare accessed from both the lwIP callback context and the main thread without explicit synchronization. In practice, the atomic_statuscheck before reading these values provides implicit ordering (the callback sets_raw_addr/_errorcountbefore setting_status), but this relies on the atomic's acquire-release semantics.For robustness, consider either:
- Making
_errorcountatomic (low overhead)- Adding a comment documenting the synchronization guarantee through
_statusThis is a minor concern given the current usage pattern where callers always check
status()first.
13-64: Document the callback lifecycle constraint to guide future maintainers.The
AsyncDNSclass relies on lwIP callbacks that invoke_dns_callbackafter the query completes. While the current callers (ntp.cpp and bus_manager.cpp) safely handle this by only deleting after completion or using stack allocation, the API should explicitly document this constraint.ntp.cpp deletes only after
SuccessorErrorstates (lines 211, 221). bus_manager.cpp avoids dynamic allocation entirely with a stack-allocated instance and polls with a timeout loop—its TODO comment ("make this dynamic? requires to handle the callback properly") shows developers were aware of the callback lifecycle issue.Add a comment to the class definition clarifying that instances must not be deleted while
status() == Busy, or update callers to use stack allocation where feasible (as bus_manager already does).
NTP sync should never have been anything other than a background task. Rather than adding a load of complex code to handle async DNS lookups, we should be correcting the failure in design |
Hard disagree. Every FreeRTOS task requires >4k of RAM for stack and ancillary data structures. That would be an enormous waste just for NTP synchronization. |
I retract my statement, I was unaware that there was such a high penalty for running this way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@wled00/ntp.cpp`:
- Around line 186-229: The DNS helper ntpDNSlookup can remain allocated if the
server becomes a literal IP (the else branch calling sendNTPPacket()); ensure
you delete and null out ntpDNSlookup when ntpServerIP.fromString(ntpServerName)
returns true. In practice, before calling sendNTPPacket() in the IP path (and
also before creating a new AsyncDNS in the hostname path), check if ntpDNSlookup
!= nullptr, call delete ntpDNSlookup, set ntpDNSlookup = nullptr to free the
stale allocation and avoid leaks or stale state.
🧹 Nitpick comments (1)
wled00/ntp.cpp (1)
218-224: Replace retry/backoff magic numbers with named constants.
This makes the backoff policy self-documenting and easier to adjust.♻️ Suggested refactor
`#define` NTP_SYNC_INTERVAL 42000UL //Get fresh NTP time about twice per day +#define NTP_DNS_MAX_ERRORS 6 +#define NTP_DNS_RETRY_BACKOFF_MS 300000UL // 5 minutes @@ - if (ntpDNSlookup->getErrorCount() > 6) { + if (ntpDNSlookup->getErrorCount() > NTP_DNS_MAX_ERRORS) { @@ - ntpLastSyncTime = millis() - (1000*NTP_SYNC_INTERVAL - 300000); // pause for 5 minutes + ntpLastSyncTime = millis() - (1000*NTP_SYNC_INTERVAL - NTP_DNS_RETRY_BACKOFF_MS); // pause for 5 minutesBased on learnings: In WLED code reviews, when code is modified or added, look for "magic numbers" (hardcoded numeric literals) and suggest replacing them with appropriate defined constants when those constants are meaningful in the context of the PR. This improves code maintainability and reduces the risk of inconsistencies.
Yeah, the minimum stack requirement for proper Tasks is really kind of frustrating honestly. The end result seems to be that every reasonably sized project ends up writing their own async executive, from as simple as just |
WiFi.hostByName() is blocking, leading to freezeing/glitches on NTP requests - what makes it worse is that there is a bug in the IDF that leads to a timeout on every call if the look-up ever fails - if that happens, LEDs freeze for a few seconds every 10 seconds.
The new class does the look-up async without blocking and also handles that bug with workaround code. IDF V5 should have that bug fixed though.
Fixes #3285
Summary by CodeRabbit
New Features
Configuration
Behavior Changes
✏️ Tip: You can customize this high-level summary in your review settings.