Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Net module model review and reimplementation. #1080

Closed
TerryE opened this issue Feb 23, 2016 · 37 comments
Closed

Net module model review and reimplementation. #1080

TerryE opened this issue Feb 23, 2016 · 37 comments
Assignees

Comments

@TerryE
Copy link
Collaborator

TerryE commented Feb 23, 2016

The net module attempts to provide a unified model for TCP and UDP and really fails to paper over the cracks in how these are structurally different and in particular that UDP is not connection orientated. In the current implementation:

  • You can have one TCP server and one UDP server
  • You can create multiple listeners per TCP server (number TBD) to listen on more than one port, but you will need to use bound upvals to know which socket you are receiving on since this information isn't passed as a parameter.
  • The TCP server will receive up to 5 connections and create a net.socket for each.
  • You can reply on connected sockets
  • You can (also) create multiple outbound TCP connections and send and receive on these.
  • You can create multiple listeners per UDP server (number TBD) to listen on more than one socket
  • The UDP model is different in that the listen() method ignores any connection function and the on and send methods a bound to the server and not the socket.
  • If you are receiving from multiple hosts or on multiple ports, your send will go the the last port/IP address and not necessarily the one which initiated the send.
  • If you do try any if these "permitted" multiple combinations then your ESP will regularly PANIC or die with an exception.

IMO, this is a mess.

  1. The model should be clearly documented and the ESP should not crash. We should either return a success fail status or through a Lua error, but the application should remain in control.
  2. We should not try to make UDP look as if it is connection oriented. Better that the on receive gets an IP and port and can whether to process it or not. we shouldn't have a createConnection for UDP but the send should just accept an IP and a port.

Thoughts, comments?

@pjsg
Copy link
Member

pjsg commented Feb 23, 2016

With regard to UDP, it is common programming practice to call connect on a socket and have it send and receive packets just to that destination (i.e. avoid the sendto calls).

I agree wholeheartedly with your first goal!

Also, if you are going to rework the net module, then making sure that it will be possible to move to IPv6 would be a significant advantage. I know that the current SDK does not support v6, but it is supposed to be coming.....

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 23, 2016

With regard to UDP, it is common programming practice to call connect on a socket and have it send and receive packets just to that destination (i.e. avoid the sendto calls).

I agree that this is simpler but it is still not the same as the current model, because with this on a UDP server we would still want to use a net.socket for I/O and open a new socket if an inbound connection from from a different IP address, rather than the current "last sender gets the reply, regardless"

@jmattsson
Copy link
Member

  1. Yes, though this can be tricky due to the SDK internally failing to test for failed memory allocations and crashing & burning as a result thereof.

  2. Yes. We might need to break backwards compat though, which would be unpleasant.

As a general comment, the SDK's espconn layer attempts to ride a middle ground between a traditional socket API and a more highlevel stream-oriented API which hides the complexity of TLS from the user. For the better part I think it works on the TCP side, but the UDP aspect has been challeng(ed|ing).

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 26, 2016

In terms of resource configuration, I think the sensible thing it to introduce a new net.config() call to allow the application to programmatically set the max numbers of connections per listener, etc.. This would also have the advantage tha if you weren't using TCP, say, then you could set the TCP resources to zero, and use the RAM for other purposes.

In terms of UDP, my instinct would be to maintain soft backwards compatibility -- that is correctly designed and written UDP apps would continue to work as before, but some of the nature of bizarre encoding and failure modes wouldn't be preserved.

We are also constrained by the espconn UDP API, but in my view UDP listener should return return a pseudo connection socket which is unique to the IP of the sender, so that if more than one other host is in UDP exchange with the ESP then the reply packets go tot he correct host.

If would also be nice to be able to have an optional port parameter on the send because some UDP protocols use different ports for the server and client ends.

@jmattsson
Copy link
Member

We might even want to consider not using the espconn layer for the UDP path, and use the lower-level LWIP api directly instead. I took that approach when I wrote the SNTP module, because the espconn abstraction was more a hindrance than a help at the time.

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 27, 2016

This one is a biggy so I am not rushing into it. I want to pick some low hanging fruit off my tasklist first ;-?

@pjsg
Copy link
Member

pjsg commented Mar 6, 2016

@TerryE The work that I am doing to add certificate support adds some new code. Currently I have this in net.c, but I could break it out into it's own file. It adds another 200 lines (at the moment).

Which would you prefer?

@TerryE TerryE mentioned this issue Apr 8, 2016
@TerryE
Copy link
Collaborator Author

TerryE commented Apr 10, 2016

@pjsg @jmattsson @dnc40085 I keep looking at the net.c implementation and it's just horrible, which makes it very hard to rework without a total rewrite. If we do do a total rewrite that this will be view as a high risk change, so my instinct is to do this in incremental steps. I would suggest that the first is a code clean up which simply removes all of the deadwood, for example

  • paths that can never be taken, for example the code includes luaL_checkudata() calls which throw an error if the metatable is wrong, but the code still then check the result.
  • code that's been commented out (typically for over a year)
  • cut and paste repetition which should be either MACROed or pulled into a shared routine.

I've done a previous dry run and this dropped the size of the module by roughly a third in source lines and slightly less in terms of generated code. I would suggest that I don't attempt to fix any of the execution logic on this pass so that functional execution paths are preserved.

Thoughts?

@jmattsson
Copy link
Member

That sounds like a prudent approach if you're willing to spend the time on it!

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 11, 2016

I tend to interleave my work. When one issue is giving me brain ache then I switch to another. The net module is pretty core and I feel that we do need to progress this somehow.

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 13, 2016

@jmattsson @pjsg @devsaurus Here is a my first pass of the tidy of net.c. I would really appreciate you scanning this and giving me your reactions.

  • It's now in 1TBS
  • I've killed off long-dead commented out code.
  • I've used macros and outlining of common sequences to drop both the number of lines and generate code size, so the logic is a lot more readable.
  • There were a large number of trivial redundant checks. E.g all of the lua_check macros throw an error if the check fails, so there's no point in following one of these API calls with the inline logic to validate that the check has worked. For example luaL_checkudata() checks to see if the type item is a userdata with the given metatable. It will never return a NULL value so there is no point in following one of these calls with a NULL check.
  • I have used compile time optimization rather than preprocesor code where this results in cleaner and more readable source. For example if (SECURE(nud)) generates if ((nud->secure)) if SSL is enabled and if((0)) otherwise enabling the compiler to remove the branch (better than the nasty #ìfdef CLIENT_SSL_ENABLE method chopping if ... then ... else logic in half.
  • There are still a large number of checks which should be Lua assertions at best, and are redundant at worst.
  • There are a large number of silent error returns which I dislike. These are almost impossible to diagnose at an application level.
  • If failure is valid programmatic result then OK (e.g. some callbacks can be validlly omitted)
  • In some cases there should there be some status returned
  • In general IMO it is far better a throw so the error is clearly apparent.
  • There should be no silent errors. (Printing out a NODE_DBG() informational error doesn't count)
  • I know Philip dislikes the Lua allocator because it throws an error on allocation failure but I believe that we should still deprecate the use of anything else in Lua modules. My reasons for this are (i) the Lua allocator is integrated into the Lua GC. If an allocation failure occurs then the GC is run and the allocation retried. (ii) I know that if you don't establish a cleanup error handler, then you can get leakage in some cases. On these systems out-of -memory is never graceful, so it doesn't matter if you leak some small block, as the only sensible recovery is to restart the CPU.

Anyway please scan through this and give me your thoughts. Unfortunate since its now 1TBS and 2/3rds the size you can't sensibly do a diff, but I have kept the routine order for this pass so you can do a side-by-side.

An example of a silent error is net.createConnection(net.TCP,1) if CLIENT_SSL_ENABLE is not defined. This fails back to creating an insecure connection. Hence a programmer could forget to enable SSL in the firmware build, but the program will still appear to work, silently using an insecure connection. Not good.

PS. This version doesn't introduce any material functional changes (e.g. luaL_argcheck() gives more informative errors than luaL_error() but the errors will still occur at the same points / code paths.) If this receives general positive feedback then I can now start to sort out some of the known issues.

@devsaurus
Copy link
Member

Do you plan to issue an PR so we can comment on code in-line or do you expect us to feed back within this thread?

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 13, 2016

Sorry Arnim but when you look at the review screen in the commit, then you get

464 additions, 1,162 deletions not shown because the diff is too large. Please use a local Git client to view these changes.

So I don't think that doing a PR will help. At this stage I am more interested in comments on the general approach to see if there is any support for doing this properly. The lessons learnt can also carry across to our general module programming guide. I'd you guys like the macros then these might be worth hoisting into a eLua or nodeMCU header to make them generally available. I personally think this is a lot shorter, clearer and more efficient at runtime. I've more redundant code to take out, but my main aim is to have a solid base to fix some of the known limitations of the current implementation.

If you just want to refer to specific code then just use the line no.

@devsaurus
Copy link
Member

I favour the review features of a PR since they enable to have separate sub threads. Keeps things separated compared to a bulk issue thread. But I see your point.

So my comments after a quick read through. Will give it the code a try later.

Discarding the check for NULL result from luaL_checkudata()

I can't match your statements above with the description for Metatables. Are we running on a different flavour of lauxlib? Until now I thought that I don't have to study the code but can live with the online docs.

Memory leak when registering unknown methods?

There are paths through the conditional tree in net_on() that neither call NEWREF_NUDCB() nor error out. Will the pushed callback function argument remain on the stack and consume memory until reset?

Redundant check for LUA_NOREF

According to http://pgl.yoyo.org/luai/i/luaL_unref, luaL_unref() will "do nothing" when ref equals LUA_NOREF. The check can be removed IMO.

Lua allocator

How is the Lua allocator enabled for your code? It's not obvious to me.

Deprecated espconn_[secure_]sent()

SDK docs recommend to switch to espconn_[secure_]send().

Coding style

There are still some TABs used for indentation.


All in all the code looks very well structured and the macros enhance readability a lot 👌.

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 13, 2016

I favour the review features of a PR since they enable to have separate sub threads.

So do I, but github doesn't seem to support this the review on this level of difference. Also this is compile-clean only. I don't guarantee that it works, yet.

Discarding the check for NULL result from luaL_checkudata(). I can't match your statements above with the description for Metatables.

This is a 5.1 tweak. See here. The online PiL is for 5.0, and the wording in current book is consistent with the code "It raises an error if the object does not have the correct metatable, or if it is not userdata".

Memory leak when registering unknown methods? There are paths through the conditional tree in net_on() that neither call NEWREF_NUDCB() nor error out. Will the pushed callback function argument remain on the stack and consume memory until reset?

I will take a look, but in this case the original probably had the same bug. As I said, I haven't changed the logic this pass. It's just easier to see the flaws now.

According to pgl.yoyo.org luaL_unref() will "do nothing" when ref equals LUA_NOREF. The check can be removed IMO.

Correct, but since this is now out-lined into a common routine the check saves a call. If in general the cb isn't unref, then you are right, we should still remove it. Next pass.

How is the Lua allocator enabled for your code? It's not obvious to me.

Correct. It's not. This is for the next pass.

SDK docs recommend to switch to espconn_[secure_]send().

Yes, next pass.

There are still some TABs used for indentation.

Thanks. Will clean these up.

All in all the code looks very well structured and the macros enhance readability a lot.

Thanks. That was my aim. It's a lot easier to spot errors when the code is readable. In this case the binary is smaller and the execution slightly faster as well. 😄

@jmattsson
Copy link
Member

I've only had a very cursory look at it, and it didn't want me to run screaming for the hills. I'd say keep at it!

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 23, 2016

Gosh, this one goes on and on. I discussed using the Lua memory allocator system for espconn resources. Bad idea. Why? Because if you look at the espconn code, it can clone / copy espconn blocks (and free them) and it uses the OS allocator system to do this. I know that with the nodeMCU implementation, the Lua allocator is just a wrapper around the OS one, but one of the nice features of it being a unified hook is that you can rehook this in a test harness to check for leaks, etc. and this will all die messily if you do this rehook when some of the memory objects are allocated / freed by different systems (Lua / OS).

The API documentation also includes the caveat:

void *arg: pointer corresponding structure espconn. This pointer may be different in different callbacks, please don’t use this pointer directly to distinguish one from another in multiple connections.

Which is another way of saying: don't assume that the esconn block that you get passed back in a callback is one that you have allocated. I need to follow through the alloc / free logic here on some use cases.

Final comment here: most of the espconn routines can error out early for a variety of reasons, the most frequent (but not only) being an out of memory error. Needless to say, the current implementation always assumes successful execution. Durrhhh.

@pjsg
Copy link
Member

pjsg commented Apr 23, 2016

This is awful. It seems that getting rid of the espconn layer is the only viable solution..... However, that is probably a significant amount of work.....

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 23, 2016

Sorry Philip, I should have qualified this. The current net.c implementation always assumes successful execution. The ESPCONN layer might have a few holes, but its our code (not Espressif's) that is horrible.

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 24, 2016

Incidentally I note that the SDK now supports

#ifdef SERVER_SSL_ENABLE
    espconn_secure_set_default_certificate(default_certificate, default_certificate_len);
    espconn_secure_set_default_private_key(default_private_key, default_private_key_len);
    espconn_secure_accept(&esp_conn);
#else
    espconn_accept(&esp_conn);
#endif 

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 24, 2016

@jmattsson @pjsg @Alkorin et al, I've just started testing my cleaned up version of net.c and I can push a copy to my github repo if anyone is interested, but I must admit that I am rather ambivalent about all of this. I feel that this new version is a step improvement over the old code, but so much of ESPCONN subtleties are undocumented, so I have to keep trawling the ESPCONN code to retro-engineer these.

What really concerns me is that as far as I can deduce, the Espressif guys have added major functional enhancements in the various 1.x SDK releases which have stretched the ESPCONN parameter block interface well beyond its original design. The struct espconn block is now really an input parameter block, which is then cloned in the espconn layer -- possibly multiple times as you can my accept on multiple ports for a server, and servers can accept multiple sockets. This introduces a extra layer of object management and GC on top of the LwIP one and the Lua one. Hence we have ~1K lines of net.c + ~4K lines of espconn source over LwIP whereas a direct Lua module over LwIP only needs to use the Lua memory system and can dump a lot of this complexity. Even so it might be say 2-3K lines (probably on the low side).

  • Pros: LwIP is well documented and the 2K would be our source; the memory footprint both RAM and Flash would be less.
  • Cons: It's still double the current 1K lines, and modules would either need to use the net.c or direct LwIP stack. We don't know if there are any hidden dependencies on ESPCONN in say the Espressif WiFi code.

So I would appreciate some comment and feedback. Do I press on with the cleaned up net.c over ESPCONN? Are people more comfortable dumping ESPCONN?

@pjsg
Copy link
Member

pjsg commented Apr 25, 2016

There are other modules which make use of ESPCONN today. (Almost) Every module that touches the network does so with the ESPCONN stuff. Changing all of that will be a big struggle..... Before we can dump ESPCONN entirely we will be stuck having to maintain both approaches.

On the upside, we can fix a lot of the annoying features of the current stack.

@jmattsson
Copy link
Member

In my mind the only major "good" thing the ESPCONN layer provides is the same API for plain vs TLS connections. As Terry said, internally the ESPCONN design model has outgrown itself (and I'd argue that happened the moment they added support for TCP server sockets), and that's ignoring the rest of the code quality in that layer.

There is nothing stopping us from doing a gradual shift away from ESPCONN module by module if that's what we want to do. The SNTP module never used ESPCONN for example, and the EUS module was recently moved over to regular LwIP.

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 25, 2016

@pjsg Philip, the modules which use the ESPCONN layer are coap, http and mqtt. Changing these to use the "native" net library would be a relatively small change compared to the swap-out of ESPCONN in net.c itself, especially if I exposed the basic net API as a C callable interface.

However, let's see hat testing this new net.c module throws up. I've now got a better idea of the edge cases that will cause it trouble.

@Alkorin
Copy link
Contributor

Alkorin commented Apr 25, 2016

@TerryE push your code somewhere, I'll test it

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 25, 2016

@Alkorin Thomas, let me do a first pass set of tests to iron out the obvious stuff. I will push it up to my Github fork then, and post back the id here.

@jmattsson
Copy link
Member

jmattsson commented Apr 28, 2016

I just noticed that the sock:on('reconnection', ...) which we have documented will never actually fire, as internally the net module maps the "reconcb" (really, it should be "errorcb") to the disconnection event. I think we need to keep this for backwards compatibility, and we should probably deprecate the reconnection registration as part of the cleanup.

And to add to that, currently the reconcb is only registered once the socket has connected, so DNS failures are never noticed other than with a printf. That needs to change.

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 28, 2016

Yup, already picked this one up. When you look at the ESPCONN code, the name is a misnomer. It's actually an unexpected disconnect (as opposed to a negotiated disconnect), e.g. the link has timed out or the far end has aborted the link. I suppose it is called reconcb because this gives the app the opportunity to reconnect if that is appropriate.

What I've done is to flip these two around, so that the disconnect callback returns an error code, which is ESPCONN_OK in the case of a normal disconnect. Of course the cb function is free to ignore this, maintaining backwards compatability.

This is the sort of detail that I meant re having to trawl through the Espressif code to work out what is actually going on.

@TerryE
Copy link
Collaborator Author

TerryE commented May 1, 2016

Sorry guys, but this is all slow going, made worse by my own non-nodeMCU commitments. The new error checking around the espconn calls continues to throw up interesting issues. For example, espconn_disconnect() can be used to close a TCP socket, but it barfs on an TCP server close with an illegal argument (the server espconn block). No, at some point the use of espconn_delete() has been extended in the case of TCP servers to be the logical close of a espconn_accept(), but you need to trawl into the espconn_tcp.c to discover this. Upwards and onwards.

@danjohn
Copy link

danjohn commented Oct 9, 2016

There is another issue with the UDP "replay to last sender" policy: There is no way to use the built-in multicast feature (what would be very desirable for IoT nodes)!

Multicasting works in receiving from, and sending to a specific udp address+port (say, 240.0.0.240:7777), so that all **net.multicastJoin()**ed nodes share that communication.

This works well on the receiving side; but socket.send() (and server.send()) sends to "random" addresses (last received, or even 0.0.0.0:0 on startup).
There has to be a method to nail down send() to a specific address+port, otherwise any data will get unicasted to the last received node (which definitly breaks the communication mesh).

At last I see no way to establish multicasting so far... hints, anyone?
(If the current implementation really can't cope with this (but hoping for the best:), IMHO the most intuitive solution would be to adapt socket.connect(port,ip) (which makes no sense with UDP anyway) to specify a target address bevore send()...?)

@TerryE
Copy link
Collaborator Author

TerryE commented Oct 9, 2016

Sorry guys, building a house got in the way of completing this, and @djphoenix Yuri's work on #1379 now takes precedence. I need to get myself up to speed with what Yuri has done, and then I will update and reissue this if appropriate, once I've gone through his changes - - or close this.

@marcelstoer
Copy link
Member

Was fixed with #1379.

@eyaleb
Copy link

eyaleb commented Jan 3, 2017

I recently rebuilt the firmware and had an unexpected failure (is one ever expected?). After spending much time and drilling through the fw code I found that for UDP connect() was removed and is now part of send().

While this makes sense (UDP does not have the concept of connection) and I like it, I think that if this change is folded into master without backward compatibility then this fact should be highlighted.

@djphoenix
Copy link
Contributor

@eyaleb Don't need to look in code but in docs :)
I think there are case where we haven't option for backward compatibility. And so I think we're should note it something in changelog (every project have a changelog? Hmmm...)
Anyway firmware update always is breaking-aware.

@eyaleb
Copy link

eyaleb commented Jan 3, 2017

I did look at github and it does not say much, this is all there is:

net.udpsocket:send()
Sends data to specific remote peer.

readthedocs is similar.

Then again, this is dev so I expect surprises or omissions and I should be able to deal with them. My note is looking forward to when we hit master.

@marcelstoer
Copy link
Member

@eyaleb thanks for the note. I agree we could do better regarding this specific issue and I created #1701 for that.

We don't have a dedicated changelog, nor did I want to maintain one, but the major changes are highlighted in the release notes. Of course that doesn't help you if the changes in question haven't been released yet. We also keep (more or less internal) notes for each milestone at e.g. for 2.0.0. I once documented that in the contribution guidelines.

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 3, 2017

@eyelab, the points that you make are valid, but don't belong on this issue, since it has been superseded by @djphoenix Yuri's work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants