Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] switch from axTLS to BearSSL #3490

Closed
igrr opened this issue Aug 2, 2017 · 13 comments
Closed

[RFC] switch from axTLS to BearSSL #3490

igrr opened this issue Aug 2, 2017 · 13 comments

Comments

@igrr
Copy link
Member

igrr commented Aug 2, 2017

BearSSL is a relatively new TLS library. It has a some features which may come handy in ESP8266 environment, such as:

  • can work without any dynamic memory allocation (much easier to support memory pools, more predictable than axTLS, easier handling of out-of-memory conditions)
  • IO interface is state machine based, doesn't assume existence of threads. Hence easy to integrate with LwIP raw API (i think). @me-no-dev: this may be useful for async libraries as well.
  • configurable fragment buffers: can support half-duplex, full-duplex, asymmetric buffer sizes, etc. AxTLS only supports half-duplex, and we have patched it to somewhat support dynamic (on-demand) fragment buffer size.

See https://bearssl.org/goals.html for more.

On the other hand, axTLS is fairly well studied by now. I have spent a good amount of time reading its source and doing some optimizations. Others (@slaff, @earlephilhower, @ADiea) have also become familiar with axTLS and did many improvements and bug fixes. If we switch to BearSSL, that would mean investing more time to learn ins and outs of it. If we do it though, we may end up with a more predictable and reliable TLS implementation.

This issue is mainly intended to collect feedback and host discussion related to BearSSL in the context of this project.

@ADiea
Copy link

ADiea commented Aug 2, 2017

Hi @igrr thanks for the heads up.
For me it looks still imature to base off commercial products. Axtls is more mature and proven.
I would more likely take mbedtls into consideration due to elliptic keys enabling forward secrecy something axtls lacks for now and is important in my opinion. There are already some projects using it like SuperHouse and a hub/team
Not sure if I remember correctly I think mbedtls wanted 2x16k buffers already allocated maybe superhouse got somehow over this issue

Also BearSSL is not on github but on a proprietary git server this will be hard to maintain the upstream like axtls is now because it is using svn...
In contrast mbedtls is on github and easy to use as a submodule as superhouse project does...

@earlephilhower
Copy link
Collaborator

I'm probably not the heaviest user, but the only real instability I've seen with axTLS has been when the heap runs out or gets fragmented. Unless one of those has a smaller heap footprint or a real, static allocation of about the same size as axTLS, then I don't think it's going to move the needle.

I don't think BearSSL being on a self-hosted GIT is really an issue other than for submitting pull requests. They do have a gitweb which seems to be up-to-date, so you can even trawl through code online.

If I had to guess, I'd say mbedTLS has a better chance of fitting in the little RAM we've got on chip. Looking through their code and readmes I see configuration options and suggestions for small-memory systems. And it's now owned by Softbank/ARM so there's that pedigree.

BearSSL seems to be worried about stopping denial-of-service by allocating a fixed amount of memory, so you can't fill up RAM by getting dozens of connections. That doesn't necessarily mean the fixed allocation is particularly small, just constant.

But I don't see any actual memory statistics so can't really know. mbedTLS's README talks about there being a 16K TLS buffer required by the spec. If you need one for xmit and one for receive, then you're out of luck anyway on the ESP8266 since only ~40KB is free to begin with before any connections...

@mtnbrit
Copy link

mtnbrit commented Aug 2, 2017 via email

@ADiea
Copy link

ADiea commented Aug 2, 2017

I think for full duplex you will need 2x16kb buffers because of max fragment size of ssl so i don.t think full duplex is doable with any system. I would directly alloc 16kb for axtls and that.s it one reusable connection no more fragmentation...

@pornin
Copy link

pornin commented Aug 4, 2017

Some extra information:

  • In TLS, records may go up to 16 kB of plaintext, and when receiving a record, you have to wait for the authentication tag at the end of the record to know if the data is correct or not, hence the need for a full 16 kB buffer.

  • However, it is possible to use smaller buffers if both parties agree. There is an extension (Max Fragment Length, from RFC 6066) by which the client may ask for a smaller maximum record length. This is supported by BearSSL (but not by, for instance, OpenSSL). However, the extension is poorly designed, in that only the client may ask for a small buffer, not the server; thus, this does not work well with situations where the client is big (say, a Web browser) and the server is small (e.g. some ESP8266 device). The maximum record length may be dropped down to 512 bytes, but encryption requires a bit of overhead; to account for the maximum encryption overhead (there may be up to 255 padding bytes with CBC cipher suites), BearSSL needs 325 extra bytes, so a full minimum of 837 bytes. Note that smaller records mean larger relative overhead of encryption, both in size (per-record header and authentication tag) and CPU cost.

  • BearSSL supports both half-duplex (shared input/output buffer) and full-duplex (separate input/output buffers) modes. In full-duplex mode, nothing requires that the input and output buffers to have the same size. The protocol mandates a maximum record size, but no minimum, and sending smaller records is always permitted. If you are short on RAM, you need full-duplex because the underlying protocol is asynchronous (e.g. HTTP/1.1), and you cannot use the Max Fragment Length extension (because the BearSSL-using device is the server, and/or the peer does not support the extension), then you can still, for instance, have a 16 kB input buffer (16709 bytes, exactly) and a 2 kB output buffer, for a total of 18 kB instead of 32.

  • Buffers are exposed through the API. This means that when the application code wants to send some data, it can write it directly into the buffer (the API returns a pointer-and-length) without needing an extra outside buffer, as would be the case with a classic read()/write() API. This helps with saving a bit of RAM. I don't know if axTLS does the same.

  • Apart from the I/O buffer, BearSSL needs a context structure that maintains some running state, especially during the handshake. Size depends on the underlying architecture (size of pointers and alignment requirements). On 32-bit x86, that's 3348 bytes for a client, 3684 bytes for a server. If the client needs to validate the server's certificate (i.e. it is a client, and it does not already know the server's public key), then that's an extra context structure (3036 bytes on 32-bit x86). Context structures are allocated by the caller (the application that uses BearSSL) and can be anywhere in RAM (stack, static data, heap if there is a heap,...). BearSSL has no static modifiable data, so it is reentrant (i.e. you can maintain several SSL connections, each with its own context structure).

  • Occasionally, BearSSL needs up to about 3 kB of stack space, mainly for asymmetric cryptographic operations. Since it has a state-machine API, this use is purely transient (i.e. when the application goes about doing the actual low-level I/O, BearSSL has returned, so that stack space is free again).

All together, a minimal client that uses a half-duplex mode and talks to a server whose public-key is already known, and that understands the Max Fragment Length extension, should work with as little as 837 + 3348 = 4185 bytes of RAM, and a 4 kB stack (for transient allocation). A minimal server, that can accept connections from big client (who will not send the Max Fragment Length extension), will need 16709 + 3684 = 20393 bytes of RAM, again in half-duplex mode.

Right now, BearSSL is declared "beta", which means that it has passed through some extensive testing, and thus should have few remaining bugs. This also means that the features I'd like to add before version 1.0 (the "stable" version) should not require any breaking API change.

Code is not on GitHub so that the legal status is simpler (it's developed in Canada and distributed from Canada). The Git repository is on a dedicated VM that I rent (from OVH) for that exact purpose. The gitweb feeds dynamically on the repository, so by construction it is always up-to-date. As for pull requests, in any case, I am a complete maniac and I would never merge external code "as is": suggestions and patches are welcome (and I already receive some) but I will read them through and mostly rewrite them completely, because I want to be able to say: "I fully know and understand every single line of code in BearSSL".

@earlephilhower
Copy link
Collaborator

@pornin , Thanks for the detailed info (and your concern for getting something as important as SSL right in a 1.0 version)!

Seems like OpenSSL has been slow with RFC6066 (at least from the pull requests..I think they're at month 13 with people still asking for tests to be added), so while it sounds great I think it will break most things on the internet for now.

But the suggestion on using a smaller xmit buffer, "... a 16 kB input buffer (16709 bytes, exactly) and a 2 kB output buffer, for a total of 18 kB instead of 32," sounds like a great thing to have. Sorry if it's a silly question, but I'm not familiar w/SSL internals: Does this limit the connection capabilities, or will it fragment packets automatically to the smaller size? That is, can I still do, for example, a HTTPS POST of 8KB of data if I have my send buffer only at 2K?

The 3K+ stack requirements are a bit rough, as the current setup give a total of 4K (and this stack is used by the OS WiFi code, too, so it's not even all for user apps!). As long as it's bounded, I imagine @igrr can increase if he swaps out axtls.

@pornin
Copy link

pornin commented Aug 5, 2017

Fragmentation is automatic and nominally invisible to applications. In fact, in the last few years, when using a CBC cipher suite with TLS 1.0, Web browsers have taken to automatically splitting off the first byte of each record into a record of its own (this is a defence against the "BEAST attack"), and they still work well with existing servers. There used to be some very poorly written application that would not tolerate fragmentation, but they have basically died out. Pre-1.0 OpenSSL would not accept fragmentation for some of the handshake messages, but this has been fixed, and here we are talking about "application data" records anyway.

Of note, during the initial handshake, when encryption is not active yet, BearSSL can handle records larger than its input buffer (since there is no authentication tag to verify at this point, data can be processed as it is received). It's only after the handshake that the maximum record size matters. In a closed application where both client and server code are controlled, one can ensure that outgoing records are small but adding some "flush" calls where appropriate.

RFC 6066 support would nominally break nothing. But it is useful only to the client, when RAM is scarce, and OpenSSL uses too much RAM to run on systems with little RAM, so it feels little pressure to implement it. I think patches have been floating around since at least 2014.

Biggest stack user is the RSA code. In order to support RSA keys up to 4096 bits, it uses stack buffers that eat up to 2208 bytes. One can gain a kilobyte or so by reducing the maximum supported RSA key size to a lower value, such as 2048 bits. In practice, most RSA keys are 2048 bits, but some CA will use 3072 or 4096-bit RSA keys. If not using X.509 certificate validation, then only the server key size matters, and that will normally be 2048 bits (if using RSA).

@earlephilhower
Copy link
Collaborator

earlephilhower commented Jan 30, 2018

@pornin I know this is an ooooold thread, but I just got around to building BearSSL for the ESP8266 and doing some testing this weekend.

A client w/the standard required 16K++ receive and minimum (876?) send buffer seems to take ~25.5KB. A server with the min send/recv. buffers was around ~5.5KB. Both of those numbers seem outstanding and would allow both a client that could talk to any server, and a server that could talk to any client, to run simultaneously in the ~43KB free on the ESP8266 Arduino system.

I did notice that the full stack for X509 validation (of your bearssl cert using LEt'sEncrypt as the trust anchor, actually) took 4.5-5K on the ESP8266 from the app calling BearSSL to it returning. GCC for the xtensa isn't compiled to dump stack-sizes, and it's a pain to manually instrument each function to get a runtime accounting, so I didn't go into it in detail. Is there one function or area where a large chunk or two are stack allocated to do the RSA validation? As there's nominally only 4KB total for stack (including interrupts and inline TCP processing), I'd need to allocate those larger variable on the heap to have any chance of stability. In fact, since during this time data's not being xferred, I think I could piggyback on the already-allocated buffer memory and not actually require any more space.

I see there are lots of machine generated .c files. Are those built by the .net code included in the repo, and is there an incantation for rebuilding them? The constant (state transition?) tables come to many KB and need to be moved to a different linker segment with a decorator, as well as anything that accesses them needs to use helper functions because the non-RAM segment they'd be moved to can only be read by 32-bit accesses. (There is also the possibility of patching GCC to only ever use 32-bit accesses, but that slows down even RAM accesses as GCC is not cognizant of any variable-specific requirements.) I can either hand-edit the generated .C files every release (bad idea) or patch the generator and keep a much smaller change...

One last thing, what is the magic incantation to not validate the X509 cert at all, to save memory? Reading the doxygen it looks like I can pass in a custom X509 hash function, but not being a SSL expert I'm not sure exactly what that means or if I'm looking at the wrong spot entirely.

Thanks!

@pornin
Copy link

pornin commented Jan 30, 2018

@earlephilhower For stack usage in RSA, the actual allocation occurs in src/int/rsa_i31_pub.c (if you are using the "i31" implementation; see rsa_i15_pub.c for the "i15" implementation); this is for public key operations, as will occur when verifying signatures on certificates. The allocation size depends on the macro BR_MAX_RSA_SIZE, defined in src/inner.h at the value 4096. Normally, you should be able to decrease that value to, say, 2048, which would save some space. Let's Encrypt certificate chains use RSA-2048 keys, so, for these certificates, you do not need full 4096-bit support.

However, at 4096 bits, stack usage is only a bit more than 2 kB for RSA; decreasing to 2048 bits will save about 1 kB. If you observe 4-5 kB stack usage, then there is some unaccounted extra stack usage elsewhere. BearSSL should normally keep itself within about 3 kB of stack space at all times. Predicting exact stack usage is hard since it depends on the target architecture, and how the C compiler allocates space; maybe GCC for xtensa does things suboptimally. You should first check that the relatively bulky state structures (br_ssl_client_context, br_x509_minimal_context) are not allocated on the stack.

For the machine-generated files: you can rebuild them with "make kT0". It will invoke the T0Comp.exe compiler, which is written in C#; on Linux you'll need to install Mono to do that (on Debian-like Linux systems, install the "mono-devel" package to get both the Mono runtime, and the C# compiler, which is needed if you modify T0Comp.exe itself). The generated C files are portable, which is why I can simply include them in the source archive. If you want to modify the .t0 files, or the T0Comp compiler itself, then you'll need to re-run "make kT0" (this invocation will recompile T0Comp.exe itself if necessary; then it will invoke it to regenerate the C files). To add extra directives to the static tables, you will want to modify T0/T0Comp.cs, lines 1776 to 1787. I am interested in seeing the modifications you need: I may be able to add some generic hook allowing to do what you want through a simple macro definition when compiling C files; this would avoid all this business with T0Comp.

About not validating the X.509 certificate: for SSL to actually achieve some security, the client MUST have some way to make sure that it uses the proper public key (the one that truly belongs to the intended server). The normal way to do that is through validation, but if you have other methods to get that public key, then it is possible to do otherwise. The certificate validation engine is pluggable, and BearSSL comes with two implementations: the "minimal" engine, which is the default and performs the basic steps of validation (name matching, signature verification,...), and the "knownkey" engine, which is used when the client already knows in some unspecified way the server's public key, and just want to use it. The "knownkey" engine simply discards whatever certificates the server sends, and uses the configured public key instead. The API is explained on: https://www.bearssl.org/x509.html

The default SSL client initialization function is br_ssl_client_init_full(), that expects (among other things) a br_x509_minimal_context engine. However, you can also perform a custom client initialization, as explained in samples/custom_profile.c (I suppose you already do that, in order to reduce the number of algorithms linked in the binary). That way, you can also choose which X.509 validation engine will be used.

It would be entirely possible, and actually easy, to make a third, reduced "validation" engine that would simply decode (not validate) the server certificate to get the public key, and trust it. BearSSL includes a non-validating X.509 certificate decoder (look up br_x509_decoder_context). However, a client that simply trusts whatever the server sends is also a client that can be fooled with a simple Man-in-the-Middle attack, so don't do that. It is a very classic failure of embedded systems that use SSL.

(Conceptually, you could use TOFU, i.e. "Trust On First Use": the client would simply trust the server key when first connecting, then remember it, and enforce its use for all subsequent connections. This is not a bad model, but it can be tough to do properly. Notably, it's hard to make TOFU work in a context where you can still occasionally change the server public key, without letting an active attacker fool the client. It would require some sort of explicit pairing process, just like Bluetooth gadgets, or SSH clients.)

@earlephilhower
Copy link
Collaborator

Much appreciate the detailed info!

For the stack test, all buffers that the app passed in to BearSSL were new'd from the heap, so this measurement was just inside the library + the LWIP library + anything the ESP IP stack itself took out of the current stack. So some could definitely have been outside of BearSSL, but still in the code flow from the app to the lib and back.

I definitely hear you on not validating the x509 certificate. On a real commercial or professional-level product that's negligent, but for folks starting on the Arduino it's kind of harsh to tell them they need to figure out the root CA for, say, www.cnn.com, download and convert the cert, add it top their sketch, and recompile their app just to use a https RSS news feed. Plus, they'd need to do it all over again if they decided they wanted to watch the BBC news feeds instead, making it rather unwieldy. We don't have the luxury of enough space to include a whole directory of trusted CAs. :(

@earlephilhower
Copy link
Collaborator

earlephilhower commented Jan 31, 2018

The T0 interpreter changes are actually quite minimal. The instruction and jump tables can go into PROGMEM and use a simple accessor helper as they are not touched except in very focused spots. The constant datatable can't as it seems to be passed out of the T0 and into certain pluggable functions (but it takes under 700bytes total so it's not a big deal compared to the ~9KB for the other two tables).

The preliminary diffs for the C# are attached for your perusal, but it's still a WIP so please don't bother doing anything other than looking them over and seeing if there's something that makes you cringe in them:
T0.diffs.txt

With this and moving the crypto (u)int32_t tables (which requires simply adding the "PROGMEM" decorator to the static const [] declaration...a simple SED script may be able to do it) it leaves ~18KB free heap out of the 44KB total while supporting a SSL bidirectional client connection.

@earlephilhower
Copy link
Collaborator

I've got a pre-alpha version replacing WiFiClientSecure w/a bidirectional BearSSL one in my Arduino fork: https://github.com/earlephilhower/Arduino/tree/bearssl_wip . The bearssl.ino example has downloaded your homepage so many times during my debug that I think I could re-type it from memory now.

SSL_io and examples were very handy in getting it up and running so fast!

Still work to go to replace the existing axtls server and handle the (IMO very silly) Arduino "copy objects by value instead of passing pointers" refcounting/etc.

@devyte
Copy link
Collaborator

devyte commented May 29, 2018

Given that #4273 is merged, I'm setting this as staged for release.

@devyte devyte added this to the 2.5.0 milestone May 29, 2018
@d-a-v d-a-v modified the milestones: 2.5.0, 2.4.2 May 31, 2018
@devyte devyte closed this as completed Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants