Question about the ALPN bytes in JA4 #148

p-l- · 2024-08-21T13:56:15Z

Hi again!

This is related to #16 (see also #32 & #147). It seems that the Rust implementation uses .is_ascii() method which is true iff the char code is lower (or equal) than 0x7f. Which means that it would be true for e.g. a null byte. The Python implementation is even more straightforward and tests if the first byte is higher than 127 (code).

I'm not sure this is wanted. Even a space character could be unexpected in a JA4 fingerprint string. Since they are not escaped or hashed, I think those chars should be replaced with 9 (which seems to be the value chosen for that) if they are not strictly in the range 33 <= c <= 126.

And even so, would we accept an underscore in the value? I suppose that would break a lot of parsers... So maybe we even want to replace anything apart from lower & upper case letters and numbers?

The text was updated successfully, but these errors were encountered:

john-althouse · 2024-08-26T21:07:32Z

Good point. While all IANA approved ALPN values start and end with a-z or 0-9, https://www.iana.org/assignments/tls-extensiontype-values/tls-extensiontype-values.xhtml#alpn-protocol-ids, anyone could put whatever they wanted in those fields which would break JA4.

As such I think an update to the spec is required and your suggestion makes sense. If no ALPN, the value is 00, if not a-z, A-Z, or 0-9, the value is 99, to indicate that an ALPN exists but it's special characters or non-ASCII.

How does that sound @p-l- ?

john-althouse · 2024-08-26T21:15:51Z

The other option, which would better preserve fingerprintability, would be to take the first and last nibble of the hex codes if the ALPN is special characters or non-ascii.

lrstewart · 2024-08-26T21:37:49Z

The other option, which would better preserve fingerprintability, would be to take the first and last nibble of the hex codes if the ALPN is special characters or non-ascii.

Pros:

fingerprints can distinguish between invalid alpns like "__" and "&&"
wireshark won't need to be updated / change behavior

Cons:

fingerprints can't distinguish between valid alpns like "c-webrtc" and invalid alpns like [ 0xc9, 0xcc ]
- but fingerprints already can't distinguish between valid alpns like "c-webrtc" and invalid alpns like "cc", so is this really a loss?
less human readable

I vote for the nibble strategy. It's less human-readable, but it does fingerprint better.

p-l- · 2024-08-26T22:21:12Z

I'd be in favour of something like:

00 if no ALPN (I don't really like it as it would collide with 0http0 for example, but we need to keep this one because it happens a lot and changing it would break existing fingerprints).
first and last char if they are in [A-Za-z0-9], replaced with 9 otherwise (http\x00 => h9, \xffhttp => 9p, \x00\xff => 99) so that we keep 99 for non-ascii chars (I'd like, here too, to use a different chars that would do be within the valid chars, but we don't want to break existing fingerprints).

There are other cases we may want to consider (preferred: in my opinion only!):

if only one byte
- proposal: pad with either 0 or 9 (preferred, to be consistent with the following)
if ALPN exists and first protocol is an empty string
- proposal: 00 or 99 (preferred)
if ALPN exists and protocol list is empty (I think this one is already clear, but just in case)
- 00 (same as no ALPN extension)

The proposals are not ideal but I have tried to keep as compatible as possible with existing fingerprints.

WDYT?

lrstewart · 2024-08-26T22:34:20Z

The proposals are not ideal but I have tried to keep as compatible as possible with existing fingerprints.

I opened #147 because Wireshark 4.2 was using nibbles while my implementation had matched the Rust implementation and was using "9". It's hard to say that "9" is any more compatible with existing fingerprints than the nibble approach.

p-l- · 2024-08-26T22:37:20Z

It's hard to say that "9" is any more compatible with existing fingerprints than the nibble approach.

At first glance that was my impression, based on implementations here. Plus I did not read anything about using nibbles in the reference implementations or in the documentation. Did I miss something?

lrstewart · 2024-08-26T23:30:08Z

The documentation doesn't say anything about using "9" either, and the reference implementations aren't exactly consistent. Only Rust really does "9" right, and even it only considers non-ascii. Python does it wrong, and I don't think Zeek or the local Wireshark extension handle it at all. Wireshark requested clarification in #16, and now uses nibbles in their official release (that tripped me up because I was testing my implementation against tshark). Because it wasn't covered in the spec, I wouldn't expect consistency or even for most implementations to consider that particular edge case.

Any choice is going to break compatibility. I think it's more important that the spec makes a choice and is clear going forwards.

p-l- · 2024-08-27T08:43:58Z

I think it's more important that the spec makes a choice and is clear going forwards.

Agreed!

john-althouse · 2024-08-28T21:01:53Z

Thanks @p-l- @lrstewart for bringing this back up. I see now that I dropped the ball in #16 by not updating the spec and then totally forgetting about this edge case.

As listed there, I like the option of using nibbles. Yes, 0xc9 0xcc would = cc same as c-webrtc but these are edge cases. The point is not to prevent collisions in the ALPN part of the fingerprint, which would not have much attacker value, they might as well use c-webrtc. The point is to prevent a vulnerability where an attacker could break JA4 or systems running JA4 by sending malformed ALPNs. The option of using nibbles preserves the fingerprintability of these non-RFC8447-17 compliant client hello packets.

I'm not aware of any existing fingerprints which have malformed ALPNs. If you're seeing some, I'd love to get some pcap.

IRT to other cases:

If only one byte, first and last nibble still work.
If ALPN exists and protocol list is empty 01
If ALPN exists and first protocol is an empty string, 02

This preserves some uniqueness in those edge cases, which again, I haven't seen. What do you think?

lrstewart · 2024-08-28T21:54:05Z

You're right, these are all VERY niche edge cases, so even changing the existing spec probably isn't the end of the world.

If only one byte, first and last nibble still work.

Oh, I like that. That's a much better solution for the one byte case than "0"!

If ALPN exists and protocol list is empty 01

I'm not sure about this. By my reading, the spec actually is already clear on that case: "If there are no ALPN values or no ALPN extension then we print “00” as the value in the fingerprint." While changing the spec isn't a deal breaker (VERY edge case), the benefit here seems pretty negligible and maybe not worth it.

If ALPN exists and first protocol is an empty string, 02

I guess an empty string doesn't actually count as "no ALPN values". But if we've already committed to the "00" case for the other "missing" cases, I'm not sure how useful giving this one a unique code is. It might be simpler / more consistent just to keep "00" for all the "missing" cases. Since we want all versions of JA4 to agree, simplicity and ease of implementation does have value, and I'd argue more value than being able to distinguish between those "missing" cases.

p-l- · 2024-08-28T22:06:48Z

If only one byte, first and last nibble still work.
True for non-ASCII, but for "h", why would we set the value to "68" when "http" will be "hp"? Wouldn't "h0" be a better option (for readability at least)?

Regarding the different "no ALPN" (or close), it is true that using 01 when no ALPN is listed changes from the existing spec but it won't break existing fingerprints database since nobody does that, while adding the opportunity to detect weird stuffs in the future if that happens. Also, the existence or not of the ALPN extension is not even listed in the hashed list of extensions, so using the same value would loose that piece of information.

In any cases, I really like having that "ALPN exists but is an empty string" represented differently than "no ALPN exist".

lrstewart · 2024-08-29T04:52:38Z

Also, the existence or not of the ALPN extension is not even listed in the hashed list of extensions, so using the same value would loose that piece of information.

I'm still not convinced. We'd be able to distinguish between "ALPN exists but is an empty string" and "no ALPN exist", but not between "no ALPN exist" and alpn="01", or alpn="0abcd1", or alpn=[x00, 0x01]. We don't really avoid losing a piece of information, because we can't tell whether we're printing "00"/"01"/"02" because of one of the "missing" edge cases or because of an alpn value. I'm not convinced that any complication of these edge cases is worth the minor distinctions between edge cases.

john-althouse · 2024-08-29T14:58:17Z

I've updated the JA4 Spec to reflect our conversation here about non-alphanumeric characters in the ALPN and have left the part where if ALPN exists but is an empty string as "00" This way if people implemented the upper nibble and lower nibble method, like Wireshark, they have nothing to update in their code.

See: https://github.com/FoxIO-LLC/ja4/blob/main/technical_details/JA4.md

lrstewart · 2024-08-29T17:54:04Z

Unfortunately I think this might still be ambiguous.

From the updated spec:

If the ALPN value is non-alphanumeric (0x30-0x39, 0x41-0x5A, 0x61-0x7A), we take the first high-nibble and the last low-nibble.

Your examples only cover cases where all bytes in the alpn are non-alphanumeric. What if only one is?

It looks like Wireshark falls back to nibbles for both characters if either character is non-ascii: https://gitlab.com/wireshark/wireshark/-/merge_requests/12699/diffs That differs from the current Rust implementation, which considers each character separately and only replaces the non-ascii ones with "9". Which behavior did you intend? How should 0x30 0xAA be printed?

john-althouse · 2024-08-29T18:24:35Z

Wireshark's implementation makes sense. I'll update it to say "If either the first, last, or both characters of the ALPN are non-alphanumeric"

lrstewart · 2024-08-29T18:40:09Z

Thanks! It'd probably be safest to also include an example with one alphanumeric character, to completely avoid any difference in interpretation.

p-l- · 2024-08-29T20:24:22Z

I have more questions...

I'm not sure what to do with one single alphanumeric char: should we

repeat the char twice (as the first and last are the same)?
fallback to the nibble (but would that make sense since the char is alphanumeric)?
use the char + "0" (as "0" is used to mean empty in this case)?

My personal preference goes to using the char twice, as it avoids making an exception when no exception is really needed. For now, in IVRE, I'll keep that as it does not need an exception (and update it when it has been clarified)

Second question is: what should we do if ALPN extension has two (or more) protocol values, and the first one is empty? Should we use the next one or set "00"? I don't have any preference here. For now, in IVRE, I'll use the next value.

I know those are corner cases but I think it's important the spec covers all of them to make sure values are consistent across all implementations.

lrstewart · 2024-08-29T21:09:13Z

I'm not sure what to do with one single alphanumeric char: should we

The discussed answer from this thread was to use the nibble strategy, but I agree that didn't really make it into the spec. It reads like the spec is saying to only do that for non-alphanumeric single bytes.

what should we do if ALPN extension has two (or more) protocol values, and the first one is empty?

I think the only reasonable answer is to treat it the same as if the ALPN only had one empty protocol value. Skipping an empty value seems both unnecessarily complicated and misleading. But it also looks like the spec still isn't clear on how to handle an empty value. The discussed answer from this thread was to use "00" like the other "missing" cases.

lrstewart · 2024-08-30T18:01:58Z

How about this: #156

p-l- · 2024-08-30T20:30:43Z

That clarifies all the points. My only (minor) concern is about 1 char ALPN extensions: why create an exception and use the nibble pattern when we could use first and last chars (which happen to be the same)? I think that too many exceptions won't help make/keep the implementations simple, so unless it has an advantage for fingerprinting (but I don't think so), I would propose to remove that exception. WDYT?

lrstewart · 2024-08-30T21:00:41Z

Oh I see, so you're saying that 0x31 would be "11" instead of "31"?

The 1-byte case is going to be an exception no matter how we handle it. Like, we're going to have to call out that case specifically in the spec no matter how we end up handling it, because how it should be handled isn't obvious. So the question is just whether it belongs in the valid alpn handling (first and last char) or the invalid alpn handling (first and last nibble).

I lean towards it belonging in the invalid alpn handling, since no 1-byte alpns currently exist and I don't think a 1-byte alpn is any more likely to be approved than an alpn that ends in a non-alphanumeric character. However, it looks like Wireshark currently treats it as valid and just repeats the character (I tested). It's probably easiest to just match Wireshark.

p-l- · 2024-08-31T19:36:56Z

The 1-byte case is going to be an exception no matter how we handle it.

I see what you mean but I don't agree. If we don't add an exception specifically for this case, then "\x31" is alphanumeric, so the first and last bytes are used. And you can see it is not an exception in the implementations I have written¹: I don't have to write a specific case for one-byte values (alphanumeric or not).

So unless it brings value to handle it specifically, I don't think we should do that.

Both for the IVRE project: Zeek-based & Python/Scapy-based implementations. ↩

lrstewart · 2024-09-04T20:57:45Z

It may not be an exception in the implementation, but the spec should still call it out as a special case. It also might be better to move this discussion to #156, since this issue is closed.

p-l- mentioned this issue Aug 21, 2024

Harden JA4 generation / parsing ivre/ivre#1637

Closed

4 tasks

john-althouse self-assigned this Aug 26, 2024

john-althouse added the bug Something isn't working label Aug 26, 2024

lrstewart mentioned this issue Aug 26, 2024

Clarify alpn edge case handling #147

Closed

john-althouse closed this as completed Aug 29, 2024

p-l- mentioned this issue Aug 29, 2024

Zeek/JA4: fix ALPN based on updated spec ivre/ivre#1645

Merged

p-l- mentioned this issue Aug 29, 2024

Nmap/JA4: fix ALPN based on updated spec ivre/ivre#1646

Merged

lrstewart mentioned this issue Aug 30, 2024

More clarifications of ALPN handling #156

Merged

lrstewart mentioned this issue Sep 5, 2024

fix: update handling of ja4 alpn edge cases aws/s2n-tls#4755

Merged

p-l- mentioned this issue Sep 12, 2024

JA4: fix ALPN based on updated spec ivre/ivre#1655

Merged

nakoo mentioned this issue Sep 28, 2024

fix alpn value O-X-L/haproxy-ja4#15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the ALPN bytes in JA4 #148

Question about the ALPN bytes in JA4 #148

p-l- commented Aug 21, 2024 •

edited

Loading

john-althouse commented Aug 26, 2024

john-althouse commented Aug 26, 2024

lrstewart commented Aug 26, 2024

p-l- commented Aug 26, 2024

lrstewart commented Aug 26, 2024

p-l- commented Aug 26, 2024

lrstewart commented Aug 26, 2024 •

edited

Loading

p-l- commented Aug 27, 2024

john-althouse commented Aug 28, 2024

lrstewart commented Aug 28, 2024

p-l- commented Aug 28, 2024

lrstewart commented Aug 29, 2024

john-althouse commented Aug 29, 2024

lrstewart commented Aug 29, 2024

john-althouse commented Aug 29, 2024

lrstewart commented Aug 29, 2024

p-l- commented Aug 29, 2024 •

edited

Loading

lrstewart commented Aug 29, 2024

lrstewart commented Aug 30, 2024

p-l- commented Aug 30, 2024

lrstewart commented Aug 30, 2024 •

edited

Loading

p-l- commented Aug 31, 2024 •

edited

Loading

lrstewart commented Sep 4, 2024

Question about the ALPN bytes in JA4 #148

Question about the ALPN bytes in JA4 #148

Comments

p-l- commented Aug 21, 2024 • edited Loading

john-althouse commented Aug 26, 2024

john-althouse commented Aug 26, 2024

lrstewart commented Aug 26, 2024

p-l- commented Aug 26, 2024

lrstewart commented Aug 26, 2024

p-l- commented Aug 26, 2024

lrstewart commented Aug 26, 2024 • edited Loading

p-l- commented Aug 27, 2024

john-althouse commented Aug 28, 2024

lrstewart commented Aug 28, 2024

p-l- commented Aug 28, 2024

lrstewart commented Aug 29, 2024

john-althouse commented Aug 29, 2024

lrstewart commented Aug 29, 2024

john-althouse commented Aug 29, 2024

lrstewart commented Aug 29, 2024

p-l- commented Aug 29, 2024 • edited Loading

lrstewart commented Aug 29, 2024

lrstewart commented Aug 30, 2024

p-l- commented Aug 30, 2024

lrstewart commented Aug 30, 2024 • edited Loading

p-l- commented Aug 31, 2024 • edited Loading

Footnotes

lrstewart commented Sep 4, 2024

p-l- commented Aug 21, 2024 •

edited

Loading

lrstewart commented Aug 26, 2024 •

edited

Loading

p-l- commented Aug 29, 2024 •

edited

Loading

lrstewart commented Aug 30, 2024 •

edited

Loading

p-l- commented Aug 31, 2024 •

edited

Loading