-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the ALPN bytes in JA4 #148
Comments
Good point. While all IANA approved ALPN values start and end with a-z or 0-9, https://www.iana.org/assignments/tls-extensiontype-values/tls-extensiontype-values.xhtml#alpn-protocol-ids, anyone could put whatever they wanted in those fields which would break JA4. As such I think an update to the spec is required and your suggestion makes sense. If no ALPN, the value is 00, if not a-z, A-Z, or 0-9, the value is 99, to indicate that an ALPN exists but it's special characters or non-ASCII. How does that sound @p-l- ? |
The other option, which would better preserve fingerprintability, would be to take the first and last nibble of the hex codes if the ALPN is special characters or non-ascii. |
Pros:
Cons:
I vote for the nibble strategy. It's less human-readable, but it does fingerprint better. |
I'd be in favour of something like:
There are other cases we may want to consider (preferred: in my opinion only!):
The proposals are not ideal but I have tried to keep as compatible as possible with existing fingerprints. WDYT? |
I opened #147 because Wireshark 4.2 was using nibbles while my implementation had matched the Rust implementation and was using "9". It's hard to say that "9" is any more compatible with existing fingerprints than the nibble approach. |
At first glance that was my impression, based on implementations here. Plus I did not read anything about using nibbles in the reference implementations or in the documentation. Did I miss something? |
The documentation doesn't say anything about using "9" either, and the reference implementations aren't exactly consistent. Only Rust really does "9" right, and even it only considers non-ascii. Python does it wrong, and I don't think Zeek or the local Wireshark extension handle it at all. Wireshark requested clarification in #16, and now uses nibbles in their official release (that tripped me up because I was testing my implementation against tshark). Because it wasn't covered in the spec, I wouldn't expect consistency or even for most implementations to consider that particular edge case. Any choice is going to break compatibility. I think it's more important that the spec makes a choice and is clear going forwards. |
Agreed! |
Thanks @p-l- @lrstewart for bringing this back up. I see now that I dropped the ball in #16 by not updating the spec and then totally forgetting about this edge case. As listed there, I like the option of using nibbles. Yes, I'm not aware of any existing fingerprints which have malformed ALPNs. If you're seeing some, I'd love to get some pcap. IRT to other cases: If only one byte, first and last nibble still work. This preserves some uniqueness in those edge cases, which again, I haven't seen. What do you think? |
You're right, these are all VERY niche edge cases, so even changing the existing spec probably isn't the end of the world.
Oh, I like that. That's a much better solution for the one byte case than "0"!
I'm not sure about this. By my reading, the spec actually is already clear on that case: "If there are no ALPN values or no ALPN extension then we print “00” as the value in the fingerprint." While changing the spec isn't a deal breaker (VERY edge case), the benefit here seems pretty negligible and maybe not worth it.
I guess an empty string doesn't actually count as "no ALPN values". But if we've already committed to the "00" case for the other "missing" cases, I'm not sure how useful giving this one a unique code is. It might be simpler / more consistent just to keep "00" for all the "missing" cases. Since we want all versions of JA4 to agree, simplicity and ease of implementation does have value, and I'd argue more value than being able to distinguish between those "missing" cases. |
Regarding the different "no ALPN" (or close), it is true that using 01 when no ALPN is listed changes from the existing spec but it won't break existing fingerprints database since nobody does that, while adding the opportunity to detect weird stuffs in the future if that happens. Also, the existence or not of the ALPN extension is not even listed in the hashed list of extensions, so using the same value would loose that piece of information. In any cases, I really like having that "ALPN exists but is an empty string" represented differently than "no ALPN exist". |
I'm still not convinced. We'd be able to distinguish between "ALPN exists but is an empty string" and "no ALPN exist", but not between "no ALPN exist" and alpn="01", or alpn="0abcd1", or alpn=[x00, 0x01]. We don't really avoid losing a piece of information, because we can't tell whether we're printing "00"/"01"/"02" because of one of the "missing" edge cases or because of an alpn value. I'm not convinced that any complication of these edge cases is worth the minor distinctions between edge cases. |
I've updated the JA4 Spec to reflect our conversation here about non-alphanumeric characters in the ALPN and have left the part where if ALPN exists but is an empty string as "00" This way if people implemented the upper nibble and lower nibble method, like Wireshark, they have nothing to update in their code. See: https://github.com/FoxIO-LLC/ja4/blob/main/technical_details/JA4.md |
Unfortunately I think this might still be ambiguous. From the updated spec:
Your examples only cover cases where all bytes in the alpn are non-alphanumeric. What if only one is? It looks like Wireshark falls back to nibbles for both characters if either character is non-ascii: https://gitlab.com/wireshark/wireshark/-/merge_requests/12699/diffs That differs from the current Rust implementation, which considers each character separately and only replaces the non-ascii ones with "9". Which behavior did you intend? How should |
Wireshark's implementation makes sense. I'll update it to say "If either the first, last, or both characters of the ALPN are non-alphanumeric" |
Thanks! It'd probably be safest to also include an example with one alphanumeric character, to completely avoid any difference in interpretation. |
I have more questions... I'm not sure what to do with one single alphanumeric char: should we
My personal preference goes to using the char twice, as it avoids making an exception when no exception is really needed. For now, in IVRE, I'll keep that as it does not need an exception (and update it when it has been clarified) Second question is: what should we do if ALPN extension has two (or more) protocol values, and the first one is empty? Should we use the next one or set "00"? I don't have any preference here. For now, in IVRE, I'll use the next value. I know those are corner cases but I think it's important the spec covers all of them to make sure values are consistent across all implementations. |
The discussed answer from this thread was to use the nibble strategy, but I agree that didn't really make it into the spec. It reads like the spec is saying to only do that for non-alphanumeric single bytes.
I think the only reasonable answer is to treat it the same as if the ALPN only had one empty protocol value. Skipping an empty value seems both unnecessarily complicated and misleading. But it also looks like the spec still isn't clear on how to handle an empty value. The discussed answer from this thread was to use "00" like the other "missing" cases. |
How about this: #156 |
That clarifies all the points. My only (minor) concern is about 1 char ALPN extensions: why create an exception and use the nibble pattern when we could use first and last chars (which happen to be the same)? I think that too many exceptions won't help make/keep the implementations simple, so unless it has an advantage for fingerprinting (but I don't think so), I would propose to remove that exception. WDYT? |
Oh I see, so you're saying that The 1-byte case is going to be an exception no matter how we handle it. Like, we're going to have to call out that case specifically in the spec no matter how we end up handling it, because how it should be handled isn't obvious. So the question is just whether it belongs in the valid alpn handling (first and last char) or the invalid alpn handling (first and last nibble). I lean towards it belonging in the invalid alpn handling, since no 1-byte alpns currently exist and I don't think a 1-byte alpn is any more likely to be approved than an alpn that ends in a non-alphanumeric character. However, it looks like Wireshark currently treats it as valid and just repeats the character (I tested). It's probably easiest to just match Wireshark. |
I see what you mean but I don't agree. If we don't add an exception specifically for this case, then So unless it brings value to handle it specifically, I don't think we should do that. Footnotes
|
It may not be an exception in the implementation, but the spec should still call it out as a special case. It also might be better to move this discussion to #156, since this issue is closed. |
Hi again!
This is related to #16 (see also #32 & #147). It seems that the Rust implementation uses
.is_ascii()
method which is true iff the char code is lower (or equal) than 0x7f. Which means that it would be true for e.g. a null byte. The Python implementation is even more straightforward and tests if the first byte is higher than 127 (code).I'm not sure this is wanted. Even a space character could be unexpected in a JA4 fingerprint string. Since they are not escaped or hashed, I think those chars should be replaced with
9
(which seems to be the value chosen for that) if they are not strictly in the range 33 <= c <= 126.And even so, would we accept an underscore in the value? I suppose that would break a lot of parsers... So maybe we even want to replace anything apart from lower & upper case letters and numbers?
The text was updated successfully, but these errors were encountered: