Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate right-length node name #30491

Closed
turing85 opened this issue Jan 19, 2023 · 26 comments · Fixed by #36752
Closed

Generate right-length node name #30491

turing85 opened this issue Jan 19, 2023 · 26 comments · Fixed by #36752
Labels
area/narayana Transactions / Narayana kind/enhancement New feature or request
Milestone

Comments

@turing85
Copy link
Contributor

turing85 commented Jan 19, 2023

Background

Currently, when we pass a value to quarkus.transaction-manager.node-name that is too long, the application fails to start with an exception:

2023-01-19 17:37:06,788 WARN  [com.arj.ats.arjuna] (Quarkus Main Thread) ARJUNA012138: Node name cannot exceed 28 bytes!
2023-01-19 17:37:06,788 WARN  [com.arj.ats.arjuna] (Quarkus Main Thread) ARJUNA012138: Node name cannot exceed 28 bytes!
2023-01-19 17:37:06,797 ERROR [io.qua.run.Application] (Quarkus Main Thread) Failed to start application (with profile dev): java.lang.IllegalArgumentException
	at com.arjuna.ats.arjuna.coordinator.TxControl.setXANodeName(TxControl.java:178)
	at io.quarkus.narayana.jta.runtime.NarayanaJtaRecorder.setNodeName(NarayanaJtaRecorder.java:36)
	at io.quarkus.deployment.steps.NarayanaJtaProcessor$build75739834.deploy_0(Unknown Source)
	at io.quarkus.deployment.steps.NarayanaJtaProcessor$build75739834.deploy(Unknown Source)
	at io.quarkus.runner.ApplicationImpl.doStart(Unknown Source)
	at io.quarkus.runtime.Application.start(Application.java:101)
	at io.quarkus.runtime.ApplicationLifecycleManager.run(ApplicationLifecycleManager.java:110)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:69)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:42)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:122)
	at io.quarkus.runner.GeneratedMain.main(Unknown Source)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at io.quarkus.runner.bootstrap.StartupActionImpl$1.run(StartupActionImpl.java:103)
	at java.base/java.lang.Thread.run(Thread.java:1589)

2023-01-19 17:37:06,797 ERROR [io.qua.run.Application] (Quarkus Main Thread) Failed to start application (with profile dev): java.lang.IllegalArgumentException
	at com.arjuna.ats.arjuna.coordinator.TxControl.setXANodeName(TxControl.java:178)
	at io.quarkus.narayana.jta.runtime.NarayanaJtaRecorder.setNodeName(NarayanaJtaRecorder.java:36)
	at io.quarkus.deployment.steps.NarayanaJtaProcessor$build75739834.deploy_0(Unknown Source)
	at io.quarkus.deployment.steps.NarayanaJtaProcessor$build75739834.deploy(Unknown Source)
	at io.quarkus.runner.ApplicationImpl.doStart(Unknown Source)
	at io.quarkus.runtime.Application.start(Application.java:101)
	at io.quarkus.runtime.ApplicationLifecycleManager.run(ApplicationLifecycleManager.java:110)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:69)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:42)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:122)
	at io.quarkus.runner.GeneratedMain.main(Unknown Source)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at io.quarkus.runner.bootstrap.StartupActionImpl$1.run(StartupActionImpl.java:103)
	at java.base/java.lang.Thread.run(Thread.java:1589)

It would be a better developer experience when if quarkus-narayana-jta transformed the value to a valid value.


Story

AS a developer
WHEN I provide a value for quarkus.transaction-manager.node-name
THEN the extension quarkus-narayana-jta guarantees that the value is transformed to a valid node name if necessary.

Implementation ideas

No response

@turing85 turing85 added the kind/enhancement New feature or request label Jan 19, 2023
@maxandersen maxandersen added area/narayana Transactions / Narayana and removed triage/needs-triage labels Jan 19, 2023
@quarkus-bot
Copy link

quarkus-bot bot commented Jan 19, 2023

/cc @mmusgrov (narayana)

@maxandersen
Copy link
Member

Wondering if we should just sha any value longer than 28(?) bytes ?

What could go wrong :)

@turing85
Copy link
Contributor Author

FTR: relevant zulip discussion

@mmusgrov
Copy link
Contributor

mmusgrov commented Apr 19, 2023

@maxandersen Wouldn't truncating the hash increase the probability of a collision.

If there is someone else here with a background in hashing functions that can provide an estimate of the two probabilities (truncated hash versus the truncated hashes of any of the other pods) then we can decide how exposed we'd be if we go with truncating the hash.

@mmusgrov
Copy link
Contributor

But if the hashes for two different nodes collide then you could end up with two transaction recovery managers independently recovering the same in doubt transaction, probably resulting in a heuristic outcome, that would be wrong ;-)

@maxandersen
Copy link
Member

So your concern is that a sha256 is 32 characters and we only have room for 28?

Where does that 28 size come from anyway ? ::)

@maxandersen
Copy link
Member

Just realizing sha-224 exists which would fit into 28 bytes. Thus not be truncated.

My understanding is that collisions on small values is still very unlikely - one can potentially have a hit if one generate a very long string that coincidentally has same hash. But that is not bound to happen in any realistic usage of these key to signify identity.

Don't know the exact numbers of these but should be possible to find.

@maxandersen
Copy link
Member

maxandersen commented Apr 20, 2023

Okey so late night reading. The hash collision is the same as the birthday problem. This article explains it pretty well https://kevingal.com/blog/collisions.html with a calculator https://kevingal.com/apps/collision.html

Let's say we expect there to be on average 4(?) nodes participating; how likely is it their values will have hash collisions given their values are most likely unique (i.e. host names are user entered keys with the intention to be unique) then the simplified calculation is:

4^2/2*(2^224) = 215679573337205118357336120696157045389097155380324579848828881993728

Approximate: 2.15*10^68

It's way more likely a asteroid hits and eliminate the planet in our lifetime than there will be collisions Afaics :)

@maxandersen
Copy link
Member

Ps. Check my math - I simply based it on the article explaining the approximation for the birthday problem.

@mmusgrov
Copy link
Contributor

Interesting, this is a useful analysis.

The limitation on the size of the node name is because we include it in the (X/Open) Xid which is a fixed size data structure.
When deployed on k8s we need to further restrict the node name/pod name length to 23 characters.
We would need to reverse the hash to get back the node name (I think this is what Jonathan was referring to in the linked zulip chat) but hash functions are not reversible in general. In the chat someone did suggest a way to "re-establish" the connection between the cryptic value and the actual name.

But I think you have kick-started another round of discussions about how to work around the issues ...

@maxandersen
Copy link
Member

Is there a technical reason for needing the reversal? If yes; then we should consider using compression instead.

If it's just to locate the node; doesn't the protocol allow to annotate/tag the log to get more context ?

@turing85
Copy link
Contributor Author

turing85 commented May 6, 2023

...When deployed on k8s we need to further restrict the node name/pod name length to 23 characters. ...

@mmusgrov Why is that? I know that there is a limit on route URLs (I think of 63 characters). I definitively know that we have pod-names with more than 23 and 28 character in production in an OpenShift cluster.

@turing85
Copy link
Contributor Author

turing85 commented May 6, 2023

Just realizing sha-224 exists which would fit into 28 bytes. Thus not be truncated.

@maxandersen while this is true, not all of the generated bytes represent printable characters. So sha224sum would actually result in a 56-character long hexdec string (28 * 2). Not sure how relevant this is for the X/Open Xid. Maybe @mmusgrov has some additional input?

@turing85
Copy link
Contributor Author

turing85 commented May 6, 2023

Okey so late night reading. The hash collision is the same as the birthday problem. This article explains it pretty well https://kevingal.com/blog/collisions.html with a calculator https://kevingal.com/apps/collision.html

Let's say we expect there to be on average 4(?) nodes participating; how likely is it their values will have hash collisions given their values are most likely unique (i.e. host names are user entered keys with the intention to be unique) then the simplified calculation is:

4^2/2*(2^224) = 215679573337205118357336120696157045389097155380324579848828881993728

Approximate: 2.15*10^68

It's way more likely a asteroid hits and eliminate the planet in our lifetime than there will be collisions Afaics :)

Not sure the math is correct, but the order of magnitude seems sensible.

So basically

f(k, n) = prod_(i = 0)^(k - 1) ((2^n - i) / (2^n)) 

should be the probability to have k hashes of bit-length n that do not have a collision (f ~ "(collision-)free").

Conversely,

c(k, n) = 1 - f(k, n) 
        = 1 - prod_(i = 0)^(k - 1) ((2^n - i) / (2^n))

is the probability that at least one collision occurs (c ~ "collision").

Wolframalpha refuses to calculate the value c(4, 244) (wolframalpha.com), but happily computates c(4,1000) as 0 (wolframalpha.com), so the probability seems astronomically low.

If we let wolframalpha calculate f(4, 224) instead, it is pretty darn close to 1 (wolframalpah.com).

so yeah... sha224 should be sufficient.

@turing85
Copy link
Contributor Author

turing85 commented May 7, 2023

Okay, I used the approximation used by the link provided by @maxandersen and calculated the number of nodes necessary to reach a probability of 0.01 = 1% for a collision (wolframalpha.com), which comes down to something in the range of 10^32. For comparison: the estimated number of grains of sand on earth is in the order of 10^18. So yeah... this should be sufficient.

@turing85
Copy link
Contributor Author

turing85 commented Sep 7, 2023

bump so... do we wanna do this?

@mmusgrov
Copy link
Contributor

mmusgrov commented Sep 13, 2023

I'm not sure what to say here other than to state our needs:

  1. We embed the nodeIdentifier into XIDs (these identify resource manager XA branches) and we need the nodeIdentifier to stop independent recovery managers from trying to recover the same branch.
  2. The XID data structure is a fixed size and after we have put all of the other information, that we and XA demand, into the structure we are only left with 23 bytes for the nodeIdentifier.

So our key requirement is uniqueness and I don't feel sufficiently qualified to comment on the solutions that Marco and Max have presented here. So to answer your question, we would like to do this but I'm uncomfortable with the solutions presented so far. Can we not detect the problem, of someone setting too long a node name, at build time?

@turing85
Copy link
Contributor Author

turing85 commented Oct 21, 2023

@mmusgrov The node identifier is something one would set at run time. Think, for example, the pod name of a pod in a stateful set. We cannot detect this value at build time. Also: where are the 23 bytes coming from if the error message says 28 bytes?

@mmusgrov
Copy link
Contributor

Thanks.

The 23 byte node identifier limit is something that was originally imposed when running WildFly on OpenShift and subsequently got added as a WildFly restriction too and there are a number of JIRAs for it, none of which say why. I've asked around a few times without success and the two people that I think could know the answer no longer work in this area (and no longer work for RedHat). Perhaps it's just one of those things that got "lost in the mists of time" and is no longer relevant, I know that OpenShift has changed a lot since the restriction was added. So let's just assume this isn't applicable to quarkus, I'm comfortable doing that. Note that the 28 bytes for the datum is still a hard requirement since we inherit this limitation from the XA specification which constrains the size of an Xid.

@turing85
Copy link
Contributor Author

28 bytes is fine, we have hash-algorithms that generate 28-byte hashes. But there is nothing that generates 23-byte hashes.

@turing85
Copy link
Contributor Author

turing85 commented Oct 27, 2023

So would an identifier like "Rӽ[/d��h�}1�:����o¦Z��2���" be fine, or is this problematic? (This is basically the result of hashing some string with SHA224)

@mmusgrov
Copy link
Contributor

That would be fine with me. I know it was said elsewhere that this would make debugging a pain but that complaint can be side-stepped if the boot log can report the original value prior to hashing?

@turing85
Copy link
Contributor Author

Printing the initial value is no problem. So the last question... do we want to always hash, or only if the name is too long (and print information accordingly)?

@mmusgrov
Copy link
Contributor

mmusgrov commented Oct 27, 2023

I don't know. But if you only hash when necessary then if someone complained about it then we could tell them to use a shorter node name so personally I'd go for only hash if necessary.

@mmusgrov
Copy link
Contributor

Oh, and thanks very much for sorting through the issues and coming up with a good solution.

turing85 added a commit to turing85/quarkus that referenced this issue Oct 27, 2023
@turing85
Copy link
Contributor Author

@maxandersen @mmusgrov PR is open. You are welcome to review it 🙂
When the issue is closed, I'd love to see a backport to 3.2.x.

turing85 added a commit to turing85/quarkus that referenced this issue Oct 31, 2023
@quarkus-bot quarkus-bot bot added this to the 3.6 - main milestone Nov 6, 2023
DavideD pushed a commit to DavideD/quarkus that referenced this issue Nov 27, 2023
deepred-dev added a commit to deepred-dev/narayana-spring-boot that referenced this issue Feb 6, 2024
Generate right-length nodeIdentifier
if nodeIdentifier is longer than 28 bytes, the XID generated by narayana ist too long, and it breaks.
This shortens it, while nodeIdentifier remains unique.

Fix ported from quarkus
quarkusio/quarkus#30491
holly-cummins pushed a commit to holly-cummins/quarkus that referenced this issue Feb 8, 2024
graben pushed a commit to graben/narayana-spring-boot that referenced this issue Mar 5, 2024
Generate right-length nodeIdentifier
if nodeIdentifier is longer than 28 bytes, the XID generated by narayana ist too long, and it breaks.
This shortens it, while nodeIdentifier remains unique.

Fix ported from quarkus
quarkusio/quarkus#30491
cunningt pushed a commit to cunningt/narayana-spring-boot that referenced this issue Jun 21, 2024
Generate right-length nodeIdentifier
if nodeIdentifier is longer than 28 bytes, the XID generated by narayana ist too long, and it breaks.
This shortens it, while nodeIdentifier remains unique.

Fix ported from quarkus
quarkusio/quarkus#30491
cunningt added a commit to jboss-fuse/narayana-spring-boot that referenced this issue Jun 24, 2024
* Update NarayanaPropertiesInitializer.java

Generate right-length nodeIdentifier
if nodeIdentifier is longer than 28 bytes, the XID generated by narayana ist too long, and it breaks.
This shortens it, while nodeIdentifier remains unique.

Fix ported from quarkus
quarkusio/quarkus#30491

* Replace deprecated API usage

* Compute XA recovery nodes from node identifier by default

* Upgrade versions

---------

Co-authored-by: deepred-dev <67878078+deepred-dev@users.noreply.github.com>
Co-authored-by: Benjamin Graf <benjamin.graf@gmx.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/narayana Transactions / Narayana kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants