Remove wasteful safer Inet4 lookups #276

franz1981 · 2024-01-15T12:45:38Z

No description provided.

franz1981 · 2024-01-15T12:48:46Z

@radcortez can you check if the ones happening in clinit are the one you saw in the flamegraphs?

here

i can see many, not just the 3 ones I've fixed.

franz1981 · 2024-01-15T13:12:44Z

Ah no, the flamegraph is per line of code, so the fix here should take care of it already

radcortez · 2024-01-15T13:29:51Z

Here is the flamegraph with the current behaviour:
alloc-999.html.zip

And with the change:
alloc-999-sr-common-276.html.zip

franz1981 · 2024-01-15T13:47:51Z

Thanks @radcortez

so

The overall saving as "allocation bamdwidth" is ~217KB, due to removing the indy allocation, which "can" translate in more work to the JIT as shown by quarkusio/quarkus#36204.

If you want to give this a shot you could profile with the -e malloc --total argument which should (see https://github.com/async-profiler/async-profiler/blob/8910a7a46241abc7cb5b80ba6717c3eab603aae6/src/perfEvents_linux.cpp#L194) report the size of native allocations to see if something evident has changed around the mentioned hot parts involved in JIT compiler work (see quarkusio/quarkus#36204 (comment)).
This will give the overall footprint saving, apart from the 217 KB reported here (which is still a just once cost).

The overall heap allocation pressure reduction between the 2 versions is ~0.8% which is tiny amount, but still measurable; with the native one you can get the global picture of the improvement, if you wish.

dmlloyd · 2024-01-16T15:53:43Z

The diff as given doesn't seem to do anything to remove an indy. Also as I said on the WF common change, I think that it's better to use new String(bytes, StandardCharsets.ISO_8859_1). @franz1981 referenced a bug (JDK-8295496) as reason why that wasn't done but the bug seemed to relate to a different ctor, relating to copyOfRange? Maybe it was the wrong #. The code path for the ctor I give above should be nearly identical to the deprecated one used in the PR.

radcortez · 2024-01-22T16:30:40Z

Yes, I confirm an improvement in allocs RSS for native as well with this PR.

It is somehow significant, but I only noticed it because it was in my analysis path, and it wasn't showing up before on previous versions. We should move forward, but we need to clarify @dmlloyd suggestion first.

dmlloyd · 2024-01-22T16:36:18Z

How are you measuring, and are you comparing against main? I believe that we are talking about a difference of three byte[4] plus six byte[7 <= n <= 15] plus three String instances allocated out of TLAB during class initialization. I would be astonished if that showed up on RSS reports or CPU time reports compared to our (thousands of) other class initializations.

I'm not arguing against the change per se, but if we're justifying any additional complexity by a measurable performance improvement, then the measurements had better show a real-world improvement that is greater than the margin of error.

franz1981 · 2024-01-22T16:58:19Z

@dmlloyd

@franz1981 referenced a bug (JDK-8295496) as reason why that wasn't done but the bug seemed to relate to a different ctor, relating to copyOfRange? Maybe it was the wrong #

Nope, sadly the JDK issue is correct, just Claes probaby didn't reported in the issue itself the full analysis which brought me to prepare the reproducer for him, but I will later attach my comment to the PR which explain something more about it.

In the original benchmark reported in the issue, you can find https://gist.github.com/franz1981/bf67ed8f328ed112a524c1150833c718#file-asciicopy-java-L73 which is indeed simplifying the life of the JIT during the generation of the array's copy stub, by having separate call-sites which:

allow the JIT to propagate the information that we always pass an array which length is what we use as String's parameter: given that this information is constant and propagated in that version of the method, it would constant-fold some (but not all! that's why the JDK issue is still opened!) the other checks
this information about array's length propagate till the stub generation which would skip these bound checks, making it as minimal as possible (although never as good the indy which is using an hidden String constuctor and not performing any - sigh - vectorized copy)

Which translate in the conclusion that having a properly sized array payoff in the already slower (if compared to indy) String creation, at 2 level:

the String checks itself
the array's copy one

Related using the String constructor which suggest the encoding type, instead, I can quote what Claes said

I also included String(byte[], int, int, ISO_8859_1) which disappointingly lags slightly behind the deprecated String(byte[], int, int, int).

Although on the PR at openjdk/jdk#12453 (comment) I cannot see any mention about it in the main PR comment, but my previous tests report that one as the worst method to use; I can re-verify it, in case.

The changes at https://bugs.openjdk.org/browse/JDK-8301958 are instead providing a "better" copy method, using the same trick I've suggested, but it will improve just the copy, and not the inherent checks in String itself, which will still be performed (and that's why https://bugs.openjdk.org/browse/JDK-8295496 is still opened, despite the copy improvement!).

Said that, these are facts which should be weighted with the perceived (it's personal, but vs the original version is clearly worsen now!) increase of code complexity vs the actual gain: I've no problem to change the PR or close it and let anyone to propose a different change which still improve the RSS and native usage as this one; the runtime performance is sadly already reduced because of not using the indy, but probably is a small drop in the ocean, if is not the hottest path!

dmlloyd · 2024-01-22T17:04:51Z

I think the loss of the indy is not a problem for the same reason that the non-deprecated ctor usage is not a problem: it is indeed a drop in the ocean. We do not create many of these address instances in this way, normally. Certainly not on any hot paths. The purpose of these methods is to construct addresses for which a RDNS lookup is never performed, and the code that exists does that pretty well.

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%? If not, then I don't think the complexity is warranted (in addition to the message "we provide these methods to construct an address, but do not use them because they are too slow").

dmlloyd · 2024-01-22T17:07:11Z

Addendum: In retrospect, the best approach would have been to create these as static methods with return a dynamic constant that lazily constructs each address. But, the only way we could possibly move to that approach (leaving out the necessary bytecode processing, which is not trivial) would be to deprecate the fields, which would not fix the problem until they are actually removed. So, I'd say it's too late for that.

franz1981 · 2024-01-22T17:09:47Z

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%?

I can share what I've learnt at #276 (comment): the difference exists in quickstart uses cases, which, although are not representative of any "real" usage IMO, are still used by some users (eg some experts which can share RSS random numbers on their social accounts) - which can affect us, somehow.
Maybe @radcortez got some different type of tests, but I believe that ~217KB is still small but surprisingly high cost for something like that (and I didn't checked the RSS native footprint yet) and if we can avoid users to complain about it in public, spending even more cycles, later on, it's an (half) win.

dmlloyd · 2024-01-22T17:25:57Z

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%?

I can share what I've learnt at #276 (comment): the difference exists in quickstart uses cases, which, although are not representative of any "real" usage IMO, are still used by some users (eg some experts which can share RSS random numbers on their social accounts) - which can affect us, somehow. Maybe @radcortez got some different type of tests, but I believe that ~217KB is still small but surprisingly high cost for something like that (and I didn't checked the RSS native footprint yet) and if we can avoid users to complain about it in public, spending even more cycles, later on, it's an (half) win.

Where do you get ~217KB from? Upstream already has the indy-removing change AFAICT... this patch only seems to reuse byte[]. Did I miss something?

radcortez · 2024-01-22T17:41:09Z

Sorry, to clarify, the measurement is with Quarkus main, and with a replace of Inet from Wildfly with this one in the PR from SR Commons. I know that Wildfly will soon delegate to SR Commons, but I believe that one is not released yet.

How are you measuring, and are you comparing against main

I didn't use any fancy measurement; a plain ps -e -o pid, RSS,args shows the difference between both versions. I believe the increased RSS came from a new Wildfly Commons version built with a more recent JDK.

For reference, this is how it looks like in main with no changes (and with WF Commons 1.7.0):

And this is what it looks like with Quarkus 3.5.0, before the WF Common (1.5.4) update:

And with the PR change:

dmlloyd · 2024-01-22T18:15:38Z

This however all has no bearing on this PR. WF Common is clearly the main cost here.

To maximize the measurement of this change, I'd recommend building wildfly-common from main (this will be 2.0.0.Final-SNAPSHOT), and then retest the quickstart without and with this PR to see if it makes an impact. The reason for that is since wildfly/wildfly-common#100, WildFly Common will redirect to SmallRye Common for this code, so all of the relevant code paths for this class will come through one route. Then it will be possible to test just this PR in isolation to see if there is indeed a relevant cost savings.

franz1981 · 2024-01-22T18:41:11Z

@dmlloyd

Where do you get ~217KB from?

It was from the allocation flamegraphs which @radcortez shared in #276 (comment)

franz1981 · 2024-01-22T19:59:43Z

@dmlloyd let me apologize, my brain was still stuck at #275 which was already merged and for some reason I was thinking this PR to be that one instead 😂

I have probably misread my own comment at #276 (comment) in which I was providing the full analysis of this and the other PR, altogether, compared to without.

This one is just saving to allocate few bytes (and Strings), because we already know the addresses, in string literal form; said that I have no problem to close it, in case it looks weird or useless...

Also, use the new API to construct the global IP address constants, to avoid unnecessary `String` construction. Supercedes smallrye#276.

dmlloyd · 2024-01-22T20:22:07Z

In that case WDYT of #279 as an API-friendly alternative?

radcortez · 2024-01-23T15:07:14Z

Here are some additional flamegraphs:

With Wildfly Commons 2.0.0-SNAPSHOT, build with J17:

With SR Commons 2.1.2, and using SR Commons Inet directly in InetSocketAddressConverter:

franz1981 · 2024-01-23T15:11:50Z

Yeah @radcortez in both I see that the strong concat indy is very relevant

dmlloyd · 2024-01-23T15:17:22Z

Yeah we all agree about the string concat issue. Now we just need to all agree that string concat is not the subject of this PR nor the subject of this discussion. 😉

franz1981 · 2024-01-23T15:39:14Z

I see @dmlloyd that @radcortez has been stucked with me on the previous pr ihih

Also, use the new API to construct the global IP address constants, to avoid unnecessary `String` construction. Supercedes smallrye#276.

dmlloyd · 2024-04-03T15:37:10Z

@franz1981 do we still want to pursue this further?

franz1981 · 2024-04-03T16:01:20Z

We can ask @radcortez if dmlloyd@2274755 (which I like) superseded and fixed this same issue; if it does it, we can safely close it!

radcortez · 2024-04-03T16:30:35Z

Correct.

franz1981 force-pushed the main_no_concat_indy branch from e0f3b3c to a2c7235 Compare January 15, 2024 12:48

Remove using the safe method if not necessary

3a6d6da

franz1981 force-pushed the main_no_concat_indy branch from a2c7235 to 3a6d6da Compare January 15, 2024 13:14

dmlloyd mentioned this pull request Jan 22, 2024

Provide API to construct IP addresses from host name and segments #279

Merged

radcortez closed this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove wasteful safer Inet4 lookups #276

Remove wasteful safer Inet4 lookups #276

franz1981 commented Jan 15, 2024

franz1981 commented Jan 15, 2024 •

edited

Loading

franz1981 commented Jan 15, 2024

radcortez commented Jan 15, 2024

franz1981 commented Jan 15, 2024 •

edited

Loading

dmlloyd commented Jan 16, 2024

radcortez commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

franz1981 commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

franz1981 commented Jan 22, 2024 •

edited

Loading

dmlloyd commented Jan 22, 2024

radcortez commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

franz1981 commented Jan 22, 2024 •

edited

Loading

franz1981 commented Jan 22, 2024 •

edited

Loading

dmlloyd commented Jan 22, 2024

radcortez commented Jan 23, 2024

franz1981 commented Jan 23, 2024

dmlloyd commented Jan 23, 2024

franz1981 commented Jan 23, 2024 •

edited

Loading

dmlloyd commented Apr 3, 2024

franz1981 commented Apr 3, 2024

radcortez commented Apr 3, 2024 •

edited

Loading

Remove wasteful safer Inet4 lookups #276

Remove wasteful safer Inet4 lookups #276

Conversation

franz1981 commented Jan 15, 2024

franz1981 commented Jan 15, 2024 • edited Loading

franz1981 commented Jan 15, 2024

radcortez commented Jan 15, 2024

franz1981 commented Jan 15, 2024 • edited Loading

dmlloyd commented Jan 16, 2024

radcortez commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

franz1981 commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

franz1981 commented Jan 22, 2024 • edited Loading

dmlloyd commented Jan 22, 2024

radcortez commented Jan 22, 2024

dmlloyd commented Jan 22, 2024

franz1981 commented Jan 22, 2024 • edited Loading

franz1981 commented Jan 22, 2024 • edited Loading

dmlloyd commented Jan 22, 2024

radcortez commented Jan 23, 2024

franz1981 commented Jan 23, 2024

dmlloyd commented Jan 23, 2024

franz1981 commented Jan 23, 2024 • edited Loading

dmlloyd commented Apr 3, 2024

franz1981 commented Apr 3, 2024

radcortez commented Apr 3, 2024 • edited Loading

franz1981 commented Jan 15, 2024 •

edited

Loading

franz1981 commented Jan 15, 2024 •

edited

Loading

franz1981 commented Jan 22, 2024 •

edited

Loading

franz1981 commented Jan 22, 2024 •

edited

Loading

franz1981 commented Jan 22, 2024 •

edited

Loading

franz1981 commented Jan 23, 2024 •

edited

Loading

radcortez commented Apr 3, 2024 •

edited

Loading