Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve pod SHA256 values #2262

Merged
merged 3 commits into from
Mar 16, 2021
Merged

Preserve pod SHA256 values #2262

merged 3 commits into from
Mar 16, 2021

Conversation

russgold
Copy link
Member

This changes verifies some cases of SHA256 hash computations and ensures that they don't change over time. The hash values used as references are those generated by operator version 3.1.0.

@russgold russgold self-assigned this Mar 15, 2021
// Add legacy prometheus annotations. These have to remain, as they are included in the computation
// of the pod sha256 hash, and removing them will cause pods to roll when customers upgrade to new
// versions of the operator.
AnnotationHelper.annotateForPrometheus(metadata, "/wls-exporter", getMetricsPort());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line may cause a NPE on unboxing a null Integer if both getListenPort() and getSslPort() is null in getMetricsPort().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - of course, such a domain would be pretty worthless. IMO the logic is wrong in any case, but I didn't want to change it here - all I have done is refactored the existing logic.

Copy link
Member

@doxiao doxiao Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator definitely hit this problem without Johnny's fix in PR #2256 with secure mode enabled but SSL is not configured. Johnny's changes tire up the loose ends in some of the configuration permutations. I am not sure if there will be any configuration that may hit this even with PR #2256.

Copy link
Member

@doxiao doxiao Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like adding defensive code for invalid configurations, but the ugly part of this NPE is that when it happens it is hard to figure out what is going on. It would be helpful if we could at least detect the situation and log a warning message, instead of letting NPE occur.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a change that should catch that.

return metadata;
}

private Integer getMetricsPort() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure if the NPE will happen on return or on the method that calls this, although I would think it is latter. It is probably safer to catch the exception on the caller, or assign the port to a variable in this method, and check that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should happen on the unboxing - so I should ensure that happens here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, even that doesn't help. That kind of domain error causes problems all over the place. If you catch it here, it pops up elsewhere. This PR is not the place to fix that preexisting problem, I think. The code here does not cause the problem, and fixing it will require more effort and is out of scope. I am going to revert my last change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a simple class to test this, and unboxing occurred on the method call where null Integer value is assigned to an int parameter. In other words it does not happen in the return statement here and the try/catch should be on the calling method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class Test {
public static void main(String[] args) {
try {
printValue(getValue());
} catch (Throwable t) {
System.out.println("t2 = " + t);
}
}

static Integer getValue() {
Integer value = null;
try {
return value;
} catch (Throwable t) {
System.out.println("t = " + t);
throw t;
}
}

static void printValue(int i) {
System.out.println("i = " + i);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the output is

t2 = java.lang.NullPointerException

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably better to use a local variable to check the value. If it is null, log and return zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce the problem. I can also reproduce it before this change. I can prevent getMetricsPort() from returning a null, which causes the problem later - and then the NPE happens in getLocalAdminProtocolChannelPort.

I agree that there is a problem - but that problem has nothing to do with the PR, and should be addressed elsewhere. Could you please file a bug report or a GitHub issue so we can track it?

The basic problem is the logic by which we select the port for admin protocol. It should be done in one place, and it is done in multiple places. Worse, a change was made in 3.1.4 to the logic now served by getMetricsPort which cannot be changed, as it would cause pods to roll on upgrade if we do.

The best thing would probably be to catch the error in the domain long before we get here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that the logic of choosing a port to use is not clean and needs to be revisited.

I was glad that this particular section was removed by a previous PR, and was just concerned that this code is now added back (to prevent pods to roll) so the issue came back.

We can certainly handle this in a different jira.

Copy link
Member

@doxiao doxiao Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to complete the current discussion. It is pretty rare that getLocalAdminProtocolChannelPort returns null since it looks into every port that is configured (admin port, listen port, SSL port, and NAPs). But getMetricsPort only looks for listen port and SSL port.

Copy link
Member

@doxiao doxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can track the NPE on unboxing and revisiting the logic of choosing a port for Prometheus in a different jira task.

@rjeberhard rjeberhard merged commit 386e724 into develop Mar 16, 2021
@russgold russgold deleted the prevent_roll_on_upgrade branch March 17, 2021 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants