Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusted TaskListener flushing in remote code #3961

Merged
merged 11 commits into from
Apr 11, 2019

Conversation

jglick
Copy link
Member

@jglick jglick commented Apr 2, 2019

Amends #3563 a bit. See discussion in jenkinsci/workflow-api-plugin#81.

  • A StreamTaskListener created on the master does not set the PrintStream.autoFlush flag. So why would we do so on the remote side? That merely produces a flurry of ProxyOutputStream.Flush packets. We expect that code flushes a stream when it is done with it. In fact from 2.121.x → 2.138.x for freestyle builds running Shell I see introduced flush packets, from [JENKINS-52729] Launcher.ProcStarter.stdout(TaskListener) discards remotability #3563 I presume, which this PR corrects again.
  • Conversely, RemoteLaunchCallable on the agent side produces a LocalLauncher which produces a LocalProc which uses StreamCopyThread to forward stdout/stderr of the process. Yet as far as I can tell, this never flushed its sink unless you pass the closeOut flag (which no code in core does), meaning the tail end of log output could be lost if the sink were buffered. Freestyle builds have unbuffered logs, so probably this never mattered, but for JEP-210 I am finding that some buffering for Pipeline with local log storage is useful for performance reasons, and definitely for cloud log storage it is expected. Launcher is not used in this way for durable tasks like sh, but various SynchronousNonBlockingStepExecutions (and thus also SimpleBuildSteps) use it.

(The main pieces originally filed in #3703, but with a feature flag added.)

Proposed changelog entries

  • Adjusted stream flushing behavior for code running remotely on agents for better performance. This may lead to loss of messages for plugins which print to a build log from the agent machine but do not flush their output. Use -Dhudson.util.StreamTaskListener.AUTO_FLUSH=true to restore the previous behavior for freestyle builds. Note that Pipeline builds always expect remote flush.
  • API: SystemProperties may now be used from agent-side code.

Desired reviewers

@jenkinsci/code-reviewers

@@ -761,6 +761,7 @@ protected final void printCommandLine(@Nonnull String[] cmd, @CheckForNull FileP
buf.append(c);
}
listener.getLogger().println(buf.toString());
listener.getLogger().flush();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is commented out, ArgumentListBuilder2Test.slaveMask fails (gets a blank string)—unless you start the test with

static {
    System.setProperty("hudson.util.StreamTaskListener.AUTO_FLUSH", "true");
}

in which case it passes, demonstrating the effectiveness of the flag on the remote side.

@@ -376,27 +462,4 @@ public static Long getLong(String name, Long def, Level logLevel) {
return def;
}

@CheckForNull
private static String tryGetValueFromContext(String key) {
if (!JenkinsJVM.isJenkinsJVM()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inverting control. This was always a bit hackish anyway. Amends #3362 to make agent-side calls work, not merely not crash.

*/
//TODO: Define a correct design of this engine later. Should be accessible in libs (remoting, stapler) and Jenkins modules too
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed for agents, and Jenkins modules should be able to use this already. If the need arises, we could move this into a standalone library usable from Remoting and Stapler, though I think these do well enough for now using Java Platform methods.

private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
out = new PrintStream((OutputStream)in.readObject(),true);
out = new PrintStream((OutputStream)in.readObject(), AUTO_FLUSH);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meat of the change: switching auoFlush on the remote copy from true to (by default) false.

// Take a snapshot of all system properties and context variables available on the master at the time the agent starts.
allStrings = new HashMap<>();
allStrings.putAll(handler.getAllStrings());
allStrings.putAll((Map) System.getProperties()); // these take precedence
Copy link
Member

@dwnusbaum dwnusbaum Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this would mean that calling SystemProperties.getString("file.separator") on the agent would actually get you the value from the master, which could be problematic if the agent is running a different OS than the master. Not necessarily a huge problem as long as anyone using SystemProperties on an agent is careful to avoid reading machine/OS dependent properties, and the API is @Restricted(NoExternalUse.class), but maybe it would be safer to only copy properties whose keys contain hudson or jenkins in the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling SystemProperties.getString("file.separator") on the agent would actually get you the value from the master

No, because getString will first check the local system properties. This snapshot is only used for fallbacks of variables not set on the agent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a problem, since even on master you would get weird results today if part of your code relied on the System property value alone for a given key, while the rest takes into account the context variable. If the former is undefined, SystemProperties might still have a value but System.getProperty won't.

@jglick jglick requested a review from dwnusbaum April 8, 2019 17:01
@jglick jglick added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Apr 8, 2019
@daniel-beck daniel-beck removed the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Apr 8, 2019
@jglick jglick added the work-in-progress The PR is under active development, not ready to the final review label Apr 8, 2019
@jglick jglick requested review from daniel-beck and dwnusbaum April 8, 2019 18:30
@jglick jglick removed the work-in-progress The PR is under active development, not ready to the final review label Apr 8, 2019
@jglick
Copy link
Member Author

jglick commented Apr 11, 2019

INFO	hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel slave0.
java.util.concurrent.TimeoutException: No ping
	at hudson.slaves.PingThreadTest.failedPingThreadResetsComputerChannel(PingThreadTest.java:78)
java.lang.AssertionError
	at hudson.slaves.PingThreadTest.failedPingThreadResetsComputerChannel(PingThreadTest.java:90)

in one branch looks like a flake to me.

@jglick jglick added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Apr 11, 2019
@batmat
Copy link
Member

batmat commented Apr 11, 2019

After checking the master branch, I don't seem to be able to find similar test failure, so I am going to retrigger to have another data point. But it could also be actually related to the change here I suppose.

FTR
image

@batmat batmat closed this Apr 11, 2019
@batmat batmat reopened this Apr 11, 2019
@batmat batmat self-assigned this Apr 11, 2019
@batmat
Copy link
Member

batmat commented Apr 11, 2019

FWIW, I ran mvn test -Dtest=PingThreadTest 220 times locally in a loop. It failed only twice, both for a test timeout:

[INFO] Running hudson.slaves.PingThreadTest
=== Starting failedPingThreadResetsComputerChannel(hudson.slaves.PingThreadTest)
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 180.734 s <<< FAILURE! - in hudson.slaves.PingThreadTest
[ERROR] failedPingThreadResetsComputerChannel(hudson.slaves.PingThreadTest)  Time elapsed: 180.113 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 180 seconds
        at java.lang.Object.wait(Native Method)
        at hudson.remoting.Request.call(Request.java:177)
        at hudson.remoting.Channel.call(Channel.java:956)
        at hudson.slaves.PingThreadTest.failedPingThreadResetsComputerChannel(PingThreadTest.java:81)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:556)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:748)

So, AFAICT not even the failure seen in CI. We'll see how the currently ongoing build goes.

@batmat
Copy link
Member

batmat commented Apr 11, 2019

So the PR build succeeed, so given the PR build for this PR in ci.jenkins.io was blue almost for all builds but one, I am going to merge this.

image

*/
//TODO: Define a correct design of this engine later. Should be accessible in libs (remoting, stapler) and Jenkins modules too
@SuppressFBWarnings(value = "ST_WRITE_TO_STATIC_FROM_INSTANCE_METHOD", justification = "Currently Jenkins instance may have one ond only one context")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just @oleg-nenashev’s grammar, moved as is from line 85.

@batmat batmat merged commit 850168f into jenkinsci:master Apr 11, 2019
@jglick
Copy link
Member Author

jglick commented Apr 11, 2019

220 times locally in a loop

I suspect the flake manifests itself only under heavy load conditions, which would not be simulated by rerunning the test serially.

@jglick jglick deleted the TaskListener-flush branch April 11, 2019 19:23
orrc added a commit to jenkinsci/google-play-android-publisher-plugin that referenced this pull request Sep 28, 2019
This appears to be behaviour that manifests itself in Pipeline builds,
and was fixed by jenkinsci/jenkins#3961.

To ensure we don't lose buffered logs on Jenkins versions before 2.173,
we now ensure the log is flushed after use on the remote agent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants