Release pipeline times out syncing artifacts #2704

planetf1 · 2020-03-04T09:35:11Z

The final step of our release pipeline is to synchronize artifacts from bintray to maven central

However this is very slow, and with our number of artifacts (200 or so) it times out, since the Virtual Machine is terminated after 6 hours.

Need to investigate

If the process can be sped up (per artifact)
if the overall timeout can be increased

The step can be restarted and will continue from where it left but requires human intervention.

Finally it's worth noting that there is no 'transactionality' on this sync from bintray to maven central. If instead it was possible to sync from bintray to a staging area at maven central, we could at least hold back on the final commit until all artifacts are synced. Currently we have hours or days where we have an incomplete release on Maven Central, which could cause confusion.

IN MORE DETAIL
Example: https://dev.azure.com/ODPi/Egeria/_releaseProgress?_a=release-environment-logs&releaseId=116&environmentId=348

Error reported (by pipelines)

Job issues
·
1 warning
Received request to deprovision: The request was cancelled by the remote provider.

Failing step: Sync Missing Package to Maven-Central

Progress log:

2020-03-03T17:18:12.0386260Z Syncing discovery-engine-services-api:1.5
2020-03-03T17:22:35.3705556Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:22:35.3709765Z Syncing discovery-engine-services-client:1.5
2020-03-03T17:26:10.2062656Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:26:10.2065902Z Syncing discovery-engine-services-server:1.5
2020-03-03T17:30:21.0519918Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:30:21.0521862Z Syncing discovery-engine-services-spring:1.5
2020-03-03T17:36:43.2953966Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:36:43.2957556Z Syncing discovery-engine-spring:1.5
2020-03-03T17:41:31.7039712Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:41:31.7042080Z Syncing discovery-service-connectors:1.5
2020-03-03T17:41:41.3023523Z ##[error]The operation was canceled.
2020-03-03T17:41:41.3037573Z ##[section]Finishing: Sync Missing Package to Maven-Central

From this we can see we didn't get that far (alphabetical) and each package at this stage is taking 4 minutes

Mitigation

To kick off the step again, open up the pipeline, click on this step in the graphical view and select 'redeploy' Accept the defaults, and the step will start running again, and sync will pick up where it left off. This may be needed a number of times. Each time will run for 6 hours then stop.

The text was updated successfully, but these errors were encountered:

planetf1 · 2020-03-04T10:42:28Z

From a quick look

Each job takes about the same amount of time
The op to do the sync is a SINGLE rest call to jfrog

One possibility would be to see if we can issue calls to the blocking sync curl op in parallel -- unclear what limiting jfrog may apply. Some techniques for bash parallel jobs are at https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop

It's likely 3-4 retries should work - so at a minimum a parallel level of 4 may work (if we use true parallel) or a little more if batching. I'd be inclined to go with a value like 8.

@bramwelt any other ideas?

I'll complete manually for this release (1.5) but aim to address for 1.6 following the above idea if nothing better surfaces!

planetf1 · 2020-03-04T10:45:25Z

When executing the parallel loop we probably want to continue even a failure occurs, as any issue is likely to be for a specific artifact.

At the end of the loop it would be useful to summarize what was processed ok, and what failed, and set the final return code to success only if EVERY sync worked correctly.

planetf1 · 2020-03-04T10:47:46Z

For reference here is the current script from the pipeline definition

function sync_package () {
    local package_name=$1
    local package_version=$2
    echo "Syncing ${package_name}:${package_version}"
    curl \
    -u ${BINTRAY_USER}:${BINTRAY_TOKEN} \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{"close":"1"}' \
    -sSL "https://api.bintray.com/maven_central_sync/${JFROG_ORGANIZATION}/${JFROG_PROJECT}/${package_name}/versions/${package_version}"
    echo
}

while read -r package_name; do
    sync_package $package_name ${VERSION} ;
done < packages_to_sync_${VERSION}

planetf1 · 2020-03-04T10:49:30Z

Obviously adding a '&' at the end of the sync package could work, but more likely bintray might object to >200 at once (untested). Also it's not the clearest way to collect results.

GNU Parallels is interesting, but the license /attribution may be problematic.

planetf1 · 2020-03-04T18:13:22Z

Tried some basic parallelism (without summarizing final status based on http response)

THREADS=8

function sync_package () {
    local package_name=$1
    local package_version=$2
    echo "STARTING - Sync ${package_name}:${package_version}"
    curl \
    -u ${BINTRAY_USER}:${BINTRAY_TOKEN} \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{"close":"1"}' \
    -sSL "https://api.bintray.com/maven_central_sync/${JFROG_ORGANIZATION}/${JFROG_PROJECT}/${package_name}/versions/${package_version}"
 }
export -f sync_package
cat packages_to_sync_${VERSION} | xargs -n1 -P${THREADS} bash -c 'sync_package "$@" ${VERSION}' _

The sync gets launched as I expected. However jfrog api starts returning errors. I am guessing it cannot handle parallel requests - even though the temporary repo is unique ie:

{"status":"Sync Failed","messages":"[Failed to close repository: orgodpi-2791. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Staging repository is already transitioning: orgodpi-2791<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository.]"}STARTING - Sync platform-services-server:1.5
{"status":"Sync Failed","messages":"[Failed to close repository: orgodpi-2794. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Staging repository is already transitioning: orgodpi-2794<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository.]"}STARTING - Sync platform-services-spring:1.5
{"status":"Sync Failed","messages":"[Failed to close repository: orgodpi-2794. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Staging repository is already transitioning: orgodpi-2794<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository.]"}STARTING - Sync platform-services-spring:1.5
{"status":"Sync Failed","messages":"[Failed to promote repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository., Failed to drop repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management:1.5
{"status":"Sync Failed","messages":"[Failed to promote repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository., Failed to drop repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management:1.5
{"status":"Validation Failed","messages":"[Failed to close repository: orgodpi-2792., Dropping existing partial staging repository., Failed to drop repository: orgodpi-2792. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2792<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management-api:1.5
{"status":"Validation Failed","messages":"[Failed to close repository: orgodpi-2792., Dropping existing partial staging repository., Failed to drop repository: orgodpi-2792. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2792<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management-api:1.5

planetf1 · 2020-03-05T10:30:48Z

Reverted the parallelism to N=1 (ie left code in place) - seems to cause additional issues over and above the prior kind of sync issues we've had (JCenter, poms etc)
Good to revisit this after 1.6 has shipped

planetf1 · 2020-03-05T10:50:34Z

@bramwelt did you find any official documentation for this API? I wonder if there's any other parms that may affect the behaviour, such as the identity of the staging repo. Currently it looks as if one is automatically created when the api call is made... but it's not safe when calls overlap.

If we can control it, having a single repo for the 'run' and then only 'committing' if EVERY package got updated would give us the transactionality I referred to above and stop a new release dribbling out over many days as we fight any issues, or simply wait for the slow sync to complete

planetf1 · 2020-03-05T10:52:16Z

ie

Create staging repo
check complete
Start sync in parallel
Check status of syncs
If ALL succeed close the staging repo
If not, manual intervention needed - and can then resume, or retry, or close manually

planetf1 · 2020-03-05T17:49:12Z

On the transactionality/staging repository issues, as per https://help.sonatype.com/repomanager2/staging-releases/managing-staging-repositories it looks like our current process will open maven central staging repos for each ip/UA/org combo.

For the next release I'm inclined to try with close:0 in our script - it may be all our artifacts will then go into a single staging repo. That might a) allow all to be released at once and b) address issues with parallelism.

planetf1 · 2020-04-09T16:56:39Z

This went through with 2 attempts in R1.6 which is a lot better. Moving to 1.7 to monitor

planetf1 · 2020-05-06T09:02:14Z

will summarise current issues in 1.8 timeframe and request assistance from LF team

planetf1 · 2020-05-11T12:49:29Z

Have requested LF help -> https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-19687

planetf1 · 2020-06-12T08:45:39Z

Will track in #3914

planetf1 added build-failure High Priority - a build is failing cicd release Work to create a new releae labels Mar 4, 2020

planetf1 assigned bramwelt and planetf1 Mar 4, 2020

planetf1 added this to the 2020.03 (1.6) milestone Mar 4, 2020

planetf1 mentioned this issue Mar 5, 2020

Create 1.5 Release #2654

Closed

13 tasks

planetf1 modified the milestones: 2020.03 (1.6), 2020.04 (1.7) Apr 9, 2020

planetf1 modified the milestones: 2020.04 (1.7), 2020.05 (1.8) May 6, 2020

planetf1 closed this as completed Jun 12, 2020

planetf1 mentioned this issue Jun 12, 2020

Release pipeline unreliability - improvements needed #3194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release pipeline times out syncing artifacts #2704

Release pipeline times out syncing artifacts #2704

planetf1 commented Mar 4, 2020 •

edited

Loading

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020 •

edited

Loading

planetf1 commented Mar 5, 2020

planetf1 commented Mar 5, 2020

planetf1 commented Mar 5, 2020

planetf1 commented Mar 5, 2020

planetf1 commented Apr 9, 2020

planetf1 commented May 6, 2020

planetf1 commented May 11, 2020

planetf1 commented Jun 12, 2020

Release pipeline times out syncing artifacts #2704

Release pipeline times out syncing artifacts #2704

Comments

planetf1 commented Mar 4, 2020 • edited Loading

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020

planetf1 commented Mar 4, 2020 • edited Loading

planetf1 commented Mar 5, 2020

planetf1 commented Mar 5, 2020

planetf1 commented Mar 5, 2020

planetf1 commented Mar 5, 2020

planetf1 commented Apr 9, 2020

planetf1 commented May 6, 2020

planetf1 commented May 11, 2020

planetf1 commented Jun 12, 2020

planetf1 commented Mar 4, 2020 •

edited

Loading

planetf1 commented Mar 4, 2020 •

edited

Loading