Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release pipeline times out syncing artifacts #2704

Closed
planetf1 opened this issue Mar 4, 2020 · 13 comments
Closed

Release pipeline times out syncing artifacts #2704

planetf1 opened this issue Mar 4, 2020 · 13 comments
Assignees
Labels
build-failure High Priority - a build is failing release Work to create a new releae
Milestone

Comments

@planetf1
Copy link
Member

planetf1 commented Mar 4, 2020

The final step of our release pipeline is to synchronize artifacts from bintray to maven central

However this is very slow, and with our number of artifacts (200 or so) it times out, since the Virtual Machine is terminated after 6 hours.

Need to investigate

  • If the process can be sped up (per artifact)
  • if the overall timeout can be increased

The step can be restarted and will continue from where it left but requires human intervention.

Finally it's worth noting that there is no 'transactionality' on this sync from bintray to maven central. If instead it was possible to sync from bintray to a staging area at maven central, we could at least hold back on the final commit until all artifacts are synced. Currently we have hours or days where we have an incomplete release on Maven Central, which could cause confusion.

IN MORE DETAIL
Example: https://dev.azure.com/ODPi/Egeria/_releaseProgress?_a=release-environment-logs&releaseId=116&environmentId=348

Error reported (by pipelines)

Job issues
·
1 warning
Received request to deprovision: The request was cancelled by the remote provider.

Failing step: Sync Missing Package to Maven-Central

Progress log:

2020-03-03T17:18:12.0386260Z Syncing discovery-engine-services-api:1.5
2020-03-03T17:22:35.3705556Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:22:35.3709765Z Syncing discovery-engine-services-client:1.5
2020-03-03T17:26:10.2062656Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:26:10.2065902Z Syncing discovery-engine-services-server:1.5
2020-03-03T17:30:21.0519918Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:30:21.0521862Z Syncing discovery-engine-services-spring:1.5
2020-03-03T17:36:43.2953966Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:36:43.2957556Z Syncing discovery-engine-spring:1.5
2020-03-03T17:41:31.7039712Z {"status":"Successfully synced and closed repo.","messages":["Sync finished successfully."]}
2020-03-03T17:41:31.7042080Z Syncing discovery-service-connectors:1.5
2020-03-03T17:41:41.3023523Z ##[error]The operation was canceled.
2020-03-03T17:41:41.3037573Z ##[section]Finishing: Sync Missing Package to Maven-Central

From this we can see we didn't get that far (alphabetical) and each package at this stage is taking 4 minutes

Mitigation

To kick off the step again, open up the pipeline, click on this step in the graphical view and select 'redeploy' Accept the defaults, and the step will start running again, and sync will pick up where it left off. This may be needed a number of times. Each time will run for 6 hours then stop.

@planetf1 planetf1 added build-failure High Priority - a build is failing cicd release Work to create a new releae labels Mar 4, 2020
@planetf1
Copy link
Member Author

planetf1 commented Mar 4, 2020

From a quick look

  • Each job takes about the same amount of time
  • The op to do the sync is a SINGLE rest call to jfrog

One possibility would be to see if we can issue calls to the blocking sync curl op in parallel -- unclear what limiting jfrog may apply. Some techniques for bash parallel jobs are at https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop

It's likely 3-4 retries should work - so at a minimum a parallel level of 4 may work (if we use true parallel) or a little more if batching. I'd be inclined to go with a value like 8.

@bramwelt any other ideas?

I'll complete manually for this release (1.5) but aim to address for 1.6 following the above idea if nothing better surfaces!

@planetf1 planetf1 added this to the 2020.03 (1.6) milestone Mar 4, 2020
@planetf1
Copy link
Member Author

planetf1 commented Mar 4, 2020

When executing the parallel loop we probably want to continue even a failure occurs, as any issue is likely to be for a specific artifact.

At the end of the loop it would be useful to summarize what was processed ok, and what failed, and set the final return code to success only if EVERY sync worked correctly.

@planetf1
Copy link
Member Author

planetf1 commented Mar 4, 2020

For reference here is the current script from the pipeline definition

function sync_package () {
    local package_name=$1
    local package_version=$2
    echo "Syncing ${package_name}:${package_version}"
    curl \
    -u ${BINTRAY_USER}:${BINTRAY_TOKEN} \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{"close":"1"}' \
    -sSL "https://api.bintray.com/maven_central_sync/${JFROG_ORGANIZATION}/${JFROG_PROJECT}/${package_name}/versions/${package_version}"
    echo
}

while read -r package_name; do
    sync_package $package_name ${VERSION} ;
done < packages_to_sync_${VERSION}

@planetf1
Copy link
Member Author

planetf1 commented Mar 4, 2020

Obviously adding a '&' at the end of the sync package could work, but more likely bintray might object to >200 at once (untested). Also it's not the clearest way to collect results.

GNU Parallels is interesting, but the license /attribution may be problematic.

@planetf1
Copy link
Member Author

planetf1 commented Mar 4, 2020

Tried some basic parallelism (without summarizing final status based on http response)

THREADS=8

function sync_package () {
    local package_name=$1
    local package_version=$2
    echo "STARTING - Sync ${package_name}:${package_version}"
    curl \
    -u ${BINTRAY_USER}:${BINTRAY_TOKEN} \
    -H "Content-Type: application/json" \
    -X POST \
    -d '{"close":"1"}' \
    -sSL "https://api.bintray.com/maven_central_sync/${JFROG_ORGANIZATION}/${JFROG_PROJECT}/${package_name}/versions/${package_version}"
 }
export -f sync_package
cat packages_to_sync_${VERSION} | xargs -n1 -P${THREADS} bash -c 'sync_package "$@" ${VERSION}' _

The sync gets launched as I expected. However jfrog api starts returning errors. I am guessing it cannot handle parallel requests - even though the temporary repo is unique ie:

{"status":"Sync Failed","messages":"[Failed to close repository: orgodpi-2791. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Staging repository is already transitioning: orgodpi-2791<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository.]"}STARTING - Sync platform-services-server:1.5
{"status":"Sync Failed","messages":"[Failed to close repository: orgodpi-2794. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Staging repository is already transitioning: orgodpi-2794<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository.]"}STARTING - Sync platform-services-spring:1.5
{"status":"Sync Failed","messages":"[Failed to close repository: orgodpi-2794. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Staging repository is already transitioning: orgodpi-2794<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository.]"}STARTING - Sync platform-services-spring:1.5
{"status":"Sync Failed","messages":"[Failed to promote repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository., Failed to drop repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management:1.5
{"status":"Sync Failed","messages":"[Failed to promote repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>, Dropping existing partial staging repository., Failed to drop repository: orgodpi-2795. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2795<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management:1.5
{"status":"Validation Failed","messages":"[Failed to close repository: orgodpi-2792., Dropping existing partial staging repository., Failed to drop repository: orgodpi-2792. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2792<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management-api:1.5
{"status":"Validation Failed","messages":"[Failed to close repository: orgodpi-2792., Dropping existing partial staging repository., Failed to drop repository: orgodpi-2792. Server response:\n <nexus-error>\n  <errors>\n    <error>\n      <id>*<\u002fid>\n      <msg>Unhandled: Missing staging repository: orgodpi-2792<\u002fmsg>\n    <\u002ferror>\n  <\u002ferrors>\n<\u002fnexus-error>]"}STARTING - Sync project-management-api:1.5

@planetf1
Copy link
Member Author

planetf1 commented Mar 5, 2020

Reverted the parallelism to N=1 (ie left code in place) - seems to cause additional issues over and above the prior kind of sync issues we've had (JCenter, poms etc)
Good to revisit this after 1.6 has shipped

@planetf1 planetf1 mentioned this issue Mar 5, 2020
13 tasks
@planetf1
Copy link
Member Author

planetf1 commented Mar 5, 2020

@bramwelt did you find any official documentation for this API? I wonder if there's any other parms that may affect the behaviour, such as the identity of the staging repo. Currently it looks as if one is automatically created when the api call is made... but it's not safe when calls overlap.

If we can control it, having a single repo for the 'run' and then only 'committing' if EVERY package got updated would give us the transactionality I referred to above and stop a new release dribbling out over many days as we fight any issues, or simply wait for the slow sync to complete

@planetf1
Copy link
Member Author

planetf1 commented Mar 5, 2020

ie

  • Create staging repo
  • check complete
  • Start sync in parallel
  • Check status of syncs
  • If ALL succeed close the staging repo
  • If not, manual intervention needed - and can then resume, or retry, or close manually

@planetf1
Copy link
Member Author

planetf1 commented Mar 5, 2020

On the transactionality/staging repository issues, as per https://help.sonatype.com/repomanager2/staging-releases/managing-staging-repositories it looks like our current process will open maven central staging repos for each ip/UA/org combo.

For the next release I'm inclined to try with close:0 in our script - it may be all our artifacts will then go into a single staging repo. That might a) allow all to be released at once and b) address issues with parallelism.

@planetf1
Copy link
Member Author

planetf1 commented Apr 9, 2020

This went through with 2 attempts in R1.6 which is a lot better. Moving to 1.7 to monitor

@planetf1 planetf1 modified the milestones: 2020.03 (1.6), 2020.04 (1.7) Apr 9, 2020
@planetf1
Copy link
Member Author

planetf1 commented May 6, 2020

will summarise current issues in 1.8 timeframe and request assistance from LF team

@planetf1 planetf1 modified the milestones: 2020.04 (1.7), 2020.05 (1.8) May 6, 2020
@planetf1
Copy link
Member Author

@planetf1
Copy link
Member Author

Will track in #3914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build-failure High Priority - a build is failing release Work to create a new releae
Projects
None yet
Development

No branches or pull requests

2 participants