Skip to content

Conversation

@mik-laj
Copy link
Member

@mik-laj mik-laj commented Jan 9, 2021

Master Issue: #9154

Motivation

This PR aims to reduce the GitHub Action usage by cancelling the previous build.

Modifications

Similar to: apache/spark#31098 apache/calcite#2318 (solution suggestted by @vlsi - #9154 (comment))

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
  • If a feature is not applicable for documentation, explain why?
  • If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

@vlsi
Copy link

vlsi commented Jan 9, 2021

@mik-laj
Copy link
Member Author

mik-laj commented Jan 9, 2021

@vlsi @potiuk Can I ask for a review?

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

1 similar comment
@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@mik-laj
Copy link
Member Author

mik-laj commented Jan 10, 2021

@codelipenghui We want to support fork as well, so we have to use workflow_run, so this change will only take effect after merging.

If you use forks, you should create a separate "Cancelling" workflow_run triggered workflow. The workflow_run should be responsible for all canceling actions. The examples below show the possible ways the action can be utilized.

https://github.com/potiuk/cancel-workflow-runs#usage

Note: This event will only trigger a workflow run if the workflow file is on the default branch.

@mik-laj
Copy link
Member Author

mik-laj commented Jan 10, 2021

This action has been added to white-list. See: https://issues.apache.org/jira/browse/INFRA-21287
I working on the update for this PR.

@mik-laj
Copy link
Member Author

mik-laj commented Jan 10, 2021

I updated this PR.

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

1 similar comment
@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

on:
workflow_run:
workflows:
- 'CI - Build - MacOS'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to write here the list of jobs ?
we can forget about it in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is not very precise and does not describe whether this field is optional. I have not seen any examples that do not define it, so I think it is required. On the other hand, this event doesn't support the exclude list, so it seems to be the only way to exclude "PulsarBot".
https://docs.github.com/en/free-pro-team@latest/actions/reference/events-that-trigger-workflows#workflow_run

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think this is the only way :( . What you can do is to make some pre-commit check to verify if the list is complete/

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@eolivelli
Copy link
Contributor

It looks like that this job is being triggered too many times.
It is using many resources and for every workflow run we are adding many items to the queue
https://github.com/apache/pulsar/actions?page=4&query=is%3Aqueued

I wonder if we can restore #8393

cc @lhotari @sijie @mik-laj

@potiuk
Copy link
Member

potiuk commented Jan 25, 2021

@eolivelli

I think the root cause is a bit different. Seems like your workflow/job pattern is different than many other ASF projects, you are simply launching MANY workflows and jobs.

Have you considered merging those workflows into smaller number of them ? I think actually the many, many workflows launched by pulsar are the reason for many ASF projects to get blocked queue.

See this chart:

06a6b973-fee9-4651-a197-a275a39b6f55

I am happy to help Pulsar to redesign their CI to be a bit more friendlier towards other projects? I have quite an experience with that - in Apache Airflow I introduced changes that brought the usage of Github Actions by 70% maintaining it's usefulness and I am pretty sure the way you are using GitHub Actions can be optimised. Super Happy to help with that if needed.

You can also see the discussions we had about it at the ASF discussion lists recently. I thik this thread from the last weekend describes the current state of affairs rather well: https://lists.apache.org/thread.html/r6130d6d111217b8a90b3b90bbb3029b75b5e2a4b5463f8d2c8f130f6%40%3Cusers.infra.apache.org%3E (you need to sign-up to users@infra.apache.org to view it).

Do you think this can be done ?

@zymap
Copy link
Member

zymap commented Jan 26, 2021

Hi @potiuk. In the beginning, Pulsar separates the tests into different workflows is for reducing the build and test time. We category with the different tests to make the tests can run parallel. If there has a specific workflow failed, we can re-run the failed workflow and don't need to re-run the entire project tests.
I saw the other apache projects use a single workflow to run the build and test, if Pulsar uses one workflow to run the build and test, that would take a long time to pass the CI.
Do you have some suggestions? If you can help with improving the CI that would be nice!
Thanks!

@eolivelli
Copy link
Contributor

I am happy to help Pulsar to redesign their CI to be a bit more friendlier towards other projects? I have quite an experience with that - in Apache Airflow I introduced changes that brought the usage of Github Actions by 70% maintaining it's usefulness and I am pretty sure the way you are using GitHub Actions can be optimised. Super Happy to help with that if needed.

@potiuk any help is very appreciated
as @zymap says the problem is that we have lots of tests and they take much time.

We are working on fixing all of the flakes, but this will take time. This will help because we won't need to rerun PR validation more times than needed.

Probably it would be interesting to try to merge the workflows: all java tests, all integration tests, all cpp....in some test GitHub repository and work on a new setup.
We also should remove the 'retry' at build script level.

if we let the build continue in case of errors we should have some tool to gather all of the failures and we must be sure that we are not missing any of them

@sijie
Copy link
Member

sijie commented Jan 26, 2021

@zymap - I think we can consider running Pulsar CI in the Azure pipeline like Flink is doing. We can reach out to our Flink friends to see how they do it.

Example: apache/flink#14756 (comment)

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12498&view=results

@lhotari
Copy link
Member

lhotari commented Jan 28, 2021

Hi @potiuk @zymap @sijie,

I have started a draft PIP (Pulsar Improvement Proposal) discussion on the dev@pulsar.apache.org mailing list to address the Github Actions Runner resource consumption issues that currently exist in the Pulsar Github Actions workflows:
https://lists.apache.org/thread.html/ra2fcf7b973448bb51e00aceaeed06433e8d886270b0f0db0c80d4e0c%40%3Cdev.pulsar.apache.org%3E

Please participate in the discussion on the dev@pulsar.apache.org mailing list or place comments / editing suggestions directly on the Google doc:
https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit?usp=sharing

@potiuk
Copy link
Member

potiuk commented Jan 28, 2021

I will take a look tomorrow. I'v been swamped by some company matters but tomorrow is the day I start going back to life :)

@potiuk
Copy link
Member

potiuk commented Jan 29, 2021

I commented as much as I could and provided all the examples and code links from Apache Airflow :)

@lhotari
Copy link
Member

lhotari commented Jan 29, 2021

I commented as much as I could and provided all the examples and code links from Apache Airflow :)

Thank you @potiuk . Very helpful and informative comments with lots of good advice. I'll take a closer look next week when I continue working on the topic. Have a great weekend!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants