Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc and diagram #1

Merged
merged 1 commit into from
Jun 27, 2019
Merged

Update doc and diagram #1

merged 1 commit into from
Jun 27, 2019

Conversation

jimbobby5
Copy link
Contributor

No description provided.

@jankaspar jankaspar merged commit 1842105 into master Jun 27, 2019
@jankaspar jankaspar deleted the doc-update branch June 27, 2019 15:51
stackedsax pushed a commit that referenced this pull request Dec 23, 2020
d80tb7 added a commit that referenced this pull request Nov 15, 2021
GROpenSourceRO pushed a commit that referenced this pull request Mar 25, 2022
…build Armada internally (#1)

* Moved in events code and added scheduler spec

* Updated scheduler

* Adapter from log messages to Armada

* Comments

* Deleted unused file

* Copied in missing function principalHasQueuePermissions

* Replaced atMostOnce with fragile

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Store k8s services and ingresses in the api.Job object

* Use correct time type

* Executor uses bundled k8s services and ingresses if present

* Removed unused code

* Guard populateServicesIngresses against nil values

* Added groups to event sequence and namespace, labels, and annotations to EventSequence

* Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels

* Comment

* Pass through namespace, labels, and annotations

* Added a list of concerns

* Refactored log submit authorization

* Dockerfile for building .proto internally

* make proto

* Added armadaerrors.ErrNoPermission

* Replaced timestamp with time type

* Removed go_package option not needed by gogo

* make proto

* Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop

* Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error

* Removed "", which caused tests to fail, from auth exec test script

* Improved exec authenticator error messages, fixed bug where locks were copied

* Fail test immediately on error

* Fail test immediately on error to avoid panics

* Fail test immediately on error to avoid panics

* Fail test immediately on error, improved error messages

* Create slices using make (seems to have fixed a test failure)

* Import ordering

* Added corporate proxy and compilation of events.proto

* Replace assert.NotEmpty -> assert.NoError

* Fixed erroneous error message

* commented out ca-certificates install

* added google.golang.org/api

* replaced assert.Nil -> assert.NoError
GROpenSourceRO pushed a commit that referenced this pull request Apr 22, 2022
* Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1)

* Moved in events code and added scheduler spec

* Updated scheduler

* Adapter from log messages to Armada

* Comments

* Deleted unused file

* Copied in missing function principalHasQueuePermissions

* Replaced atMostOnce with fragile

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Store k8s services and ingresses in the api.Job object

* Use correct time type

* Executor uses bundled k8s services and ingresses if present

* Removed unused code

* Guard populateServicesIngresses against nil values

* Added groups to event sequence and namespace, labels, and annotations to EventSequence

* Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels

* Comment

* Pass through namespace, labels, and annotations

* Added a list of concerns

* Refactored log submit authorization

* Dockerfile for building .proto internally

* make proto

* Added armadaerrors.ErrNoPermission

* Replaced timestamp with time type

* Removed go_package option not needed by gogo

* make proto

* Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop

* Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error

* Removed "", which caused tests to fail, from auth exec test script

* Improved exec authenticator error messages, fixed bug where locks were copied

* Fail test immediately on error

* Fail test immediately on error to avoid panics

* Fail test immediately on error to avoid panics

* Fail test immediately on error, improved error messages

* Create slices using make (seems to have fixed a test failure)

* Import ordering

* Added corporate proxy and compilation of events.proto

* Replace assert.NotEmpty -> assert.NoError

* Fixed erroneous error message

* commented out ca-certificates install

* added google.golang.org/api

* replaced assert.Nil -> assert.NoError

* Added gr-tests-e2e make target (#2)

* Moved in events code and added scheduler spec

* Updated scheduler

* Adapter from log messages to Armada

* Comments

* Deleted unused file

* Copied in missing function principalHasQueuePermissions

* Replaced atMostOnce with fragile

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Store k8s services and ingresses in the api.Job object

* Use correct time type

* Executor uses bundled k8s services and ingresses if present

* Removed unused code

* Guard populateServicesIngresses against nil values

* Added groups to event sequence and namespace, labels, and annotations to EventSequence

* Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels

* Comment

* Pass through namespace, labels, and annotations

* Added a list of concerns

* Refactored log submit authorization

* Dockerfile for building .proto internally

* make proto

* Added armadaerrors.ErrNoPermission

* Replaced timestamp with time type

* Removed go_package option not needed by gogo

* make proto

* Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop

* Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error

* Removed "", which caused tests to fail, from auth exec test script

* Improved exec authenticator error messages, fixed bug where locks were copied

* Fail test immediately on error

* Fail test immediately on error to avoid panics

* Fail test immediately on error to avoid panics

* Fail test immediately on error, improved error messages

* Create slices using make (seems to have fixed a test failure)

* Import ordering

* Added corporate proxy and compilation of events.proto

* Replace assert.NotEmpty -> assert.NoError

* Fixed erroneous error message

* commented out ca-certificates install

* added google.golang.org/api

* replaced assert.Nil -> assert.NoError

* added gr-tests-e2e target

* make e2e tests work inside gr (#3)

* make e2e tests work inside gr

* rever change for normal e2e

* and again

* enable docker build

* Enable e2e tests running in WSL (#4)

* Enable e2e tests running in WSL

* Submit to pulsar, fall back to existing API for queue admin

* Added pulsar-client-go and go-multierror

* Spin up Pulsar in e2e tests, load config for Pulsar

* Start Pulsar submit API and log processor in Armada

* Removed debug messages, comments

* Add flag to explicitly enable Pulsar

* Added periodic logging to the submit from Pulsar service

* Import ordering

* Kubernetes object metadata improvements, improved logging (#5)

* Improved logging and error handling

* Import ordering

* Comments, logging

* Include any additional podspecs in Pulsar submit jobs message

* go mod tidy

* Use a separate ObjectMeta for each k8s object in Pulsar'

* Merge namespace/annotations/labels at the Pulsar submit API

* Support submitting jobs with multiple podspecs

* Annotate each incoming gRPC request with a request id

* Annotate Pulsar messages with gRPC request id

* Annotate per-message logger with gRPC request id attached to the Pulsar message

* Import ordering

* Preserve ordering within sequences (#6)

* Publish job transitions to Pulsar (#7)

* Added JobRunFailed reasons

* Added logic to covert legacy events to Pulsar events

* Publish events to Pulsar in addition to Redis

* Added e2e tests that connrect directly to Pulsar

* Updated Pulsar message spec (#8)

* Comments

* comments

* Added function to return a request id or missing if none is found

* Updated events spec

* Updated Pulsar e2e tests

* Removed commented-out code

* Updated state transition message adapter to reflect changes to the proto

* Generate JobSucceeded on JobRunSucceeded, logging

* Provide Pulsar producer for SubmitFromLog service

* Added utility function to insert error information and stack trace to a logrus.Entry

* Removed deprecated code

* Import ordering

* Removed commented-out code

* Removed commented-out code

* Added isSequencef that takes a message to be logged on error

* Removed commented-out code

* Comments, removed debug logging

* Removed temporary swagger.merged file (#9)

* Removed temporary swagger.merged file

* Removed temporary swagger.merged file

* add pulsar tls config

* add pulsar tls config

* remove stray files

* Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10)

* Pulsar message utilities

* Added service for writing to Pulsar based on Pulsar messages

* Refactoring, use separate PulsarFromPulsar service

* Import ordering

* Refactoring

* Return an error on invalid pulsar message id comparison

* Improved error message

* Renamed Pulsar events topic to be more descriptive

* Removed commented-out code

* add advanced pulsar config

* review comments

* review comments

* more review comments

* more review comments

* more review comments

* Pulsar events spec improvements (#13)

* Use uint32 instead of double for priority

* Todo comment

* Hash queue + job_set_name instead of job_set_name

* Added efficient UUID message type

* Added conversion between google UUID and proto message UUID

* Added converters between proto UUIDs and ULIDs

* Import ordering

* Use optimised uuid message

* Added converters between strings and proto uuids

* Function to generate a plain ULID, comments

* Use optimised proto UUIDs

* Comments

* More fine-grained settings for job guarantees

* Replace 4294967295 by math.MaxUint32

* Break priority parsing into a separate function

* Securely hash queue and jobSetName together

* Refactoring

* Added lifetime to SubmitJob message

* fix chart

* move defaults

* remove pulsar enabled

* test fixes on wsl and windows

* End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15)

* Pass through GOPROXY/GOPRIVATE from the host for make proto

* Removed commented-out code

* Open armadactl by relative path, use valid priority

* Refactoring, cleanup

* Added test submitting several jobs, more rigorous event comparison

* Removed test submitting only a single job

* Pulsar e2e test cleanup

* Added code for getting jobIds from events

* Remove GR-specific GOPROXY/GOPRIVATE

* Remove references to GR from proto build

* Todo, whitespace

* Test improvements

* Use same alpine image as for tests, set limits equal to requests (as required by Armada)

* Removed todos

* Disallow combining PodSpec and PodSpecs, dissallow PodSpecs

* Correctly create services and ingresses in log submit API

* Set name of objects to create from the ObjectMeta included with the SubmitJob message

* Comments

* Added todo

* Comments

* Todos

* Test jobs with services/ingresses

* Comments

* Use PodSpec instead of PodSpecs

* Fail immediately on failure to connect to db

* Convert PodSpecs with 1 entry to PodSpec

* Avoid panics, check for PodSpec instead of PodSpecs[0]

* Import ordering

* Pulsar events refactoring (#17)

* Remove accelerator logging (#894)

This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval)

This causes massive spam for little to no benefit

* Moved events package into pkg

* Update reference to events.proto

* Updated events package import

* Added Pulsar properties to distinguish between control and utilisation messages

* Handle legacy job utilisation messages, set message key

* Removed queue_job_set_hash

* Comments

* Added ObjectMeta to main object, comments

* Added executor_id to ObjectMeta

* Renamed code to exit_code in ApplicationError message

* Comments

* Comments

Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>

* Address comments on PR ARMADA/990 GRPub/armada (#18)

* Refer to corporate proxies in general terms

* Comments

* Removed events.pb.go to simplify PR

* Restore swagger files to simplify PR

* Comments

* Removed proposed scheduler code

* Only report jobs done once their state has been reported (#899)

Normally the state gets reported instantly so this is already true 99% of the time.

However if reporting the state goes wrong, we shouldn't report the job as done
 - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease

In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too

This will only really impact edge cases and most of the time this will already be true

* Enable dotnet and npm build internally (#19)

* Added make target for dotnet that works internally

* Added dotnet tests to tests-e2e target

* Handle end-of-line symbols in a cross-platform manner

* Moved dotnet build to separate make target

* Get protoc via Maven

* Load Maven URL from environment variables

* Single Dockerfile for building proto

* Cleaned up tests make target

* Run unit tests in docker containers

* Removed unused armada-test docker network

* Run e2e tests and dotnet target in containers

* Bump go version to 1.16 for consitency

* Optionally run all go commands in containers

* Mount GOPROXY and GOPRIVATE into go containers

* Comments

* Run npm in docker containers

* Get go version string correctly from containers

* Added missing .SubmitServer

* Generate legacy job submitted events when submitting to Pulsar

* Have cancel and reprioritise endpoints generate set messages

* Always run builds in containers

* Moved armadactl tests into e2e

* Moved pulsar e2e tests into separate directory

* Renamed directory

* Updated e2e tests target to reflect new directories

* Updated tests-e2e-no-setup target

* Commented in tests

* Removed npm environment variables values

* Commented in tests

* Removed commented-out code

* ARMADA-1028 Events proto updates (#20)

* Moved terminal flag into individual errors

* Generate JobErrors instead of JobRunErrors on JobFailed API message

* Updated to reflect moving terminal flag in errors

* Added JobErrors to JobIdFromEvent

* Added config files for local use to .gitignore

* Added ReprioritisedJob, CancelledJob, and JobDuplicateDetected

* Handle all api messages

* Populate ObjectMeta info for errors

* Include container name with container errors

* Create EventId type (#21)

* Return a concrete type from new to enable comparison

* Added event id type

* Create utility for sniffing Pulsar events (#22)

* Added program to print events

* Write eventsprinter as a cobra app

* Take pulsar.Message instead of pulsar.ConsumerMessage

* Filter out non-control messages

* Fix bug associated with creating nil events

* Include CancelledJob in list of messages indicating job failure

* Print job ids

* Improved testing (#23)

* Added missing events to JobIdFromEvent

* write test output to disk, convert test output to junit format

* Added go-junit-report as dependency

* Spin up postgres for e2e tests

* Write html test report if possible

* Removed problematic -e flag

* Test submitting a job with errors, test cancelling jobs

* Ignore test_reports

* Improve Pulsar consumer retry logic (#24)

* Do not return an error on messages requiring no action

* Ack messages after processing

* Fail test immediately on error

* Propagate errors correctly

* Add code to detect (possibly nested) network errors

* Return immediately on nil in IsNetworkError

* Improved error handling and retry logic

* Added missing parentheses

* Consider context.DeadlineExceeded a network error

* Improved failure and retry logic

* Removed seek from Pulsar setup

* Removed multierror from GetActiveJobIds

* Added tests for non-network errors

* Only ack on successfully processing a sequence

* Set Pulsar message key

* keyshared sub (#27)

* Merge In changes from public github (#28)

* Remove accelerator logging (#894)

This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval)

This causes massive spam for little to no benefit

* Only report jobs done once their state has been reported (#899)

Normally the state gets reported instantly so this is already true 99% of the time.

However if reporting the state goes wrong, we shouldn't report the job as done
 - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease

In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too

This will only really impact edge cases and most of the time this will already be true

Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>

* fix infinite loop (#29)

* lowe case returned jobIds (#30)

* ARMADA-995: Ingester from pulsar-> lookout database (#26)

* initial impl of lookout ingester

* added pod/container error

* fixes after testing

* goimports

* doc

* doc

* fixes

* remove unneeded code

* add docker file- move tests

* Fix event sequence number update (#31)

* Enable CI builds (#25)

* Added Jenkinsfile

* Fix syntax errors

* Changed var to def

* Removed need to clone armada-ci

* Removed armada-ci dependency

* Use writeFile instead of echo

* Removed -it

* Get version correctly

* Added debug printouts

* Printouts

* Set PWD to one valid on the host

* Set PWD in sh

* Set PWD consistently

* Make proto

* Exit on proto no matching

* Added goimports as dependency

* Import ordering

* Check error code

* Check error

* Added code checks make target

* Go mod tidy

* Aded code checks stage

* Updated line endings

* Escape $

* Added download make target

* Set GOPATH such that it is available on the host

* Removed unnecessary mkdir

* Added build tag to avoid compiling tools into binary

* Automatically install tools

* Run goimports for pkg/events

* Whitespace

* Fixed line endings

* Added templify to list of tools

* Comment out proto stage during development

* Renamed junit_report to junit-report for consistency

* Run go-junit-report in docker

* Require kind, run kubectl in docker

* Do not check for kind, to allow downloading it using make download

* Add kind to list of tools

* Fixed import for go-junit-report

* Added tests-teardown target, avoid returning error code from tests-e2e-teardown

* Run teardown targets before tests

* Removed unnecessary cleanup stage

* Set shorter OperationTimeout, bundle stack trace with errors

* Commented out tests failing due to Docker setup

* Added gox to list of tools

* Run gox in docker

* Comments

* Whitespace

* Whitespace

* Whitespace

* Commented out tests during development

* Commented in all stages

* Enabled junit plugin

* Removed junit2html

* Updated paths

* Updated PWD

* Removed -set-exit-code from go-junit-report

* For Pulsar, only expose IPv4 to fix issue with Pulsar client library

* Added go-swagger and grpc-gateway to tools

* Move proto post-processing from proto.sh into makefile for performance

* Fix go-imports

* Comment in Pulsar tests

* Update Jenkinsfile

* Fix ineffassign

* Fix typo NoError -> Error

* Only build if tests succeed

* Delete Jenkinsfile

* Import ordering

* Timeout retrying processing an event sequence (#32)

* Add dependent targets to tests-e2e-no-setup

* Only ack messages if sequence is empty or if at least 1 event was processed

* Timeout processing a sequence after 5 minutes

* Added rebuild-server target for testing, removed tests-e2e-no-setup dependencies

* Comments

* Always sleep on making no progress

* Rename events module to armadaevents (#33)

* Fix test

* Rename events -> armadaevents

* Rename events -> armadaevents

* Import ordering

* Lookout Ingester fixes following testing (#35)

* fixes following testing

* fixes following testing

* Use PGX for Database Conections to Lookout (#915)

* use pgx

* removed log lines

* fix import order

* fix test

* fix test

* Add retries to Lookout Ingester (#36)

* fixes following testing

* fixes following testing

* added database error handling

* import order

* code review comments

* code review comments

* cause doesn't work

* import order

* Set default tolerations (#34)

* Fix test

* Rename events -> armadaevents

* Rename events -> armadaevents

* Import ordering

* Add default tolerations for e2e tests

* Test for default tolerations

* Improved job vonersion code

* Add job conversion tests

* Use job conversion code in eventutil

* Test expected tolerations

* More verbose printing

* Lookout ingester produces compressed proto (#37)

* add compressed proto

* wip

* go imports

* fixed config- added todo

* Remove changed generated files

* Comments

* Remove changed generated files

* Remove changed generated files

Co-authored-by: Chris Martin <Chris.Martin@gresearch.co.uk>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>
severinson added a commit that referenced this pull request Apr 26, 2022
* Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1)

* Moved in events code and added scheduler spec

* Updated scheduler

* Adapter from log messages to Armada

* Comments

* Deleted unused file

* Copied in missing function principalHasQueuePermissions

* Replaced atMostOnce with fragile

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Store k8s services and ingresses in the api.Job object

* Use correct time type

* Executor uses bundled k8s services and ingresses if present

* Removed unused code

* Guard populateServicesIngresses against nil values

* Added groups to event sequence and namespace, labels, and annotations to EventSequence

* Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels

* Comment

* Pass through namespace, labels, and annotations

* Added a list of concerns

* Refactored log submit authorization

* Dockerfile for building .proto internally

* make proto

* Added armadaerrors.ErrNoPermission

* Replaced timestamp with time type

* Removed go_package option not needed by gogo

* make proto

* Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop

* Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error

* Removed "", which caused tests to fail, from auth exec test script

* Improved exec authenticator error messages, fixed bug where locks were copied

* Fail test immediately on error

* Fail test immediately on error to avoid panics

* Fail test immediately on error to avoid panics

* Fail test immediately on error, improved error messages

* Create slices using make (seems to have fixed a test failure)

* Import ordering

* Added corporate proxy and compilation of events.proto

* Replace assert.NotEmpty -> assert.NoError

* Fixed erroneous error message

* commented out ca-certificates install

* added google.golang.org/api

* replaced assert.Nil -> assert.NoError

* Added gr-tests-e2e make target (#2)

* Moved in events code and added scheduler spec

* Updated scheduler

* Adapter from log messages to Armada

* Comments

* Deleted unused file

* Copied in missing function principalHasQueuePermissions

* Replaced atMostOnce with fragile

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Store k8s services and ingresses in the api.Job object

* Use correct time type

* Executor uses bundled k8s services and ingresses if present

* Removed unused code

* Guard populateServicesIngresses against nil values

* Added groups to event sequence and namespace, labels, and annotations to EventSequence

* Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels

* Comment

* Pass through namespace, labels, and annotations

* Added a list of concerns

* Refactored log submit authorization

* Dockerfile for building .proto internally

* make proto

* Added armadaerrors.ErrNoPermission

* Replaced timestamp with time type

* Removed go_package option not needed by gogo

* make proto

* Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop

* Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error

* Removed "", which caused tests to fail, from auth exec test script

* Improved exec authenticator error messages, fixed bug where locks were copied

* Fail test immediately on error

* Fail test immediately on error to avoid panics

* Fail test immediately on error to avoid panics

* Fail test immediately on error, improved error messages

* Create slices using make (seems to have fixed a test failure)

* Import ordering

* Added corporate proxy and compilation of events.proto

* Replace assert.NotEmpty -> assert.NoError

* Fixed erroneous error message

* commented out ca-certificates install

* added google.golang.org/api

* replaced assert.Nil -> assert.NoError

* added gr-tests-e2e target

* make e2e tests work inside gr (#3)

* make e2e tests work inside gr

* rever change for normal e2e

* and again

* enable docker build

* Enable e2e tests running in WSL (#4)

* Enable e2e tests running in WSL

* Submit to pulsar, fall back to existing API for queue admin

* Added pulsar-client-go and go-multierror

* Spin up Pulsar in e2e tests, load config for Pulsar

* Start Pulsar submit API and log processor in Armada

* Removed debug messages, comments

* Add flag to explicitly enable Pulsar

* Added periodic logging to the submit from Pulsar service

* Import ordering

* Kubernetes object metadata improvements, improved logging (#5)

* Improved logging and error handling

* Import ordering

* Comments, logging

* Include any additional podspecs in Pulsar submit jobs message

* go mod tidy

* Use a separate ObjectMeta for each k8s object in Pulsar'

* Merge namespace/annotations/labels at the Pulsar submit API

* Support submitting jobs with multiple podspecs

* Annotate each incoming gRPC request with a request id

* Annotate Pulsar messages with gRPC request id

* Annotate per-message logger with gRPC request id attached to the Pulsar message

* Import ordering

* Preserve ordering within sequences (#6)

* Publish job transitions to Pulsar (#7)

* Added JobRunFailed reasons

* Added logic to covert legacy events to Pulsar events

* Publish events to Pulsar in addition to Redis

* Added e2e tests that connrect directly to Pulsar

* Updated Pulsar message spec (#8)

* Comments

* comments

* Added function to return a request id or missing if none is found

* Updated events spec

* Updated Pulsar e2e tests

* Removed commented-out code

* Updated state transition message adapter to reflect changes to the proto

* Generate JobSucceeded on JobRunSucceeded, logging

* Provide Pulsar producer for SubmitFromLog service

* Added utility function to insert error information and stack trace to a logrus.Entry

* Removed deprecated code

* Import ordering

* Removed commented-out code

* Removed commented-out code

* Added isSequencef that takes a message to be logged on error

* Removed commented-out code

* Comments, removed debug logging

* Removed temporary swagger.merged file (#9)

* Removed temporary swagger.merged file

* Removed temporary swagger.merged file

* add pulsar tls config

* add pulsar tls config

* remove stray files

* Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10)

* Pulsar message utilities

* Added service for writing to Pulsar based on Pulsar messages

* Refactoring, use separate PulsarFromPulsar service

* Import ordering

* Refactoring

* Return an error on invalid pulsar message id comparison

* Improved error message

* Renamed Pulsar events topic to be more descriptive

* Removed commented-out code

* add advanced pulsar config

* review comments

* review comments

* more review comments

* more review comments

* more review comments

* Pulsar events spec improvements (#13)

* Use uint32 instead of double for priority

* Todo comment

* Hash queue + job_set_name instead of job_set_name

* Added efficient UUID message type

* Added conversion between google UUID and proto message UUID

* Added converters between proto UUIDs and ULIDs

* Import ordering

* Use optimised uuid message

* Added converters between strings and proto uuids

* Function to generate a plain ULID, comments

* Use optimised proto UUIDs

* Comments

* More fine-grained settings for job guarantees

* Replace 4294967295 by math.MaxUint32

* Break priority parsing into a separate function

* Securely hash queue and jobSetName together

* Refactoring

* Added lifetime to SubmitJob message

* fix chart

* move defaults

* remove pulsar enabled

* test fixes on wsl and windows

* End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15)

* Pass through GOPROXY/GOPRIVATE from the host for make proto

* Removed commented-out code

* Open armadactl by relative path, use valid priority

* Refactoring, cleanup

* Added test submitting several jobs, more rigorous event comparison

* Removed test submitting only a single job

* Pulsar e2e test cleanup

* Added code for getting jobIds from events

* Remove GR-specific GOPROXY/GOPRIVATE

* Remove references to GR from proto build

* Todo, whitespace

* Test improvements

* Use same alpine image as for tests, set limits equal to requests (as required by Armada)

* Removed todos

* Disallow combining PodSpec and PodSpecs, dissallow PodSpecs

* Correctly create services and ingresses in log submit API

* Set name of objects to create from the ObjectMeta included with the SubmitJob message

* Comments

* Added todo

* Comments

* Todos

* Test jobs with services/ingresses

* Comments

* Use PodSpec instead of PodSpecs

* Fail immediately on failure to connect to db

* Convert PodSpecs with 1 entry to PodSpec

* Avoid panics, check for PodSpec instead of PodSpecs[0]

* Import ordering

* Pulsar events refactoring (#17)

* Remove accelerator logging (#894)

This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval)

This causes massive spam for little to no benefit

* Moved events package into pkg

* Update reference to events.proto

* Updated events package import

* Added Pulsar properties to distinguish between control and utilisation messages

* Handle legacy job utilisation messages, set message key

* Removed queue_job_set_hash

* Comments

* Added ObjectMeta to main object, comments

* Added executor_id to ObjectMeta

* Renamed code to exit_code in ApplicationError message

* Comments

* Comments

Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>

* Address comments on PR ARMADA/990 GRPub/armada (#18)

* Refer to corporate proxies in general terms

* Comments

* Removed events.pb.go to simplify PR

* Restore swagger files to simplify PR

* Comments

* Removed proposed scheduler code

* Sync changes made internally 220322-220419 to GRPub (#15)

* Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1)

* Moved in events code and added scheduler spec

* Updated scheduler

* Adapter from log messages to Armada

* Comments

* Deleted unused file

* Copied in missing function principalHasQueuePermissions

* Replaced atMostOnce with fragile

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Store k8s services and ingresses in the api.Job object

* Use correct time type

* Executor uses bundled k8s services and ingresses if present

* Removed unused code

* Guard populateServicesIngresses against nil values

* Added groups to event sequence and namespace, labels, and annotations to EventSequence

* Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels

* Comment

* Pass through namespace, labels, and annotations

* Added a list of concerns

* Refactored log submit authorization

* Dockerfile for building .proto internally

* make proto

* Added armadaerrors.ErrNoPermission

* Replaced timestamp with time type

* Removed go_package option not needed by gogo

* make proto

* Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop

* Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error

* Removed "", which caused tests to fail, from auth exec test script

* Improved exec authenticator error messages, fixed bug where locks were copied

* Fail test immediately on error

* Fail test immediately on error to avoid panics

* Fail test immediately on error to avoid panics

* Fail test immediately on error, improved error messages

* Create slices using make (seems to have fixed a test failure)

* Import ordering

* Added corporate proxy and compilation of events.proto

* Replace assert.NotEmpty -> assert.NoError

* Fixed erroneous error message

* commented out ca-certificates install

* added google.golang.org/api

* replaced assert.Nil -> assert.NoError

* Added gr-tests-e2e make target (#2)

* Moved in events code and added scheduler spec

* Updated scheduler

* Adapter from log messages to Armada

* Comments

* Deleted unused file

* Copied in missing function principalHasQueuePermissions

* Replaced atMostOnce with fragile

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Scheduler updates

* Store k8s services and ingresses in the api.Job object

* Use correct time type

* Executor uses bundled k8s services and ingresses if present

* Removed unused code

* Guard populateServicesIngresses against nil values

* Added groups to event sequence and namespace, labels, and annotations to EventSequence

* Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels

* Comment

* Pass through namespace, labels, and annotations

* Added a list of concerns

* Refactored log submit authorization

* Dockerfile for building .proto internally

* make proto

* Added armadaerrors.ErrNoPermission

* Replaced timestamp with time type

* Removed go_package option not needed by gogo

* make proto

* Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop

* Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error

* Removed "", which caused tests to fail, from auth exec test script

* Improved exec authenticator error messages, fixed bug where locks were copied

* Fail test immediately on error

* Fail test immediately on error to avoid panics

* Fail test immediately on error to avoid panics

* Fail test immediately on error, improved error messages

* Create slices using make (seems to have fixed a test failure)

* Import ordering

* Added corporate proxy and compilation of events.proto

* Replace assert.NotEmpty -> assert.NoError

* Fixed erroneous error message

* commented out ca-certificates install

* added google.golang.org/api

* replaced assert.Nil -> assert.NoError

* added gr-tests-e2e target

* make e2e tests work inside gr (#3)

* make e2e tests work inside gr

* rever change for normal e2e

* and again

* enable docker build

* Enable e2e tests running in WSL (#4)

* Enable e2e tests running in WSL

* Submit to pulsar, fall back to existing API for queue admin

* Added pulsar-client-go and go-multierror

* Spin up Pulsar in e2e tests, load config for Pulsar

* Start Pulsar submit API and log processor in Armada

* Removed debug messages, comments

* Add flag to explicitly enable Pulsar

* Added periodic logging to the submit from Pulsar service

* Import ordering

* Kubernetes object metadata improvements, improved logging (#5)

* Improved logging and error handling

* Import ordering

* Comments, logging

* Include any additional podspecs in Pulsar submit jobs message

* go mod tidy

* Use a separate ObjectMeta for each k8s object in Pulsar'

* Merge namespace/annotations/labels at the Pulsar submit API

* Support submitting jobs with multiple podspecs

* Annotate each incoming gRPC request with a request id

* Annotate Pulsar messages with gRPC request id

* Annotate per-message logger with gRPC request id attached to the Pulsar message

* Import ordering

* Preserve ordering within sequences (#6)

* Publish job transitions to Pulsar (#7)

* Added JobRunFailed reasons

* Added logic to covert legacy events to Pulsar events

* Publish events to Pulsar in addition to Redis

* Added e2e tests that connrect directly to Pulsar

* Updated Pulsar message spec (#8)

* Comments

* comments

* Added function to return a request id or missing if none is found

* Updated events spec

* Updated Pulsar e2e tests

* Removed commented-out code

* Updated state transition message adapter to reflect changes to the proto

* Generate JobSucceeded on JobRunSucceeded, logging

* Provide Pulsar producer for SubmitFromLog service

* Added utility function to insert error information and stack trace to a logrus.Entry

* Removed deprecated code

* Import ordering

* Removed commented-out code

* Removed commented-out code

* Added isSequencef that takes a message to be logged on error

* Removed commented-out code

* Comments, removed debug logging

* Removed temporary swagger.merged file (#9)

* Removed temporary swagger.merged file

* Removed temporary swagger.merged file

* add pulsar tls config

* add pulsar tls config

* remove stray files

* Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10)

* Pulsar message utilities

* Added service for writing to Pulsar based on Pulsar messages

* Refactoring, use separate PulsarFromPulsar service

* Import ordering

* Refactoring

* Return an error on invalid pulsar message id comparison

* Improved error message

* Renamed Pulsar events topic to be more descriptive

* Removed commented-out code

* add advanced pulsar config

* review comments

* review comments

* more review comments

* more review comments

* more review comments

* Pulsar events spec improvements (#13)

* Use uint32 instead of double for priority

* Todo comment

* Hash queue + job_set_name instead of job_set_name

* Added efficient UUID message type

* Added conversion between google UUID and proto message UUID

* Added converters between proto UUIDs and ULIDs

* Import ordering

* Use optimised uuid message

* Added converters between strings and proto uuids

* Function to generate a plain ULID, comments

* Use optimised proto UUIDs

* Comments

* More fine-grained settings for job guarantees

* Replace 4294967295 by math.MaxUint32

* Break priority parsing into a separate function

* Securely hash queue and jobSetName together

* Refactoring

* Added lifetime to SubmitJob message

* fix chart

* move defaults

* remove pulsar enabled

* test fixes on wsl and windows

* End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15)

* Pass through GOPROXY/GOPRIVATE from the host for make proto

* Removed commented-out code

* Open armadactl by relative path, use valid priority

* Refactoring, cleanup

* Added test submitting several jobs, more rigorous event comparison

* Removed test submitting only a single job

* Pulsar e2e test cleanup

* Added code for getting jobIds from events

* Remove GR-specific GOPROXY/GOPRIVATE

* Remove references to GR from proto build

* Todo, whitespace

* Test improvements

* Use same alpine image as for tests, set limits equal to requests (as required by Armada)

* Removed todos

* Disallow combining PodSpec and PodSpecs, dissallow PodSpecs

* Correctly create services and ingresses in log submit API

* Set name of objects to create from the ObjectMeta included with the SubmitJob message

* Comments

* Added todo

* Comments

* Todos

* Test jobs with services/ingresses

* Comments

* Use PodSpec instead of PodSpecs

* Fail immediately on failure to connect to db

* Convert PodSpecs with 1 entry to PodSpec

* Avoid panics, check for PodSpec instead of PodSpecs[0]

* Import ordering

* Pulsar events refactoring (#17)

* Remove accelerator logging (#894)

This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval)

This causes massive spam for little to no benefit

* Moved events package into pkg

* Update reference to events.proto

* Updated events package import

* Added Pulsar properties to distinguish between control and utilisation messages

* Handle legacy job utilisation messages, set message key

* Removed queue_job_set_hash

* Comments

* Added ObjectMeta to main object, comments

* Added executor_id to ObjectMeta

* Renamed code to exit_code in ApplicationError message

* Comments

* Comments

Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>

* Address comments on PR ARMADA/990 GRPub/armada (#18)

* Refer to corporate proxies in general terms

* Comments

* Removed events.pb.go to simplify PR

* Restore swagger files to simplify PR

* Comments

* Removed proposed scheduler code

* Only report jobs done once their state has been reported (#899)

Normally the state gets reported instantly so this is already true 99% of the time.

However if reporting the state goes wrong, we shouldn't report the job as done
 - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease

In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too

This will only really impact edge cases and most of the time this will already be true

* Enable dotnet and npm build internally (#19)

* Added make target for dotnet that works internally

* Added dotnet tests to tests-e2e target

* Handle end-of-line symbols in a cross-platform manner

* Moved dotnet build to separate make target

* Get protoc via Maven

* Load Maven URL from environment variables

* Single Dockerfile for building proto

* Cleaned up tests make target

* Run unit tests in docker containers

* Removed unused armada-test docker network

* Run e2e tests and dotnet target in containers

* Bump go version to 1.16 for consitency

* Optionally run all go commands in containers

* Mount GOPROXY and GOPRIVATE into go containers

* Comments

* Run npm in docker containers

* Get go version string correctly from containers

* Added missing .SubmitServer

* Generate legacy job submitted events when submitting to Pulsar

* Have cancel and reprioritise endpoints generate set messages

* Always run builds in containers

* Moved armadactl tests into e2e

* Moved pulsar e2e tests into separate directory

* Renamed directory

* Updated e2e tests target to reflect new directories

* Updated tests-e2e-no-setup target

* Commented in tests

* Removed npm environment variables values

* Commented in tests

* Removed commented-out code

* ARMADA-1028 Events proto updates (#20)

* Moved terminal flag into individual errors

* Generate JobErrors instead of JobRunErrors on JobFailed API message

* Updated to reflect moving terminal flag in errors

* Added JobErrors to JobIdFromEvent

* Added config files for local use to .gitignore

* Added ReprioritisedJob, CancelledJob, and JobDuplicateDetected

* Handle all api messages

* Populate ObjectMeta info for errors

* Include container name with container errors

* Create EventId type (#21)

* Return a concrete type from new to enable comparison

* Added event id type

* Create utility for sniffing Pulsar events (#22)

* Added program to print events

* Write eventsprinter as a cobra app

* Take pulsar.Message instead of pulsar.ConsumerMessage

* Filter out non-control messages

* Fix bug associated with creating nil events

* Include CancelledJob in list of messages indicating job failure

* Print job ids

* Improved testing (#23)

* Added missing events to JobIdFromEvent

* write test output to disk, convert test output to junit format

* Added go-junit-report as dependency

* Spin up postgres for e2e tests

* Write html test report if possible

* Removed problematic -e flag

* Test submitting a job with errors, test cancelling jobs

* Ignore test_reports

* Improve Pulsar consumer retry logic (#24)

* Do not return an error on messages requiring no action

* Ack messages after processing

* Fail test immediately on error

* Propagate errors correctly

* Add code to detect (possibly nested) network errors

* Return immediately on nil in IsNetworkError

* Improved error handling and retry logic

* Added missing parentheses

* Consider context.DeadlineExceeded a network error

* Improved failure and retry logic

* Removed seek from Pulsar setup

* Removed multierror from GetActiveJobIds

* Added tests for non-network errors

* Only ack on successfully processing a sequence

* Set Pulsar message key

* keyshared sub (#27)

* Merge In changes from public github (#28)

* Remove accelerator logging (#894)

This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval)

This causes massive spam for little to no benefit

* Only report jobs done once their state has been reported (#899)

Normally the state gets reported instantly so this is already true 99% of the time.

However if reporting the state goes wrong, we shouldn't report the job as done
 - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease

In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too

This will only really impact edge cases and most of the time this will already be true

Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>

* fix infinite loop (#29)

* lowe case returned jobIds (#30)

* ARMADA-995: Ingester from pulsar-> lookout database (#26)

* initial impl of lookout ingester

* added pod/container error

* fixes after testing

* goimports

* doc

* doc

* fixes

* remove unneeded code

* add docker file- move tests

* Fix event sequence number update (#31)

* Enable CI builds (#25)

* Added Jenkinsfile

* Fix syntax errors

* Changed var to def

* Removed need to clone armada-ci

* Removed armada-ci dependency

* Use writeFile instead of echo

* Removed -it

* Get version correctly

* Added debug printouts

* Printouts

* Set PWD to one valid on the host

* Set PWD in sh

* Set PWD consistently

* Make proto

* Exit on proto no matching

* Added goimports as dependency

* Import ordering

* Check error code

* Check error

* Added code checks make target

* Go mod tidy

* Aded code checks stage

* Updated line endings

* Escape $

* Added download make target

* Set GOPATH such that it is available on the host

* Removed unnecessary mkdir

* Added build tag to avoid compiling tools into binary

* Automatically install tools

* Run goimports for pkg/events

* Whitespace

* Fixed line endings

* Added templify to list of tools

* Comment out proto stage during development

* Renamed junit_report to junit-report for consistency

* Run go-junit-report in docker

* Require kind, run kubectl in docker

* Do not check for kind, to allow downloading it using make download

* Add kind to list of tools

* Fixed import for go-junit-report

* Added tests-teardown target, avoid returning error code from tests-e2e-teardown

* Run teardown targets before tests

* Removed unnecessary cleanup stage

* Set shorter OperationTimeout, bundle stack trace with errors

* Commented out tests failing due to Docker setup

* Added gox to list of tools

* Run gox in docker

* Comments

* Whitespace

* Whitespace

* Whitespace

* Commented out tests during development

* Commented in all stages

* Enabled junit plugin

* Removed junit2html

* Updated paths

* Updated PWD

* Removed -set-exit-code from go-junit-report

* For Pulsar, only expose IPv4 to fix issue with Pulsar client library

* Added go-swagger and grpc-gateway to tools

* Move proto post-processing from proto.sh into makefile for performance

* Fix go-imports

* Comment in Pulsar tests

* Update Jenkinsfile

* Fix ineffassign

* Fix typo NoError -> Error

* Only build if tests succeed

* Delete Jenkinsfile

* Import ordering

* Timeout retrying processing an event sequence (#32)

* Add dependent targets to tests-e2e-no-setup

* Only ack messages if sequence is empty or if at least 1 event was processed

* Timeout processing a sequence after 5 minutes

* Added rebuild-server target for testing, removed tests-e2e-no-setup dependencies

* Comments

* Always sleep on making no progress

* Rename events module to armadaevents (#33)

* Fix test

* Rename events -> armadaevents

* Rename events -> armadaevents

* Import ordering

* Lookout Ingester fixes following testing (#35)

* fixes following testing

* fixes following testing

* Use PGX for Database Conections to Lookout (#915)

* use pgx

* removed log lines

* fix import order

* fix test

* fix test

* Add retries to Lookout Ingester (#36)

* fixes following testing

* fixes following testing

* added database error handling

* import order

* code review comments

* code review comments

* cause doesn't work

* import order

* Set default tolerations (#34)

* Fix test

* Rename events -> armadaevents

* Rename events -> armadaevents

* Import ordering

* Add default tolerations for e2e tests

* Test for default tolerations

* Improved job vonersion code

* Add job conversion tests

* Use job conversion code in eventutil

* Test expected tolerations

* More verbose printing

* Lookout ingester produces compressed proto (#37)

* add compressed proto

* wip

* go imports

* fixed config- added todo

* Remove changed generated files

* Comments

* Remove changed generated files

* Remove changed generated files

Co-authored-by: Chris Martin <Chris.Martin@gresearch.co.uk>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>

* goimports

* removed duplicated tests

* fix duplicate gopath

* fix duplicate gopath

* Minor fixes

* Regenerate dotnet

* Update circleci to reflect makefile changes

* Separate e2e-test job

* Renamed e2e test job for consistency

* Run e2e tests

* Increase test VM size

* Download tools for build job

* Enable integration tests in circleci

* Store junit test report

* Whitespace

* Circleci cleanup

* Update job name

* Typo

* Improved dependency caching

* Remove unnecessary make download calls

* Print GOPATH

* Print GOPATH

* Change go cache directories

* Only download with GO_TEST_CMD

* Always check dependencies

* Go mod tidy after make download

* Use large resource class

* Refactored

* Use xlarge resource class

* Removed unused dependency

* Specify container versions

* Remove unused jobs

* Enable DLC

* Print logs

* Fix order of printing logs

* Bump e2e test instance resource class

* Revert e2e test instance resource class

Co-authored-by: Albin Severinson <Albin.Severinson@gresearch.co.uk>
Co-authored-by: Chris Martin <Chris.Martin@gresearch.co.uk>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>
Co-authored-by: Albin Severinson <larsalbins@latpoc32.maas>
Sharpz7 added a commit that referenced this pull request Aug 21, 2023
* Seperate python script for armada v1 system diagram

* removed generate.py so it can be replaced with two seperate files for Armada V1 and Armada V2

* Python script to generate Armada V2 system diagram

* generate_v1.py Update #1

* generate_v1.py Update Number:2

* generate.py runs generate_v1.py as well as generate_v2.py and it is consistent with our instructions as 'docs/design/diagrams/relationships'

* generate_v1.py Update No:3

* Armada V1 and Armada V2 diagrams

* updated relationships_diagram.md to include armada v1 and v2 diagrams

---------

Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
svc-gh-ghzonetrans-p pushed a commit that referenced this pull request Oct 23, 2023
* Update simulator

* Replace Output with C

* Typo

* Restore pkg proto

* Restore files

* Fixing simulator changes (#6)

* Fixing simulator changes

* Changed to less than or equal

Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

* Simulator Changes (#9)

* Add config and dependency injection to scheduler metrics (#2892)

* Replace metrics singleton with an injection pattern.

* fix

* add configuration structures to metrics

* add configuration

* rename elements

* Maker Pulsar ReceiverQueueSize Configurable (#2895)

* wip

* wip

* set receiverQueueSize to 100

* remove old PulsarReceiverQueueSize

* revert

* subscriptionin api

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>

* Add poll_interval (#2805)

* Add poll_interval

* Add poll_interval

* Added poll_interval

* update by running tox-e docs

---------

Co-authored-by: Kevin Hannon <kannon1992@gmail.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* Seperate python script for armada v1 and v2 system diagrams (#2758)

* Seperate python script for armada v1 system diagram

* removed generate.py so it can be replaced with two seperate files for Armada V1 and Armada V2

* Python script to generate Armada V2 system diagram

* generate_v1.py Update #1

* generate_v1.py Update Number:2

* generate.py runs generate_v1.py as well as generate_v2.py and it is consistent with our instructions as 'docs/design/diagrams/relationships'

* generate_v1.py Update No:3

* Armada V1 and Armada V2 diagrams

* updated relationships_diagram.md to include armada v1 and v2 diagrams

---------

Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* Add config to use autoupdater on tagged branches (#2905)

* #2904 add autoupdate config

* #2904 add label config and other options

* docs: create README.md for plugins directory (#2897)

* Create README.md for plugins directory

* Update README.md

* Update plugins/README.md

Co-authored-by: Kevin Hannon <kehannon@redhat.com>

* Update README.md

---------

Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* Enables airflow operator level retry. (#2894)

* Update docker stuff for latest airflow 2.7.0

* Use AirflowException instead of AirflowFailException to allow for retries

* Remove codecov workflows (#2902)

* Upgrade Pulsar Client to v0.11 (#2896)

* update

* update pulsar client

* Fix bug causing server spinning

* Abstract out the retry until success logic for testing (#2901)

* Respond to review

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Daniel Rastelli <rastellidani@gmail.com>

* Sync quickstart/index.md with gh-pages/quickstart.md (#2891)

* Log Call Site (#2909)

* allow logger to report caller

* allow logger to report caller

* lint

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>

* Add cleaner test output for mage with os/exec.Command (#2907)

* feat: Update Semver from version 6.3.0 to 6.3.1 (#2686)

Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* fix: upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0 (#2743)

Snyk has created this PR to upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* fix: upgrade @types/react from 16.14.32 to 16.14.43 (#2747)

Snyk has created this PR to upgrade @types/react from 16.14.32 to 16.14.43.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Bump github.com/go-openapi/jsonreference from 0.20.0 to 0.20.2 (#2316)

Bumps [github.com/go-openapi/jsonreference](https://github.com/go-openapi/jsonreference) from 0.20.0 to 0.20.2.
- [Release notes](https://github.com/go-openapi/jsonreference/releases)
- [Commits](go-openapi/jsonreference@v0.20.0...v0.20.2)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/jsonreference
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Order leased jobs by serial (#2912)

This will ensure the job leased first, gets send to the cluster first

Currently we just order by postgres default sorting - which often picks the most recently leased - causing the first lease jobs to get stuck
 - This only occurs when scheduling is faster than leasing

* Bump webpack from 5.75.0 to 5.77.0 in /internal/lookout/ui (#2302)

Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.77.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](webpack/webpack@v5.75.0...v5.77.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Bump word-wrap from 1.2.3 to 1.2.5 in /internal/lookout/ui (#2806)

Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.5.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](jonschlinkert/word-wrap@1.2.3...1.2.5)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* resolve flaky (#2914)

Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* fix: upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0 (#2744)

Snyk has created this PR to upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* fix: upgrade react-router-dom from 6.9.0 to 6.14.1 (#2746)

Snyk has created this PR to upgrade react-router-dom from 6.9.0 to 6.14.1.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Bump semver from 6.3.0 to 6.3.1 in /internal/lookout/ui (#2661)

Bumps [semver](https://github.com/npm/node-semver) from 6.3.0 to 6.3.1.
- [Release notes](https://github.com/npm/node-semver/releases)
- [Changelog](https://github.com/npm/node-semver/blob/v6.3.1/CHANGELOG.md)
- [Commits](npm/node-semver@v6.3.0...v6.3.1)

---
updated-dependencies:
- dependency-name: semver
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Run CodeQL once daily on a schedule (#2918)

* Helm chart update: executor  (#2917)

* Helm chart update: executor

At the moment the helm chart for the executor doesn't include priorityClass even though one is created in the chart. This means that the executor deployment is unable to set the priorityClass.

* Patch/dependencies (#2923)

* Bump github.com/go-openapi/strfmt from 0.21.3 to 0.21.7

Bumps [github.com/go-openapi/strfmt](https://github.com/go-openapi/strfmt) from 0.21.3 to 0.21.7.
- [Release notes](https://github.com/go-openapi/strfmt/releases)
- [Commits](go-openapi/strfmt@v0.21.3...v0.21.7)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/strfmt
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump github.com/go-openapi/runtime from 0.24.2 to 0.26.0

Bumps [github.com/go-openapi/runtime](https://github.com/go-openapi/runtime) from 0.24.2 to 0.26.0.
- [Release notes](https://github.com/go-openapi/runtime/releases)
- [Commits](go-openapi/runtime@v0.24.2...v0.26.0)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/runtime
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump github.com/goreleaser/nfpm/v2 from 2.25.1 to 2.29.0

Bumps [github.com/goreleaser/nfpm/v2](https://github.com/goreleaser/nfpm) from 2.25.1 to 2.29.0.
- [Release notes](https://github.com/goreleaser/nfpm/releases)
- [Changelog](https://github.com/goreleaser/nfpm/blob/main/.goreleaser.yml)
- [Commits](goreleaser/nfpm@v2.25.1...v2.29.0)

---
updated-dependencies:
- dependency-name: github.com/goreleaser/nfpm/v2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump github.com/go-playground/validator/v10 from 10.11.1 to 10.14.1

Bumps [github.com/go-playground/validator/v10](https://github.com/go-playground/validator) from 10.11.1 to 10.14.1.
- [Release notes](https://github.com/go-playground/validator/releases)
- [Commits](go-playground/validator@v10.11.1...v10.14.1)

---
updated-dependencies:
- dependency-name: github.com/go-playground/validator/v10
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump Grpc.Net.Client in /client/DotNet/ArmadaProject.Io.Client

Bumps [Grpc.Net.Client](https://github.com/grpc/grpc-dotnet) from 2.47.0 to 2.52.0.
- [Release notes](https://github.com/grpc/grpc-dotnet/releases)
- [Changelog](https://github.com/grpc/grpc-dotnet/blob/master/doc/release_process.md)
- [Commits](grpc/grpc-dotnet@v2.47.0...v2.52.0)

---
updated-dependencies:
- dependency-name: Grpc.Net.Client
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: upgrade @mui/material from 5.10.17 to 5.13.6

Snyk has created this PR to upgrade @mui/material from 5.10.17 to 5.13.6.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade prettier from 2.7.1 to 2.8.8

Snyk has created this PR to upgrade prettier from 2.7.1 to 2.8.8.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade @mui/icons-material from 5.10.16 to 5.14.3

Snyk has created this PR to upgrade @mui/icons-material from 5.10.16 to 5.14.3.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade eslint-plugin-import from 2.26.0 to 2.28.0

Snyk has created this PR to upgrade eslint-plugin-import from 2.26.0 to 2.28.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade eslint-config-prettier from 8.5.0 to 8.10.0

Snyk has created this PR to upgrade eslint-config-prettier from 8.5.0 to 8.10.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* Trying to update klog

* go mod fix

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Fix bug causing GetJobSetEvents to get stuck (#2903)

* Add error message of final job run to JobFailedMessage

When we hit the maximum retry limit, the JobFailedMessage just says something along the lines of
"Job has been retried too many times, giving up"

Now we include the final run error in that message - to make it easier to work out the cause of retries

* Fix bug causing GetJobSetEvents to get stuck

GetJobSetEvents only increments its fromId variable on sending new messages

However now all redis events produce api events that will be sent downstream

The issue here is if we get 500 redis events in a row that don't produce api events, then the fromId never gets updated
 - Meaning the watching gets stuck here

To fix this, ReadEvents now returns a lastMessageId. So if there are no messages to process, the fromId should be updated using the lastMessageId

* Formatting

* Bump @adobe/css-tools from 4.0.1 to 4.3.1 in /internal/lookout/ui (#2931)

Bumps [@adobe/css-tools](https://github.com/adobe/css-tools) from 4.0.1 to 4.3.1.
- [Changelog](https://github.com/adobe/css-tools/blob/main/History.md)
- [Commits](https://github.com/adobe/css-tools/commits)

---
updated-dependencies:
- dependency-name: "@adobe/css-tools"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Improved etcd protection (#2925)

* Initial commit

* Delete unused code

* Export metrics collection delay metrics

* Add mutex to InMemoryJobRepository

* Add tests

* Lint

* Update internal/executor/configuration/types.go

* Lint

---------

Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>

* Stop executor requesting more jobs when it still has leased jobs (#2932)

* Stop executor requesting more jobs when it still has leased jobs

Currently we "queue" jobs to be submitted on the executor - which sit the leased state until they are submitted to kubernetes

However this causes 2 issues with our current setup:
 - It prevents back-pressure from working well on the scheduler side. As it sees all these "Leased" jobs as active, so just keep scheduling more
 - In the case we are slowing submission due to etcd going over its limit. We "queue" lots of jobs, and as soon as etcd goes under its limit we hit it with potentially thousands of jobs

This flow needs further work and thought - however for now this is the minimal fix to prevent bad behaviour

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* WIP

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Fix scheduler side tests

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Implement number of requested jobs on executor side

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Remove unused config

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Fixing panic on startup when etcd health monitor not registered

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Enhance logging

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Set more sensible default for maxLeasedJobs

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

---------

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Fix race in etcd protections (#2937)

* Initial commit

* Fix MultiHealthMonitor race

* Fix etcd health metric naming conflict (#2939)

* Fix metric naming conflict

* Fix metric names

* Fix metrix prefix

* Fix label

* Bump golang.org/x/sync from 0.1.0 to 0.3.0 (#2946)

Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.1.0 to 0.3.0.
- [Commits](golang/sync@v0.1.0...v0.3.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sync
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add more scheduler metrics (#2906)

* Add jobs considered and refactor to counters

* Add fair share metrics

* Add reset for gauge metrics

* format

* cycle imports

* modify cycle return struct

* verbose logging

---------

Co-authored-by: Albin Severinson <albin@severinson.org>

* Update config.yaml (#2953)

* Remove gang job cardinality submit check. Add placeholder for min gang size

* Add msumner91 and mustafai to magic list of trusted people (#2956)

* Add msumner91 to magic list of trusted people

* Update .mergify.yml

* Airflow: always set credentials from args in channel ctor (#2952)

In the GrpcChannelArguments constructor, always set the
credentials_callback_args member from what is given. Add a test to
verify serialization round-tripping is complete, and a __eq__
implementation for GrpcChannelArguments.

Signed-off-by: Rich Scott <richscott@sent.com>

* Removed Makefile from repo (#2915)

Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Add per-queue scheduling rate-limiting (#2938)

* Initial commit

* Add rate limiters

* go mod tidy

* Updates

* Add tests

* Update default config

* Update default scheduler config

* Whitespace

* Cleanup

* Docstring improvements

* Remove limiter nil checks

* Add Cardinality() function on gctx

* Fix test

* Fix test

* Add note about signed commits to Contributor documentation (#2960)

* Add note about signed commits to Contributor documentation

Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>

* Add note about signed commits to Contributor documentation

---------

Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>

* ArmadaContext that includes a logger (#2934)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* compilation!

* rename package

* more compilation

* rename to Context

* embed

* compilation

* compilation

* fix test

* remove old ctxloggers

* revert design doc

* revert developer doc

* formatting

* wip

* tests

* don't gen

* don't gen

* merged master

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Albin Severinson <albin@severinson.org>

* Bump armada airflow operator to version 0.5.4 (#2961)

* Bump armada airflow operator to version 0.5.4

Signed-off-by: Rich Scott <richscott@sent.com>

* Regenerate Airflow Operator Markdown doc.

Signed-off-by: Rich Scott <richscott@sent.com>

* Fix regenerated Airflow doc error.

Signed-off-by: Rich Scott <richscott@sent.com>

* Pin versions of all modules, especially around docs generation.

Signed-off-by: Rich Scott <richscott@sent.com>

* Regenerate Airflow docs using Python 3.10

Signed-off-by: Rich Scott <richscott@sent.com>

---------

Signed-off-by: Rich Scott <richscott@sent.com>

* Simulator Changes

Made a number of changes to the simulator and simulator tests, most notably:
 - Fixed implementation of minSubmitTime setting for workload
   specifications
 - Added tests for SchedulingConfigsFromPattern,
   ClusterSpecsFromPattern, WorkloadFromPattern
 - Added sample workloads, clusters and scheduling configs
 - Added tests which simulate per-pool and per-executorGroup scheduling
 - Implemented further metrics for use in simulator tests, such as a
   cluster's aggregate resources, number of preemptions and schedules
   for a given test run
 - Added optimisation to speed up simulator, whereby the scheduler skips
   the current schedule event if no eventSequences have been received
   since the previous schedule.

* Simplified TestClusterSpecsFromPattern and TestWorkloadFromPattern tests

* Removed unused test

* Fixed malformed yaml

* Improved metrics for simulations. Improved simulator tests with errorgroups.

* Removed all simulator test data except basic data necessary for testing

* Implementing CLI

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>
Co-authored-by: Daniel Rastelli <rastellidani@gmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>
Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kannon1992@gmail.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com>
Co-authored-by: Dave Gantenbein <dave@gr-oss.io>
Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Clif Houck <me@clifhouck.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>
Co-authored-by: Kanu Mike Chibundu <michotall95@gmail.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: owenthomas17 <owen@owen-thomas.co.uk>
Co-authored-by: Albin Severinson <albin@severinson.org>
Co-authored-by: Mark Sumner <m.sumner91@hotmail.co.uk>
Co-authored-by: Rich Scott <rich@gr-oss.io>
Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com>
Co-authored-by: Aviral Singh <itsaviral.2609@gmail.com>
Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

* Adding verbose flag to simulator CLI, changing logging context in simulator

* Improved simulator CLI output, removed redundant features, implemented parallel simulations by addressing mutability of structures inputted into the simulator

* Removed unknown logging library

* Changing threadSafeLogger Info call to Print. Adding separation back between simulation results

* Implemented stochastic runtime for jobs using a shifted exponential distribution (#13)

* Implemented stochastic runtime for jobs using a shifted exponential distribution

* Implemented min submit time from dependency completion (#14)

Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

* Fixed tests

* Fixed implementation of shifted exponential distribution

* Using FP unrounded parameters to sample from distribution

* Modified stochastic runtime definition

* Adding logging to simulator

Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>
Co-authored-by: Albin Severinson <larsalbins@uberit.net>
Co-authored-by: Mustafa Ilyas <Mustafa.Ilyas@gresearch.co.uk>
Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>
Co-authored-by: Daniel Rastelli <rastellidani@gmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>
Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kannon1992@gmail.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com>
Co-authored-by: Dave Gantenbein <dave@gr-oss.io>
Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Clif Houck <me@clifhouck.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>
Co-authored-by: Kanu Mike Chibundu <michotall95@gmail.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: owenthomas17 <owen@owen-thomas.co.uk>
Co-authored-by: Albin Severinson <albin@severinson.org>
Co-authored-by: Mark Sumner <m.sumner91@hotmail.co.uk>
Co-authored-by: Rich Scott <rich@gr-oss.io>
Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com>
Co-authored-by: Aviral Singh <itsaviral.2609@gmail.com>
severinson added a commit that referenced this pull request Oct 27, 2023
* Sync out testsuite changes (#19)

* Update simulator

* Replace Output with C

* Typo

* Restore pkg proto

* Restore files

* Fixing simulator changes (#6)

* Fixing simulator changes

* Changed to less than or equal

Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

* Simulator Changes (#9)

* Add config and dependency injection to scheduler metrics (#2892)

* Replace metrics singleton with an injection pattern.

* fix

* add configuration structures to metrics

* add configuration

* rename elements

* Maker Pulsar ReceiverQueueSize Configurable (#2895)

* wip

* wip

* set receiverQueueSize to 100

* remove old PulsarReceiverQueueSize

* revert

* subscriptionin api

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>

* Add poll_interval (#2805)

* Add poll_interval

* Add poll_interval

* Added poll_interval

* update by running tox-e docs

---------

Co-authored-by: Kevin Hannon <kannon1992@gmail.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* Seperate python script for armada v1 and v2 system diagrams (#2758)

* Seperate python script for armada v1 system diagram

* removed generate.py so it can be replaced with two seperate files for Armada V1 and Armada V2

* Python script to generate Armada V2 system diagram

* generate_v1.py Update #1

* generate_v1.py Update Number:2

* generate.py runs generate_v1.py as well as generate_v2.py and it is consistent with our instructions as 'docs/design/diagrams/relationships'

* generate_v1.py Update No:3

* Armada V1 and Armada V2 diagrams

* updated relationships_diagram.md to include armada v1 and v2 diagrams

---------

Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* Add config to use autoupdater on tagged branches (#2905)

* #2904 add autoupdate config

* #2904 add label config and other options

* docs: create README.md for plugins directory (#2897)

* Create README.md for plugins directory

* Update README.md

* Update plugins/README.md

Co-authored-by: Kevin Hannon <kehannon@redhat.com>

* Update README.md

---------

Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* Enables airflow operator level retry. (#2894)

* Update docker stuff for latest airflow 2.7.0

* Use AirflowException instead of AirflowFailException to allow for retries

* Remove codecov workflows (#2902)

* Upgrade Pulsar Client to v0.11 (#2896)

* update

* update pulsar client

* Fix bug causing server spinning

* Abstract out the retry until success logic for testing (#2901)

* Respond to review

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Daniel Rastelli <rastellidani@gmail.com>

* Sync quickstart/index.md with gh-pages/quickstart.md (#2891)

* Log Call Site (#2909)

* allow logger to report caller

* allow logger to report caller

* lint

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>

* Add cleaner test output for mage with os/exec.Command (#2907)

* feat: Update Semver from version 6.3.0 to 6.3.1 (#2686)

Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* fix: upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0 (#2743)

Snyk has created this PR to upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* fix: upgrade @types/react from 16.14.32 to 16.14.43 (#2747)

Snyk has created this PR to upgrade @types/react from 16.14.32 to 16.14.43.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Bump github.com/go-openapi/jsonreference from 0.20.0 to 0.20.2 (#2316)

Bumps [github.com/go-openapi/jsonreference](https://github.com/go-openapi/jsonreference) from 0.20.0 to 0.20.2.
- [Release notes](https://github.com/go-openapi/jsonreference/releases)
- [Commits](go-openapi/jsonreference@v0.20.0...v0.20.2)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/jsonreference
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Order leased jobs by serial (#2912)

This will ensure the job leased first, gets send to the cluster first

Currently we just order by postgres default sorting - which often picks the most recently leased - causing the first lease jobs to get stuck
 - This only occurs when scheduling is faster than leasing

* Bump webpack from 5.75.0 to 5.77.0 in /internal/lookout/ui (#2302)

Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.77.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](webpack/webpack@v5.75.0...v5.77.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Bump word-wrap from 1.2.3 to 1.2.5 in /internal/lookout/ui (#2806)

Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.5.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](jonschlinkert/word-wrap@1.2.3...1.2.5)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* resolve flaky (#2914)

Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>

* fix: upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0 (#2744)

Snyk has created this PR to upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* fix: upgrade react-router-dom from 6.9.0 to 6.14.1 (#2746)

Snyk has created this PR to upgrade react-router-dom from 6.9.0 to 6.14.1.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Bump semver from 6.3.0 to 6.3.1 in /internal/lookout/ui (#2661)

Bumps [semver](https://github.com/npm/node-semver) from 6.3.0 to 6.3.1.
- [Release notes](https://github.com/npm/node-semver/releases)
- [Changelog](https://github.com/npm/node-semver/blob/v6.3.1/CHANGELOG.md)
- [Commits](npm/node-semver@v6.3.0...v6.3.1)

---
updated-dependencies:
- dependency-name: semver
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Run CodeQL once daily on a schedule (#2918)

* Helm chart update: executor  (#2917)

* Helm chart update: executor

At the moment the helm chart for the executor doesn't include priorityClass even though one is created in the chart. This means that the executor deployment is unable to set the priorityClass.

* Patch/dependencies (#2923)

* Bump github.com/go-openapi/strfmt from 0.21.3 to 0.21.7

Bumps [github.com/go-openapi/strfmt](https://github.com/go-openapi/strfmt) from 0.21.3 to 0.21.7.
- [Release notes](https://github.com/go-openapi/strfmt/releases)
- [Commits](go-openapi/strfmt@v0.21.3...v0.21.7)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/strfmt
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump github.com/go-openapi/runtime from 0.24.2 to 0.26.0

Bumps [github.com/go-openapi/runtime](https://github.com/go-openapi/runtime) from 0.24.2 to 0.26.0.
- [Release notes](https://github.com/go-openapi/runtime/releases)
- [Commits](go-openapi/runtime@v0.24.2...v0.26.0)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/runtime
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump github.com/goreleaser/nfpm/v2 from 2.25.1 to 2.29.0

Bumps [github.com/goreleaser/nfpm/v2](https://github.com/goreleaser/nfpm) from 2.25.1 to 2.29.0.
- [Release notes](https://github.com/goreleaser/nfpm/releases)
- [Changelog](https://github.com/goreleaser/nfpm/blob/main/.goreleaser.yml)
- [Commits](goreleaser/nfpm@v2.25.1...v2.29.0)

---
updated-dependencies:
- dependency-name: github.com/goreleaser/nfpm/v2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump github.com/go-playground/validator/v10 from 10.11.1 to 10.14.1

Bumps [github.com/go-playground/validator/v10](https://github.com/go-playground/validator) from 10.11.1 to 10.14.1.
- [Release notes](https://github.com/go-playground/validator/releases)
- [Commits](go-playground/validator@v10.11.1...v10.14.1)

---
updated-dependencies:
- dependency-name: github.com/go-playground/validator/v10
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump Grpc.Net.Client in /client/DotNet/ArmadaProject.Io.Client

Bumps [Grpc.Net.Client](https://github.com/grpc/grpc-dotnet) from 2.47.0 to 2.52.0.
- [Release notes](https://github.com/grpc/grpc-dotnet/releases)
- [Changelog](https://github.com/grpc/grpc-dotnet/blob/master/doc/release_process.md)
- [Commits](grpc/grpc-dotnet@v2.47.0...v2.52.0)

---
updated-dependencies:
- dependency-name: Grpc.Net.Client
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: upgrade @mui/material from 5.10.17 to 5.13.6

Snyk has created this PR to upgrade @mui/material from 5.10.17 to 5.13.6.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade prettier from 2.7.1 to 2.8.8

Snyk has created this PR to upgrade prettier from 2.7.1 to 2.8.8.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade @mui/icons-material from 5.10.16 to 5.14.3

Snyk has created this PR to upgrade @mui/icons-material from 5.10.16 to 5.14.3.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade eslint-plugin-import from 2.26.0 to 2.28.0

Snyk has created this PR to upgrade eslint-plugin-import from 2.26.0 to 2.28.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade eslint-config-prettier from 8.5.0 to 8.10.0

Snyk has created this PR to upgrade eslint-config-prettier from 8.5.0 to 8.10.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* Trying to update klog

* go mod fix

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Fix bug causing GetJobSetEvents to get stuck (#2903)

* Add error message of final job run to JobFailedMessage

When we hit the maximum retry limit, the JobFailedMessage just says something along the lines of
"Job has been retried too many times, giving up"

Now we include the final run error in that message - to make it easier to work out the cause of retries

* Fix bug causing GetJobSetEvents to get stuck

GetJobSetEvents only increments its fromId variable on sending new messages

However now all redis events produce api events that will be sent downstream

The issue here is if we get 500 redis events in a row that don't produce api events, then the fromId never gets updated
 - Meaning the watching gets stuck here

To fix this, ReadEvents now returns a lastMessageId. So if there are no messages to process, the fromId should be updated using the lastMessageId

* Formatting

* Bump @adobe/css-tools from 4.0.1 to 4.3.1 in /internal/lookout/ui (#2931)

Bumps [@adobe/css-tools](https://github.com/adobe/css-tools) from 4.0.1 to 4.3.1.
- [Changelog](https://github.com/adobe/css-tools/blob/main/History.md)
- [Commits](https://github.com/adobe/css-tools/commits)

---
updated-dependencies:
- dependency-name: "@adobe/css-tools"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Improved etcd protection (#2925)

* Initial commit

* Delete unused code

* Export metrics collection delay metrics

* Add mutex to InMemoryJobRepository

* Add tests

* Lint

* Update internal/executor/configuration/types.go

* Lint

---------

Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>

* Stop executor requesting more jobs when it still has leased jobs (#2932)

* Stop executor requesting more jobs when it still has leased jobs

Currently we "queue" jobs to be submitted on the executor - which sit the leased state until they are submitted to kubernetes

However this causes 2 issues with our current setup:
 - It prevents back-pressure from working well on the scheduler side. As it sees all these "Leased" jobs as active, so just keep scheduling more
 - In the case we are slowing submission due to etcd going over its limit. We "queue" lots of jobs, and as soon as etcd goes under its limit we hit it with potentially thousands of jobs

This flow needs further work and thought - however for now this is the minimal fix to prevent bad behaviour

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* WIP

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Fix scheduler side tests

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Implement number of requested jobs on executor side

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Remove unused config

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Fixing panic on startup when etcd health monitor not registered

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Enhance logging

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Set more sensible default for maxLeasedJobs

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

---------

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Fix race in etcd protections (#2937)

* Initial commit

* Fix MultiHealthMonitor race

* Fix etcd health metric naming conflict (#2939)

* Fix metric naming conflict

* Fix metric names

* Fix metrix prefix

* Fix label

* Bump golang.org/x/sync from 0.1.0 to 0.3.0 (#2946)

Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.1.0 to 0.3.0.
- [Commits](golang/sync@v0.1.0...v0.3.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sync
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add more scheduler metrics (#2906)

* Add jobs considered and refactor to counters

* Add fair share metrics

* Add reset for gauge metrics

* format

* cycle imports

* modify cycle return struct

* verbose logging

---------

Co-authored-by: Albin Severinson <albin@severinson.org>

* Update config.yaml (#2953)

* Remove gang job cardinality submit check. Add placeholder for min gang size

* Add msumner91 and mustafai to magic list of trusted people (#2956)

* Add msumner91 to magic list of trusted people

* Update .mergify.yml

* Airflow: always set credentials from args in channel ctor (#2952)

In the GrpcChannelArguments constructor, always set the
credentials_callback_args member from what is given. Add a test to
verify serialization round-tripping is complete, and a __eq__
implementation for GrpcChannelArguments.

Signed-off-by: Rich Scott <richscott@sent.com>

* Removed Makefile from repo (#2915)

Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>

* Add per-queue scheduling rate-limiting (#2938)

* Initial commit

* Add rate limiters

* go mod tidy

* Updates

* Add tests

* Update default config

* Update default scheduler config

* Whitespace

* Cleanup

* Docstring improvements

* Remove limiter nil checks

* Add Cardinality() function on gctx

* Fix test

* Fix test

* Add note about signed commits to Contributor documentation (#2960)

* Add note about signed commits to Contributor documentation

Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>

* Add note about signed commits to Contributor documentation

---------

Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>

* ArmadaContext that includes a logger (#2934)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* compilation!

* rename package

* more compilation

* rename to Context

* embed

* compilation

* compilation

* fix test

* remove old ctxloggers

* revert design doc

* revert developer doc

* formatting

* wip

* tests

* don't gen

* don't gen

* merged master

---------

Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Albin Severinson <albin@severinson.org>

* Bump armada airflow operator to version 0.5.4 (#2961)

* Bump armada airflow operator to version 0.5.4

Signed-off-by: Rich Scott <richscott@sent.com>

* Regenerate Airflow Operator Markdown doc.

Signed-off-by: Rich Scott <richscott@sent.com>

* Fix regenerated Airflow doc error.

Signed-off-by: Rich Scott <richscott@sent.com>

* Pin versions of all modules, especially around docs generation.

Signed-off-by: Rich Scott <richscott@sent.com>

* Regenerate Airflow docs using Python 3.10

Signed-off-by: Rich Scott <richscott@sent.com>

---------

Signed-off-by: Rich Scott <richscott@sent.com>

* Simulator Changes

Made a number of changes to the simulator and simulator tests, most notably:
 - Fixed implementation of minSubmitTime setting for workload
   specifications
 - Added tests for SchedulingConfigsFromPattern,
   ClusterSpecsFromPattern, WorkloadFromPattern
 - Added sample workloads, clusters and scheduling configs
 - Added tests which simulate per-pool and per-executorGroup scheduling
 - Implemented further metrics for use in simulator tests, such as a
   cluster's aggregate resources, number of preemptions and schedules
   for a given test run
 - Added optimisation to speed up simulator, whereby the scheduler skips
   the current schedule event if no eventSequences have been received
   since the previous schedule.

* Simplified TestClusterSpecsFromPattern and TestWorkloadFromPattern tests

* Removed unused test

* Fixed malformed yaml

* Improved metrics for simulations. Improved simulator tests with errorgroups.

* Removed all simulator test data except basic data necessary for testing

* Implementing CLI

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>
Co-authored-by: Daniel Rastelli <rastellidani@gmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>
Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kannon1992@gmail.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com>
Co-authored-by: Dave Gantenbein <dave@gr-oss.io>
Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Clif Houck <me@clifhouck.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>
Co-authored-by: Kanu Mike Chibundu <michotall95@gmail.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: owenthomas17 <owen@owen-thomas.co.uk>
Co-authored-by: Albin Severinson <albin@severinson.org>
Co-authored-by: Mark Sumner <m.sumner91@hotmail.co.uk>
Co-authored-by: Rich Scott <rich@gr-oss.io>
Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com>
Co-authored-by: Aviral Singh <itsaviral.2609@gmail.com>
Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

* Adding verbose flag to simulator CLI, changing logging context in simulator

* Improved simulator CLI output, removed redundant features, implemented parallel simulations by addressing mutability of structures inputted into the simulator

* Removed unknown logging library

* Changing threadSafeLogger Info call to Print. Adding separation back between simulation results

* Implemented stochastic runtime for jobs using a shifted exponential distribution (#13)

* Implemented stochastic runtime for jobs using a shifted exponential distribution

* Implemented min submit time from dependency completion (#14)

Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

* Fixed tests

* Fixed implementation of shifted exponential distribution

* Using FP unrounded parameters to sample from distribution

* Modified stochastic runtime definition

* Adding logging to simulator

Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>
Co-authored-by: Albin Severinson <larsalbins@uberit.net>
Co-authored-by: Mustafa Ilyas <Mustafa.Ilyas@gresearch.co.uk>
Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>
Co-authored-by: Daniel Rastelli <rastellidani@gmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>
Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kannon1992@gmail.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com>
Co-authored-by: Dave Gantenbein <dave@gr-oss.io>
Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Clif Houck <me@clifhouck.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>
Co-authored-by: Kanu Mike Chibundu <michotall95@gmail.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: owenthomas17 <owen@owen-thomas.co.uk>
Co-authored-by: Albin Severinson <albin@severinson.org>
Co-authored-by: Mark Sumner <m.sumner91@hotmail.co.uk>
Co-authored-by: Rich Scott <rich@gr-oss.io>
Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com>
Co-authored-by: Aviral Singh <itsaviral.2609@gmail.com>

* Add missing brace

* Lint

* Lint

* Lint

* Cleanup

* Testsuite improvements

* Lint

* Tidying

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Aviral Singh <itsaviral.2609@gmail.com>
Co-authored-by: Albin Severinson <Albin.Severinson@gresearch.co.uk>
Co-authored-by: Albin Severinson <larsalbins@uberit.net>
Co-authored-by: Mustafa Ilyas <Mustafa.Ilyas@gresearch.co.uk>
Co-authored-by: Mustafa Ilyas <mustafai@uberit.net>
Co-authored-by: Daniel Rastelli <rastellidani@gmail.com>
Co-authored-by: Chris Martin <council_tax@hotmail.com>
Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kannon1992@gmail.com>
Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com>
Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com>
Co-authored-by: Dave Gantenbein <dave@gr-oss.io>
Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: Clif Houck <me@clifhouck.com>
Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com>
Co-authored-by: Kanu Mike Chibundu <michotall95@gmail.com>
Co-authored-by: snyk-bot <snyk-bot@snyk.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com>
Co-authored-by: owenthomas17 <owen@owen-thomas.co.uk>
Co-authored-by: Mark Sumner <m.sumner91@hotmail.co.uk>
Co-authored-by: Rich Scott <rich@gr-oss.io>
Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com>
Co-authored-by: Aviral Singh <itsaviral.2609@gmail.com>
svc-gh-ghzonetrans-p pushed a commit that referenced this pull request Oct 17, 2024
* Move PulsarConfig into common/config (#217) (#3907)

* ARMADA-2848 Move PulsarConfig into commonconfig

* Update test name TestValidateHasJobSetID->Id

* Revert unintended changes to yarn.lock file

* fix import order

Co-authored-by: Eleanor Pratt <Eleanor.Pratt@gresearch.co.uk>

(cherry picked from commit 35cb59f)
Signed-off-by: mustaily891 <mustafa.ilyas@gresearch.co.uk>

* Adding ControlPlaneEventsTopic to pulsar config

* Evolving ControlPlaneEvents message structure

We've decided on a parent/wrapper message for the ControlPlaneEvents to avoid passing around ambiguous proto.Message slices in the Publisher and Ingester.

* Setting maxAllowedMessageSize to correct value in relevant tests

* Removing reason for uncordon requests to the executor service

* Moving event creation time to parent Control Plane Event, modifying executor service rpcs to reflect the events being published, changed pulsar message keys to hard coded strings rather than proto name

* Renaming UpdateExecutorSettings rpc to UpsertExecutorSettings

* Removing message keys from ControlPlaneEvent messages, reverting method name changes

* Renaming LimitEventSequencesByteSize

* Adding executor cordoning functionality to armadactl

* Renaming ControlPlaneEvent to Event

* Simplifying executor cordoning code

* More sane checks on UpsertExecutorSettings rpc, better error messages

* Typo

* Updated command descriptions for executor cordoning and uncordoning

* Separating executor service args from controlplaneevents

* Executor Service #2 (#254)

* Generalising common ingestion pipeline

* Removing unused config

* Amending comments and variable names in common ingestion pipeline to be more event agnostic

* Returning to original metric name, denoting ingested event type via labal rather than metric name

* Import ordering

* Generalising pulsar publisher

* Executor Service #3 (#255)

* Modifying SchedulerIngester to ingest control plane events, creating executor settings table and associated plumbing

* Simplifying dbops merge for controlplanevents

* Moving DBOperation scoping into schedulerdb

* Adding GetOperation method to DBOperation, determining locking using this

* Executor Service #4 (#257)

* Implementing cluster cordoning in scheduler

* Filter executors from previous filter result

* Adding default value for queue label when publishing controlplaneevent metrics

---------

Signed-off-by: mustaily891 <mustafa.ilyas@gresearch.co.uk>
Co-authored-by: Eleanor Pratt <Eleanor.Pratt@gresearch.co.uk>
MustafaI added a commit that referenced this pull request Oct 17, 2024
* Move PulsarConfig into common/config (#217) (#3907)

* ARMADA-2848 Move PulsarConfig into commonconfig

* Update test name TestValidateHasJobSetID->Id

* Revert unintended changes to yarn.lock file

* fix import order



(cherry picked from commit 35cb59f)


* Adding ControlPlaneEventsTopic to pulsar config

* Evolving ControlPlaneEvents message structure

We've decided on a parent/wrapper message for the ControlPlaneEvents to avoid passing around ambiguous proto.Message slices in the Publisher and Ingester.

* Setting maxAllowedMessageSize to correct value in relevant tests

* Removing reason for uncordon requests to the executor service

* Moving event creation time to parent Control Plane Event, modifying executor service rpcs to reflect the events being published, changed pulsar message keys to hard coded strings rather than proto name

* Renaming UpdateExecutorSettings rpc to UpsertExecutorSettings

* Removing message keys from ControlPlaneEvent messages, reverting method name changes

* Renaming LimitEventSequencesByteSize

* Adding executor cordoning functionality to armadactl

* Renaming ControlPlaneEvent to Event

* Simplifying executor cordoning code

* More sane checks on UpsertExecutorSettings rpc, better error messages

* Typo

* Updated command descriptions for executor cordoning and uncordoning

* Separating executor service args from controlplaneevents

* Executor Service #2 (#254)

* Generalising common ingestion pipeline

* Removing unused config

* Amending comments and variable names in common ingestion pipeline to be more event agnostic

* Returning to original metric name, denoting ingested event type via labal rather than metric name

* Import ordering

* Generalising pulsar publisher

* Executor Service #3 (#255)

* Modifying SchedulerIngester to ingest control plane events, creating executor settings table and associated plumbing

* Simplifying dbops merge for controlplanevents

* Moving DBOperation scoping into schedulerdb

* Adding GetOperation method to DBOperation, determining locking using this

* Executor Service #4 (#257)

* Implementing cluster cordoning in scheduler

* Filter executors from previous filter result

* Adding default value for queue label when publishing controlplaneevent metrics

---------

Signed-off-by: mustaily891 <mustafa.ilyas@gresearch.co.uk>
Co-authored-by: Mustafa Ilyas <Mustafa.Ilyas@gresearch.co.uk>
Co-authored-by: Eleanor Pratt <Eleanor.Pratt@gresearch.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants