Merge main into feature branch#452
Merged
RyaliNvidia merged 5 commits intofeature/PROJ-148-auth-reworkfrom Feb 19, 2026
Merged
Conversation
* Sync main into feature/PROJ-148-user-mapping (#375) * fix: update extraArgs to render string properly (#362) * Remove auth router in agent service (#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #294 #295 - Add user table and user-role mapping (#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #339 - Add documentation for creating PAT and service accounts (#395) * Rename pat (#428) * #403 - Set optional default admin during service creation + fix PAT wording (#404) * use base64 for access tokens (#437) * Merge main into feature (#441) * Support Non-AWS S3 storage without environment variables (#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com>
tdewanNvidia
approved these changes
Feb 18, 2026
|
fernandol-nvidia
approved these changes
Feb 18, 2026
723f339
into
feature/PROJ-148-auth-rework
29 of 40 checks passed
RyaliNvidia
added a commit
that referenced
this pull request
Feb 20, 2026
* fix: pass node_condition_prefix to backend-worker deployment (#448) * #148 - Add user mapping into OSMO (#418) * Sync main into feature/PROJ-148-user-mapping (#375) * fix: update extraArgs to render string properly (#362) * Remove auth router in agent service (#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #294 #295 - Add user table and user-role mapping (#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #339 - Add documentation for creating PAT and service accounts (#395) * Rename pat (#428) * #403 - Set optional default admin during service creation + fix PAT wording (#404) * use base64 for access tokens (#437) * Merge main into feature (#441) * Support Non-AWS S3 storage without environment variables (#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Revert "Remove auth router in agent service (#371)" (#390) This reverts commit fac90f2. * Merge main into feature --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com>
RyaliNvidia
added a commit
that referenced
this pull request
Feb 20, 2026
* Sync main into feature/PROJ-148-auth-rework (#258) * allow flexible squid proxy replicas (#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * #244 - Use last login url if url is not specified (#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (#249) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> * * Add authz sidecar service with Go implementation This commit adds the authorization sidecar service including: - Go-based authz server implementing Envoy External Authorization - PostgreSQL client for role/policy storage - Role caching for performance optimization - Action registry for path-to-action mapping - Comprehensive test suite - Python test service for integration testing - Documentation and quickstart guide * * Begin resource action model * Server validates both legacy and new * Update logic for action registry * Sync main into feature/PROJ-148-auth-rework (#298) * allow flexible squid proxy replicas (#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * #244 - Use last login url if url is not specified (#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (#247) * Allow PR checks to run on release branches (#264) * Database Pooling in Postgres Singleton Across Services (#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (#272) * #148 - Auth Project Design Documents (#165) * add args to postgres (#282) * #267 - cloud deployment scripts (#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Fix conflicts --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> * Remove action permissions from pool config (#307) * Sync main into feature/PROJ-148-auth-rework (#322) * allow flexible squid proxy replicas (#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * #244 - Use last login url if url is not specified (#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (#247) * Allow PR checks to run on release branches (#264) * Database Pooling in Postgres Singleton Across Services (#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (#272) * #148 - Auth Project Design Documents (#165) * add args to postgres (#282) * #267 - cloud deployment scripts (#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Add new project proposal to describe nvlink + topology aware scheduling (#211) * Add new project proposal to describe nvlink + topology aware scheduling * Split design into two docs * Finish docs and add some updates from feedback * Add some open items * OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (#315) * add redis utlis, update postgres utils (#313) * add redis utlis, update postgres utils * add deps * Fix missing seperator in the test runner roles (#320) * fix * remove * fix --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> Co-authored-by: ecolternv <ecolter@nvidia.com> Co-authored-by: tdewanNvidia <tdewan@nvidia.com> * Connect envoy with authz sidecar (#319) * Connect the authz sidecar to envoy * update sidecar * fix typo * add extra env * uncomment * #290 - Add attribute fetching for workflow pool matching (#338) * update * fix * fix * Merge main into feature branch (#452) * fix: pass node_condition_prefix to backend-worker deployment (#448) * #148 - Add user mapping into OSMO (#418) * Sync main into feature/PROJ-148-user-mapping (#375) * fix: update extraArgs to render string properly (#362) * Remove auth router in agent service (#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #294 #295 - Add user table and user-role mapping (#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #339 - Add documentation for creating PAT and service accounts (#395) * Rename pat (#428) * #403 - Set optional default admin during service creation + fix PAT wording (#404) * use base64 for access tokens (#437) * Merge main into feature (#441) * Support Non-AWS S3 storage without environment variables (#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Revert "Remove auth router in agent service (#371)" (#390) This reverts commit fac90f2. * Merge main into feature --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * #407 - Authz sidecar sends accessible pools to service (#455) * update cache specification * lint * authz sidecar sends info to service * Add logging to the go code * comments * Merge main into feature (#463) * fix: pass node_condition_prefix to backend-worker deployment (#448) * #148 - Add user mapping into OSMO (#418) * Sync main into feature/PROJ-148-user-mapping (#375) * fix: update extraArgs to render string properly (#362) * Remove auth router in agent service (#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #294 #295 - Add user table and user-role mapping (#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * #339 - Add documentation for creating PAT and service accounts (#395) * Rename pat (#428) * #403 - Set optional default admin during service creation + fix PAT wording (#404) * use base64 for access tokens (#437) * Merge main into feature (#441) * Support Non-AWS S3 storage without environment variables (#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Add override_url when forwarding default_credential to StaticDataCredential (#444) * Oauth2 proxy design (#436) * design for oauth2 proxy * format * update design after POC * update format * update design per envoy / UI changes * #356 - Group Template Implementation (#454) - Add group templates as a new type of config - Allow group templates to be assigned to pools - When a workflow is submitted, group templates will be instantiated per group, and when it is completed, they will be destroyed - Cleanup logic for creating kubernetes objects in the backend by using kubernetes dynamic client. * Client install location is determined by service (#447) * Revert "Remove auth router in agent service (#371)" (#390) This reverts commit fac90f2. * dupe * remove --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: ecolternv <ecolter@nvidia.com> * Revert "Remove auth router in agent service (#371)" (#390) This reverts commit fac90f2. * Add new fields in the envoy logs (#466) * Revert "#356 - Group Template Implementation (#454)" (#459) This reverts commit a483a88. * fix --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> Co-authored-by: ecolternv <ecolter@nvidia.com> Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
xutongNV
added a commit
that referenced
this pull request
Feb 24, 2026
* Update the wording re: creating feature branches (#204)
* Add a link back to OSMO from the brev launchable (#205)
* Improve styling for badges in the brev launchable readme (#207)
* Fix osmo config pool update payload in backend installation docs (#210)
* Fix osmo config pool update payload in practical guide (#213)
* #147 - backend operator redesign doc (#149)
* backend operator redesign doc
* 195 - Bump quick-start version due to updated dependencies (#217)
* Perform Client Side Data Auth Check In the Event of Environment Based Auth (#177)
* Data/Dataset Auth Check CLIs
* Remove auth check from data service
* Use auth check CLIs in ctrl
* Add exit code to docs
* Fix build issues
* Fix lint
* Ctrl to use user config when validating data auth
* Use the correct CLI argument type
* Fix lint
* Use profile when looking up data credential from config
* Update quick start installation to always install latest version (#218)
* Add workflow to label external issues and pull requests (#222)
* Add workflow to label external issues and pull requests
* pin to allowed action version
* add reopened event
* allow flexible squid proxy replicas (#241)
* allow flexible squid proxy replicas
* fix
* Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167)
* Improving Performance for Uploading Workflow Artifacts in Worker Jobs
* Cleanup
* Add progress writing after upload
* Add dependency in Bazel BUILD
* Add type to mypy requirements
* Update mypy requirements
* Add to mypy_cli BUILD
* Fix lint
* Comment
* Use constant to define semaphor and storage client executor count
* #244 - Use last login url if url is not specified (#245)
* Use last login url if url is not specified
* print message
* Cannot select any text inside modals or slideouts (#248)
* Video html element not changin when selecting different video files in the UI for OSMO dataset (#249)
* sync-feature-branches: fix no conflict case, allow single branch to be synced (#252)
* Fix sync-feature-branches with no merge conflicts
* Allow a single branch to be specified for sync-feature-branches
* Perform operations as OSMO CI Bot
* Add external label when the PR is created
* extract issue number
* add test cases (#247)
* Allow PR checks to run on release branches (#264)
* Database Pooling in Postgres Singleton Across Services (#251)
* Initial commit for database pooling
* Update set_session
* Fix lint
* Update PostgresConnector to have semaphor to control connections
* Lint fix
* Fix number of maxconn for test
* Address comments
* Add Go Postgres utils (#272)
* #148 - Auth Project Design Documents (#165)
* add args to postgres (#282)
* #267 - cloud deployment scripts (#268)
* script to create azure resources and deploy
* Remove auto-generated values files from tracking
- Added .gitignore to ignore values/, *.env files
- Removed values/*.yaml files from git (auto-generated during deployment)
* add aws script
* add aws script
* add copyright
* update copyright
* Support for Azure workload identity in AKS and Arc clusters (#141)
* feat(src): add Azure service account and extra pod labels configuration
- implement service account creation with customizable name and annotations
- enhance service templates to support extra pod labels for various services
- update Azure backend to utilize DefaultAzureCredential for authentication
- add tests for Azure credential extraction and client creation
* feat(src): extract account key from connection string for Azure Blob Storage
- add function to extract AccountKey from connection string
- update AzureBlobStorageClient to handle different credential types
* feat(test): add tests for account key extraction from Azure connection strings
* chore: clean up linting issues for tests
* refactor(src): update data credential types in PostgresConnector and TaskGroup
- change StaticDataCredential to DataCredential in get_all_data_creds method
- update fetch_creds function signature to use DataCredential
* feat(src): update Azure client creation to include storage account and account URL
- remove deprecated storage account extraction function
- modify create_client to accept storage_account and account_url parameters
- update AzureBlobStorageClientFactory to use new parameters
- adjust tests to reflect changes in client creation
🔒 - Generated by Copilot
* refactor(src): mark storage_account parameter as unused in create_client function
🔧 - Generated by Copilot
* refactor(src): remove unused storage_account parameter from client creation
🔧 - Generated by Copilot
* Add new project proposal to describe nvlink + topology aware scheduling (#211)
* Add new project proposal to describe nvlink + topology aware scheduling
* Split design into two docs
* Finish docs and add some updates from feedback
* Add some open items
* OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (#315)
* add redis utlis, update postgres utils (#313)
* add redis utlis, update postgres utils
* add deps
* Fix missing seperator in the test runner roles (#320)
* show backend name in scheduler validation error message (#323)
* #220 - Design documentation for dynamic subpool (#221)
* Initial design spike for dynamic subpool
* Add more context to design
* Address feedback
* fix: update extraArgs to render string properly (#362)
* Remove auth router in agent service (#371)
* Various fixes to stabilize GitHub actions on self-hosted nodes (#366)
* Add subagents to help debug CI
* Pin digests in Github actions
* Add safe Bazel and workspace cleanup to ci-internal
Add filesystem cleanup steps that don't interfere with concurrent jobs.
Testcontainers handles Docker resource cleanup automatically via ryuk.
Changes:
- Add Bazel cache cleanup to prevent unbounded local cache growth
- Add workspace cleanup for pytest and temporary files
- Keep concurrency control per PR
- Rely on Testcontainers for Docker resource cleanup
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add 30-minute timeout to ci-internal job
Prevent jobs from running indefinitely if tests hang. 30 minutes
provides sufficient time for normal test execution while ensuring
hung jobs don't block the runner.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Use bazel clean --expunge to prevent unbounded cache growth
Change from 'bazel clean' to 'bazel clean --expunge' to remove
repository cache in addition to build outputs. This prevents
unbounded growth of external dependencies on self-hosted runners.
Uses synchronous --expunge (not --expunge_async) since we're in a
Docker container that will terminate after the job completes.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add resource limits to Docker-in-Docker service
Limit DinD service to 4GB memory and 2 CPUs to prevent runaway
resource consumption and OOM conditions on self-hosted runners.
These limits provide sufficient resources for Testcontainers while
leaving headroom for the main job container and preventing memory
leaks from exhausting runner resources.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Tune Bazel CI config
* Testcontainers resource limit
* Clean up testcontainers networkedcontainer list
* Shutdown bazel at the end of job
* Close docker client in test utils
* SandboxedWorker shutdown in tests
* Add docker clean up
* Add clean up
* Add node dep
* Add docker deps
* Use the right image
* Tune bazel in CI
* Remove golang.org/x/crypto from root module
* Pylint suppress
* Fix redis closure in tests
* Fix jinja_sandbox test
* Clean up in jinja_sandbox test
* Fix jinja_sandbox test and lint
* Enhance cleanup
* Fix pr-checks yaml
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* Fix swift region check (#360)
* Add Python and Golang Expert Subagents (#382)
* Remove dataset version retention policy (#372)
* Detect non-aws S3 compatible endpoints when validating data_auth (#385)
* Handle non-aws s3-compatible data-auth
* Remove OSMO_SKIP_DATA_AUTH from quickstart
* Method rename
* Fix tests
* add envoy lua filter to refresh id_token (#388)
* add envoy lua filter to for id_token refresh
* add filter to service and router
* add validate token
* only remove auth cookies
* allow user configuration
* Revert "Remove auth router in agent service (#371)" (#390)
This reverts commit fac90f2b799ee9d11b77b3e8e81144abc7b8f4cd.
* Increase Backend Worker Max Message Size (#391)
* remove router refresh filter (#393)
* Improve Concurrent Log Upload (#394)
* Improve Concurrent Log Upload
* Fix lint
* Remove fsync
* Sphinx multiversion should prefer remote (#405)
* Compress Backend Job when Service sends to Backend Worker (#398)
* Compress Backend Job when Service sends to Backend Worker
* Fix lint
* Fix lint
* address comments
* Service Config History and Editor (#406)
* Service Config History and Editor
* Add yup and react-hook-forms
* Prevent destruction of pending async task (#412)
* Fix node level clean up in CI (#417)
* Physical AI Workflow Series: Nut Pouring (#230)
* Physical AI Workflow Series: Nut Pouring
* Update copyright and whitespaces
* Remove internal
* Update mimic generation workflow with Swift storage and file injection
- Add nutpour_gr1t2_base_env_cfg.py for GR1T2 nut pouring task
- Configure Swift input/output storage for datasets
- Add MimicGen dataset generation command with 1000 trials
- Inject custom environment config via file injection
* Minor cleanup
* Add starting dataset to instructions
* Update top level readme
* Add copyright
* add copyright
---------
Co-authored-by: Saurav Nanda <sauravn@nvidia.com>
* Fix ConfigMap name mismatch in backend test CronJob template (#422)
The CronJob volume was referencing the ConfigMap as
'{{backend_name}}-{{test_name}}-config' but the actual ConfigMap is
created with the name '{{configmap_name}}' (which is '{test_name}-config').
This mismatch would cause the CronJob to fail to mount the ConfigMap.
Use the configmap_name template variable that is already passed to the
Jinja context to ensure consistency.
* Support Non-AWS S3 storage without environment variables (#421)
* Add override_url to data credentials and remove cache_config
* Update local run and quickstart
* Make sure local dev works
* Revert unnecessary changes
* Update from workflows/ to cookbook/ (#419)
* Update from workflows/ to cookbook/
* Update README
* Update links from workflows/ to cookbook/
* Update paths
* Update README
* Revert to pass link test
* Update links to cookbook instead of workflows in Github (#432)
* Update links to cookbook instead of workflows in Github
* Update more links
* fix: pass node_condition_prefix to backend-worker deployment (#448)
* #148 - Add user mapping into OSMO (#418)
* Sync main into feature/PROJ-148-user-mapping (#375)
* fix: update extraArgs to render string properly (#362)
* Remove auth router in agent service (#371)
* Various fixes to stabilize GitHub actions on self-hosted nodes (#366)
* Add subagents to help debug CI
* Pin digests in Github actions
* Add safe Bazel and workspace cleanup to ci-internal
Add filesystem cleanup steps that don't interfere with concurrent jobs.
Testcontainers handles Docker resource cleanup automatically via ryuk.
Changes:
- Add Bazel cache cleanup to prevent unbounded local cache growth
- Add workspace cleanup for pytest and temporary files
- Keep concurrency control per PR
- Rely on Testcontainers for Docker resource cleanup
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add 30-minute timeout to ci-internal job
Prevent jobs from running indefinitely if tests hang. 30 minutes
provides sufficient time for normal test execution while ensuring
hung jobs don't block the runner.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Use bazel clean --expunge to prevent unbounded cache growth
Change from 'bazel clean' to 'bazel clean --expunge' to remove
repository cache in addition to build outputs. This prevents
unbounded growth of external dependencies on self-hosted runners.
Uses synchronous --expunge (not --expunge_async) since we're in a
Docker container that will terminate after the job completes.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add resource limits to Docker-in-Docker service
Limit DinD service to 4GB memory and 2 CPUs to prevent runaway
resource consumption and OOM conditions on self-hosted runners.
These limits provide sufficient resources for Testcontainers while
leaving headroom for the main job container and preventing memory
leaks from exhausting runner resources.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Tune Bazel CI config
* Testcontainers resource limit
* Clean up testcontainers networkedcontainer list
* Shutdown bazel at the end of job
* Close docker client in test utils
* SandboxedWorker shutdown in tests
* Add docker clean up
* Add clean up
* Add node dep
* Add docker deps
* Use the right image
* Tune bazel in CI
* Remove golang.org/x/crypto from root module
* Pylint suppress
* Fix redis closure in tests
* Fix jinja_sandbox test
* Clean up in jinja_sandbox test
* Fix jinja_sandbox test and lint
* Enhance cleanup
* Fix pr-checks yaml
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* #294 #295 - Add user table and user-role mapping (#373)
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* #339 - Add documentation for creating PAT and service accounts (#395)
* Rename pat (#428)
* #403 - Set optional default admin during service creation + fix PAT wording (#404)
* use base64 for access tokens (#437)
* Merge main into feature (#441)
* Support Non-AWS S3 storage without environment variables (#421)
* Add override_url to data credentials and remove cache_config
* Update local run and quickstart
* Make sure local dev works
* Revert unnecessary changes
* Update from workflows/ to cookbook/ (#419)
* Update from workflows/ to cookbook/
* Update README
* Update links from workflows/ to cookbook/
* Update paths
* Update README
* Revert to pass link test
* Update links to cookbook instead of workflows in Github (#432)
* Update links to cookbook instead of workflows in Github
* Update more links
---------
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
* Modify profile list api to show role info (#442)
---------
Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
* Add override_url when forwarding default_credential to StaticDataCredential (#444)
* Oauth2 proxy design (#436)
* design for oauth2 proxy
* format
* update design after POC
* update format
* update design per envoy / UI changes
* #356 - Group Template Implementation (#454)
- Add group templates as a new type of config
- Allow group templates to be assigned to pools
- When a workflow is submitted, group templates will be instantiated per group, and when it is completed, they will be destroyed
- Cleanup logic for creating kubernetes objects in the backend by using kubernetes dynamic client.
* Client install location is determined by service (#447)
* Revert "#356 - Group Template Implementation (#454)" (#459)
This reverts commit a483a882f08f121d8e0d06d227930c093df4a6c3.
* chore: bump helm chart versions to 1.0.2 (#465)
* #148 - Add RBAC authz sidecar (#445)
* Sync main into feature/PROJ-148-auth-rework (#258)
* allow flexible squid proxy replicas (#241)
* allow flexible squid proxy replicas
* fix
* Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167)
* Improving Performance for Uploading Workflow Artifacts in Worker Jobs
* Cleanup
* Add progress writing after upload
* Add dependency in Bazel BUILD
* Add type to mypy requirements
* Update mypy requirements
* Add to mypy_cli BUILD
* Fix lint
* Comment
* Use constant to define semaphor and storage client executor count
* #244 - Use last login url if url is not specified (#245)
* Use last login url if url is not specified
* print message
* Cannot select any text inside modals or slideouts (#248)
* Video html element not changin when selecting different video files in the UI for OSMO dataset (#249)
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
* * Add authz sidecar service with Go implementation
This commit adds the authorization sidecar service including:
- Go-based authz server implementing Envoy External Authorization
- PostgreSQL client for role/policy storage
- Role caching for performance optimization
- Action registry for path-to-action mapping
- Comprehensive test suite
- Python test service for integration testing
- Documentation and quickstart guide
* * Begin resource action model
* Server validates both legacy and new
* Update logic for action registry
* Sync main into feature/PROJ-148-auth-rework (#298)
* allow flexible squid proxy replicas (#241)
* allow flexible squid proxy replicas
* fix
* Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167)
* Improving Performance for Uploading Workflow Artifacts in Worker Jobs
* Cleanup
* Add progress writing after upload
* Add dependency in Bazel BUILD
* Add type to mypy requirements
* Update mypy requirements
* Add to mypy_cli BUILD
* Fix lint
* Comment
* Use constant to define semaphor and storage client executor count
* #244 - Use last login url if url is not specified (#245)
* Use last login url if url is not specified
* print message
* Cannot select any text inside modals or slideouts (#248)
* Video html element not changin when selecting different video files in the UI for OSMO dataset (#249)
* sync-feature-branches: fix no conflict case, allow single branch to be synced (#252)
* Fix sync-feature-branches with no merge conflicts
* Allow a single branch to be specified for sync-feature-branches
* Perform operations as OSMO CI Bot
* Add external label when the PR is created
* extract issue number
* add test cases (#247)
* Allow PR checks to run on release branches (#264)
* Database Pooling in Postgres Singleton Across Services (#251)
* Initial commit for database pooling
* Update set_session
* Fix lint
* Update PostgresConnector to have semaphor to control connections
* Lint fix
* Fix number of maxconn for test
* Address comments
* Add Go Postgres utils (#272)
* #148 - Auth Project Design Documents (#165)
* add args to postgres (#282)
* #267 - cloud deployment scripts (#268)
* script to create azure resources and deploy
* Remove auto-generated values files from tracking
- Added .gitignore to ignore values/, *.env files
- Removed values/*.yaml files from git (auto-generated during deployment)
* add aws script
* add aws script
* add copyright
* update copyright
* Support for Azure workload identity in AKS and Arc clusters (#141)
* feat(src): add Azure service account and extra pod labels configuration
- implement service account creation with customizable name and annotations
- enhance service templates to support extra pod labels for various services
- update Azure backend to utilize DefaultAzureCredential for authentication
- add tests for Azure credential extraction and client creation
* feat(src): extract account key from connection string for Azure Blob Storage
- add function to extract AccountKey from connection string
- update AzureBlobStorageClient to handle different credential types
* feat(test): add tests for account key extraction from Azure connection strings
* chore: clean up linting issues for tests
* refactor(src): update data credential types in PostgresConnector and TaskGroup
- change StaticDataCredential to DataCredential in get_all_data_creds method
- update fetch_creds function signature to use DataCredential
* feat(src): update Azure client creation to include storage account and account URL
- remove deprecated storage account extraction function
- modify create_client to accept storage_account and account_url parameters
- update AzureBlobStorageClientFactory to use new parameters
- adjust tests to reflect changes in client creation
🔒 - Generated by Copilot
* refactor(src): mark storage_account parameter as unused in create_client function
🔧 - Generated by Copilot
* refactor(src): remove unused storage_account parameter from client creation
🔧 - Generated by Copilot
* Fix conflicts
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: xutongNV <xutongr@nvidia.com>
Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com>
* Remove action permissions from pool config (#307)
* Sync main into feature/PROJ-148-auth-rework (#322)
* allow flexible squid proxy replicas (#241)
* allow flexible squid proxy replicas
* fix
* Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167)
* Improving Performance for Uploading Workflow Artifacts in Worker Jobs
* Cleanup
* Add progress writing after upload
* Add dependency in Bazel BUILD
* Add type to mypy requirements
* Update mypy requirements
* Add to mypy_cli BUILD
* Fix lint
* Comment
* Use constant to define semaphor and storage client executor count
* #244 - Use last login url if url is not specified (#245)
* Use last login url if url is not specified
* print message
* Cannot select any text inside modals or slideouts (#248)
* Video html element not changin when selecting different video files in the UI for OSMO dataset (#249)
* sync-feature-branches: fix no conflict case, allow single branch to be synced (#252)
* Fix sync-feature-branches with no merge conflicts
* Allow a single branch to be specified for sync-feature-branches
* Perform operations as OSMO CI Bot
* Add external label when the PR is created
* extract issue number
* add test cases (#247)
* Allow PR checks to run on release branches (#264)
* Database Pooling in Postgres Singleton Across Services (#251)
* Initial commit for database pooling
* Update set_session
* Fix lint
* Update PostgresConnector to have semaphor to control connections
* Lint fix
* Fix number of maxconn for test
* Address comments
* Add Go Postgres utils (#272)
* #148 - Auth Project Design Documents (#165)
* add args to postgres (#282)
* #267 - cloud deployment scripts (#268)
* script to create azure resources and deploy
* Remove auto-generated values files from tracking
- Added .gitignore to ignore values/, *.env files
- Removed values/*.yaml files from git (auto-generated during deployment)
* add aws script
* add aws script
* add copyright
* update copyright
* Support for Azure workload identity in AKS and Arc clusters (#141)
* feat(src): add Azure service account and extra pod labels configuration
- implement service account creation with customizable name and annotations
- enhance service templates to support extra pod labels for various services
- update Azure backend to utilize DefaultAzureCredential for authentication
- add tests for Azure credential extraction and client creation
* feat(src): extract account key from connection string for Azure Blob Storage
- add function to extract AccountKey from connection string
- update AzureBlobStorageClient to handle different credential types
* feat(test): add tests for account key extraction from Azure connection strings
* chore: clean up linting issues for tests
* refactor(src): update data credential types in PostgresConnector and TaskGroup
- change StaticDataCredential to DataCredential in get_all_data_creds method
- update fetch_creds function signature to use DataCredential
* feat(src): update Azure client creation to include storage account and account URL
- remove deprecated storage account extraction function
- modify create_client to accept storage_account and account_url parameters
- update AzureBlobStorageClientFactory to use new parameters
- adjust tests to reflect changes in client creation
🔒 - Generated by Copilot
* refactor(src): mark storage_account parameter as unused in create_client function
🔧 - Generated by Copilot
* refactor(src): remove unused storage_account parameter from client creation
🔧 - Generated by Copilot
* Add new project proposal to describe nvlink + topology aware scheduling (#211)
* Add new project proposal to describe nvlink + topology aware scheduling
* Split design into two docs
* Finish docs and add some updates from feedback
* Add some open items
* OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (#315)
* add redis utlis, update postgres utils (#313)
* add redis utlis, update postgres utils
* add deps
* Fix missing seperator in the test runner roles (#320)
* fix
* remove
* fix
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: xutongNV <xutongr@nvidia.com>
Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
* Connect envoy with authz sidecar (#319)
* Connect the authz sidecar to envoy
* update sidecar
* fix typo
* add extra env
* uncomment
* #290 - Add attribute fetching for workflow pool matching (#338)
* update
* fix
* fix
* Merge main into feature branch (#452)
* fix: pass node_condition_prefix to backend-worker deployment (#448)
* #148 - Add user mapping into OSMO (#418)
* Sync main into feature/PROJ-148-user-mapping (#375)
* fix: update extraArgs to render string properly (#362)
* Remove auth router in agent service (#371)
* Various fixes to stabilize GitHub actions on self-hosted nodes (#366)
* Add subagents to help debug CI
* Pin digests in Github actions
* Add safe Bazel and workspace cleanup to ci-internal
Add filesystem cleanup steps that don't interfere with concurrent jobs.
Testcontainers handles Docker resource cleanup automatically via ryuk.
Changes:
- Add Bazel cache cleanup to prevent unbounded local cache growth
- Add workspace cleanup for pytest and temporary files
- Keep concurrency control per PR
- Rely on Testcontainers for Docker resource cleanup
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add 30-minute timeout to ci-internal job
Prevent jobs from running indefinitely if tests hang. 30 minutes
provides sufficient time for normal test execution while ensuring
hung jobs don't block the runner.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Use bazel clean --expunge to prevent unbounded cache growth
Change from 'bazel clean' to 'bazel clean --expunge' to remove
repository cache in addition to build outputs. This prevents
unbounded growth of external dependencies on self-hosted runners.
Uses synchronous --expunge (not --expunge_async) since we're in a
Docker container that will terminate after the job completes.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add resource limits to Docker-in-Docker service
Limit DinD service to 4GB memory and 2 CPUs to prevent runaway
resource consumption and OOM conditions on self-hosted runners.
These limits provide sufficient resources for Testcontainers while
leaving headroom for the main job container and preventing memory
leaks from exhausting runner resources.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Tune Bazel CI config
* Testcontainers resource limit
* Clean up testcontainers networkedcontainer list
* Shutdown bazel at the end of job
* Close docker client in test utils
* SandboxedWorker shutdown in tests
* Add docker clean up
* Add clean up
* Add node dep
* Add docker deps
* Use the right image
* Tune bazel in CI
* Remove golang.org/x/crypto from root module
* Pylint suppress
* Fix redis closure in tests
* Fix jinja_sandbox test
* Clean up in jinja_sandbox test
* Fix jinja_sandbox test and lint
* Enhance cleanup
* Fix pr-checks yaml
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* #294 #295 - Add user table and user-role mapping (#373)
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* #339 - Add documentation for creating PAT and service accounts (#395)
* Rename pat (#428)
* #403 - Set optional default admin during service creation + fix PAT wording (#404)
* use base64 for access tokens (#437)
* Merge main into feature (#441)
* Support Non-AWS S3 storage without environment variables (#421)
* Add override_url to data credentials and remove cache_config
* Update local run and quickstart
* Make sure local dev works
* Revert unnecessary changes
* Update from workflows/ to cookbook/ (#419)
* Update from workflows/ to cookbook/
* Update README
* Update links from workflows/ to cookbook/
* Update paths
* Update README
* Revert to pass link test
* Update links to cookbook instead of workflows in Github (#432)
* Update links to cookbook instead of workflows in Github
* Update more links
---------
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
* Modify profile list api to show role info (#442)
---------
Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
* Revert "Remove auth router in agent service (#371)" (#390)
This reverts commit fac90f2b799ee9d11b77b3e8e81144abc7b8f4cd.
* Merge main into feature
---------
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
* #407 - Authz sidecar sends accessible pools to service (#455)
* update cache specification
* lint
* authz sidecar sends info to service
* Add logging to the go code
* comments
* Merge main into feature (#463)
* fix: pass node_condition_prefix to backend-worker deployment (#448)
* #148 - Add user mapping into OSMO (#418)
* Sync main into feature/PROJ-148-user-mapping (#375)
* fix: update extraArgs to render string properly (#362)
* Remove auth router in agent service (#371)
* Various fixes to stabilize GitHub actions on self-hosted nodes (#366)
* Add subagents to help debug CI
* Pin digests in Github actions
* Add safe Bazel and workspace cleanup to ci-internal
Add filesystem cleanup steps that don't interfere with concurrent jobs.
Testcontainers handles Docker resource cleanup automatically via ryuk.
Changes:
- Add Bazel cache cleanup to prevent unbounded local cache growth
- Add workspace cleanup for pytest and temporary files
- Keep concurrency control per PR
- Rely on Testcontainers for Docker resource cleanup
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add 30-minute timeout to ci-internal job
Prevent jobs from running indefinitely if tests hang. 30 minutes
provides sufficient time for normal test execution while ensuring
hung jobs don't block the runner.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Use bazel clean --expunge to prevent unbounded cache growth
Change from 'bazel clean' to 'bazel clean --expunge' to remove
repository cache in addition to build outputs. This prevents
unbounded growth of external dependencies on self-hosted runners.
Uses synchronous --expunge (not --expunge_async) since we're in a
Docker container that will terminate after the job completes.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add resource limits to Docker-in-Docker service
Limit DinD service to 4GB memory and 2 CPUs to prevent runaway
resource consumption and OOM conditions on self-hosted runners.
These limits provide sufficient resources for Testcontainers while
leaving headroom for the main job container and preventing memory
leaks from exhausting runner resources.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Tune Bazel CI config
* Testcontainers resource limit
* Clean up testcontainers networkedcontainer list
* Shutdown bazel at the end of job
* Close docker client in test utils
* SandboxedWorker shutdown in tests
* Add docker clean up
* Add clean up
* Add node dep
* Add docker deps
* Use the right image
* Tune bazel in CI
* Remove golang.org/x/crypto from root module
* Pylint suppress
* Fix redis closure in tests
* Fix jinja_sandbox test
* Clean up in jinja_sandbox test
* Fix jinja_sandbox test and lint
* Enhance cleanup
* Fix pr-checks yaml
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* #294 #295 - Add user table and user-role mapping (#373)
---------
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* #339 - Add documentation for creating PAT and service accounts (#395)
* Rename pat (#428)
* #403 - Set optional default admin during service creation + fix PAT wording (#404)
* use base64 for access tokens (#437)
* Merge main into feature (#441)
* Support Non-AWS S3 storage without environment variables (#421)
* Add override_url to data credentials and remove cache_config
* Update local run and quickstart
* Make sure local dev works
* Revert unnecessary changes
* Update from workflows/ to cookbook/ (#419)
* Update from workflows/ to cookbook/
* Update README
* Update links from workflows/ to cookbook/
* Update paths
* Update README
* Revert to pass link test
* Update links to cookbook instead of workflows in Github (#432)
* Update links to cookbook instead of workflows in Github
* Update more links
---------
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
* Modify profile list api to show role info (#442)
---------
Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
* Add override_url when forwarding default_credential to StaticDataCredential (#444)
* Oauth2 proxy design (#436)
* design for oauth2 proxy
* format
* update design after POC
* update format
* update design per envoy / UI changes
* #356 - Group Template Implementation (#454)
- Add group templates as a new type of config
- Allow group templates to be assigned to pools
- When a workflow is submitted, group templates will be instantiated per group, and when it is completed, they will be destroyed
- Cleanup logic for creating kubernetes objects in the backend by using kubernetes dynamic client.
* Client install location is determined by service (#447)
* Revert "Remove auth router in agent service (#371)" (#390)
This reverts commit fac90f2b799ee9d11b77b3e8e81144abc7b8f4cd.
* dupe
* remove
---------
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
* Revert "Remove auth router in agent service (#371)" (#390)
This reverts commit fac90f2b799ee9d11b77b3e8e81144abc7b8f4cd.
* Add new fields in the envoy logs (#466)
* Revert "#356 - Group Template Implementation (#454)" (#459)
This reverts commit a483a882f08f121d8e0d06d227930c093df4a6c3.
* fix
---------
Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: xutongNV <xutongr@nvidia.com>
Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* chore: set appVersion to 6.1 across all helm charts (#468)
* chore: set appVersion to 6.1 across all helm charts
* chore: bump helm chart versions to 1.1.0
* Turn authz sidecar on by default (#471)
* Workflow API Fixes: server-side sorting, more_entries value (#467)
* Workflow API Fixes: server-side sorting, more_entries value
* Simplify
* Update CLI
* Use .extend
* Fix actions for auth (#474)
* Oauth2 proxy (#443)
* update envoy and add oauth2 proxy
* fix
* fix
* update oauth2 proxy args
* update oauth2 proxy args
* update envoy filters
* upgrade proxy to v7.14
* fix envoy checksum
* fix envoy
* update x-osmo-auth approach
* update ui to handle session
* remove oauth2 proxy sign in page
* update UI to handle auth header from proxy
* fix token
* use envoy to copy token to header
* update ui accordingly
* add user info to header
* fix
* set name in header
* fix
* Non-AWS S3 documentation (#456)
* Non-AWS S3 Documentation
* Spelling
* Shorten tab name
* Address feedback
* database schema integration (#473)
* integrate with pgroll schema version
* don't use in init tables
* go env var
* use RuntimeParams to set the search_path
* fix
* update python search path
* use flag to check initialized
* Fix osmo config set (#477)
* #206 - Nvlink support (#479)
- Implement topology aware scheduling
- Add topology aware scheduling unit tests
- Add group template implementation
- Add documentation for topology aware scheduling
- Add documentation for group templates
* Grouping Pools with Shared Nodes in the Same Nodeset (#457)
* Group Pools as Shared if Pools share at least one node with each other
* Optimize BFS
* Fix lint
* Update python base containers (#486)
* UI Next (#476)
* feat: Add created_at column to datasets table
* fix: Use DATASET type instead of file format in mock data
* feat: Implement dataset detail page with Overview tab
* refactor: Change dataset detail URL from /datasets/[name] to /datasets/[bucket]/[name]
* feat: Add dataset detail metadata, versions tab, and improved breadcrumbs
* fix(datasets): Match workflows tab styling, fix version identification, add search, remove duplicate breadcrumb
* feat(datasets): Replace versions table with DataTable, add search/filter/columns, fix page title, update file browser to Coming Soon
* Fix workflow panel rendering
* Fix timeline and panel-tab hysterisis.
* Clean up docs
* Minor stylistic tweaks for dashboard
* My Workflows User Filter Update
* fix: buckets and breadcrumbs
* Prepare for auto-refresh by auditing our codebase for antipatterns
* Refresh infra + wiring through the pages
* Exit Code column on by default
* Add doc link
* Add hover-card
* CLI Install Hovercard
* Vertical padding between CLI/Documentation
* Group section header connector line
* Clean up workflowTasksTable
* Zebra table styling
* Use runtime env instead of buildtime env
* Fix mock user roles
* Reduce extra API calls
* No total count tracking
* Search log text
* Add server-optimized queryClient to prevent retry delays
* Simplify log streaming implementation
* Update mock to match logging
* Ensure mock handler works with streaming
* Fix infinite loop rendering
* Log Viewer presentation layer
* Timeline Axis
* Add markers and remove invalid zones
* Timeline redesign
* Remove refresh button and animation from log-viewer
* Log-viewer in workflow
* find shortcut for workflow log viewer
* Mod Key Hint Text
* Link out properly
* Log-viewer in Tasks
* Fix log streaming URL
* Tab cycle in filter-bar
* Proper shell guarding
* No autoscroll/pin on initial log-viewer load
* Add more details and cross links to workflow details
* Node cross reference in Task Details
* Proper scoping of "My Workflows" vs. "All Workflows"
* Enter to select suggestions
* My Pools vs. All Pools
* Fix version fetch
* My Pools only in Resubmit
* Add code-simplifier guidance
* Add backend TODO for events
* Events css
* Handle plain text in fetcher
* Events API Adapter
* Events mock
* Tailwind best practice guidance
* Clean up events css
* Event css and date format
* Event Viewer Initial Implementation
* Event viewer refinements
* CSS DRY-ness
* SSOT event derived states
* Refactor filter-bar tab cycle behavior
* Clean up filter-bar
* Add "failed" events derived state
* Better zebra table colors
* Event Viewer
* Mock event stream
* Event Viewer stream
* Event stream mock
* Fix panel slide in animation
* Fix event viewer
* Inferred Lifecycle for Events Viewer
* Lifecycle Timeline
* Simplify event lifecycle
* Profile Settings Refactor
* Search Input Styling
* Styling fixes for profile settings page
* Bucket fix in profile
* Credential section fixes
* Profile Setting Loading
* Remove planning docs
* Envoy debugger
* Nested children issue fix
* Passive wheel fix
* Fix hydration issues
* More hydration fixes
* Consistent error handling
* Minor UI Tweaks
* Tab panel padding tweak
* Mock Spec and Template generator
* Use backend response for spec/template fetching
* Rename jwt-utils
* Simplify auth related code
* Refresh Flow
* Accessible Pools on Dashboard
* Default workflow panel width 60%
* Hide backend column in resources table by default
* Events always expanded
* Events Lifecycle Derived State Fix
* Events Viewer Task Duration
* No double-url for spec open-new-tab
* ctrl click new tab from table
* Status -> Events Affordance
* Mod key UI label
* Streamline resubmit spec edit
* Format
* Log Viewer Select and Copy
* Log level simplification
* Fix log-viewer focus and row selection height
* Workflows Page Default User Filter
* Events table utils consolidation and expansion
* Ensure consistent query keys
* Workflow list sort order fetch
* Mod Key consolidation
* Nuqs state patch in useDefaultFilter
* Pools Default Filter
* Profile Prefetch during Pools page
* Parallelize profile call on dashboard
* fix(datasets): always pass dataset_type=DATASET in list API request
Ensures the backend returns only DATASET entries by always sending
dataset_type: DatasetType.DATASET in buildApiParams(). Previously it
was set to undefined, which could return COLLECTION entries mixed in.
Mock handler now caps count at totalDatasets to prevent generating
10,000 entries for the fetchAll path, and short-circuits with empty
response when dataset_type=COLLECTION is requested.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): disable ShowAll toggle when user: filter chip is active
When a user: chip is present in the FilterBar, the ShowAll toggle is
disabled — the chip already controls per-user filtering so the toggle
would conflict. Matches the behavior in workflows.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(datasets): remove format field entirely
The backend's type field (DATASET/COLLECTION) was incorrectly mapped to
a UI format field (parquet, arrow, etc.) — these are unrelated concepts.
Since we now always filter to dataset_type=DATASET, the field carries no
meaningful data.
Removes: Dataset.format, format column def, format size config, format
search field, format query key segment, format in buildApiParams comment,
format in GeneratedDataset and DATASET_PATTERNS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(claude): update project instructions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(datasets): add user: filter field with live suggestions from /api/users
The user: chip was wired to the backend API but was never registered as a
search field, so it never appeared in the FilterBar dropdown. Adds an async
field backed by useUsers() with the same lazy-loading pattern as workflows
(query disabled until field accessed, then cached 5 min). Updates placeholder
to include 'user:'.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(filter-bar): support value|hint encoding for secondary suggestion text
getValues() may now return strings in "rawValue|hint" format. The suggestion
engine splits on the first "|" to use rawValue as the chip value and chip
deduplication key, while displaying hint as right-aligned secondary text.
Existing fields with plain string values are unaffected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add date range filter utilities, search fields, and client-side shim
- date-filter-utils.ts: preset ranges (today, last 7/30/90/365 days) with
value|hint encoding; parseDateRangeValue handles ISO ranges, single dates,
and preset label backward-compat
- datasets-shim.ts: applyDatasetsFiltersSync applies created_at/updated_at
date range filters client-side from the React Query cache; to be deleted
when backend supports these params (see BACKEND_TODOS.md #25)
- dataset-search-fields.ts: add created_at and updated_at search fields using
preset suggestions
- datasets-toolbar.tsx: add date fields to searchFields; update placeholder
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add fetch-all adapter and fix all_users overriding user filter
- fetchAllDatasets: fetches with count:10_000; server-side params (name,
bucket, user, all_users) applied at API level; date/sort handled by shim
- buildAllDatasetsQueryKey: excludes created_at/updated_at so date filter
changes don't trigger API calls — shim filters from cache
- useAllDatasets: React Query hook wrapping fetchAllDatasets with STATIC
stale time
- buildApiParams: force all_users=false when user chips are active; backend
all_users overrides the user param, breaking user: filter chips
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(datasets): migrate to fetch-all + shim, remove infinite scroll
use-datasets-data: replace usePaginatedData + fetchPaginatedDatasets with
useAllDatasets + applyDatasetsFiltersSync in useMemo. DataTable receives the
full filtered list; virtual scrolling handles display. Removes hasMore,
fetchNextPage, isFetchingNextPage from the return type.
datasets-page-content: remove hasNextPage, onLoadMore, isFetchingNextPage
props passed to DatasetsDataTable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(mocks): spread dataset updated_at over past year for date filter testing
Changed from faker.date.recent({ days: 90 }) to faker.date.past({ years: 1 })
so date range filter chips produce visible filtering in mock mode.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(backend): update BACKEND_TODOS for dataset date filtering and sorting
- Issue #23: updated to reflect current fetch-all workaround and migration path
- Issue #25 (new): documents desired created_after/created_before/updated_after/
updated_before/sort_by/sort_dir params and offset support; lists client-side
workaround locations and full migration path for when backend is updated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): replace show-all toggle with My Datasets preset and default user filter
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add column sorting via client-side shim
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(data-table): add onRowDoubleClick support to DataTable and VirtualTableBody
- Forward onRowDoubleClick prop through DataTable → VirtualTableBody
- Delegate dblclick via tbody onDoubleClick (same pattern as click/auxclick)
- Guard handleTbodyClick to skip detail>=2 clicks when a double-click handler
is registered, preventing competing document.startViewTransition calls
- Capture row item at mousedown as fallback for handleTbodyDoubleClick in case
the virtualizer remounts rows between the first click and the dblclick event
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add panel width persistence to datasets table store
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add DatasetPanel component with details and versions
DatasetPanel: sticky header (PanelHeaderActions badge + close) + scrollable
content with DatasetPanelDetails and DatasetPanelVersions stacked vertically.
DatasetPanelDetails: pool-style card with key/value metadata grid (bucket,
version, size, created, updated, created by, path) and always-visible labels
section showing key/value pairs or "No labels".
DatasetPanelVersions: pool-style card wrapping a table sorted latest-first
with columns: version (current highlighted green), created by, date, size,
tags (pill badges or em dash).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): wire slideout panel and row interactions into page
datasets-page-content: add ResizablePanel wrapping the table, URL-synced panel
state via useSelectionState("view") in "bucket/name" format, usePanelLifecycle
for open/close animation, usePanelWidth for persistence.
datasets-data-table: separate navigateToDataset helper, add onRowSelect /
onRowDoubleClick / selectedDatasetId props. Debounce onRowSelect by 250ms so
the panel URL state change never races with double-click's router.push.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): implement file listing adapter with path and version params
- Add url field to DatasetFile type for preview support
- Update fetchDatasetFiles to call info endpoint with ?path= and ?version= params
- Update buildDatasetFilesQueryKey to include version in cache key
- Update useDatasetFiles hook to accept optional version param
- Update mock generateFileTree to return preview URLs for files
- Add HEAD and GET /api/bucket/:bucket/dataset/:name/preview mock handlers
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(datasets): remove num_files field from Dataset type
num_files does not exist in the backend API (DataInfoResponse or
DataListEntry) and was always set to 0. Removing it eliminates a
misleading field and reduces the adapter surface area.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add file browser URL state hook
Manages ?path=, ?version=, and ?file= URL params together via nuqs.
navigateTo() clears ?file= on folder navigation; setVersion() preserves
?path= when switching versions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add FileBrowserBreadcrumb and VersionSwitcher components
FileBrowserBreadcrumb renders dataset name / path segments with clickable
intermediate segments for directory navigation.
VersionSwitcher renders a compact Select with versions sorted newest-first,
status labels, and a (latest) indicator on the highest version.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add FileBrowserTable component
Renders dataset directory contents using DataTable with:
- Folders sorted before files (both groups alphabetical)
- File type icons (folder amber, image blue, video purple, text/other zinc)
- Size, modified date, and extension columns
- Per-row copy-path button (visible on hover) via useServices().clipboard
- Row click: folders navigate, files select for preview
- selectedRowId highlights the currently previewed file
- Loading skeleton and empty-directory state
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): add FilePreviewPanel with HEAD preflight and copy path
- HEAD preflight via useQuery: 200 renders preview, 401/403 shows public
bucket error, 404 shows not-found error, network error shows retry
- image/* renders <img>, video/* renders <video controls>,
text/* and JSON render <iframe sandbox>, binary shows fallback message
- Copy Path button always visible in panel header
- Metadata footer: size, modified date, checksum when available
- Falls back to metadata-only view when file.url is not set
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): overhaul dataset detail page to Google Drive-style file browser
Replaces the tab-based layout (Overview / Versions / Coming Soon) with a
full-page file browser backed by useFileBrowserState (?path, ?version, ?file
URL params) and a ResizablePanel preview for individual files.
- Add FileBrowserHeader: sticky breadcrumb + version switcher row
- Rewrite DatasetDetailContent: wires FileBrowserHeader, FileBrowserTable,
FilePreviewPanel, and ResizablePanel into a single cohesive layout
- Update DatasetDetailSkeleton: matches new header-bar + table-row layout
- Delete DatasetDetailHeader: superseded by the new header component
- Chrome breadcrumbs include dataset name as third segment; title is "Files"
- Fix VersionSwitcher trigger to show only version number (no double "v")
- Fix buildDatasetFilesQueryKey: use null instead of "latest" sentinel
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(datasets): fix version picker and breadcrumb issues
- Rewrite VersionSwitcher to use SelectPrimitive.ItemText directly so only
the version number reflects in the SelectValue trigger (no double "v" or
verbose subtitle leaking into the trigger display)
- Add prev/next chevron buttons to step one version at a time
- Replace status badge with created_by + created_date subtitle per item
- Make dataset name breadcrumb non-clickable (href: null) since we are
already on that page
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(datasets): anchor version dropdown below trigger using popper position
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(datasets): prevent CSP violation from auth redirect in file listing fetch
The backend returns a 302 redirect to an external Keycloak host when the
session is not established. With the default redirect:"follow" behaviour,
fetch follows the redirect and the browser blocks the connection because
auth-us-west-2-aws.osmo.nvidia.com is outside the page's connect-src CSP.
Use redirect:"manual" so the redirect is never followed. An opaque redirect
response is detected and converted to a clear error message instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): fetch file manifest via location URL proxy
Switch file listing from a path-filtered /info endpoint (which returns no
files) to fetching the version's location URL — a flat manifest of all files
with relative_path, size, etag, and url fields.
- Add RawFileItem type and buildDirectoryListing() to build per-path views
client-side from the flat manifest
- Add /api/datasets/location-files route to proxy the location fetch
server-side, avoiding CORS issues with the storage URL
- Add /api/datasets/file-proxy route for proxying GET/HEAD requests to
storage URLs, avoiding CSP violations (img-src, connect-src, frame-src)
- Update fetchDatasetFiles() to use customFetch via the location proxy
- Update useDatasetFiles() to accept location: string | null
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): replace slideout panel with push side-by-side layout
Replace the overlay ResizablePanel with a flex-row split where the file
browser and preview panel share horizontal space. The drag handle (1px
separator) gives the browser width to the preview and back.
- Derive location from the current version's DatasetVersion.location
- Build directory listing via buildDirectoryListing() in a useMemo
- selectedFileData resolves the DatasetFile for the selected path
so the preview panel always receives a live file object
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(datasets): improve copy buttons, file preview, and empty state
FileBrowserTable:
- Copy button now copies the fully qualified storage URL (file.url)
instead of the local relative path
- Copy column locked to ACTIONS_SMALL width (minSize = maxSize) so it
never expands; hidden for folders and files without a URL
- Switch to useCopy hook; wrap button in controlled Tooltip — shows
"Copied!" on click, auto-dismisses after 2 s
- Empty state updated to "This directory is empty or does not exist"
FilePreviewPanel:
- All file requests proxied via /api/datasets/file-proxy to avoid CSP
violations (HEAD preflight, <img src>, <video src>)
- Remove <iframe> branch (blocked by frame-src 'none'); text/* files
rendered via TextPreview — fetches content via proxy and renders in
a <pre> with monospace styling; loading + error states included
- Access-denied error: lock icon + updated copy referencing public
bucket requirement + prominent "Copy path" button
- Not-found error: also gets a "Copy path" button
- Header "Copy path" button and both error-state buttons use useCopy
+ controlled Tooltip for "Copied!" feedback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* No double fetch for resources
* Filter suggestion left align
* Filter Bar Presets Layout
* Single scroll in filter-bar dropdown
* Fix filter-bar behavior
* Remove elkjs and install dagre
* Remove elk worker
* Refine Dagre + ReactFlow integration
* Use permissive licenses only
* Setup error-boundary agent/skill/memory
* Audit and fix error boundaries
* Add vercel skills and tailwind skill
* Setup /audit-and-fix
* Run /audit-and-fix for all except design-guidelines
* Fix lint
* /audit-and-fix for design-guidelines
* Minor fixes
* Add code graph builder and traverser agent skills
* More agent/skills
* Absraction Enforcer
* Update tools and model for agents
* Remove dead code
* File name convention enforcement - 1/N
* File name convention enforcement - 2/N
* File name convention enforcement - 3/N
* File name convention enforcement - 4/N
* File name convention enforcement - 5/N
* File name convention enforcement - 6/N
* File name convention enforcement - 7/N
* File name convention enforcement - 8/N
* File name convention enforcement - 9/N
* File name convention enforcement - 10/N
* File name convention enforcement - 11/N
* File name convention enforcement - 12/N
* File name convention enforcer - 13/N
* Update agent to include all files in scope
* Dead code removal
* Folder restructure - 1/N
* Folder structure audit
* Import Layer Compliance
* Update deps
* Update folder-structure-enforcer
* Fix types
* Clear agent memory
* Update agents
* Update to node 24.13.1
* Update audit-and-fix pipeline
* Remove dep graph from audit-and-fix
* Pools Feature Folder Restructure
* Resource feature folder restructure
* Log viewer feature folder
* Profile feature folder
* Dashboard feature folder
* Dataset feature folder
* Lint
* Workflow feature folder - 1/N
* Workflow feature folder - 2/N
* Workflow feature folder - 3/N
* Update folder-structure-enforcer to work with sub-dir
* Dag feature folder
* Flatten dashboard feature folder
* Lint and update agent memory
* Flatten log-viewer
* Flatten more files
* Move more files
* Move more files around
* More file moves
* Move more files around
* Fix type-check, format
* Fix unidirectional code import
* Lint
* Fix react-doctor errors
* Fix react-doctor warnings - 1/N
* Fix react-doctor warnings - 2/N
* Fix react-doctor warnings - 3/N
* Fix react-doctor warnings - 4/N
* Fix react-doctor warnings - 5/N
* Fix react-doctor warnings - 6/N
* Fix react-doctor warnings - 7/N
* Fix react-doctor warnings - 8/N
* Fix react-doctor warnings - 9/N
* Code Viewer Skeleton
* Proper copyright text
* Update licenses and readme
* Scrub unnecessary nvidia info
* Scrub openapi.json
* Fix dockerfile
* Revert ui.yaml
* Update export script, orval, and tsconfig target
* Move agent/claude temp out of PR
* Fix lint
* Move ui-next into /src/ui
* Update files per latest ui-next move
* Enforce kebab-case for filenames
* Lint
* Fix breadcrumb origin
---------
Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Upgrade python packages (#485)
* Upgrade python packages
* Update mypy.ini
* Update to node 22.22 (#487)
* Update to node 22.22
* Update sha tag
* address (#488)
* Update sha (#489)
* chore: bump helm chart versions to 1.2.0 and set appVersion to 6.2 (#493)
* chore: remove _v8 from architecture specifiers for distroless version (#497)
* Sort the task list so that the lead task is first (#501)
* Sort the task list so that the lead task is first
* update the task table as well
* Fix multi-select, single-select, and default filter-chips (#500)
* Fix pool multi select
* Fix pool filter
* Fix resource filters
* Default Field
* Fix more filter chips
* Format
* No infinite loop when refresh fails (#502)
* Cleanup doctree and pr previews (#503)
* Push the github pages into a single commit (#505)
* Push the github pages into a single commit
* Fix rename
* lint
---------
Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Saurav Nanda <sauravn@nvidia.com>
Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com>
cypres
pushed a commit
to cypres/OSMO
that referenced
this pull request
Feb 26, 2026
* Sync main into feature/PROJ-148-auth-rework (NVIDIA#258) * allow flexible squid proxy replicas (NVIDIA#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (NVIDIA#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * NVIDIA#244 - Use last login url if url is not specified (NVIDIA#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (NVIDIA#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (NVIDIA#249) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> * * Add authz sidecar service with Go implementation This commit adds the authorization sidecar service including: - Go-based authz server implementing Envoy External Authorization - PostgreSQL client for role/policy storage - Role caching for performance optimization - Action registry for path-to-action mapping - Comprehensive test suite - Python test service for integration testing - Documentation and quickstart guide * * Begin resource action model * Server validates both legacy and new * Update logic for action registry * Sync main into feature/PROJ-148-auth-rework (NVIDIA#298) * allow flexible squid proxy replicas (NVIDIA#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (NVIDIA#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * NVIDIA#244 - Use last login url if url is not specified (NVIDIA#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (NVIDIA#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (NVIDIA#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (NVIDIA#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (NVIDIA#247) * Allow PR checks to run on release branches (NVIDIA#264) * Database Pooling in Postgres Singleton Across Services (NVIDIA#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (NVIDIA#272) * NVIDIA#148 - Auth Project Design Documents (NVIDIA#165) * add args to postgres (NVIDIA#282) * NVIDIA#267 - cloud deployment scripts (NVIDIA#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (NVIDIA#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Fix conflicts --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> * Remove action permissions from pool config (NVIDIA#307) * Sync main into feature/PROJ-148-auth-rework (NVIDIA#322) * allow flexible squid proxy replicas (NVIDIA#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (NVIDIA#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * NVIDIA#244 - Use last login url if url is not specified (NVIDIA#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (NVIDIA#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (NVIDIA#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (NVIDIA#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (NVIDIA#247) * Allow PR checks to run on release branches (NVIDIA#264) * Database Pooling in Postgres Singleton Across Services (NVIDIA#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (NVIDIA#272) * NVIDIA#148 - Auth Project Design Documents (NVIDIA#165) * add args to postgres (NVIDIA#282) * NVIDIA#267 - cloud deployment scripts (NVIDIA#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (NVIDIA#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Add new project proposal to describe nvlink + topology aware scheduling (NVIDIA#211) * Add new project proposal to describe nvlink + topology aware scheduling * Split design into two docs * Finish docs and add some updates from feedback * Add some open items * OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (NVIDIA#315) * add redis utlis, update postgres utils (NVIDIA#313) * add redis utlis, update postgres utils * add deps * Fix missing seperator in the test runner roles (NVIDIA#320) * fix * remove * fix --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> Co-authored-by: ecolternv <ecolter@nvidia.com> Co-authored-by: tdewanNvidia <tdewan@nvidia.com> * Connect envoy with authz sidecar (NVIDIA#319) * Connect the authz sidecar to envoy * update sidecar * fix typo * add extra env * uncomment * NVIDIA#290 - Add attribute fetching for workflow pool matching (NVIDIA#338) * update * fix * fix * Merge main into feature branch (NVIDIA#452) * fix: pass node_condition_prefix to backend-worker deployment (NVIDIA#448) * NVIDIA#148 - Add user mapping into OSMO (NVIDIA#418) * Sync main into feature/PROJ-148-user-mapping (NVIDIA#375) * fix: update extraArgs to render string properly (NVIDIA#362) * Remove auth router in agent service (NVIDIA#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (NVIDIA#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#294 NVIDIA#295 - Add user table and user-role mapping (NVIDIA#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#339 - Add documentation for creating PAT and service accounts (NVIDIA#395) * Rename pat (NVIDIA#428) * NVIDIA#403 - Set optional default admin during service creation + fix PAT wording (NVIDIA#404) * use base64 for access tokens (NVIDIA#437) * Merge main into feature (NVIDIA#441) * Support Non-AWS S3 storage without environment variables (NVIDIA#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (NVIDIA#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (NVIDIA#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (NVIDIA#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * Merge main into feature --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * NVIDIA#407 - Authz sidecar sends accessible pools to service (NVIDIA#455) * update cache specification * lint * authz sidecar sends info to service * Add logging to the go code * comments * Merge main into feature (NVIDIA#463) * fix: pass node_condition_prefix to backend-worker deployment (NVIDIA#448) * NVIDIA#148 - Add user mapping into OSMO (NVIDIA#418) * Sync main into feature/PROJ-148-user-mapping (NVIDIA#375) * fix: update extraArgs to render string properly (NVIDIA#362) * Remove auth router in agent service (NVIDIA#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (NVIDIA#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#294 NVIDIA#295 - Add user table and user-role mapping (NVIDIA#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#339 - Add documentation for creating PAT and service accounts (NVIDIA#395) * Rename pat (NVIDIA#428) * NVIDIA#403 - Set optional default admin during service creation + fix PAT wording (NVIDIA#404) * use base64 for access tokens (NVIDIA#437) * Merge main into feature (NVIDIA#441) * Support Non-AWS S3 storage without environment variables (NVIDIA#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (NVIDIA#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (NVIDIA#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (NVIDIA#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Add override_url when forwarding default_credential to StaticDataCredential (NVIDIA#444) * Oauth2 proxy design (NVIDIA#436) * design for oauth2 proxy * format * update design after POC * update format * update design per envoy / UI changes * NVIDIA#356 - Group Template Implementation (NVIDIA#454) - Add group templates as a new type of config - Allow group templates to be assigned to pools - When a workflow is submitted, group templates will be instantiated per group, and when it is completed, they will be destroyed - Cleanup logic for creating kubernetes objects in the backend by using kubernetes dynamic client. * Client install location is determined by service (NVIDIA#447) * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * dupe * remove --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: ecolternv <ecolter@nvidia.com> * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * Add new fields in the envoy logs (NVIDIA#466) * Revert "NVIDIA#356 - Group Template Implementation (NVIDIA#454)" (NVIDIA#459) This reverts commit a483a88. * fix --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> Co-authored-by: ecolternv <ecolter@nvidia.com> Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
cypres
pushed a commit
to cypres/OSMO
that referenced
this pull request
Feb 26, 2026
* Sync main into feature/PROJ-148-auth-rework (NVIDIA#258) * allow flexible squid proxy replicas (NVIDIA#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (NVIDIA#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * NVIDIA#244 - Use last login url if url is not specified (NVIDIA#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (NVIDIA#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (NVIDIA#249) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> * * Add authz sidecar service with Go implementation This commit adds the authorization sidecar service including: - Go-based authz server implementing Envoy External Authorization - PostgreSQL client for role/policy storage - Role caching for performance optimization - Action registry for path-to-action mapping - Comprehensive test suite - Python test service for integration testing - Documentation and quickstart guide * * Begin resource action model * Server validates both legacy and new * Update logic for action registry * Sync main into feature/PROJ-148-auth-rework (NVIDIA#298) * allow flexible squid proxy replicas (NVIDIA#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (NVIDIA#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * NVIDIA#244 - Use last login url if url is not specified (NVIDIA#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (NVIDIA#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (NVIDIA#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (NVIDIA#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (NVIDIA#247) * Allow PR checks to run on release branches (NVIDIA#264) * Database Pooling in Postgres Singleton Across Services (NVIDIA#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (NVIDIA#272) * NVIDIA#148 - Auth Project Design Documents (NVIDIA#165) * add args to postgres (NVIDIA#282) * NVIDIA#267 - cloud deployment scripts (NVIDIA#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (NVIDIA#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Fix conflicts --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> * Remove action permissions from pool config (NVIDIA#307) * Sync main into feature/PROJ-148-auth-rework (NVIDIA#322) * allow flexible squid proxy replicas (NVIDIA#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (NVIDIA#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * NVIDIA#244 - Use last login url if url is not specified (NVIDIA#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (NVIDIA#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (NVIDIA#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (NVIDIA#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (NVIDIA#247) * Allow PR checks to run on release branches (NVIDIA#264) * Database Pooling in Postgres Singleton Across Services (NVIDIA#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (NVIDIA#272) * NVIDIA#148 - Auth Project Design Documents (NVIDIA#165) * add args to postgres (NVIDIA#282) * NVIDIA#267 - cloud deployment scripts (NVIDIA#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (NVIDIA#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Add new project proposal to describe nvlink + topology aware scheduling (NVIDIA#211) * Add new project proposal to describe nvlink + topology aware scheduling * Split design into two docs * Finish docs and add some updates from feedback * Add some open items * OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (NVIDIA#315) * add redis utlis, update postgres utils (NVIDIA#313) * add redis utlis, update postgres utils * add deps * Fix missing seperator in the test runner roles (NVIDIA#320) * fix * remove * fix --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> Co-authored-by: ecolternv <ecolter@nvidia.com> Co-authored-by: tdewanNvidia <tdewan@nvidia.com> * Connect envoy with authz sidecar (NVIDIA#319) * Connect the authz sidecar to envoy * update sidecar * fix typo * add extra env * uncomment * NVIDIA#290 - Add attribute fetching for workflow pool matching (NVIDIA#338) * update * fix * fix * Merge main into feature branch (NVIDIA#452) * fix: pass node_condition_prefix to backend-worker deployment (NVIDIA#448) * NVIDIA#148 - Add user mapping into OSMO (NVIDIA#418) * Sync main into feature/PROJ-148-user-mapping (NVIDIA#375) * fix: update extraArgs to render string properly (NVIDIA#362) * Remove auth router in agent service (NVIDIA#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (NVIDIA#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#294 NVIDIA#295 - Add user table and user-role mapping (NVIDIA#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#339 - Add documentation for creating PAT and service accounts (NVIDIA#395) * Rename pat (NVIDIA#428) * NVIDIA#403 - Set optional default admin during service creation + fix PAT wording (NVIDIA#404) * use base64 for access tokens (NVIDIA#437) * Merge main into feature (NVIDIA#441) * Support Non-AWS S3 storage without environment variables (NVIDIA#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (NVIDIA#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (NVIDIA#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (NVIDIA#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * Merge main into feature --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * NVIDIA#407 - Authz sidecar sends accessible pools to service (NVIDIA#455) * update cache specification * lint * authz sidecar sends info to service * Add logging to the go code * comments * Merge main into feature (NVIDIA#463) * fix: pass node_condition_prefix to backend-worker deployment (NVIDIA#448) * NVIDIA#148 - Add user mapping into OSMO (NVIDIA#418) * Sync main into feature/PROJ-148-user-mapping (NVIDIA#375) * fix: update extraArgs to render string properly (NVIDIA#362) * Remove auth router in agent service (NVIDIA#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (NVIDIA#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#294 NVIDIA#295 - Add user table and user-role mapping (NVIDIA#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#339 - Add documentation for creating PAT and service accounts (NVIDIA#395) * Rename pat (NVIDIA#428) * NVIDIA#403 - Set optional default admin during service creation + fix PAT wording (NVIDIA#404) * use base64 for access tokens (NVIDIA#437) * Merge main into feature (NVIDIA#441) * Support Non-AWS S3 storage without environment variables (NVIDIA#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (NVIDIA#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (NVIDIA#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (NVIDIA#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Add override_url when forwarding default_credential to StaticDataCredential (NVIDIA#444) * Oauth2 proxy design (NVIDIA#436) * design for oauth2 proxy * format * update design after POC * update format * update design per envoy / UI changes * NVIDIA#356 - Group Template Implementation (NVIDIA#454) - Add group templates as a new type of config - Allow group templates to be assigned to pools - When a workflow is submitted, group templates will be instantiated per group, and when it is completed, they will be destroyed - Cleanup logic for creating kubernetes objects in the backend by using kubernetes dynamic client. * Client install location is determined by service (NVIDIA#447) * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * dupe * remove --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: ecolternv <ecolter@nvidia.com> * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * Add new fields in the envoy logs (NVIDIA#466) * Revert "NVIDIA#356 - Group Template Implementation (NVIDIA#454)" (NVIDIA#459) This reverts commit a483a88. * fix --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> Co-authored-by: patclarknvidia <patc@nvidia.com> Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com> Co-authored-by: xutongNV <xutongr@nvidia.com> Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com> Co-authored-by: ecolternv <ecolter@nvidia.com> Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
cypres
pushed a commit
to cypres/OSMO
that referenced
this pull request
Feb 26, 2026
* fix: pass node_condition_prefix to backend-worker deployment (NVIDIA#448) * NVIDIA#148 - Add user mapping into OSMO (NVIDIA#418) * Sync main into feature/PROJ-148-user-mapping (NVIDIA#375) * fix: update extraArgs to render string properly (NVIDIA#362) * Remove auth router in agent service (NVIDIA#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (NVIDIA#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#294 NVIDIA#295 - Add user table and user-role mapping (NVIDIA#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#339 - Add documentation for creating PAT and service accounts (NVIDIA#395) * Rename pat (NVIDIA#428) * NVIDIA#403 - Set optional default admin during service creation + fix PAT wording (NVIDIA#404) * use base64 for access tokens (NVIDIA#437) * Merge main into feature (NVIDIA#441) * Support Non-AWS S3 storage without environment variables (NVIDIA#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (NVIDIA#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (NVIDIA#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (NVIDIA#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * Merge main into feature --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com>
cypres
pushed a commit
to cypres/OSMO
that referenced
this pull request
Feb 26, 2026
* fix: pass node_condition_prefix to backend-worker deployment (NVIDIA#448) * NVIDIA#148 - Add user mapping into OSMO (NVIDIA#418) * Sync main into feature/PROJ-148-user-mapping (NVIDIA#375) * fix: update extraArgs to render string properly (NVIDIA#362) * Remove auth router in agent service (NVIDIA#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (NVIDIA#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: RyaliNvidia <ryali@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#294 NVIDIA#295 - Add user table and user-role mapping (NVIDIA#373) --------- Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> * NVIDIA#339 - Add documentation for creating PAT and service accounts (NVIDIA#395) * Rename pat (NVIDIA#428) * NVIDIA#403 - Set optional default admin during service creation + fix PAT wording (NVIDIA#404) * use base64 for access tokens (NVIDIA#437) * Merge main into feature (NVIDIA#441) * Support Non-AWS S3 storage without environment variables (NVIDIA#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (NVIDIA#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (NVIDIA#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Modify profile list api to show role info (NVIDIA#442) --------- Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com> * Revert "Remove auth router in agent service (NVIDIA#371)" (NVIDIA#390) This reverts commit fac90f2. * Merge main into feature --------- Co-authored-by: tdewanNvidia <tdewan@nvidia.com> Co-authored-by: OSMO CI Bot <255188861+svc-osmo-ci@users.noreply.github.com> Co-authored-by: Vivian Pan <vivianp@nvidia.com> Co-authored-by: Fernando L <fernandol@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: ethany-nv <ethany@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #None
Checklist