Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Notebook Jobs failing because environment is not found #2277

Closed
marcelovilla opened this issue Feb 23, 2024 · 9 comments · Fixed by #2286
Closed

[BUG] - Notebook Jobs failing because environment is not found #2277

marcelovilla opened this issue Feb 23, 2024 · 9 comments · Fixed by #2286
Labels
area: integration/Argo block-release ⛔️ Must be completed for release needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug type: bug 🐛 Something isn't working

Comments

@marcelovilla
Copy link
Member

Describe the bug

When trying to submit a Jupyter Notebook job, the job immediately stops and shows the following error:

No such kernel named conda-env-global-global-papermill-py

Here is an image of the notebook jobs tab with more information:
image

Expected behavior

I would expect the Notebook Job to successfully start and execute.

OS and architecture in which you are running Nebari

AWS

How to Reproduce the problem?

I am deploying Nebari from a1e5fd1 to AWS, using the following config file:

provider: aws
namespace: dev
nebari_version: 2024.1.2.dev44+ga1e5fd1e
project_name: release-test
domain: some.domain.dev
ci_cd:
  type: none
terraform_state:
  type: remote
security:
  keycloak:
    initial_root_password: extra-secure-password
  authentication:
    type: password
theme:
  jupyterhub:
    hub_title: Nebari - release-test
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    hub_subtitle: Your open source data science platform, hosted on Amazon Web Services
amazon_web_services:
  kubernetes_version: '1.26'
  region: us-west-2
monitoring:
  enabled: true
argo_workflows:
  enabled: true

Command output

No response

Versions and dependencies used.

No response

Compute environment

AWS

Integrations

Argo

Anything else?

The environment I am using to submit the Notebook Job has papermill on it and argo is enabled. This also happens with other environments that have papermill on it.

@marcelovilla marcelovilla added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Feb 23, 2024
@marcelovilla marcelovilla added needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug area: integration/Argo block-release ⛔️ Must be completed for release and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Feb 23, 2024
@krassowski
Copy link
Member

I am getting There is no kernel associated with the notebook. Please open the notebook, select a kernel, and re-submit the job to execute. semi-reliably.

@dharhas
Copy link
Member

dharhas commented Feb 27, 2024

HTTPConnectionPool(host='nebari-conda-store-server.dev.svc', port=5000): Max retries exceeded with url: /conda-store/api/v1/environment (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fa01067de40>: Failed to resolve 'nebari-conda-store-server.dev.svc' ([Errno -2] Name or service not known)"))

This the error I see on the JATIC deployment when I try run now.

@marcelovilla
Copy link
Member Author

@nkaretnikov do you have any ideas? I know you did some major changes to https://github.com/nebari-dev/argo-jupyter-scheduler last month.

I'm not sure it's related to those changes, though. We used the latest argo-jupyter-scheduler for Nebari and the notebook job submission worked fine.

@nkaretnikov
Copy link

@marcelovilla That repo has no references to the hardcoded papermill environment inside. argo-jupyter-scheduler just needs to have papermill be available in some environment. Also, the error it would throw if papermill were to be missing wouldn't look like this. So it's some issue with environments in that deployment.

@marcelovilla
Copy link
Member Author

Checking the logs in the user pod, I'm seeing the following error:

notebook Traceback (most recent call last):
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/executors.py", line 60, in process
notebook     self.execute()
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/executors.py", line 140, in execute
notebook     ep.preprocess(nb)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/nbconvert/preprocessors/execute.py", line 96, in preprocess
notebook     with self.setup_kernel():
notebook   File "/opt/conda/envs/default/lib/python3.10/contextlib.py", line 135, in __enter__
notebook     return next(self.gen)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/nbclient/client.py", line 596, in setup_kernel
notebook     self.start_new_kernel(**kwargs)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 165, in wrapped
notebook     return loop.run_until_complete(inner)
notebook   File "/opt/conda/envs/default/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
notebook     return future.result()
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/nbclient/client.py", line 546, in async_start_new_kernel
notebook     await ensure_async(self.km.start_kernel(extra_arguments=self.extra_arguments, **kwargs))
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 198, in ensure_async
notebook     result = await obj
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 96, in wrapper
notebook     raise e
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 87, in wrapper
notebook     out = await method(self, *args, **kwargs)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 435, in _async_start_kernel
notebook     kernel_cmd, kw = await self._async_pre_start_kernel(**kw)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 397, in _async_pre_start_kernel
notebook     self.kernel_spec,
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 195, in kernel_spec
notebook     self._kernel_spec = self.kernel_spec_manager.get_kernel_spec(self.kernel_name)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/kernelspec.py", line 285, in get_kernel_spec
notebook     raise NoSuchKernel(kernel_name)
notebook jupyter_client.kernelspec.NoSuchKernel: No such kernel named conda-env-global-global-papermill-py

I checked another deployment with the latest version of Nebari (2024.1.1) where this works and the jupyter_scheduler version is slightly different. There are, however, also differences with the jupyterlab versions.

I'll need to investigate this further.

@krassowski
Copy link
Member

krassowski commented Mar 3, 2024

I think I understood the issue that I was running against: when creating a notebook, but before manually saving it, the kernelspec metadata are not there, hence kernel cannot be found. Manually saving the notebook solves the problem. I think this is an upstream issue, at the very least of quite misleading error message:

problem-zero-notebook-nneds-to-be-explicitly-saved

Server log in details below:

│ [E 2024-03-03 09:59:26.166 SchedulerApp] There is no kernel associated with the notebook. Please open                                                                                                                                                                                               │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/handlers.py", line 233, in post                                                                                                                                                                                  │
│         job_id = await ensure_async(self.scheduler.create_job(CreateJob(**payload)))                                                                                                                                                                                                                │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/scheduler.py", line 380, in create_job                                                                                                                                                                           │
│         raise SchedulerError(                                                                                                                                                                                                                                                                       │
│     jupyter_scheduler.exceptions.SchedulerError: There is no kernel associated with the notebook. Please open                                                                                                                                                                                       │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                                                     │
│ [W 2024-03-03 09:59:26.166 SchedulerApp] wrote error: 'There is no kernel associated with the notebook. Please open\n                    the notebook, select a kernel, and re-submit the job to execute.\n                    '                                                                    │
│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/handlers.py", line 233, in post                                                                                                                                                                                  │
│         job_id = await ensure_async(self.scheduler.create_job(CreateJob(**payload)))                                                                                                                                                                                                                │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/scheduler.py", line 380, in create_job                                                                                                                                                                           │
│         raise SchedulerError(                                                                                                                                                                                                                                                                       │
│     jupyter_scheduler.exceptions.SchedulerError: There is no kernel associated with the notebook. Please open                                                                                                                                                                                       │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                                                     │
│                                                                                                                                                                                                                                                                                                     │
│     The above exception was the direct cause of the following exception:                                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                                                     │
│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/tornado/web.py", line 1790, in _execute                                                                                                                                                                                            │
│         result = await result                                                                                                                                                                                                                                                                       │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/handlers.py", line 245, in post                                                                                                                                                                                  │
│         raise HTTPError(500, str(e)) from e                                                                                                                                                                                                                                                         │
│     tornado.web.HTTPError: HTTP 500: Internal Server Error (There is no kernel associated with the notebook. Please open                                                                                                                                                                            │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│                         )                                                                                                                                                                                                                                                                           │

In the beginning the notebook is created as a empty shell on disk:

{
 "cells": [],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}

after the kernel connects, it provides metadata; that metadata is then saved when user saves the notebook for the first time:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27a3458b-29d2-4b29-9695-868f409cd12f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

@krassowski
Copy link
Member

I am able to reproduce the issue with global-global-papermill-py environment with the same error message as @marcelovilla reported.

Since neither that nor the behaviour of not finding environments before notebook is saved as reported in my previous comment does not occur in the previous (stable) nebari version, I would think this is a change in jupyter-scheduler or argo-jupyter-scheduler.

  • The stable version has: jupyter-scheduler 2.5.0 and argo-jupyter-scheduler 2024.1.3
  • The test version has jupyter-scheduler 2.5.1 and argo-jupyter-scheduler 2024.1.3

@krassowski
Copy link
Member

Last week I mention it is hitting a code path that it has no right of hitting if argo-jupyter-scheduler is used. Here is why:

  1. The error I saw is raised from create_job() method in upstream jupyter-scheduler when validate() method of Execution Manager returns False
  2. The default implementation of an Execution Manager indeed raises whenever it cannot read the notebook and find a kernelspec in it (so the issue with being unable to run newly created notebooks is indeed inherent to the default jupyter-scheduler)
  3. But in the argo-jupyter-scheduler this validation is disabled and always returns True

The conclusion is that it must be that in the test environment with the current pre-release the traitlet which should be setting the Scheduler.execution_manager_class traitlet has no effect or is not passed to the server correctly. In nebari it is defined in:

from argo_jupyter_scheduler.executor import ArgoExecutor
from argo_jupyter_scheduler.scheduler import ArgoScheduler
c.Scheduler.execution_manager_class=ArgoExecutor
c.SchedulerApp.scheduler_class=ArgoScheduler
c.SchedulerApp.scheduler_class.use_conda_store_env=True

@krassowski
Copy link
Member

Ok, so #2251 by myself is to blame. The very head of the the log contains:

│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/traitlets/config/application.py", line 915, in _load_config_files                                                                                                                                                                  │
│         config = loader.load_config()                                                                                                                                                                                                                                                               │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/traitlets/config/loader.py", line 622, in load_config                                                                                                                                                                              │
│         self._read_file_as_dict()                                                                                                                                                                                                                                                                   │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/traitlets/config/loader.py", line 655, in _read_file_as_dict                                                                                                                                                                       │
│         exec(compile(f.read(), conf_filename, "exec"), namespace, namespace)  # noqa: S102                                                                                                                                                                                                          │
│       File "/etc/jupyter/jupyter_server_config.py", line 13                                                                                                                                                                                                                                         │
│         preferred_dir =                                                                                                                                                                                                                                                                             │
│                         ^                                                                                                                                                                                                                                                                           │
│     SyntaxError: invalid syntax 

This is exactly the case I was making in issue:

Another question is if an error during configuration readout should prevent the jupyter server from starting up in the first place. I would say that yes, it should because the configuration may contain additional opt-in security measures, and if those are not active the security guarantees may not be met.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: integration/Argo block-release ⛔️ Must be completed for release needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug type: bug 🐛 Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

4 participants