[BUG] - Notebook Jobs failing because environment is not found #2277

marcelovilla · 2024-02-23T22:06:06Z

Describe the bug

When trying to submit a Jupyter Notebook job, the job immediately stops and shows the following error:

No such kernel named conda-env-global-global-papermill-py

Here is an image of the notebook jobs tab with more information:

Expected behavior

I would expect the Notebook Job to successfully start and execute.

OS and architecture in which you are running Nebari

AWS

How to Reproduce the problem?

I am deploying Nebari from a1e5fd1 to AWS, using the following config file:

provider: aws
namespace: dev
nebari_version: 2024.1.2.dev44+ga1e5fd1e
project_name: release-test
domain: some.domain.dev
ci_cd:
  type: none
terraform_state:
  type: remote
security:
  keycloak:
    initial_root_password: extra-secure-password
  authentication:
    type: password
theme:
  jupyterhub:
    hub_title: Nebari - release-test
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    hub_subtitle: Your open source data science platform, hosted on Amazon Web Services
amazon_web_services:
  kubernetes_version: '1.26'
  region: us-west-2
monitoring:
  enabled: true
argo_workflows:
  enabled: true

Command output

No response

Versions and dependencies used.

No response

Compute environment

AWS

Integrations

Argo

Anything else?

The environment I am using to submit the Notebook Job has papermill on it and argo is enabled. This also happens with other environments that have papermill on it.

The text was updated successfully, but these errors were encountered:

krassowski · 2024-02-27T15:30:19Z

I am getting There is no kernel associated with the notebook. Please open the notebook, select a kernel, and re-submit the job to execute. semi-reliably.

dharhas · 2024-02-27T15:32:54Z

HTTPConnectionPool(host='nebari-conda-store-server.dev.svc', port=5000): Max retries exceeded with url: /conda-store/api/v1/environment (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fa01067de40>: Failed to resolve 'nebari-conda-store-server.dev.svc' ([Errno -2] Name or service not known)"))

This the error I see on the JATIC deployment when I try run now.

marcelovilla · 2024-02-27T15:38:47Z

@nkaretnikov do you have any ideas? I know you did some major changes to https://github.com/nebari-dev/argo-jupyter-scheduler last month.

I'm not sure it's related to those changes, though. We used the latest argo-jupyter-scheduler for Nebari and the notebook job submission worked fine.

nkaretnikov · 2024-02-27T22:21:21Z

@marcelovilla That repo has no references to the hardcoded papermill environment inside. argo-jupyter-scheduler just needs to have papermill be available in some environment. Also, the error it would throw if papermill were to be missing wouldn't look like this. So it's some issue with environments in that deployment.

marcelovilla · 2024-02-29T21:16:24Z

Checking the logs in the user pod, I'm seeing the following error:

notebook Traceback (most recent call last):
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/executors.py", line 60, in process
notebook     self.execute()
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/executors.py", line 140, in execute
notebook     ep.preprocess(nb)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/nbconvert/preprocessors/execute.py", line 96, in preprocess
notebook     with self.setup_kernel():
notebook   File "/opt/conda/envs/default/lib/python3.10/contextlib.py", line 135, in __enter__
notebook     return next(self.gen)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/nbclient/client.py", line 596, in setup_kernel
notebook     self.start_new_kernel(**kwargs)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 165, in wrapped
notebook     return loop.run_until_complete(inner)
notebook   File "/opt/conda/envs/default/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
notebook     return future.result()
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/nbclient/client.py", line 546, in async_start_new_kernel
notebook     await ensure_async(self.km.start_kernel(extra_arguments=self.extra_arguments, **kwargs))
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 198, in ensure_async
notebook     result = await obj
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 96, in wrapper
notebook     raise e
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 87, in wrapper
notebook     out = await method(self, *args, **kwargs)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 435, in _async_start_kernel
notebook     kernel_cmd, kw = await self._async_pre_start_kernel(**kw)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 397, in _async_pre_start_kernel
notebook     self.kernel_spec,
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/manager.py", line 195, in kernel_spec
notebook     self._kernel_spec = self.kernel_spec_manager.get_kernel_spec(self.kernel_name)
notebook   File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_client/kernelspec.py", line 285, in get_kernel_spec
notebook     raise NoSuchKernel(kernel_name)
notebook jupyter_client.kernelspec.NoSuchKernel: No such kernel named conda-env-global-global-papermill-py

I checked another deployment with the latest version of Nebari (2024.1.1) where this works and the jupyter_scheduler version is slightly different. There are, however, also differences with the jupyterlab versions.

I'll need to investigate this further.

krassowski · 2024-03-03T16:11:22Z

I think I understood the issue that I was running against: when creating a notebook, but before manually saving it, the kernelspec metadata are not there, hence kernel cannot be found. Manually saving the notebook solves the problem. I think this is an upstream issue, at the very least of quite misleading error message:

Server log in details below:

│ [E 2024-03-03 09:59:26.166 SchedulerApp] There is no kernel associated with the notebook. Please open                                                                                                                                                                                               │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/handlers.py", line 233, in post                                                                                                                                                                                  │
│         job_id = await ensure_async(self.scheduler.create_job(CreateJob(**payload)))                                                                                                                                                                                                                │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/scheduler.py", line 380, in create_job                                                                                                                                                                           │
│         raise SchedulerError(                                                                                                                                                                                                                                                                       │
│     jupyter_scheduler.exceptions.SchedulerError: There is no kernel associated with the notebook. Please open                                                                                                                                                                                       │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                                                     │
│ [W 2024-03-03 09:59:26.166 SchedulerApp] wrote error: 'There is no kernel associated with the notebook. Please open\n                    the notebook, select a kernel, and re-submit the job to execute.\n                    '                                                                    │
│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/handlers.py", line 233, in post                                                                                                                                                                                  │
│         job_id = await ensure_async(self.scheduler.create_job(CreateJob(**payload)))                                                                                                                                                                                                                │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/scheduler.py", line 380, in create_job                                                                                                                                                                           │
│         raise SchedulerError(                                                                                                                                                                                                                                                                       │
│     jupyter_scheduler.exceptions.SchedulerError: There is no kernel associated with the notebook. Please open                                                                                                                                                                                       │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                                                     │
│                                                                                                                                                                                                                                                                                                     │
│     The above exception was the direct cause of the following exception:                                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                                                     │
│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/tornado/web.py", line 1790, in _execute                                                                                                                                                                                            │
│         result = await result                                                                                                                                                                                                                                                                       │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/jupyter_scheduler/handlers.py", line 245, in post                                                                                                                                                                                  │
│         raise HTTPError(500, str(e)) from e                                                                                                                                                                                                                                                         │
│     tornado.web.HTTPError: HTTP 500: Internal Server Error (There is no kernel associated with the notebook. Please open                                                                                                                                                                            │
│                         the notebook, select a kernel, and re-submit the job to execute.                                                                                                                                                                                                            │
│                         )                                                                                                                                                                                                                                                                           │

In the beginning the notebook is created as a empty shell on disk:

{
 "cells": [],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}

after the kernel connects, it provides metadata; that metadata is then saved when user saves the notebook for the first time:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27a3458b-29d2-4b29-9695-868f409cd12f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

krassowski · 2024-03-03T16:30:44Z

I am able to reproduce the issue with global-global-papermill-py environment with the same error message as @marcelovilla reported.

Since neither that nor the behaviour of not finding environments before notebook is saved as reported in my previous comment does not occur in the previous (stable) nebari version, I would think this is a change in jupyter-scheduler or argo-jupyter-scheduler.

The stable version has: jupyter-scheduler 2.5.0 and argo-jupyter-scheduler 2024.1.3
The test version has jupyter-scheduler 2.5.1 and argo-jupyter-scheduler 2024.1.3

krassowski · 2024-03-03T17:02:25Z

Last week I mention it is hitting a code path that it has no right of hitting if argo-jupyter-scheduler is used. Here is why:

The error I saw is raised from create_job() method in upstream jupyter-scheduler when validate() method of Execution Manager returns False
The default implementation of an Execution Manager indeed raises whenever it cannot read the notebook and find a kernelspec in it (so the issue with being unable to run newly created notebooks is indeed inherent to the default jupyter-scheduler)
But in the argo-jupyter-scheduler this validation is disabled and always returns True

The conclusion is that it must be that in the test environment with the current pre-release the traitlet which should be setting the Scheduler.execution_manager_class traitlet has no effect or is not passed to the server correctly. In nebari it is defined in:

nebari/src/_nebari/stages/kubernetes_services/template/modules/kubernetes/services/jupyterhub/files/jupyter/jupyter_server_config.py.tpl

Lines 48 to 53 in 5463e8d

    
           from argo_jupyter_scheduler.executor import ArgoExecutor 
        
           from argo_jupyter_scheduler.scheduler import ArgoScheduler 
        
           c.Scheduler.execution_manager_class=ArgoExecutor 
        
           c.SchedulerApp.scheduler_class=ArgoScheduler 
        
           c.SchedulerApp.scheduler_class.use_conda_store_env=True

krassowski · 2024-03-03T17:11:35Z

Ok, so #2251 by myself is to blame. The very head of the the log contains:

│     Traceback (most recent call last):                                                                                                                                                                                                                                                              │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/traitlets/config/application.py", line 915, in _load_config_files                                                                                                                                                                  │
│         config = loader.load_config()                                                                                                                                                                                                                                                               │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/traitlets/config/loader.py", line 622, in load_config                                                                                                                                                                              │
│         self._read_file_as_dict()                                                                                                                                                                                                                                                                   │
│       File "/opt/conda/envs/default/lib/python3.10/site-packages/traitlets/config/loader.py", line 655, in _read_file_as_dict                                                                                                                                                                       │
│         exec(compile(f.read(), conf_filename, "exec"), namespace, namespace)  # noqa: S102                                                                                                                                                                                                          │
│       File "/etc/jupyter/jupyter_server_config.py", line 13                                                                                                                                                                                                                                         │
│         preferred_dir =                                                                                                                                                                                                                                                                             │
│                         ^                                                                                                                                                                                                                                                                           │
│     SyntaxError: invalid syntax

This is exactly the case I was making in issue:

[ENH] - Prefer use of jupyter_server_config.json over jupyter_server_config.py #2248

Another question is if an error during configuration readout should prevent the jupyter server from starting up in the first place. I would say that yes, it should because the configuration may contain additional opt-in security measures, and if those are not active the security guarantees may not be met.

marcelovilla added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Feb 23, 2024

github-project-automation bot added this to 🪴 Nebari Project Management Feb 23, 2024

github-project-automation bot moved this to New 🚦 in 🪴 Nebari Project Management Feb 23, 2024

marcelovilla added needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug area: integration/Argo block-release ⛔️ Must be completed for release and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Feb 23, 2024

krassowski mentioned this issue Mar 4, 2024

Fix syntax error in jupyter-server-config Python file #2286

Merged

12 tasks

krassowski linked a pull request Mar 5, 2024 that will close this issue

Fix syntax error in jupyter-server-config Python file #2286

Merged

12 tasks

krassowski closed this as completed in #2286 Mar 5, 2024

github-project-automation bot moved this from New 🚦 to Done 💪🏾 in 🪴 Nebari Project Management Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Notebook Jobs failing because environment is not found #2277

[BUG] - Notebook Jobs failing because environment is not found #2277

marcelovilla commented Feb 23, 2024

krassowski commented Feb 27, 2024

dharhas commented Feb 27, 2024

marcelovilla commented Feb 27, 2024

nkaretnikov commented Feb 27, 2024

marcelovilla commented Feb 29, 2024

krassowski commented Mar 3, 2024 •

edited

Loading

krassowski commented Mar 3, 2024

krassowski commented Mar 3, 2024

krassowski commented Mar 3, 2024

[BUG] - Notebook Jobs failing because environment is not found #2277

[BUG] - Notebook Jobs failing because environment is not found #2277

Comments

marcelovilla commented Feb 23, 2024

Describe the bug

Expected behavior

OS and architecture in which you are running Nebari

How to Reproduce the problem?

Command output

Versions and dependencies used.

Compute environment

Integrations

Anything else?

krassowski commented Feb 27, 2024

dharhas commented Feb 27, 2024

marcelovilla commented Feb 27, 2024

nkaretnikov commented Feb 27, 2024

marcelovilla commented Feb 29, 2024

krassowski commented Mar 3, 2024 • edited Loading

krassowski commented Mar 3, 2024

krassowski commented Mar 3, 2024

krassowski commented Mar 3, 2024

krassowski commented Mar 3, 2024 •

edited

Loading