Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue trying to hit Nessie Catalog #1680

Open
3 tasks
adamcodes716 opened this issue Feb 18, 2025 · 8 comments
Open
3 tasks

Issue trying to hit Nessie Catalog #1680

adamcodes716 opened this issue Feb 18, 2025 · 8 comments

Comments

@adamcodes716
Copy link

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

When trying to use this functionality, there is a ton of confusion around the internet. A similar question was recently posted but the OP's good configuration was never posted. I am posting my question here because of the confusion around the URI that is generated by the code.

I have this code:

warehouse_path = "s3://warehouse"

try:
    catalog = load_catalog(
        "rest",
        **{
            "uri": "http://localhost:19120/iceberg",  # Nessie Server URI
            "warehouse": warehouse_path,
            "s3.endpoint": "http://localhost:9000",
            "s3.path-style-access": "true",
            "s3.access-key-id": "admin",
            "s3.secret-access-key": "password",
            "supportedAPIVersion": 2
        }

Why is this url being generated?

  File "C:\Projects\Project\myenv\Lib\site-packages\requests\models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:19120/iceberg/v1/config?warehouse=s3%3A%2F%2Fwarehouse

the /api/v2/config url works fine:

{
  "defaultBranch" : "main",
  "minSupportedApiVersion" : 1,
  "maxSupportedApiVersion" : 2,
  "actualApiVersion" : 2,
  "specVersion" : "2.1.0",
  "noAncestorHash" : "2e1cfa82b035c26cbbbdae632cea070514eb8b773f616aaeaf668e2f0be8f10d",
  "repositoryCreationTimestamp" : "2025-02-18T15:47:13.045523855Z",
  "oldestPossibleCommitTimestamp" : "2025-02-18T15:47:13.045523855Z"
}

Like many others, I am a bit confused as to how this all ties toegether.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@kevinjqliu
Copy link
Contributor

The solution to #1524 is here, #1524 (comment)

Otherwise, this is a more Nessie centric question and Im not familiar with it. Could you also ask on the Nessie project?

When trying to use this functionality, there is a ton of confusion around the internet.

Would be great to mention this to the Nessie project as well to improve the documentation around integration with pyiceberg

@adamcodes716
Copy link
Author

adamcodes716 commented Feb 18, 2025

The solution to #1524 is here, #1524 (comment)

Otherwise, this is a more Nessie centric question and Im not familiar with it. Could you also ask on the Nessie project?

When trying to use this functionality, there is a ton of confusion around the internet.

Would be great to mention this to the Nessie project as well to improve the documentation around integration with pyiceberg

Thank you so much for the reply. Yes, I even referenced that post in the original post. Unfortunately the OP did not post his full solution - others asked him to do so at the bottom of the post.
I think the issue at the heart of this is this: why is this URL being generated using load_catalog? the "/iceberg" shouldn't be part of the url.

http://localhost:19120/iceberg/v1/config?warehouse=s3%3A%2F%2Fwarehouse

I think the issue that so many people are having is trying to get the right combination of code and settings to get all of this to work. I do have a request open in the nessie area, but they often point back here because load_catalog isn't theirs.

@kevinjqliu
Copy link
Contributor

why is this URL being generated using load_catalog? the "/iceberg" shouldn't be part of the url.

The pyiceberg rest client takes the uri specified in load_catalog as the base url and calls Iceberg rest endpoints using it. So for the uri you specified, http://localhost:19120/iceberg/v1/config?warehouse=s3%3A%2F%2Fwarehouse, the http://localhost:19120/iceberg part is from the uri configured. The /v1/config is a standard iceberg rest endpoint.

For any uri given to load_catalog, pyiceberg's rest client will call {uri}/v1/config.

I think the issue that so many people are having is trying to get the right combination of code and settings to get all of this to work. I do have a request open in the nessie area, but they often point back here because load_catalog isn't theirs.

Yea i've seen different issues of connecting to Nessie. I'm confused by the docs actually.
Here the docs mentions that https://localhost:19120/api/v2 is the uri for rest endpoint. But here the uri is http\://localhost\:33405/iceberg/

@adamcodes716
Copy link
Author

why is this URL being generated using load_catalog? the "/iceberg" shouldn't be part of the url.

The pyiceberg rest client takes the uri specified in load_catalog as the base url and calls Iceberg rest endpoints using it. So for the uri you specified, http://localhost:19120/iceberg/v1/config?warehouse=s3%3A%2F%2Fwarehouse, the http://localhost:19120/iceberg part is from the uri configured. The /v1/config is a standard iceberg rest endpoint.

For any uri given to load_catalog, pyiceberg's rest client will call {uri}/v1/config.

I think the issue that so many people are having is trying to get the right combination of code and settings to get all of this to work. I do have a request open in the nessie area, but they often point back here because load_catalog isn't theirs.

Yea i've seen different issues of connecting to Nessie. I'm confused by the docs actually. Here the docs mentions that https://localhost:19120/api/v2 is the uri for rest endpoint. But here the uri is http\://localhost\:33405/iceberg/

Thanks for the reply! So I've been down the road of using /api instead of /iceberg. I got a 200 on the connection, but I was unable to get past these errors:

Field required [type=missing, input_value={'defaultBranch': 'main',...SupportedApiVersion': 2}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.6/v/missing

Others had the same issue, and then apparently solved them by using /iceberg. And round and round we go. Have you successfully configured Nessie?

@kevinjqliu
Copy link
Contributor

I have not. Im surprised that theres not a tutorial or blog from the Nessie side to show integration with pyiceberg. Maybe its a good idea to open an issue with the Nessie project to figure this out.

@adamcodes716
Copy link
Author

I have not. Im surprised that theres not a tutorial or blog from the Nessie side to show integration with pyiceberg. Maybe its a good idea to open an issue with the Nessie project to figure this out.

Thank you again for your response. I am trying to understand why you think that this is a Nessie issue. Their endpoints seem to be pretty straight-forward and consistent (if I am wrong, please set me straight). It seems that the struggle is trying to understand how and why methods like "load_catalog" work. People come to this forum trying to understand it, and they are regularly referred to posts where a full answer is not posted. Clearly some people have this working, but it is also clear that a ton of other people are all struggling with the same thing and are hoping to find an answer.

@kevinjqliu
Copy link
Contributor

Pyiceberg's rest client supports any conforming Iceberg REST servers. Using load_catalog with Nessie's Iceberg REST endpoint should just work. load_catalog can take a number of configuration parameters such as the REST configuration here

This might be a configuration issue since #1524 was able to connect to Nessie via its REST endpoint. I can take a look and try to set up a Nessie server when I get some time.

The reason I want to defer to the Nessie project is because Pyiceberg's rest client should work like any other rest client, such as Trino/Spark. So the same configuration for setting up a Nessie REST catalog in Spark should work for pyiceberg.

@adamcodes716
Copy link
Author

Pyiceberg's rest client supports any conforming Iceberg REST servers. Using load_catalog with Nessie's Iceberg REST endpoint should just work. load_catalog can take a number of configuration parameters such as the REST configuration here

This might be a configuration issue since #1524 was able to connect to Nessie via its REST endpoint. I can take a look and try to set up a Nessie server when I get some time.

The reason I want to defer to the Nessie project is because Pyiceberg's rest client should work like any other rest client, such as Trino/Spark. So the same configuration for setting up a Nessie REST catalog in Spark should work for pyiceberg.

It absolutely could be a setup issue, I don't know. I had no problem getting it to work with other catalog types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants