Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lakeFS S3 gateway behaviour when accessed from R #5441

Closed
rmoff opened this issue Mar 10, 2023 · 11 comments · Fixed by #5447
Closed

lakeFS S3 gateway behaviour when accessed from R #5441

rmoff opened this issue Mar 10, 2023 · 11 comments · Fixed by #5447
Assignees
Labels
area/gateway Changes to the gateway area/integrations bug Something isn't working team/versioning-engine Team versioning engine

Comments

@rmoff
Copy link
Contributor

rmoff commented Mar 10, 2023

I'm trying to use the S3 gateway from R, but not having much luck.

  • It connects without error, but returns no objects when listing 'buckets' or objects within a specified bucket.
  • The same lakeFS server works fine with boto and aws CLI doing the same thing
  • The lakeFS server log for the HTTP interaction is the same for the R (doesn't work) and Python (does work)
  • The R code works fine listing buckets and objects with Minio.

R library and S3 HTTP code

AWS CLI

aws configure
AWS Access Key ID [****************MPLE]:
AWS Secret Access Key [****************EKEY]:
Default region name [None]:
Default output format [None]:
aws s3api list-buckets --endpoint-url http://127.0.0.1:8000
{
    "Buckets": [
        {
            "Name": "example",
            "CreationDate": "2023-03-07T14:08:35.111000+00:00"
        }
    ],
    "Owner": {
        "DisplayName": "",
        "ID": ""
    }
}

List buckets from boto

import boto3

AWS_ACCESS_KEY_ID='AKIAJNSOLJH5YUW4KLBQ'
AWS_SECRET_ACCESS_KEY='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'

endpoint = 'http://localhost:8000'

session = boto3.Session(
                aws_access_key_id=AWS_ACCESS_KEY_ID,
                aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
           )
s3 = session.resource('s3', endpoint_url=endpoint)

print(list(s3.buckets.all()))

Output

[s3.Bucket(name='test-repo')]

Server log

{
    "action": "list_repos",
    "file": "build/pkg/gateway/middleware.go:124",
    "func": "pkg/gateway.EnrichWithOperation.func1.1",
    "level": "debug",
    "message_type": "action",
    "msg": "performing S3 action",
    "ref": "",
    "repository": "",
    "time": "2023-03-07T13:47:17Z",
    "user_id": "admin"
}
{
    "client": "Boto3/1.26.78 Python/3.11.2 Darwin/22.2.0 Botocore/1.29.78 Resource",
    "file": "usr/local/go/src/net/http/server.go:2109",
    "func": "net/http.HandlerFunc.ServeHTTP",
    "host": "localhost:8000",
    "level": "debug",
    "log_audit": true,
    "method": "GET",
    "msg": "HTTP call ended",
    "operation_id": "list_buckets",
    "path": "/",
    "request_id": "81fb3769-dd20-468d-a889-a4cd8d43b7c2",
    "sent_bytes": 250,
    "service_name": "s3_gateway",
    "source_ip": "172.17.0.1:58854",
    "status_code": 200,
    "time": "2023-03-07T13:47:17Z",
    "took": 644584,
    "user": "admin"
}

List buckets from R

# Install packages
install.packages(c("aws.s3"))

# Load necessary libraries
library(aws.s3)
library(data.table)
library(ggplot2)

# Define AWS S3 credentials and endpoint
Sys.setenv("AWS_ACCESS_KEY_ID" = "AKIAIOSFODNN7EXAMPLE",
           "AWS_SECRET_ACCESS_KEY" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
baseurl="localhost:8000"
b="drones03"

# List buckets
bucketlist(add_region = FALSE,base_url=baseurl,verbose=FALSE,use_https=FALSE)
# List objects in bucket
get_bucket_df(bucket=b, base_url=baseurl,add_region=FALSE,verbose=FALSE,use_https=FALSE)

Output: Note that nothing is returned for either command

> # List buckets
> bucketlist(add_region = FALSE,base_url=baseurl,verbose=FALSE,use_https=FALSE)
data frame with 0 columns and 0 rows

> # List objects in bucket
> get_bucket_df(bucket=b, base_url=baseurl,add_region=FALSE,verbose=FALSE,use_https=FALSE)
[1] Key               LastModified      ETag              Size             
[5] Owner_ID          Owner_DisplayName StorageClass      Bucket           
<0 rows> (or 0-length row.names)

Server log

{
    "action": "list_repos",
    "file": "build/pkg/gateway/middleware.go:124",
    "func": "pkg/gateway.EnrichWithOperation.func1.1",
    "level": "debug",
    "message_type": "action",
    "msg": "performing S3 action",
    "ref": "",
    "repository": "",
    "time": "2023-03-07T13:49:28Z",
    "user_id": "admin"
}
{
    "client": "libcurl/7.85.0 r-curl/5.0.0 httr/1.4.5",
    "file": "usr/local/go/src/net/http/server.go:2109",
    "func": "net/http.HandlerFunc.ServeHTTP",
    "host": "localhost:8000",
    "level": "debug",
    "log_audit": true,
    "method": "GET",
    "msg": "HTTP call ended",
    "operation_id": "list_buckets",
    "path": "/",
    "request_id": "3c5df6e2-0a0e-49bc-9d74-c3fd3dfcccde",
    "sent_bytes": 250,
    "service_name": "s3_gateway",
    "source_ip": "172.17.0.1:52904",
    "status_code": 200,
    "time": "2023-03-07T13:49:28Z",
    "took": 307500,
    "user": "admin"
}

Prove R code works against another S3 implementation

# Define Minio credentials
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
           "AWS_SECRET_ACCESS_KEY" = "minioadmin")
baseurl="localhost:9000"
b="example"

# List buckets
bucketlist(add_region = FALSE,base_url=baseurl,verbose=FALSE,use_https=FALSE)
# List objects in bucket
get_bucket_df(bucket=b, base_url=baseurl,add_region=FALSE,verbose=FALSE,use_https=FALSE)

Output

> # List buckets
> bucketlist(add_region = FALSE,base_url=baseurl,verbose=FALSE,use_https=FALSE)
   Bucket             CreationDate
1 example 2023-03-07T14:08:18.427Z

> # List objects in bucket
> get_bucket_df(bucket=b, base_url=baseurl,add_region=FALSE,verbose=FALSE,use_https=FALSE)
    Key             LastModified                               ETag Size
1 dummy 2023-03-07T14:08:35.103Z "18ab9e2980f7223750c5ed4833f45dab"   70
                                                          Owner_ID
1 02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4
  Owner_DisplayName StorageClass  Bucket
1             minio     STANDARD example
> 
@rmoff rmoff added the area/gateway Changes to the gateway label Mar 10, 2023
@rmoff
Copy link
Contributor Author

rmoff commented Mar 10, 2023

Notes from @nopcoder on Slack:

it seems that the issue is related to the content type we return for the s3 protocol
R checks for "application/xml":

    ctype <- httr::headers(r)[["content-type"]]
    if (is.null(ctype) || ctype == "application/xml"){

But currently for this request we return text/xml.

@arielshaqed
Copy link
Contributor

I added this code to fix #987. That was a similar problem with the Akka client -- but at least it did send us correct headers. It seems the R client is broken and requests a content-type it cannot handle.

We could prefer application/xml to text/xml when both are possible. Given that S3 seems to prefer application/xml, that should work.

On a personal note... I dream of a protocol. Its spec will look exactly like HTTP, but clients and servers will actually follow that spec. And not request things in headers that they cannot follow.

@nopcoder nopcoder self-assigned this Mar 12, 2023
@nopcoder nopcoder added the bug Something isn't working label Mar 12, 2023
@rmoff
Copy link
Contributor Author

rmoff commented Mar 28, 2023

Listing buckets now works
✅ Testing for object presence works
❌ Listing bucket contents doesn't work (returns zero rows).

See this notebook for test code and working example against minio:
https://gist.github.com/rmoff/c9337b4f4138f5121676351a9f603985

@rmoff rmoff reopened this Mar 28, 2023
@eladlachmi eladlachmi added the team/versioning-engine Team versioning engine label Apr 5, 2023
@rmoff
Copy link
Contributor Author

rmoff commented Jul 3, 2023

@nopcoder @arielshaqed I'm revisiting this; any idea why the object listing against lakeFS wouldn't work when the same code is fine against MinIO?

@nopcoder
Copy link
Contributor

nopcoder commented Jul 3, 2023

last time if I remember it was related to content type. will need to retest to see if there is an issue with list response.

@rmoff
Copy link
Contributor Author

rmoff commented Jul 3, 2023

I did a bit of digging and it looks like this is different from the content type issue - the payload itself is different from how MinIO handles an object underneath an otherwise-empty base path.

Here's MinIO's response for a repo that has a single object main/Action_5.png:

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <Name>test</Name>
    <Prefix></Prefix>
    <Marker></Marker>
    <MaxKeys>1000</MaxKeys>
    <Delimiter></Delimiter>
    <IsTruncated>false</IsTruncated>
    <Contents>
        <Key>main/Action_5.png</Key>
        <LastModified>2023-07-03T15:58:49.257Z</LastModified>
        <ETag>&#34;0130d0c155d312ea8214f1641062ce99&#34;</ETag>
        <Size>5696</Size>
        <Owner>
            <ID>02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4</ID>
            <DisplayName>minio</DisplayName>
        </Owner>
        <StorageClass>STANDARD</StorageClass>
    </Contents>
</ListBucketResult>

Here's lakeFS':

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult>
    <Name>quickstart</Name>
    <IsTruncated>false</IsTruncated>
    <Prefix></Prefix>
    <KeyCount>0</KeyCount>
    <MaxKeys>1000</MaxKeys>
    <CommonPrefixes>
        <Prefix>main/</Prefix>
    </CommonPrefixes>
    <Marker></Marker>
</ListBucketResult>

So the main/ path is returned but evidently, the R library is not handling CommonPrefixes as expected.

@rmoff
Copy link
Contributor Author

rmoff commented Jul 3, 2023

Found this which references using some alternative code, which does work:

as.data.frame(attributes(get_bucket(base_url=baseurl,
    bucket="quickstart",
    use_https=FALSE,
    region="",
    verbose=FALSE))$CommonPrefixes)
<chr>
main/

The question is, should lakeFS be returning a structure similar to MinIO?

@rmoff
Copy link
Contributor Author

rmoff commented Jul 4, 2023

Re-reading the linked issue I found that prefix is my friend here:

get_bucket_df(
    base_url=baseurl,
    bucket="quickstart",
    use_https=FALSE, 
    prefix="main/",
    region="",
    verbose=FALSE)

Returns a dataframe with the all the keys as expected

Key LastModified ETag Size Owner_ID Owner_DisplayName StorageClass Bucket
main/README.md 2023-07-03T16:33:14.918Z "c8a53d0a2e4cb2ebcffecc53b21faff5" 21806 NA NA STANDARD quickstart
main/_lakefs_actions/pre-commit-metadata-validation.yaml 2023-07-03T16:33:14.936Z "55320f7909fb6c9f91745addbad1e2fa" 308 NA NA STANDARD quickstart
main/_lakefs_actions/pre-merge-format-validation.yaml 2023-07-03T16:33:14.948Z "938502ef903ed61a627775c1ba6d5acb" 462 NA NA STANDARD quickstart
main/data/lakes.source.md 2023-07-03T16:33:14.958Z "20aea0a00ecac19808bf552362a4568d" 531 NA NA STANDARD quickstart
main/images/axolotl.png 2023-07-03T16:33:14.969Z "683a2dc83030e4a0de9f1027944057db" 3823 NA NA STANDARD quickstart
main/images/commit-change-02.png 2023-07-03T16:33:14.979Z "8903a13069299ca842ee8c40c8e8b518" 64400 NA NA STANDARD quickstart
main/images/commit-change.png 2023-07-03T16:33:14.990Z "0fc2f8a0fe17726a737dff0657c704b2" 98801 NA NA STANDARD quickstart
[…]

The open question

  • Why do lakeFS and MinIO behave differently?
  • Is lakeFS correct to require a prefix for it to return any keys?

@nopcoder
Copy link
Contributor

nopcoder commented Jul 7, 2023

@rmoff I think I understand the issue, but I'll try to explain first.

When listing objects in lakeFS there is a special case which is the repository/bucket level. At his level we should list branches. Listing branches doesn't support recursive listing, unless you list something under the branch. The above output from MinIO does include the objects, like in recursive list, but it is because at this level lakeFS had branch and minio doesn't. We are getting no output in R at the bucket level because we return common prefixes (folders) for each branch in the repository.

About require prefix - we should behave like S3. But in this case, lakeFS can't list objects without specify the branch. So I think we don't have a choice unless we like to support such feature - listing recursive at the repository level.

@rmoff
Copy link
Contributor Author

rmoff commented Jul 10, 2023

OK that makes sense, thanks @nopcoder.
I think the best option is to add this info to the R docs page that I'm writing and close this issue as expected behaviour. Do you agree?

@rmoff rmoff changed the title lakeFS S3 gateway doesn't work with R lakeFS S3 gateway behaviour when accessed from R Jul 10, 2023
@rmoff
Copy link
Contributor Author

rmoff commented Jul 10, 2023

Added note to R doc. Closing.

@rmoff rmoff closed this as completed Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gateway Changes to the gateway area/integrations bug Something isn't working team/versioning-engine Team versioning engine
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants