Skip to content

[Python] unexpected URL encoded path (white spaces) when uploading to S3 #34905

@svenatarms

Description

@svenatarms

Describe the bug, including details regarding any error messages, version, and platform.

Environment

OS: Windows/Linux
Python: 3.10.10
s3fs: from 2022.7.1 to 2023.3.0 (doesn't matter)
S3 Backend: MinIO / Ceph (doesn't matter)

Description

Version 11.0.0 of pyarrow introduced an unexpected behavior when uploading Parquet Files to an S3 Bucket (using s3fs.S3FileSystem), if the Path to the Parque File contains white spaces. White Spaces will be replaced by URL encoded Syntax %20 e.g:
A Directory Name like:

product=My Fancy Product

becomes

product=My%20Fancy%20Product

on S3 filesystem. NOTICE: the Equal Sign = is URL encoded for the request, but won't become %3D on S3 filesystem. That means, the URL encoded equal sign = seems to be interpreted correctly

Example Code

# s3fs FileSystem Object
def return_s3filesystem(url, user, pw):
    fs = s3fs.S3FileSystem(
        anon=False,
        use_ssl=True,
        client_kwargs={
            "endpoint_url": url,
            "aws_access_key_id": user,
            "aws_secret_access_key": pw,
            "verify": False,
        }
    )
    return fs


def write_df_to_s3(df, partition_cols, path_to_s3_object, url, user, pw, more_than_one_date_per_file,
                   delete_parquet_files):
   '''
    write Parquet File from Pandas DataFrame to S3 Bucket
   '''
   
   # instantiate s3fs.S3FileSystem object
    fs = return_s3filesystem(url, user, pw)
    # if the parquet file allready exists, delete it if requested, to prevent duplicated data
    delete_if_exists(fs, path_to_s3_object, df, more_than_one_date_per_file, delete_existing_files=delete_parquet_files)
    try:
       # create ArrowTable from DataFrame
        arrow_table = Table.from_pandas(df)
    except ArrowTypeError as e:
        # this is Error No. 1626701451158
        raise InvalidDataFrame(errorno=1626701451158, dataframe=df, arrowexception=e)
    except TypeError as e:
        raise InvalidDataFrame(errorno=1627657641211, dataframe=df, arrowexception=e)
    try:
       # write Parquet File to S3 Bucket, using S3FileSystem object 'fs' from above. Create directories by partition_cols
        pq.write_to_dataset(arrow_table,
                            path_to_s3_object,
                            partition_cols=partition_cols,
                            filesystem=fs,
                            use_dictionary=False,
                            data_page_size=100000,
                            compression="snappy",
                            version="2.0")
    except ArrowTypeError as e:
        raise InvalidDataFrame(errorno=1627575189, dataframe=df, arrowexception=e)
    except aiohttp.client_exceptions.ClientConnectionError as e:
        raise S3ConnectionError(errorno=1627575130, exmsg=e)

Example Result

Expected Result (using pyarrow 10.0.1)

image

Debug output
botocore.endpoint - DEBUG - Sending http request: <AWSPreparedRequest stream_output=False, method=POST, url=http://localhost:9000/my-products/product%3DMy%20Fancy%20Product/date%3D2023-01-05/0d5d1f2c503247
2dbad1d17c845d5432-0.parquet?uploadId=NDBhYjllZDEtNWIxOC00ZTBlLWI4ODYtOGRhZjBhNTg3NzQ5LjYxNTFhMDBlLTQxMmQtNDQ5Ni05YjBjLTBiMGM3ODI3MzhkMg, headers={'User-Agent': b'Botocore/1.27.59 Python/3.10.10 Windows/10
', 'X-Amz-Date': b'20230405T073129Z', 'X-Amz-Content-SHA256': b'41dccb632a0540f4f83eaf7138f97c5dd63c09410cbc3aa3412963b2f7006f18', 'Authorization': b'AWS4-HMAC-SHA256 Credential=******/*******/us-east
-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=7d89128993d6a226d3ac4fa3e6adbb60f638f28c37265446284ca6d629c837f8', 'amz-sdk-invocation-id': b'832bc91b-c285-4413-ad3d-546a3
bcefb59', 'amz-sdk-request': b'attempt=1', 'Content-Length': '357'}>
botocore.parsers - DEBUG - Response headers: HTTPHeaderDict({'accept-ranges': 'bytes', 'cache-control': 'no-cache', 'content-length': '471', 'content-security-policy': 'block-all-mixed-content', 'content-t
ype': 'application/xml', 'etag': '"caca775951f07ca64f530aae539fe5cd-3"', 'server': 'MinIO', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'vary': 'Accept-Encoding', 'x-accel-buffering
': 'no', 'x-amz-id-2': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'x-amz-request-id': '1752F9747004F3E5', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 
'date': 'Wed, 05 Apr 2023 07:31:29 GMT'})
botocore.parsers - DEBUG - Response body:
b'<?xml version="1.0" encoding="UTF-8"?>\n<CompleteMultipartUploadResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://localhost:9000/my-products/product=My%20Fancy%20Product/date=2023-0
1-05/0d5d1f2c5032472dbad1d17c845d5432-0.parquet</Location><Bucket>my-products</Bucket><Key>product=My Fancy Product/date=2023-01-05/0d5d1f2c5032472dbad1d17c845d5432-0.parquet</Key><ETag>&#34;caca775951f07c
a64f530aae539fe5cd-3&#34;</ETag></CompleteMultipartUploadResult>'
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <function check_for_200_error at 0x0000023C49ACA3B0>
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <aiobotocore.retryhandler.AioRetryHandler object at 0x0000023C4D8B64D0>
botocore.retryhandler - DEBUG - No retry needed.
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method AioS3RegionRedirector.redirect_from_error of <aiobotocore.utils.AioS3RegionRedirector object at 0x000002
3C4D8B6590>>

Actual result (using pyarrow 11.0.0)

image

Debug output
botocore.endpoint - DEBUG - Sending http request: <AWSPreparedRequest stream_output=False, method=POST, url=http://localhost:9000/my-products/product%3DMy%2520Fancy%2520Product/date%3D2023-01-10/a724b93c25
1a486b897eb7b151c622bd-0.parquet?uploadId=NDBhYjllZDEtNWIxOC00ZTBlLWI4ODYtOGRhZjBhNTg3NzQ5LjNlOGIyZmI4LWM4ZDEtNDU0ZS1iNjA0LWMxZjczNTI1NjhmZQ, headers={'User-Agent': b'Botocore/1.27.59 Python/3.10.10 Window
s/10', 'X-Amz-Date': b'20230405T073854Z', 'X-Amz-Content-SHA256': b'316db9078636bc3acba7fc81ff32a5704c08a104bfaea7b5e15bf35db799e260', 'Authorization': b'AWS4-HMAC-SHA256 Credential=*****/*****/us-
east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=0721f578ded50c01c1a64c05d62c628fb35f0e9385ffd3ecfa45423940995a63', 'amz-sdk-invocation-id': b'5b5bc340-7f6b-48cc-bf2a-0
860f8fa859b', 'amz-sdk-request': b'attempt=1', 'Content-Length': '357'}>
botocore.parsers - DEBUG - Response headers: HTTPHeaderDict({'accept-ranges': 'bytes', 'cache-control': 'no-cache', 'content-length': '479', 'content-security-policy': 'block-all-mixed-content', 'content-t
ype': 'application/xml', 'etag': '"f44ab58edcc877c4d00075b9db28e4e5-3"', 'server': 'MinIO', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'vary': 'Accept-Encoding', 'x-accel-buffering
': 'no', 'x-amz-id-2': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'x-amz-request-id': '1752F9DC0E0CE8AD', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 
'date': 'Wed, 05 Apr 2023 07:38:54 GMT'})
botocore.parsers - DEBUG - Response body:
b'<?xml version="1.0" encoding="UTF-8"?>\n<CompleteMultipartUploadResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://localhost:9000/my-products/product=My%2520Fancy%2520Product/date=20
23-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet</Location><Bucket>my-products</Bucket><Key>product=My%20Fancy%20Product/date=2023-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet</Key><ETag>&#34;f44ab5
8edcc877c4d00075b9db28e4e5-3&#34;</ETag></CompleteMultipartUploadResult>'
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <function check_for_200_error at 0x00000207A8422710>
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <aiobotocore.retryhandler.AioRetryHandler object at 0x00000207ADA9EE30>
botocore.retryhandler - DEBUG - No retry needed.
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method AioS3RegionRedirector.redirect_from_error of <aiobotocore.utils.AioS3RegionRedirector object at 0x000002
07ADA9EEF0>>

The difference in the debug output is the line starting with botocore.parsers - DEBUG - Response body:. In the XML Part, the Node <Key></Key> contains an URL Encoded string (pyarrow 11.0.0) vs. "human readable" String (pyarrow 10.0.1). But the URL encoded string is not URL encoded at all, as mentioned before e.g. the equal sign = is intepreted as expected.

It seems, that the URL encode/decode(?) isn't done correctly at all?

Wild guess of mine: This behavior might be introduced by: #33598 and/or #33468

Thanks,
Sven

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions