Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read a single parquet file from a Delta table with LastUpdated=2023-10-08T20%3A54%3A33.250Z in the path #7877

Closed
rspears74 opened this issue Oct 20, 2023 · 5 comments · Fixed by #8012
Labels
bug Something isn't working

Comments

@rspears74
Copy link
Contributor

Describe the bug

I am trying to read parquet files from a Delta table. The parquet files are snappy compressed. My Delta table has 3 partition columns, so the folder structure of the Delta table looks something like this:

table
-- Col1=ABC
---- Col2=123
------ Col3=abc
-------- part-0000-xxxx-asdf.c00.snappy.parquet
-- Col1=XYZ
---- Col2=789
------ Col3=xyz
-------- part-0000-xyzj-qplg.c00.snappy.parquet

I was originally trying to read a list of files (I need to be able to read an arbitrary list of files), but debugging my issue has brought me to trying to read a single parquet file. My code for this is as follows:

async fn read_parquet(path: String) -> Result<(), DataFusionError> {
    let session = SessionContext::new();
    let df = session.read_parquet("/Users/me/Downloads/table/Col1=ABC/Col2=123/Col3=abc/part-0000-xxxx-asdf.c00.snappy.parquet", ParquetReadOptions::default()).await?;
    Ok(())
}

When I run this, I get:

thread 'main' panicked at src/main.rs:7:35:
called `Result::unwrap()` on an `Err` value: ObjectStore(NotFound { path: "/Users/me/Downloads/table/Col1=ABC/Col2=123/Col3=abc/part-0000-xxxx-asdf.c00.snappy.parquet", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } })

HOWEVER, if I run this same code, but instead of passing the full path to the parquet file, I pass only the directory the file is in ("/Users/me/Downloads/table/Col1=ABC/Col2=123/Col3=abc"), I get no such error and I'm able to successfully read the parquet file.

I'm not sure if I'm doing something wrong, or if this is some kind of bug.

To Reproduce

Try to read a single parquet file from a local, partitioned Delta table, using SessionContext::read_parquet.

Expected behavior

I expect the file to be read into DataFusion as a DataFrame.

Additional context

No response

@rspears74 rspears74 added the bug Something isn't working label Oct 20, 2023
@alamb
Copy link
Contributor

alamb commented Oct 20, 2023

Thank you for the report @rspears74 -- it certainly seems like a bug.

I wonder if something about the .snappy.parquet is causing an issue -- does the problem go away if you use a different filename (e.g. /Users/me/Downloads/table/Col1=ABC/Col2=123/Col3=abc/data.parquet)?

@rspears74
Copy link
Contributor Author

rspears74 commented Oct 20, 2023

@alamb I've been able to a bit more debugging and I've figured out what seems to be the root of the problem.

I changed the partition file names when I posted here, the bottom level partition folders are actually Timestamps, so the folder names are something like LastUpdated=2023-10-08T20%3A54%3A33.250Z. When I change that to something else, or nest the parquet file in folders that don't contain %, it works. (I've been trying this with uncompressed parquet files, but I expect that this will work fine with snappy files).

After discovering this, I tried to simply read the file with std::fs::read(path).unwrap(), and I don't get an error. So I'm not sure if there's something different I should be doing, or if Datafusion isn't handling something like %3A in paths correctly.

Side note: I also found that if I tried to show the DataFrame in the original case where it "worked", the output was:

++
++

which I assume means an empty DF.

@alamb alamb changed the title Unable to read a single parquet file from a Delta table Unable to read a single parquet file from a Delta table with LastUpdated=2023-10-08T20%3A54%3A33.250Z in the path Oct 21, 2023
@alamb
Copy link
Contributor

alamb commented Oct 21, 2023

Thanks @rspears74 - I agree it sounds like an issue with % in the path, - to confirm the next step is probably to pull out a self contained reproducer.

Thanks again for the report

@Jefffrey
Copy link
Contributor

Minimum example:

use datafusion::error::Result;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    ctx.read_csv(
        "/home/jeffrey/tmp%123/test.csv",
        CsvReadOptions::new(),
    )
    .await?
    .show()
    .await?;

    Ok(())
}

Running this throws error:

Error: ObjectStore(NotFound { path: "/home/jeffrey/tmp%25123/test.csv", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } })

This can be avoided by specifying the file:// scheme:

use datafusion::error::Result;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    ctx.read_csv(
        "file:///home/jeffrey/tmp%123/test.csv", // <-- here
        CsvReadOptions::new(),
    )
    .await?
    .show()
    .await?;

    Ok(())
}

(This can be reproduced for parquet as well)

Cause seems to be here:

https://github.com/apache/arrow-datafusion/blob/d8e413c0b92b86593ed9801a034bd62bdb9ddc0b/datafusion/core/src/datasource/listing/url.rs#L80-L93

Specifically, if Self::parse_path(...) is called then it will URL encode the input path (%25 is url encode of %).

@alamb
Copy link
Contributor

alamb commented Oct 31, 2023

FYI @tustvold foled #8009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants