Skip to content

readCSV fails for *.zip #469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
koperagen opened this issue Oct 14, 2023 · 4 comments · Fixed by #903
Closed

readCSV fails for *.zip #469

koperagen opened this issue Oct 14, 2023 · 4 comments · Fixed by #903
Labels
bug Something isn't working csv CSV / delim related issues
Milestone

Comments

@koperagen
Copy link
Collaborator

This file extension is treated in a special way: there's a isCompressed method, and depending on it readCSV wraps InputStream. But it doesn't work for *.zip because InputStream is wrapped in a GZIPInputStream. Apparently it's also not enough to just wrap an InputStream, because ZIP has more complex structure and you need to call methods of ZipInputStream:

val zipInputStream = ZipInputStream(
    File("data.csv.zip").inputStream(),
    Charsets.UTF_8
)
zipInputStream.nextEntry
val df1 = DataFrame.readCSV(zipInputStream)
zipInputStream.closeEntry()
@koperagen koperagen added the bug Something isn't working label Oct 14, 2023
@koperagen
Copy link
Collaborator Author

Another issue is that file ending with *.gz can be *.tar.gz, and we cannot read it properly without some special handling. So, i suggest to either support it or at least provide an exception message that file should be just an archive and not a *.tar

@koperagen
Copy link
Collaborator Author

After the fix it needs to be mentioned in the docs

@Jolanrensen Jolanrensen added this to the Backlog milestone Oct 16, 2023
@Jolanrensen
Copy link
Collaborator

There's actually a lot of places where DataFrame assumes a type based on the file extension, but we should avoid that, as file extensions can be changed while the contents of the file are not.

@Jolanrensen Jolanrensen added the csv CSV / delim related issues label Aug 20, 2024
@Jolanrensen Jolanrensen mentioned this issue Aug 20, 2024
28 tasks
@Jolanrensen Jolanrensen mentioned this issue Nov 1, 2024
2 tasks
@Jolanrensen
Copy link
Collaborator

Will be solved in the new CSV implementation: "dataframe-csv". I will probably also migrate its new Compression class to the :core module in the future to solve reading zips from other read functions too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working csv CSV / delim related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants