Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status and roadmap #1

Closed
43 of 48 tasks
martindurant opened this issue Oct 20, 2016 · 9 comments
Closed
43 of 48 tasks

Status and roadmap #1

martindurant opened this issue Oct 20, 2016 · 9 comments

Comments

@martindurant
Copy link
Member

martindurant commented Oct 20, 2016

Features to be implemented.
An asterisk shows the next item(s) on the list.
A question mark shows something that might (almost) work, but isn't tested.

  • python2 compatibility

Reading

  • Types of encoding ( https://github.com/Parquet/parquet-format/blob/master/Encodings.md )
    • plain
    • bitpacked/RLE hybrid
    • dictionary
      • decode to values
      • make into categoricals
    • delta (needs test data)
    • delta-length byte-array (needs test data)
  • compression algorithms (gzip, lzo, snappy, brotli)
  • nulls
  • repeated/list values (*)
  • map, key-value types
  • multi-file (hive-like)
    • understand partition tree structure
      • filtering by partitions
    • parallelized for dask
  • filtering by statistics
  • converted/logical types
  • alternative file-systems
  • index handling

Writing

  • primitive types
  • converted/logical types
  • encodings (selected by user)
    • plain (default)
    • dictionary encoding (default for categoricals)
    • delta-length byte array (should be much faster for variable strings)
      • delta encoding (depends on reading delta encoding)
  • nulls encoding (for dtypes that don't accept NaN)
  • choice of compression
    • per column
  • multi-file
    • partitions on categoricals
    • parallelize for dask
      • partitions and division for dask
  • append
    • single-file
    • multi-file
    • consolidate files into logical data-set
  • alternative file-systems

Admin

  • packaging
    • pypi, conda
  • README
  • documentation
    • RTD
    • API documentation and doc-strings
    • Developer documentation (everything you need to run tests)
    • List of parquet features not yet supported to establish expectations
  • Announcement blogpost with example

Features not to be attempted

  • nested schemas (maybe can find a way to flatten or encode as dicts)
  • choice of encoding on write? (keep it simple)
  • schema evolution
@mrocklin
Copy link
Member

Can I make a request for an additional section for Administrative topics like packaging, documentation, etc..

What do we need for Dask.dataframe integration? Presumably we're depending on dask.bytes.open_files?

@martindurant
Copy link
Member Author

martindurant commented Oct 20, 2016

Yes, passing a file-like object that can be resolved in each worker would do: core.read_col currently takes an open file-object or a string that can be opened within the function. It probably should take a function to create a file object given a path (a parquet metadata file will reference other files with relative paths).
The only places that reading actually happens is core.read_thrift (where the size of the thrift structure is not known) and core._read_page (where the size in bytes is known). The former is small and would fit within a read-ahead buffer, the latter can be formed in terms of dask's read_bytes.

@lomereiter
Copy link

Hi there,

Are you guys aware of ongoing PyArrow development? It is also already on conda-forge and also has pandas <-> parquet read/write (through Arrow), although I don't think it supports multi-file yet.

@mrocklin
Copy link
Member

mrocklin commented Dec 3, 2016

@lomereiter Yes, we're very aware. We've been waiting for comprehensive Parquet read-write functionality from Arrow for a long while. Hopefully fastparquet is just a stopgap measure until PyArrow matures as a comprehensive solution.

@teh
Copy link

teh commented Dec 7, 2016

Hi, amazing work. Two things I noticed:

  • pytest required at runtime (imported in utils.py) which is a bit unusual
  • if column names are not string types then saving fails (e.g. AttributeError: 'int' object has no attribute 'encode')

@frol
Copy link

frol commented Mar 1, 2017

Since @lomereiter mentioned PyArrow, I will just leave this link here: Extreme IO performance with parallel Apache Parquet in Python

@martindurant
Copy link
Member Author

martindurant commented Mar 1, 2017

Thanks @frol . That there are multiple projects pushing on parquet for python is a good thing. You should also have linked to the previous posting python-parquet-update (Wes's work, not mine) which shows that fastparquet and arrow have very similar performance in many cases.

Note also that fastparquet is designed to run in parallel using dask, allowing distributed data access, and reading from remote stores such as s3.

@frol
Copy link

frol commented Mar 1, 2017

@martindurant Thank you! I was actually looking out there for some sorts of benchmarks for fastparquet as I am going to use it with Dask. It would be very helpful to have some info about benchmarks in the documentation as "fast" suffix in the project name implies the focus on speed, but I failed to find any info on this until you pointed me to this article.

@martindurant
Copy link
Member Author

There are some raw benchmarks in https://github.com/dask/fastparquet/blob/master/fastparquet/benchmarks/columns.py

My colleagues at datashader did some benchmarking on census data at the time when we were focusing on performance. Their numbers include both loading and performing aggregations on the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants