Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of
ParquetFile.write_row_groups()
&ParquetFile._sort_part_names()
#712Implementation of
ParquetFile.write_row_groups()
&ParquetFile._sort_part_names()
#712Changes from 1 commit
593057c
1d24a2b
6d79021
ea8d8fe
2ffc94d
822fb11
b41bd91
bc30ba4
3e70a16
61b7c8b
714d116
550ec06
511f218
b7d88ab
2307c09
c900f3b
3b29e3a
357c507
4e7378e
786101c
d7140e2
2bb1a87
aa44f48
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top of docstring should be kept to one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminds me: how is this method different from
write(..., append=True)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What
append_as_row_groups
does overwrite
are:ParquetFile
, (if the instance is already existing/loaded, you don't re-load it again as opposed towrite
)ParquetFile
instance / it requires less parameters, as some of them are defined directly from the instance attributes (partition_on
for instance)write_fmd=False
)ParquetFile
instance (allowing further modifications)After this PR, I would like to prepare a new one containing documentation and an illustrative test case.
This will hopefully illustrate better the interest of completing the set of utilities to manage row groups of a
ParquetFile
instance.This documentation would be a kind of 'tutorial' / example about how using these functions together to update a parquet dataset.
Assuming:
Then the following update methodology with fastparquet is possible:
I would like to provide 2 application test cases:
filter_row_groups(pf, filter=[('name', 'in', ['Yoh','Fred'])
filter_row_groups(pf, filter=[('timestamp', '>=', pd.Timestamp('2021/01/01'), ('timestamp', '<=', pd.Timestamp('2021/01/04')])
and then roll-out above-proposed procedure.
Compared to
append=overwrite
, this approach:I introduced this in #676, I am sorry if I am writing a lot of things not necessarily very clear.
Near the end of the 1st post, with bullet points, I am briefly summarizing the main requirements of this 'generic' update feature. They correspond to the different utilities I have implemented:
remove_row_groups
,_write_common_metadata
append_as_row_groups
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(made some updates to complete above answer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @martindurant
Replying here to
By the way, is "insert" possible (i.e., shuffling the other row-groups and filenames as necessary)?
I rephrased this question in my mind into:
Could your work be extended so as to make insertion of row groups ('shuffling' other row-groups and filenames as necessary)?
To this question, my answer is yes, and I think it is actually a good idea!
It has not been my target so far (as it is possible to do it by using
sorted
on row group list directly), but I think doing insertion directly inappend_as_row_groups
(that I will rename intowrite_row_groups
as per our above discussion) is a neater solution.I think i would also change the interface of
write_row_groups()
so that it accepts an iterable of dataframes (e.g. list of dataframes or generator of dataframes, each individual dataframe defining a row group) instead of a single dataframe to be split.For each of these row groups, the list of indexes where these row groups have to be inserted will have to be provided optionally (if not provided, it will be usual append).
With these changes, we can make the existing
overwrite
feature a separate function, similar in the way it is 'external' as the existingmerge
, and as efficient as it is now (but which current implementation is intricated intowrite
).As I see it, it will thus bring us more modular code, easier to read and maintain, so yes, I am keen on investing this!
I will work out something in the coming days, and push updates, probably by the end of this week.
Thanks for your constructive insights!
Bests,