-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose the Delta Log in a DataFrame that's easy for analysis #1031
Comments
That does sounds useful. We should be careful about the naming, though, since I think this could be confused with two other ideas:
|
Yep, I'm open for suggestions with names here. FYI, for other interested parties, history() has already been implemented in this lib. Making it easy to grab the full log across versions / for a given version would be ideal. That's a good point. That'd be especially ideal if there was a log entry for vacuum commands and we could indicate the data that's already been vacuumed. |
Can we also add an indicator of "number of version available" to this metadata some where? |
@chitralverma Unfortunately that isn't trivial, since we don't track that anywhere currently, and figuring out it requires looking through the log to see which files are around. I've created #1037 to track that. |
# Description Exposes function to get a dataframe of add actions for selected version of the table. TODO: * [x] add unit tests * [x] write user guide * [x] handle partition columns * [x] handle stats * [x] handle tags * [x] add a `flatten` option # Related Issue(s) - closes #1031 # Documentation <!--- Share links to useful documentation --->
# Description Exposes function to get a dataframe of add actions for selected version of the table. TODO: * [x] add unit tests * [x] write user guide * [x] handle partition columns * [x] handle stats * [x] handle tags * [x] add a `flatten` option # Related Issue(s) - closes delta-io#1031 # Documentation <!--- Share links to useful documentation --->
The _delta_log contains all sorts of valuable information for end users. Valuable chunks of Delta Log data are stored in JSON files that aren't easy for users to access.
It'd be great if the file name, file size, modification time, and column statistics were exposed to the user in a DataFrame so they could better manage their Delta table. Here's a possible interface:
That would return a DataFrame with these columns:
file_name
file_size
modification_time
data_change
col_a_min
col_a_max
col_b_min
col_b_max
Here are the types of questions the users could answer with this metadata:
This would help users a lot before they perform expensive computations.
The text was updated successfully, but these errors were encountered: