-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add subcommands to support data stream analysis. #104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question about API design. Would someone ever want to get depth, count, etc. values at the same time? Do you think there's any use case where you absolutely must have only one of these values?
My gut reaction is that it would make sense to have a single ion beta analyze
command that combines analyze count
, analyze size
, analyze depth
, symtab count
and symtab symbol_count
into one command.
Ok(()) | ||
} | ||
|
||
fn calculate_depth(element: &Element, depth: usize) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not a problem with this PR; just musing)
This is an effective way to calculate the depth, but it makes me wonder if we should have some sort of "instrumented" Ion reader that can send events of some sort to a callback function with some details about the state of the reader.
let options = plot::HistogramOptions { | ||
intervals: 4, | ||
..Default::default() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really neat!
Is there an option to have a logarithmic scale? E.g. buckets of 1-8
, 9-64
, 65-256
, etc.
I think having a logarithmic scale might be a more useful visualization, and fixed bucket sizes will make it easier to compare the histograms of two different files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's an option allows us to set log_scale
for histograms. I will add it in the next commit for better data distribution visualization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After setting log_scale
as true
, the min
is always 0.0
because this is how min was initialized in histogram.rs which seems not right for me. I will keep using the fixed bucket sizes to make sure min
represents the correct value.
Thanks for the suggestion. I agree with the idea of putting all these analytical options in one subcommand. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new features look really useful!
I agree with Matt's comment that it would be nice if we had the option to get all the data at once (maybe something like ion beta analyze all
). Additionally, having an option to receive the output in a structured format (Ion!) may be useful, e.g. {value_count: 59155, symbol_table_count: 386, ...}
.
Other stats that may be useful:
- How many of the symbol tables in the stream are "brand new" vs "append"?
- Stats / histogram (like you have for value sizes) for number of symbols introduced by the symbol tables. E.g. what is the average number of new symbols introduced, and how does that change throughout the stream?
- Ability to show (i.e., dump) the largest, smallest, or average-size value in the stream.
I mention these not because I expect you to implement them now, but just in case thinking of expansion possibilities influences how you choose to structure the commands.
Thanks for providing these new ideas. This would be a great reference for the further expansion. |
For what it's worth I would prefer I think we should have the same "porcelain" vs. "plumbing" distinction in |
For the API design, I agree with combining
Combine all of the subcommands into one except
Merge ion beta depth to ion beta stats when |
Using
How expensive? If its already 5 seconds for a 20 MB file, and combining the two of them makes it take a few more seconds, I think that's fine. We can always improve the efficiency of it later, and this is something that (a) should not be in a performance critical path, and (b) is still in the
Can we indicate the units for some of these? |
* Add unit test for analyze_data_stream * Add integration test for subcommand stats.
If performance is not the concern here I can make them all in one command. We can always optimize it when the |
This PR has been updated in #134 |
Issue #, if available:
#42
Description of changes:
This PR adds options to analyze the input binary Ion data stream, allowing users to:
Additionally, the existing count option has been reorganized under the analyze command. The usage pattern has been updated from:
ion beta count [File1, File2 ...]
toion beta analyze count [File1, File2 ...]
.Here is the overview of the command structure before and after the change.
Before
After
For symbol table analysis, we added two options, here is the command structure before and after this change.
Before:
After:
Testing:
Examples:
Request the maximum depth of input ion stream
> ion beta analyze depth test.10n
Output:
> The maximum depth is 4
Request the size information of the top level values
> ion beta analyze size test.10n
Output:
Request the number of local symbol table of input ion steam
> ion beta symtab count test.10n
Output:
> The number of symbol tables is 387
Request the number of symbols of the input ion stream
> ion beta symtab symbol_count test.10n
Output:
> The number of symbols is 67808444
These examples above show what you should expect when you execute these new options. All the unit tests for these functionalities are added into the
tests/cli.rs
file.Note:
size
size
subcommand also provides the number of top-level values, we might consider removing thecount
subcommand.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.