Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subcommands to support data stream analysis. #104

Closed
wants to merge 2 commits into from
Closed

Conversation

linlin-s
Copy link
Contributor

Issue #, if available:

#42

Description of changes:

This PR adds options to analyze the input binary Ion data stream, allowing users to:

  • Get the maximum depth of the input data stream.
  • Get insights into the top-level values, including the minimum, maximum and average size, size distributions, and total count.
  • Gain insights into the symbol table, including the ability to request the number of local symbol tables and the number of symbols in the provided Ion stream.

Additionally, the existing count option has been reorganized under the analyze command. The usage pattern has been updated from: ion beta count [File1, File2 ...] to ion beta analyze count [File1, File2 ...].

Here is the overview of the command structure before and after the change.
Before

ion
└─ beta
    └─ count (The number of total top-level values)

After

ion
└─ beta
    └─ analyze
          └─ count (The number of total top-level values)
          └─ size (Return min/max/mean size and the size distribution of the top-level values)
          └─ depth (The maximum depth of input ion stream)

For symbol table analysis, we added two options, here is the command structure before and after this change.
Before:

ion
└─ beta
    └─ symtab
          └─ filter (Filters user data out of a binary Ion stream, leaving only the symbol table(s) behind.)

After:

ion
└─ beta
    └─ symtab
          └─ filter (Filters user data out of a binary Ion stream, leaving only the symbol table(s) behind.)
          └─ count (The number of local symbol tables of the input ion stream)
          └─ symbol_count (The number of symbols of the input ion stream)

Testing:

Examples:
Request the maximum depth of input ion stream
> ion beta analyze depth test.10n
Output:
> The maximum depth is 4

Request the size information of the top level values
> ion beta analyze size test.10n
Output:
image

Request the number of local symbol table of input ion steam
> ion beta symtab count test.10n
Output:
> The number of symbol tables is 387

Request the number of symbols of the input ion stream
> ion beta symtab symbol_count test.10n
Output:
> The number of symbols is 67808444

These examples above show what you should expect when you execute these new options. All the unit tests for these functionalities are added into the tests/cli.rs file.

Note:

  • We need to add test for subcommand size
  • Since the result of the size subcommand also provides the number of top-level values, we might consider removing the count subcommand.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@popematt popematt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about API design. Would someone ever want to get depth, count, etc. values at the same time? Do you think there's any use case where you absolutely must have only one of these values?

My gut reaction is that it would make sense to have a single ion beta analyze command that combines analyze count, analyze size, analyze depth, symtab count and symtab symbol_count into one command.

Ok(())
}

fn calculate_depth(element: &Element, depth: usize) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a problem with this PR; just musing)
This is an effective way to calculate the depth, but it makes me wonder if we should have some sort of "instrumented" Ion reader that can send events of some sort to a callback function with some details about the state of the reader.

Comment on lines 67 to 70
let options = plot::HistogramOptions {
intervals: 4,
..Default::default()
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really neat!

Is there an option to have a logarithmic scale? E.g. buckets of 1-8, 9-64, 65-256, etc.
I think having a logarithmic scale might be a more useful visualization, and fixed bucket sizes will make it easier to compare the histograms of two different files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's an option allows us to set log_scale for histograms. I will add it in the next commit for better data distribution visualization.

Copy link
Contributor Author

@linlin-s linlin-s May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After setting log_scale as true, the min is always 0.0 because this is how min was initialized in histogram.rs which seems not right for me. I will keep using the fixed bucket sizes to make sure min represents the correct value.

@linlin-s
Copy link
Contributor Author

I have a question about API design. Would someone ever want to get depth, count, etc. values at the same time? Do you think there's any use case where you absolutely must have only one of these values?

My gut reaction is that it would make sense to have a single ion beta analyze command that combines analyze count, analyze size, analyze depth, symtab count and symtab symbol_count into one command.

Thanks for the suggestion. I agree with the idea of putting all these analytical options in one subcommand. The count information is already included in the returned value of the size subcommand, so there's no need to keep the count. I'm wondering if keeping all the symbol table related subcommands under symtab could be an option as well? e.g., ion beta symtab analyze will return the number of local symbol tables and the number of symbols.

Copy link
Contributor

@tgregg tgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new features look really useful!

I agree with Matt's comment that it would be nice if we had the option to get all the data at once (maybe something like ion beta analyze all). Additionally, having an option to receive the output in a structured format (Ion!) may be useful, e.g. {value_count: 59155, symbol_table_count: 386, ...}.

Other stats that may be useful:

  • How many of the symbol tables in the stream are "brand new" vs "append"?
  • Stats / histogram (like you have for value sizes) for number of symbols introduced by the symbol tables. E.g. what is the average number of new symbols introduced, and how does that change throughout the stream?
  • Ability to show (i.e., dump) the largest, smallest, or average-size value in the stream.

I mention these not because I expect you to implement them now, but just in case thinking of expansion possibilities influences how you choose to structure the commands.

@linlin-s
Copy link
Contributor Author

These new features look really useful!

I agree with Matt's comment that it would be nice if we had the option to get all the data at once (maybe something like ion beta analyze all). Additionally, having an option to receive the output in a structured format (Ion!) may be useful, e.g. {value_count: 59155, symbol_table_count: 386, ...}.

Other stats that may be useful:

* How many of the symbol tables in the stream are "brand new" vs "append"?

* Stats / histogram (like you have for value sizes) for number of symbols introduced by the symbol tables. E.g. what is the average number of new symbols introduced, and how does that change throughout the stream?

* Ability to show (i.e., dump) the largest, smallest, or average-size value in the stream.

I mention these not because I expect you to implement them now, but just in case thinking of expansion possibilities influences how you choose to structure the commands.

Thanks for providing these new ideas. This would be a great reference for the further expansion.

@jobarr-amzn
Copy link
Contributor

For what it's worth I would prefer stat or stats as the command here, it matches another CLI which people are likely familiar with (git).

I think we should have the same "porcelain" vs. "plumbing" distinction in ion commands (see Git Internals Plumbing and Porcelain), rather than a beta subtree where we push everything because we regard that which is not beta as "plumbing".

@linlin-s
Copy link
Contributor Author

For the API design, I agree with combining analyze count, analyze size, analyze depth, symtab count, and symtab symbol_count into one command.
All the information about top-level data size and the symbol table can be retrieved from SystemReader. However, to get the maximum depth, we need to explore elements from the user reader recursively(current implementation). We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth. In this case, combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

LazySystemReader could be an option to achieve this, and we can explore LazyValue to avoid step_in/step_out operations when getting the maximum depth of the nested input data stream. However, there are no public methods in lazy_system_reader to get information about value length under the current version of ion-rust. This feature might be supported in the future, but for now, here is the proposed solution:

Combine all of the subcommands into one except depth.

> ion beta stats 

Samples = 59155; Min = 21; Max = 1519
Average = 374; Variance = 195565.635; STD = 442.228
Each ∎ represents a count of 426

[  21 ..  396] [38823] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[ 396 ..  770] [ 1137] ∎∎
[ 770 .. 1144] [17845] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[1144 .. 1519] [ 1350] ∎∎∎
The number of symbols is 67808444 
The number of local symbol tables is 387 
> ion beta depth 
The maximum depth is 4

Merge ion beta depth to ion beta stats when LazyValue supports accessing the underlying stream data.

@popematt
Copy link
Contributor

We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth.

Using Element also has the same number of step in/out—it's just a higher level of abstraction that hides the reader from you.

combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

How expensive? If its already 5 seconds for a 20 MB file, and combining the two of them makes it take a few more seconds, I think that's fine. We can always improve the efficiency of it later, and this is something that (a) should not be in a performance critical path, and (b) is still in the beta namespace.

Samples = 59155; Min = 21; Max = 1519
Average = 374; Variance = 195565.635; STD = 442.228
Each ∎ represents a count of 426

Can we indicate the units for some of these?

* Add unit test for analyze_data_stream
* Add integration test for subcommand stats.
@linlin-s
Copy link
Contributor Author

We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth.

Using Element also has the same number of step in/out—it's just a higher level of abstraction that hides the reader from you.

combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

How expensive? If its already 5 seconds for a 20 MB file, and combining the two of them makes it take a few more seconds, I think that's fine. We can always improve the efficiency of it later, and this is something that (a) should not be in a performance critical path, and (b) is still in the beta namespace.

Samples = 59155; Min = 21; Max = 1519
Average = 374; Variance = 195565.635; STD = 442.228
Each ∎ represents a count of 426

Can we indicate the units for some of these?

If performance is not the concern here I can make them all in one command. We can always optimize it when the LazySystemReader supports accessing underlying stream data.
I will add comment on the final report to indicates the units. plot::Histogramitself doesn't support adding units for the data.

@linlin-s
Copy link
Contributor Author

This PR has been updated in #134

@linlin-s linlin-s closed this Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants