Add subcommands to support data stream analysis. #104

linlin-s · 2024-04-19T19:55:32Z

Issue #, if available:

Description of changes:

This PR adds options to analyze the input binary Ion data stream, allowing users to:

Get the maximum depth of the input data stream.
Get insights into the top-level values, including the minimum, maximum and average size, size distributions, and total count.
Gain insights into the symbol table, including the ability to request the number of local symbol tables and the number of symbols in the provided Ion stream.

Additionally, the existing count option has been reorganized under the analyze command. The usage pattern has been updated from: ion beta count [File1, File2 ...] to ion beta analyze count [File1, File2 ...].

Here is the overview of the command structure before and after the change.
Before

ion
└─ beta
    └─ count (The number of total top-level values)

After

ion
└─ beta
    └─ analyze
          └─ count (The number of total top-level values)
          └─ size (Return min/max/mean size and the size distribution of the top-level values)
          └─ depth (The maximum depth of input ion stream)

For symbol table analysis, we added two options, here is the command structure before and after this change.
Before:

ion
└─ beta
    └─ symtab
          └─ filter (Filters user data out of a binary Ion stream, leaving only the symbol table(s) behind.)

After:

ion
└─ beta
    └─ symtab
          └─ filter (Filters user data out of a binary Ion stream, leaving only the symbol table(s) behind.)
          └─ count (The number of local symbol tables of the input ion stream)
          └─ symbol_count (The number of symbols of the input ion stream)

Testing:

Examples:
Request the maximum depth of input ion stream
> ion beta analyze depth test.10n
Output:
> The maximum depth is 4

Request the size information of the top level values
> ion beta analyze size test.10n
Output:

Request the number of local symbol table of input ion steam
> ion beta symtab count test.10n
Output:
> The number of symbol tables is 387

Request the number of symbols of the input ion stream
> ion beta symtab symbol_count test.10n
Output:
> The number of symbols is 67808444

These examples above show what you should expect when you execute these new options. All the unit tests for these functionalities are added into the tests/cli.rs file.

Note:

We need to add test for subcommand size
Since the result of the size subcommand also provides the number of top-level values, we might consider removing the count subcommand.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

popematt

I have a question about API design. Would someone ever want to get depth, count, etc. values at the same time? Do you think there's any use case where you absolutely must have only one of these values?

My gut reaction is that it would make sense to have a single ion beta analyze command that combines analyze count, analyze size, analyze depth, symtab count and symtab symbol_count into one command.

popematt · 2024-04-20T00:01:42Z

src/bin/ion/commands/beta/analyze/depth.rs

+    Ok(())
+}
+
+fn calculate_depth(element: &Element, depth: usize) -> usize {


(Not a problem with this PR; just musing)
This is an effective way to calculate the depth, but it makes me wonder if we should have some sort of "instrumented" Ion reader that can send events of some sort to a callback function with some details about the state of the reader.

popematt · 2024-04-20T00:08:45Z

src/bin/ion/commands/beta/analyze/size.rs

+    let options = plot::HistogramOptions {
+        intervals: 4,
+        ..Default::default()
+    };


This looks really neat!

Is there an option to have a logarithmic scale? E.g. buckets of 1-8, 9-64, 65-256, etc.
I think having a logarithmic scale might be a more useful visualization, and fixed bucket sizes will make it easier to compare the histograms of two different files.

Yes, there's an option allows us to set log_scale for histograms. I will add it in the next commit for better data distribution visualization.

After setting log_scale as true, the min is always 0.0 because this is how min was initialized in histogram.rs which seems not right for me. I will keep using the fixed bucket sizes to make sure min represents the correct value.

linlin-s · 2024-04-22T16:35:40Z

I have a question about API design. Would someone ever want to get depth, count, etc. values at the same time? Do you think there's any use case where you absolutely must have only one of these values?

My gut reaction is that it would make sense to have a single ion beta analyze command that combines analyze count, analyze size, analyze depth, symtab count and symtab symbol_count into one command.

Thanks for the suggestion. I agree with the idea of putting all these analytical options in one subcommand. The count information is already included in the returned value of the size subcommand, so there's no need to keep the count. I'm wondering if keeping all the symbol table related subcommands under symtab could be an option as well? e.g., ion beta symtab analyze will return the number of local symbol tables and the number of symbols.

tgregg

These new features look really useful!

I agree with Matt's comment that it would be nice if we had the option to get all the data at once (maybe something like ion beta analyze all). Additionally, having an option to receive the output in a structured format (Ion!) may be useful, e.g. {value_count: 59155, symbol_table_count: 386, ...}.

Other stats that may be useful:

How many of the symbol tables in the stream are "brand new" vs "append"?
Stats / histogram (like you have for value sizes) for number of symbols introduced by the symbol tables. E.g. what is the average number of new symbols introduced, and how does that change throughout the stream?
Ability to show (i.e., dump) the largest, smallest, or average-size value in the stream.

I mention these not because I expect you to implement them now, but just in case thinking of expansion possibilities influences how you choose to structure the commands.

linlin-s · 2024-04-22T22:47:01Z

These new features look really useful!

I agree with Matt's comment that it would be nice if we had the option to get all the data at once (maybe something like ion beta analyze all). Additionally, having an option to receive the output in a structured format (Ion!) may be useful, e.g. {value_count: 59155, symbol_table_count: 386, ...}.

Other stats that may be useful:
* How many of the symbol tables in the stream are "brand new" vs "append"?

* Stats / histogram (like you have for value sizes) for number of symbols introduced by the symbol tables. E.g. what is the average number of new symbols introduced, and how does that change throughout the stream?

* Ability to show (i.e., dump) the largest, smallest, or average-size value in the stream.
I mention these not because I expect you to implement them now, but just in case thinking of expansion possibilities influences how you choose to structure the commands.

Thanks for providing these new ideas. This would be a great reference for the further expansion.

jobarr-amzn · 2024-04-25T17:23:19Z

For what it's worth I would prefer stat or stats as the command here, it matches another CLI which people are likely familiar with (git).

I think we should have the same "porcelain" vs. "plumbing" distinction in ion commands (see Git Internals Plumbing and Porcelain), rather than a beta subtree where we push everything because we regard that which is not beta as "plumbing".

linlin-s · 2024-05-15T23:23:15Z

For the API design, I agree with combining analyze count, analyze size, analyze depth, symtab count, and symtab symbol_count into one command.
All the information about top-level data size and the symbol table can be retrieved from SystemReader. However, to get the maximum depth, we need to explore elements from the user reader recursively(current implementation). We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth. In this case, combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

LazySystemReader could be an option to achieve this, and we can explore LazyValue to avoid step_in/step_out operations when getting the maximum depth of the nested input data stream. However, there are no public methods in lazy_system_reader to get information about value length under the current version of ion-rust. This feature might be supported in the future, but for now, here is the proposed solution:

Combine all of the subcommands into one except depth.

> ion beta stats 

Samples = 59155; Min = 21; Max = 1519
Average = 374; Variance = 195565.635; STD = 442.228
Each ∎ represents a count of 426

[  21 ..  396] [38823] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[ 396 ..  770] [ 1137] ∎∎
[ 770 .. 1144] [17845] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[1144 .. 1519] [ 1350] ∎∎∎
The number of symbols is 67808444 
The number of local symbol tables is 387

> ion beta depth 
The maximum depth is 4

Merge ion beta depth to ion beta stats when LazyValue supports accessing the underlying stream data.

popematt · 2024-05-16T17:35:04Z

We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth.

Using Element also has the same number of step in/out—it's just a higher level of abstraction that hides the reader from you.

combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

How expensive? If its already 5 seconds for a 20 MB file, and combining the two of them makes it take a few more seconds, I think that's fine. We can always improve the efficiency of it later, and this is something that (a) should not be in a performance critical path, and (b) is still in the beta namespace.

Samples = 59155; Min = 21; Max = 1519
Average = 374; Variance = 195565.635; STD = 442.228
Each ∎ represents a count of 426

Can we indicate the units for some of these?

* Add unit test for analyze_data_stream * Add integration test for subcommand stats.

linlin-s · 2024-05-17T23:31:55Z

We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth.

Using Element also has the same number of step in/out—it's just a higher level of abstraction that hides the reader from you.

combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

How expensive? If its already 5 seconds for a 20 MB file, and combining the two of them makes it take a few more seconds, I think that's fine. We can always improve the efficiency of it later, and this is something that (a) should not be in a performance critical path, and (b) is still in the beta namespace.

Samples = 59155; Min = 21; Max = 1519
Average = 374; Variance = 195565.635; STD = 442.228
Each ∎ represents a count of 426

Can we indicate the units for some of these?

If performance is not the concern here I can make them all in one command. We can always optimize it when the LazySystemReader supports accessing underlying stream data.
I will add comment on the final report to indicates the units. plot::Histogramitself doesn't support adding units for the data.

linlin-s · 2024-07-29T17:14:31Z

This PR has been updated in #134

Add subcommands to support data stream analysis.

c2b5ef3

popematt reviewed Apr 20, 2024

View reviewed changes

tgregg reviewed Apr 22, 2024

View reviewed changes

Combine all the subcommands to stats.

d766d23

* Add unit test for analyze_data_stream * Add integration test for subcommand stats.

popematt mentioned this pull request Jul 2, 2024

Adds experimental --ion-version option #128

Merged

linlin-s mentioned this pull request Jul 29, 2024

Add subcommand stats to support data stream analysis. #134

Merged

linlin-s closed this Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add subcommands to support data stream analysis. #104

Add subcommands to support data stream analysis. #104

linlin-s commented Apr 19, 2024

popematt left a comment

popematt Apr 20, 2024

popematt Apr 20, 2024

linlin-s Apr 22, 2024

linlin-s May 17, 2024 •

edited

Loading

linlin-s commented Apr 22, 2024

tgregg left a comment

linlin-s commented Apr 22, 2024

jobarr-amzn commented Apr 25, 2024

linlin-s commented May 15, 2024

popematt commented May 16, 2024

linlin-s commented May 17, 2024

linlin-s commented Jul 29, 2024

Add subcommands to support data stream analysis. #104

Add subcommands to support data stream analysis. #104

Conversation

linlin-s commented Apr 19, 2024

Issue #, if available:

Description of changes:

Testing:

popematt left a comment

Choose a reason for hiding this comment

popematt Apr 20, 2024

Choose a reason for hiding this comment

popematt Apr 20, 2024

Choose a reason for hiding this comment

linlin-s Apr 22, 2024

Choose a reason for hiding this comment

linlin-s May 17, 2024 • edited Loading

Choose a reason for hiding this comment

linlin-s commented Apr 22, 2024

tgregg left a comment

Choose a reason for hiding this comment

linlin-s commented Apr 22, 2024

jobarr-amzn commented Apr 25, 2024

linlin-s commented May 15, 2024

popematt commented May 16, 2024

linlin-s commented May 17, 2024

linlin-s commented Jul 29, 2024

linlin-s May 17, 2024 •

edited

Loading