Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subcommands to support data stream analysis. #104

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,13 @@ keywords = ["format", "parse", "encode"]
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
lowcharts = "*"
anyhow = "1.0"
clap = { version = "4.0.17", features = ["cargo"] }
colored = "2.0.0"
flate2 = "1.0"
infer = "0.15.0"
ion-rs = {version = "1.0.0-rc.2", features = ["experimental"]}
ion-rs = {version = "1.0.0-rc.3", features = ["experimental"]}
memmap = "0.7.0"
tempfile = "3.2.0"
ion-schema = "0.10.0"
Expand Down
55 changes: 0 additions & 55 deletions src/bin/ion/commands/beta/count.rs

This file was deleted.

6 changes: 3 additions & 3 deletions src/bin/ion/commands/beta/mod.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
pub mod count;
pub mod from;

#[cfg(feature = "experimental-code-gen")]
Expand All @@ -7,17 +6,18 @@ pub mod head;
pub mod inspect;
pub mod primitive;
pub mod schema;
pub mod stats;
pub mod symtab;
pub mod to;

use crate::commands::beta::count::CountCommand;
use crate::commands::beta::from::FromNamespace;
#[cfg(feature = "experimental-code-gen")]
use crate::commands::beta::generate::GenerateCommand;
use crate::commands::beta::head::HeadCommand;
use crate::commands::beta::inspect::InspectCommand;
use crate::commands::beta::primitive::PrimitiveCommand;
use crate::commands::beta::schema::SchemaNamespace;
use crate::commands::beta::stats::StatsCommand;
use crate::commands::beta::symtab::SymtabNamespace;
use crate::commands::beta::to::ToNamespace;
use crate::commands::IonCliCommand;
Expand All @@ -35,11 +35,11 @@ impl IonCliCommand for BetaNamespace {

fn subcommands(&self) -> Vec<Box<dyn IonCliCommand>> {
vec![
Box::new(CountCommand),
Box::new(InspectCommand),
Box::new(PrimitiveCommand),
Box::new(SchemaNamespace),
Box::new(HeadCommand),
Box::new(StatsCommand),
Box::new(FromNamespace),
Box::new(ToNamespace),
Box::new(SymtabNamespace),
Expand Down
232 changes: 232 additions & 0 deletions src/bin/ion/commands/beta/stats.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
use crate::commands::{IonCliCommand, WithIonCliArgument};
use anyhow::{bail, Context, Result};
use clap::{ArgMatches, Command};
use ion_rs::{
BinaryWriterBuilder, Element, ElementReader, IonReader, IonType, IonWriter, RawBinaryReader,

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (macos-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (macos-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (macos-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (macos-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (ubuntu-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (ubuntu-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (ubuntu-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (ubuntu-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (ubuntu-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (ubuntu-latest)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (windows-2019)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (windows-2019)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (windows-2019)

unused imports: `BinaryWriterBuilder`, `IonWriter`

Check warning on line 5 in src/bin/ion/commands/beta/stats.rs

View workflow job for this annotation

GitHub Actions / Build and Test (windows-2019)

unused imports: `BinaryWriterBuilder`, `IonWriter`
Reader, ReaderBuilder, SystemReader, SystemStreamItem,
};
use lowcharts::plot;
use memmap::MmapOptions;
use std::fs::File;

pub struct StatsCommand;

impl IonCliCommand for StatsCommand {
fn name(&self) -> &'static str {
"stats"
}

fn about(&self) -> &'static str {
"Print the analysis report of the input data stream, including the total number of top-level values, their minimum, maximum, and mean sizes, \n\
and plot the size distribution of the input stream. The report should also include the number of symbol tables in the input stream, \n\
the total number of different symbols that occurred in the input stream, and the maximum depth of the input data stream. \n\
Currently, this subcommand only supports data analysis on binary Ion data."
}

fn configure_args(&self, command: Command) -> Command {
command.with_input()
}

fn run(&self, _command_path: &mut Vec<String>, args: &ArgMatches) -> Result<()> {
if let Some(input_file_names) = args.get_many::<String>("input") {
for input_file in input_file_names {
let file = File::open(input_file.as_str())
.with_context(|| format!("Could not open file '{}'", &input_file))?;
let mmap = unsafe {
MmapOptions::new()
.map(&file)
.with_context(|| format!("Could not mmap '{}'", input_file))?
};
// Treat the mmap as a byte array.
let ion_data: &[u8] = &mmap[..];
match ion_data {
// Pattern match the byte array to verify it starts with an IVM
// Currently the 'stats' subcommand only support binary Ion data stream.
[0xE0, 0x01, 0x00, 0xEA, ..] => {
// TODO: When there is a new release of LazySystemReader which supports accessing the underlying stream data, \n
// we should only initialize LazySystemReader to get all the statistical information instead of initializing user\n
// reader and system reader for retrieving different information.
// Initialize user reader for maximum depth calculation.
let reader = ReaderBuilder::new().build(file)?;
// Initialize system reader to get the information of symbol tables and value length.
let raw_reader = RawBinaryReader::new(ion_data);
let mut system_reader = SystemReader::new(raw_reader);
// Generate data analysis report.
analyze(&mut system_reader, reader, &mut std::io::stdout())
.expect("Failed to analyze the input data stream.");
}
_ => {
bail!(
"Input file '{}' does not appear to be binary Ion.",
input_file
);
}
};
}
} else {
bail!("this command does not yet support reading from STDIN")
}
Ok(())
}
}

#[derive(Debug, Clone, PartialEq)]
struct Output {
size_vec: Vec<f64>,
symtab_count: i32,
symbols_count: usize,
max_depth: usize,
}

fn analyze_data_stream(
system_reader: &mut SystemReader<RawBinaryReader<&[u8]>>,
reader: Reader,
) -> Output {
let mut size_vec: Vec<f64> = Vec::new();
let mut symtab_count = 0;
loop {
let system_value = system_reader.next().unwrap();
match system_value {
SystemStreamItem::SymbolTableValue(IonType::Struct) => {
symtab_count += 1;
}
SystemStreamItem::Value(..) => {
let size = system_reader.annotations_length().map_or(
system_reader.header_length() + system_reader.value_length(),
|annotations_length| {
annotations_length
+ system_reader.header_length()
+ system_reader.value_length()
},
);
size_vec.push(size as f64);
}
SystemStreamItem::Nothing => break,
_ => {}
}
}
// Reduce the number of shared symbols.
let symbols_count = system_reader.symbol_table().symbols().iter().len() - 10;
let max_depth = get_max_depth(reader);

let out = Output {
size_vec,
symtab_count,
symbols_count,
max_depth,
};
return out;
}

fn get_max_depth(mut reader: Reader) -> usize {
reader
.elements()
.map(|element| calculate_top_level_max_depth(&element.unwrap(), 0))
.max()
.unwrap_or(0)
}

fn calculate_top_level_max_depth(element: &Element, depth: usize) -> usize {
return if element.ion_type().is_container() {
if element.ion_type() == IonType::Struct {
element
.as_struct()
.unwrap()
.iter()
.map(|(_field_name, e)| calculate_top_level_max_depth(e, depth + 1))
.max()
.unwrap_or(depth)
} else {
element
.as_sequence()
.unwrap()
.into_iter()
.map(|e| calculate_top_level_max_depth(e, depth + 1))
.max()
.unwrap_or(depth)
}
} else {
depth
};
}

fn analyze(
system_reader: &mut SystemReader<RawBinaryReader<&[u8]>>,
reader: Reader,
mut writer: impl std::io::Write,
) -> Result<()> {
let out = analyze_data_stream(system_reader, reader);
// Plot a histogram of the above vector, with 4 buckets and a precision
// chosen by library. The number of buckets could be changed as needed.
let options = plot::HistogramOptions {
intervals: 4,
..Default::default()
};
let histogram = plot::Histogram::new(&out.size_vec, options);
writeln!(
writer,
"The 'samples' field represents the total number of top-level value of input data stream. The unit of min, max ,avg size is bytes.\n\
{}",
histogram
)
.expect("There is an error occurred while plotting the size distribution of input data stream.");
writeln!(writer, "The number of symbols is {} ", out.symbols_count)
.expect("There is an error occurred while writing the symbols_count.");
writeln!(
writer,
"The number of local symbol tables is {} ",
out.symtab_count
)
.expect("There is an error occurred while writing the symtab_count.");
writeln!(
writer,
"The maximum depth of the input data stream is {}",
out.max_depth
)
.expect("There is an error occurred while writing the max_depth.");
Ok(())
}

#[test]
fn test_analyze() -> Result<()> {
let expect_out = Output {
size_vec: Vec::from([10.0, 15.0, 6.0, 6.0]),
symtab_count: 4,
symbols_count: 8,
max_depth: 2,
};
let test_data: &str = r#"
{
foo: bar,
abc: [123, 456]
}
{
foo: baz,
abc: [42.0, 43e0]
}
{
foo: bar,
test: data
}
{
foo: baz,
type: struct
}
"#;
let binary_buffer = {
let mut buffer = Vec::new();
let mut writer = BinaryWriterBuilder::new().build(&mut buffer)?;
for element in ReaderBuilder::new().build(test_data.as_bytes())?.elements() {
element?.write_to(&mut writer)?;
writer.flush()?;
}
buffer
};

let mut system_reader = SystemReader::new(RawBinaryReader::new(binary_buffer.as_slice()));
let reader = ReaderBuilder::new().build(binary_buffer.as_slice())?;
let output = analyze_data_stream(&mut system_reader, reader);

assert_eq!(output, expect_out);
Ok(())
}
Loading
Loading