-
Notifications
You must be signed in to change notification settings - Fork 73
Supplemental
Joel Natividad edited this page Jan 23, 2025
·
15 revisions
The qsv stats
command computes summary statistics and infers data types for each column in a CSV file. Here's a detailed explanation of the output:
- field: The name of the column (or its index if --no-headers is used).
-
type: Inferred data type (NULL, Integer, String, Float, Date, DateTime, or Boolean).
- data type inferences are GUARANTEED as
stats
scans the entire file. - Date and DateTime are only inferred when
--infer-dates
is enabled. It can infer 19 date formats. - Booleans are only inferred when
-infer-boolean
is enabled. The heuristic for inferring booleans is as follows: When a column's cardinality is 2, and the 2 values' first characters are 0/1, t/f & y/n case-insensitive, the data type is inferred as boolean.
- data type inferences are GUARANTEED as
- is_ascii: Whether the column contains only ASCII characters (true/false).
-
sum: The total sum of all numeric values in the column. For Integers, returns
*OVERFLOW*
/*UNDERFLOW*
when the sum is greater than/lesser than i64::MAX/i64::MIN. For Floats, it returns NaN as the string "NaN", positive infinity as "inf", and negative infinity as "-inf". - min: The minimum value in the column.
- max: The maximum value in the column.
- range: The difference between the maximum and minimum values.
- sort_order: The sorting order of the column (ASCENDING, DESCENDING, or UNSORTED).
NOTE: lengths are byte not char lengths as some UTF-8 characters take more than one byte. They're only computed for String, Float and Integer types.
For Floats, its the string length of the decimal string formatted using the ryu crate. For Integers, formatted using the itoa crate.
- min_length: The length of the shortest value in the column.
- max_length: The length of the longest value in the column.
-
sum_length: The total sum of the lengths of the column. Returns
*OVERFLOW*
when the sum is greater than u64::MAX -
avg_length: The average length of the column. Returns
*OVERFLOW*
whensum_length
overflows. - stddev_length: Standard deviation of the length (only computed for String type)
- variance_length: Variance of the length (only computed for String type)
- cv_length: Coefficient of Variation, length (only computed for String type)
- mean: The average value of the column.
- sem: Standard Error of the Mean, a measure of the precision of the sample mean.
- geometric_mean: is a type of average that finds the central value by multiplying numbers together and then taking the root, making it useful for comparing things like growth rates or percentages. Only valid for positive numbers. Returns zero if there's a zero in the sample.
- harmonic_mean: is a type of average that gives more weight to smaller values, making it ideal for situations like calculating average rates or speeds. Only valid for positive numbers.
- stddev: Standard deviation, a measure of variability in the data.
- variance: The average of the squared differences from the mean.
- cv: Coefficient of Variation, the ratio of the standard deviation to the mean.
- nullcount: The number of null or empty values in the column.
- max_precision: The maximum number of decimal places in numeric values.
- sparsity: The proportion of null or empty values in the column.
-
median: The middle value when the data is sorted (requires
--median
or--everything
) When--quartiles
is specified, this is not returned as its the same as "q2_median". -
mad: Median Absolute Deviation, a robust measure of variability (requires
--mad
or--everything
).
(requires--quartiles
or--everything
). - lower_outer_fence: Q1 - 3 * IQR, used to identify extreme outliers.
- lower_inner_fence: Q1 - 1.5 * IQR, used to identify mild outliers.
- q1: First quartile (25th percentile).
- q2_median: Second quartile (50th percentile, same as median).
- q3: Third quartile (75th percentile).
- iqr: Interquartile Range, the difference between Q3 and Q1.
- upper_inner_fence: Q3 + 1.5 * IQR, used to identify mild outliers.
- upper_outer_fence: Q3 + 3 * IQR, used to identify extreme outliers.
- skewness: A measure of the asymmetry of the probability distribution.
-
cardinality: The number of unique values in the column (requires
--cardinality
or--everything
). -
mode: The most frequent value(s) in the column (requires
--mode
or--everything
). - mode_count: The number of modes.
- mode_occurrences: The number of times the mode(s) appear.
- antimode: The least frequent non-zero/non-null value(s) in the column.
- antimode_count: The number of antimodes.
- antimode_occurrences: The number of times the antimode(s) appear.
When --infer-dates
is enabled, several statistics are computed as follows:
- Date range, standard deviation, variance, MAD, and IQR are returned in days.
- DateTime results are in RFC3339 datetime format (YYYY-MM-DDTHH:MM:SS±HH:MM) - e.g. 2022-01-01T00:16:00+00:00
- Date results are in RFC3339 date format (YYYY-MM-DD) in the UTC timezone.
- The default "streaming" statistics (sum, min/max/range, sort order, min/max/sum/avg/stddev/variance/cv length, mean, sem, geometric_mean, harmonic_mean, stddev, variance, cv, nullcount, max_precision, sparsity) works with constant memory and can be computed efficiently on arbitrarily large CSV files.
- Advanced statistics require require more memory, proportional to the cardinality of each column, and must be explicitly enabled.
- The command supports various caching options to improve performance on subsequent runs.
- The stats command is central to qsv and underpins other "smart" commands like
frequency
,pivotp
,schema
,validate
, andtojsonl
that uses the statistical info to work smarter and faster.
For more detailed information on specific options and usage, refer to the qsv stats --help
output.