Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]New fieldsummary PPL command #3026

Open
YANG-DB opened this issue Sep 14, 2024 · 0 comments
Open

[FEATURE]New fieldsummary PPL command #3026

YANG-DB opened this issue Sep 14, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request PPL Piped processing language

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Sep 14, 2024

Describe the solution you'd like
We propose adding a new fieldsummary command to OpenSearch PPL that would provide summary statistics for all fields in the current result set.

This command should:

  1. Calculate basic statistics for each field (count, distinct count, min, max, avg for numeric fields)
  2. Determine the data type of each field
  3. Show the most frequent values and their counts for each field
  4. Calculate the percentage of events that contain each field

Additionally, the command should support the following key optional parameters:

  1. includefields:
    Specify which fields to include in the summary (e.g., | fieldsummary includefields="status_code,user_id,response_time")
  2. excludefields:
    Specify which fields to exclude from the summary (e.g., | fieldsummary excludefields="internal_id,debug_info")
  3. topvalues:
    Set the number of top values to display for each field (e.g., | fieldsummary topvalues=5)
  4. maxfields:
    Limit the number of fields to display (e.g., | fieldsummary maxfields=20)
  5. nulls:
    Include null/empty value counts (e.g., | fieldsummary nulls=true)

Example usage:

source = t
| where timestamp >= "2023-01-01" and timestamp < "2023-02-01"
| fieldsummary includefields="status_code,user_id,response_time" topvalues=3 nulls=true

This command would generate a table with summary statistics for the specified fields in the given date range, showing the top 3 values for each field and including null counts.

Example output:

Field Count Distinct Min Max Avg Type Top Values Nulls
status_code 10000 4 200 503 - short 200 (8000, 80%)
404 (1500, 15%)
500 (400, 4%)
0
user_id 9500 1200 - - - string user123 (100, 1%)
user456 (95, 1%)
user789 (90, 0.9%)
500
response_time 10000 986 0.01 10.5 0.75 float 0.5 (2000, 20%)
0.75 (1800, 18%)
1.0 (1500, 15%)
0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PPL Piped processing language
Projects
Status: Todo
Development

No branches or pull requests

1 participant