-
Notifications
You must be signed in to change notification settings - Fork 181
Description
PPL Format Command Requirements
Problem Statement
Users need the format command to transform query results into a single formatted result. The format command takes the current result set and converts it into a single row with a "search" field containing a formatted string, using configurable formatting options including custom separators, delimiters, and handling for empty results.
OpenSearch Integration
The format command will be implemented as a PPL (Piped Processing Language) command in OpenSearch SQL plugin. Users will be able to:
- Use in PPL queries: Add
| formatat the end of any PPL query to format results - Combine with other PPL commands: Chain format with existing commands like
fields,head,where - Use in subsearches: Format command can be explicitly added to subsearch queries when users want to convert results into search criteria
- Export formatted data: Use formatted output for external system integration or reporting
Note: When users want subsearch results to be used as search criteria in the main query, they can explicitly add the format command to the subsearch: source=logs [search source=errors | fields host | format]
Long-Term Goals
Primary Objectives
- Provide a format command that transforms query results into formatted search strings
- Support both explicit usage and implicit usage in subsearch operations
- Enable configurable formatting with custom separators and delimiters
- Maintain performance parity with native OpenSearch operations
Requirements
Use Case 1: Basic Field Formatting with Default Parameters
User Story: As a security analyst, I want to format search results into a search string using default delimiters, so that I can use the output as a filter in subsequent searches.
Acceptance Criteria:
- WHEN a user applies Format_Command with no parameters, THEN THE Format_Command SHALL create a search field containing formatted results
- WHEN Format_Command processes multiple rows, THEN THE Format_Command SHALL use "OR" as the default row separator
- WHEN Format_Command processes multiple columns, THEN THE Format_Command SHALL use "AND" as the default column separator
- WHEN Format_Command creates field-value pairs, THEN THE Format_Command SHALL use the pattern
field="value" - WHEN Format_Command wraps rows, THEN THE Format_Command SHALL use "(" as row prefix and ")" as row suffix
- WHEN Format_Command wraps columns, THEN THE Format_Command SHALL use "(" as column prefix and ")" as column suffix
PPL Query:
source=webserver | fields server_name, log_type, region | head 2 | format
Input Data:
| server_name | log_type | region |
|-------------|----------|-----------|
| web01 | access | us-east |
| web02 | access | us-west |
Expected Output:
| search |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ( ( log_type="access" AND region="us-east" AND server_name="web01" ) OR ( log_type="access" AND region="us-west" AND server_name="web02" ) ) |
Use Case 2: Custom Row and Column Formatting
User Story: As a data engineer, I want to specify custom row and column delimiters, so that I can format output for external systems with different syntax requirements.
Acceptance Criteria:
- WHEN a user specifies row prefix, column prefix, column separator, column end, row separator, and row end, THEN THE Format_Command SHALL use those values for formatting
- WHEN all six formatting parameters are provided, THEN THE Format_Command SHALL apply them in the correct order
- WHEN custom delimiters contain special characters, THEN THE Format_Command SHALL preserve them literally
- WHEN custom delimiters are empty strings, THEN THE Format_Command SHALL omit those delimiters from output
PPL Query:
source=database | fields db_name, table_name | head 2 | format "[" "[" "&&" "]" "||" "]"
Input Data:
| db_name | table_name |
|------------|------------|
| inventory | products |
| sales | orders |
Expected Output:
| search |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [ [ db_name="inventory" && table_name="products" ] || [ db_name="sales" && table_name="orders" ] ] |
Use Case 3: Multivalue Field Separator Configuration
User Story: As a data analyst, I want to configure how multivalue fields are separated in the formatted output, so that I can control the logical operators used between multiple values of the same field.
Acceptance Criteria:
- WHEN Format_Command processes multivalue fields, THEN THE Format_Command SHALL expand each value into separate field-value pairs
- WHEN mvsep parameter is specified, THEN THE Format_Command SHALL use that separator between multivalue field instances
- WHEN mvsep parameter is not specified, THEN THE Format_Command SHALL use "OR" as the default multivalue separator
- WHEN multivalue fields are combined with other fields, THEN THE Format_Command SHALL maintain proper grouping with parentheses
PPL Query:
source=user_activity | fields user_id, permissions | format mvsep="ALSO" "{" "[" "AND" "]" "AND" "}"
Input Data:
| user_id | permissions |
|---------|--------------------|
| john | ["read", "write"] |
Expected Output:
| search |
|------------------------------------------------------------------------------------------|
| { [ ( user_id="john" AND ( permissions="read" ALSO permissions="write" ) ) ] } |
Use Case 4: Empty Result Handling with Default String
User Story: As a security analyst, I want the format command to handle empty result sets gracefully with a default "NOT( )" string, so that subsearches with no results don't break the main query logic.
Acceptance Criteria:
- WHEN Format_Command receives empty input, THEN THE Format_Command SHALL return a single row with a search field
- WHEN Format_Command processes zero rows, THEN THE Format_Command SHALL use "NOT( )" as the default output
- WHEN Format_Command is used in subsearches with no results, THEN THE system SHALL ensure the main query continues to execute properly
- WHEN empty results are formatted, THEN THE Format_Command SHALL maintain the same output structure as non-empty results
PPL Query:
source=metrics | where timestamp < "1970-01-01" | format
Input Data:
(no rows returned due to impossible condition)
Expected Output:
| search |
|----------|
| NOT ( ) |
Use Case 5: Custom Empty String Handling
User Story: As a data engineer, I want to customize the string returned when format processes empty results, so that I can provide meaningful default values for my specific use case.
Acceptance Criteria:
- WHEN Format_Command receives empty input and emptystr parameter is specified, THEN THE Format_Command SHALL return the custom emptystr value
- WHEN emptystr parameter is provided, THEN THE Format_Command SHALL use that value instead of the default "NOT( )"
- WHEN custom emptystr is used in subsearches, THEN THE system SHALL use the custom value as search criteria
- WHEN emptystr contains special characters, THEN THE Format_Command SHALL preserve them literally
PPL Query:
source=events | where status = "nonexistent" | format emptystr="No Data Available"
Input Data:
(no rows returned due to nonexistent status)
Expected Output:
| search |
|-------------------|
| No Data Available |
Use Case 6: Subsearch Failsafe with Custom Empty String
User Story: As a security analyst, I want to use emptystr as a failsafe in subsearches, so that when my subsearch returns no results, I get a meaningful default that ensures my main search still returns useful data.
Acceptance Criteria:
- WHEN a subsearch returns no results and emptystr is specified, THEN THE Format_Command SHALL use the emptystr value as search criteria
- WHEN emptystr provides a failsafe value, THEN THE main search SHALL execute using that value as filter criteria
- WHEN subsearch fails but emptystr is provided, THEN THE system SHALL continue processing with the failsafe value
- WHEN emptystr contains a valid field-value pair, THEN THE main search SHALL return results matching that criteria
PPL Query:
source=security_logs event_type= [search source=threat_intel | head 1 | fields event_type | format emptystr="malware"]
Expected Behavior:
The subsearch [search source=threat_intel | head 1 | fields event_type | format emptystr="malware"] returns no results, so format outputs "malware" as the failsafe value. The main search then becomes source=security_logs event_type=malware, returning all malware-related security events.
Command Syntax Reference
Syntax:
format [mvsep="<mv separator>"] [maxresults=<int>] ["<row prefix>" "<column prefix>" "<column separator>" "<column end>" "<row separator>" "<row end>"] [emptystr="<string>"]
Parameters:
mvsep: Separator for multivalue fields (default: "OR")maxresults: Maximum results to return (default: 0 = unlimited)- Row/column formatting (must specify all 6 if any):
<row prefix>: Default "("<column prefix>: Default "("<column separator>: Default "AND"<column end>: Default ")"<row separator>: Default "OR"<row end>: Default ")"
emptystr: String to output when results are empty (default: "NOT( )")
Key Behaviors:
- Field-value formatting: Creates
field="value"pairs for each column - Row grouping: Each input row becomes a group of field-value pairs
- Multivalue handling: Each value in multivalue field becomes separate field-value pair
- Empty result handling: Returns
emptystrvalue when no user fields are present - All-or-nothing formatting: Must specify all 6 row/column parameters or none
- Output field: Always creates a field named
search