Skip to content

[RFC] Support PPL format command #4975

@aalva500-prog

Description

@aalva500-prog

PPL Format Command Requirements

Problem Statement

Users need the format command to transform query results into a single formatted result. The format command takes the current result set and converts it into a single row with a "search" field containing a formatted string, using configurable formatting options including custom separators, delimiters, and handling for empty results.

OpenSearch Integration

The format command will be implemented as a PPL (Piped Processing Language) command in OpenSearch SQL plugin. Users will be able to:

  • Use in PPL queries: Add | format at the end of any PPL query to format results
  • Combine with other PPL commands: Chain format with existing commands like fields, head, where
  • Use in subsearches: Format command can be explicitly added to subsearch queries when users want to convert results into search criteria
  • Export formatted data: Use formatted output for external system integration or reporting

Note: When users want subsearch results to be used as search criteria in the main query, they can explicitly add the format command to the subsearch: source=logs [search source=errors | fields host | format]

Long-Term Goals

Primary Objectives

  • Provide a format command that transforms query results into formatted search strings
  • Support both explicit usage and implicit usage in subsearch operations
  • Enable configurable formatting with custom separators and delimiters
  • Maintain performance parity with native OpenSearch operations

Requirements

Use Case 1: Basic Field Formatting with Default Parameters

User Story: As a security analyst, I want to format search results into a search string using default delimiters, so that I can use the output as a filter in subsequent searches.

Acceptance Criteria:

  1. WHEN a user applies Format_Command with no parameters, THEN THE Format_Command SHALL create a search field containing formatted results
  2. WHEN Format_Command processes multiple rows, THEN THE Format_Command SHALL use "OR" as the default row separator
  3. WHEN Format_Command processes multiple columns, THEN THE Format_Command SHALL use "AND" as the default column separator
  4. WHEN Format_Command creates field-value pairs, THEN THE Format_Command SHALL use the pattern field="value"
  5. WHEN Format_Command wraps rows, THEN THE Format_Command SHALL use "(" as row prefix and ")" as row suffix
  6. WHEN Format_Command wraps columns, THEN THE Format_Command SHALL use "(" as column prefix and ")" as column suffix

PPL Query:

source=webserver | fields server_name, log_type, region | head 2 | format

Input Data:

| server_name | log_type | region    |
|-------------|----------|-----------|
| web01       | access   | us-east   |
| web02       | access   | us-west   |

Expected Output:

| search                                                                                                                                                                    |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ( ( log_type="access" AND region="us-east" AND server_name="web01" ) OR ( log_type="access" AND region="us-west" AND server_name="web02" ) )                           |

Use Case 2: Custom Row and Column Formatting

User Story: As a data engineer, I want to specify custom row and column delimiters, so that I can format output for external systems with different syntax requirements.

Acceptance Criteria:

  1. WHEN a user specifies row prefix, column prefix, column separator, column end, row separator, and row end, THEN THE Format_Command SHALL use those values for formatting
  2. WHEN all six formatting parameters are provided, THEN THE Format_Command SHALL apply them in the correct order
  3. WHEN custom delimiters contain special characters, THEN THE Format_Command SHALL preserve them literally
  4. WHEN custom delimiters are empty strings, THEN THE Format_Command SHALL omit those delimiters from output

PPL Query:

source=database | fields db_name, table_name | head 2 | format "[" "[" "&&" "]" "||" "]"

Input Data:

| db_name    | table_name |
|------------|------------|
| inventory  | products   |
| sales      | orders     |

Expected Output:

| search                                                                                                                                                                    |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [ [ db_name="inventory" && table_name="products" ] || [ db_name="sales" && table_name="orders" ] ]                                                                      |

Use Case 3: Multivalue Field Separator Configuration

User Story: As a data analyst, I want to configure how multivalue fields are separated in the formatted output, so that I can control the logical operators used between multiple values of the same field.

Acceptance Criteria:

  1. WHEN Format_Command processes multivalue fields, THEN THE Format_Command SHALL expand each value into separate field-value pairs
  2. WHEN mvsep parameter is specified, THEN THE Format_Command SHALL use that separator between multivalue field instances
  3. WHEN mvsep parameter is not specified, THEN THE Format_Command SHALL use "OR" as the default multivalue separator
  4. WHEN multivalue fields are combined with other fields, THEN THE Format_Command SHALL maintain proper grouping with parentheses

PPL Query:

source=user_activity | fields user_id, permissions | format mvsep="ALSO" "{" "[" "AND" "]" "AND" "}"

Input Data:

| user_id | permissions        |
|---------|--------------------|
| john    | ["read", "write"]  |

Expected Output:

| search                                                                                   |
|------------------------------------------------------------------------------------------|
| { [ ( user_id="john" AND ( permissions="read" ALSO permissions="write" ) ) ] }           |

Use Case 4: Empty Result Handling with Default String

User Story: As a security analyst, I want the format command to handle empty result sets gracefully with a default "NOT( )" string, so that subsearches with no results don't break the main query logic.

Acceptance Criteria:

  1. WHEN Format_Command receives empty input, THEN THE Format_Command SHALL return a single row with a search field
  2. WHEN Format_Command processes zero rows, THEN THE Format_Command SHALL use "NOT( )" as the default output
  3. WHEN Format_Command is used in subsearches with no results, THEN THE system SHALL ensure the main query continues to execute properly
  4. WHEN empty results are formatted, THEN THE Format_Command SHALL maintain the same output structure as non-empty results

PPL Query:

source=metrics | where timestamp < "1970-01-01" | format

Input Data:

(no rows returned due to impossible condition)

Expected Output:

| search   |
|----------|
| NOT ( )  |

Use Case 5: Custom Empty String Handling

User Story: As a data engineer, I want to customize the string returned when format processes empty results, so that I can provide meaningful default values for my specific use case.

Acceptance Criteria:

  1. WHEN Format_Command receives empty input and emptystr parameter is specified, THEN THE Format_Command SHALL return the custom emptystr value
  2. WHEN emptystr parameter is provided, THEN THE Format_Command SHALL use that value instead of the default "NOT( )"
  3. WHEN custom emptystr is used in subsearches, THEN THE system SHALL use the custom value as search criteria
  4. WHEN emptystr contains special characters, THEN THE Format_Command SHALL preserve them literally

PPL Query:

source=events | where status = "nonexistent" | format emptystr="No Data Available"

Input Data:

(no rows returned due to nonexistent status)

Expected Output:

| search            |
|-------------------|
| No Data Available |

Use Case 6: Subsearch Failsafe with Custom Empty String

User Story: As a security analyst, I want to use emptystr as a failsafe in subsearches, so that when my subsearch returns no results, I get a meaningful default that ensures my main search still returns useful data.

Acceptance Criteria:

  1. WHEN a subsearch returns no results and emptystr is specified, THEN THE Format_Command SHALL use the emptystr value as search criteria
  2. WHEN emptystr provides a failsafe value, THEN THE main search SHALL execute using that value as filter criteria
  3. WHEN subsearch fails but emptystr is provided, THEN THE system SHALL continue processing with the failsafe value
  4. WHEN emptystr contains a valid field-value pair, THEN THE main search SHALL return results matching that criteria

PPL Query:

source=security_logs event_type= [search source=threat_intel | head 1 | fields event_type | format emptystr="malware"]

Expected Behavior:
The subsearch [search source=threat_intel | head 1 | fields event_type | format emptystr="malware"] returns no results, so format outputs "malware" as the failsafe value. The main search then becomes source=security_logs event_type=malware, returning all malware-related security events.

Command Syntax Reference

Syntax:

format [mvsep="<mv separator>"] [maxresults=<int>] ["<row prefix>" "<column prefix>" "<column separator>" "<column end>" "<row separator>" "<row end>"] [emptystr="<string>"]

Parameters:

  • mvsep: Separator for multivalue fields (default: "OR")
  • maxresults: Maximum results to return (default: 0 = unlimited)
  • Row/column formatting (must specify all 6 if any):
    • <row prefix>: Default "("
    • <column prefix>: Default "("
    • <column separator>: Default "AND"
    • <column end>: Default ")"
    • <row separator>: Default "OR"
    • <row end>: Default ")"
  • emptystr: String to output when results are empty (default: "NOT( )")

Key Behaviors:

  • Field-value formatting: Creates field="value" pairs for each column
  • Row grouping: Each input row becomes a group of field-value pairs
  • Multivalue handling: Each value in multivalue field becomes separate field-value pair
  • Empty result handling: Returns emptystr value when no user fields are present
  • All-or-nothing formatting: Must specify all 6 row/column parameters or none
  • Output field: Always creates a field named search

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions