Skip to content

[RFC] Improved structured data extraction with spath #4119

@Swiddis

Description

@Swiddis

Problem Statement

Users would like a simple, consistent way to extract nested data from structured documents (JSON/XML) stored in text fields. While OpenSearch PPL provides JSON utility functions, the syntax can be verbose and there's no built-in support for XML extraction.

Most structured data should be indexed directly as OpenSearch fields, but this gets more annoying when documents have highly flexible formats, or when users don't want to pay indexing costs for many nested fields.

Current State

OpenSearch PPL currently supports JSON extraction through functions like json_extract(), but:

  • The function syntax is somewhat verbose when just needing to extract a single field
  • There's no built-in support for XML document parsing
  • Users working with both JSON and XML need to use different approaches for each format

Long-Term Goals

  • Provide a unified, intuitive syntax for extracting data from both JSON and XML documents
  • Improve the usability of analyzing nested structured data
  • Enable consistent handling of structured data regardless of format
  • Keep it extensible so we could implement further structured data improvements down the road

Proposal

Starting with the conventions from SPL's spath command, implement an spath that provides:

  1. Path-based syntax for both JSON and XML extraction
  2. Automatic format detection between JSON and XML
  3. Output consistent with existing JSON functions

Example usage:

source=logs 
| spath input=message path=user.id 
| where user.id != null

source=events
| spath input=payload path=root.metadata.version output=version

Approach

  1. Implement core spath command with (for now) required parameters, based on existing JSON functions:
spath input=<field> [output=<field>] [path=]<datapath>
  1. Add XML support to the existing JSON infrastructure
  2. Implement format detection and appropriate parser selection

Down the road, this could be extended based on usage to support features such as dynamic column loading, more input format types, or parsing the results into specified types. Specifically having a built-in cast operation would be a nice enhancement that current JSON extraction doesn't support (we effectively need eval bytes_temp = json_extract(data, 'bytes'), bytes = cast(bytes_temp as int)).

Alternatives

Users can continue using existing JSON functions and implement custom XML parsing logic:

source=logs 
| eval user_id = json_extract(message, 'user.id')

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagev3.3.0

Type

No type

Projects

Status

New

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions