-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Problem Statement
Users would like a simple, consistent way to extract nested data from structured documents (JSON/XML) stored in text fields. While OpenSearch PPL provides JSON utility functions, the syntax can be verbose and there's no built-in support for XML extraction.
Most structured data should be indexed directly as OpenSearch fields, but this gets more annoying when documents have highly flexible formats, or when users don't want to pay indexing costs for many nested fields.
Current State
OpenSearch PPL currently supports JSON extraction through functions like json_extract(), but:
- The function syntax is somewhat verbose when just needing to extract a single field
- There's no built-in support for XML document parsing
- Users working with both JSON and XML need to use different approaches for each format
Long-Term Goals
- Provide a unified, intuitive syntax for extracting data from both JSON and XML documents
- Improve the usability of analyzing nested structured data
- Enable consistent handling of structured data regardless of format
- Keep it extensible so we could implement further structured data improvements down the road
Proposal
Starting with the conventions from SPL's spath command, implement an spath that provides:
- Path-based syntax for both JSON and XML extraction
- Automatic format detection between JSON and XML
- Output consistent with existing JSON functions
Example usage:
source=logs
| spath input=message path=user.id
| where user.id != null
source=events
| spath input=payload path=root.metadata.version output=versionApproach
- Implement core
spathcommand with (for now) required parameters, based on existing JSON functions:
spath input=<field> [output=<field>] [path=]<datapath>
- Add XML support to the existing JSON infrastructure
- Implement format detection and appropriate parser selection
Down the road, this could be extended based on usage to support features such as dynamic column loading, more input format types, or parsing the results into specified types. Specifically having a built-in cast operation would be a nice enhancement that current JSON extraction doesn't support (we effectively need eval bytes_temp = json_extract(data, 'bytes'), bytes = cast(bytes_temp as int)).
Alternatives
Users can continue using existing JSON functions and implement custom XML parsing logic:
source=logs
| eval user_id = json_extract(message, 'user.id')Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status