-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Summary
This RFC proposes the implementation of the mvjoin function in OpenSearch PPL Calcite (V3 engine). The function joins array elements into a single string using a specified delimiter, enhancing PPL's array manipulation capabilities and supporting common data processing workflows.
Motivation
Background
OpenSearch PPL provides comprehensive array manipulation capabilities for data processing and analysis. The mvjoin function fills a key gap by providing array-to-string conversion functionality that is essential for many data processing workflows.
Current State
PPL Calcite currently supports several array functions introduced in version 3.1.0:
ARRAY()- Creates arrays from input valuesARRAY_LENGTH()- Returns array lengthFORALL,EXISTS,FILTER,TRANSFORM,REDUCE- Higher-order functions with lambda support
However, there is no function to join array elements into strings, which is essential for:
- Data Export: Converting arrays to delimited strings for external systems
- Display Formatting: Creating human-readable string representations
- Data Integration: Supporting common data processing patterns and workflows
- Report Generation: Creating formatted output strings from array data
Use Cases
- Log Analysis: Join extracted field arrays into CSV format
- Data Integration: Convert multivalue fields to string format for downstream systems
- Report Generation: Create formatted output strings from array data
- Data Processing: Transform arrays for use in string-based operations and filters
Detailed Design
Function Specification
Syntax
mvjoin(<mv>, <delim>)
Parameters
- mv (ARRAY): Multivalue field (array) containing values to join
- delim (STRING): String delimiter to use as separator between individual values
Return Type
- STRING: Single concatenated string with individual values separated by delimiter
Behavior
- Multivalue Field Processing: Works on any multivalue field, not just string arrays
- Type Coercion: Numeric values and other data types are automatically coerced to text during join
- Single Value Handling: If input is single-valued, returns that value as-is (converted to string)
- String Concatenation: Concatenates individual values using the delimiter as separator
- Element Order: Maintains element order from original multivalue field
- Empty Array: Returns empty string for empty arrays
- Null Handling: Handles null elements by converting them to "null" strings
- Universal Input: Accepts any field type (string, numeric, boolean, etc.) and always returns string
- Industry Standard: Follows common multivalue join function patterns
Examples
# Basic numeric array joining
source=logs | eval base = array(1, 2, 3, 4, 5) | eval joined = mvjoin(base, " OR ")
# Result: joined = "1 OR 2 OR 3 OR 4 OR 5" (numbers coerced to strings)
# Single-valued field handling
source=logs | eval single_value = 42 | eval result = mvjoin(single_value, ",")
# Result: result = "42" (single value returned as string)
# Mixed data types with automatic coercion
source=logs | eval mixed = array(1, 2.5, "text", true) | eval result = mvjoin(mixed, ";")
# Result: result = "1;2.5;text;true" (all types coerced to strings)
# Join multivalue field with semicolon
source=logs | eval n = mvjoin(myfield, ";")
# Result: n = "value1;value2;value3"
# String arrays (traditional case)
source=logs | eval tags = array("error", "timeout", "retry") | eval tags_str = mvjoin(tags, ",")
# Result: tags_str = "error,timeout,retry"
# Numeric arrays with automatic string conversion
source=logs | eval nums = array(100, 200, 300) | eval result = mvjoin(nums, " | ")
# Result: result = "100 | 200 | 300"
# Boolean and numeric mix
source=logs | eval flags = array(true, false, 1, 0) | eval result = mvjoin(flags, ",")
# Result: result = "true,false,1,0"
# Edge case: Empty array handling
source=logs | eval empty = array() | eval result = mvjoin(empty, ",")
# Result: result = ""
# Use in where clause for filtering
source=logs | where mvjoin(tags, ",") LIKE "%error%"
# Format arrays for display
source=logs | eval formatted_tags = mvjoin(tag_array, ", ")
Implementation Architecture
Implementation Strategy Analysis
Option 1: Calcite Built-in Functions
Calcite provides relevant string joining functions that could potentially be leveraged:
- ARRAY_JOIN(array, delimiter [, nullText]): Synonym for ARRAY_TO_STRING, joins array elements into a string
- Signature:
ARRAY_JOIN(array, delimiter [, nullText]) - Behavior: Returns concatenation of array elements as STRING using delimiter
- Null Handling: Omits NULL values (or replaces with nullText if provided)
- Return: NULL if any argument is NULL
- Input Requirement: Strict ARRAY type only
- Signature:
- STRING_AGG(value [, separator]): Aggregate function for joining strings
- CONCAT_WS(separator, arg1, ...): Concatenate with separator
Analysis:
ARRAY_JOINexists and provides basic array-to-string functionality- Critical Limitation: Requires strict array input - cannot handle single-value fields
- Behavioral Mismatch: Omits NULL values instead of converting to "null" strings
- Type Restriction: Only accepts array types, not universal field types as required
- Would require complex wrapper logic to handle single values and achieve mvjoin behavior
Option 1b: Hybrid Approach - Custom UDAF Using Calcite ARRAY_JOIN
Implement custom function that leverages Calcite's ARRAY_JOIN for array cases while handling single values:
public static class MvjoinHybridImplementor implements NotNullImplementor {
@Override
public Expression implement(RexToLixTranslator translator, RexCall call, List<Expression> translatedOperands) {
// Runtime logic: detect array vs single value, delegate to ARRAY_JOIN for arrays
return Expressions.call(
MvjoinHybridImplementor.class,
"mvjoinHybrid",
translatedOperands.get(0), // mv field
translatedOperands.get(1) // delimiter
);
}
public static String mvjoinHybrid(Object mvField, String delimiter) {
if (mvField == null) return null;
// Handle single-valued fields: return as string
if (!(mvField instanceof List)) {
return mvField.toString();
}
// For arrays, could potentially use ARRAY_JOIN via RexBuilder call
// But still need custom logic for null handling differences
List<?> array = (List<?>) mvField;
return array.stream()
.map(element -> element == null ? "null" : element.toString())
.collect(Collectors.joining(delimiter));
}
}Analysis:
- Partial Reuse: Could leverage ARRAY_JOIN for pure array cases
- Still Complex: Requires wrapper logic for single values and null handling
- Mixed Approach: Combines custom logic with built-in function calls
- Performance Overhead: Type checking and delegation adds complexity
Option 2: Custom Implementation with Calcite Integration
Implement custom MvjoinFunctionImpl using Calcite's UDF framework:
- Full control over SPL-compatible behavior
- Handles universal input types (numeric, string, boolean, etc.)
- Manages single-value vs multivalue field logic
- Integrates with existing CollectionUDF pattern
Analysis:
- Provides exact behavioral compatibility
- Better control over type coercion and edge cases
- Consistent with existing OpenSearch PPL function implementations
Option 1c: Type-Aware AST Implementation Using Existing Patterns
Leverage OpenSearch SQL's existing type-aware function registration system used by functions like LOG, ATAN, and XOR. Note: The recent upstream commit (d758163) improved the type checking infrastructure and simplified registration patterns:
// In PPLBuiltinOperators.java - define SqlOperators for different input types
public static final SqlOperator MVJOIN_ARRAY = new MvjoinArrayFunctionImpl().toUDF("mvjoin");
public static final SqlOperator MVJOIN_SINGLE = new MvjoinSingleFunctionImpl().toUDF("mvjoin");
// In PPLFuncImpTable.java - register multiple operators under same function name
// The improved registerOperator method handles type checking automatically
registerOperator(MVJOIN, MVJOIN_ARRAY, MVJOIN_SINGLE);
// Alternative: Use the new external registration pattern
registerExternalOperator(MVJOIN, MVJOIN_ARRAY);
registerExternalOperator(MVJOIN, MVJOIN_SINGLE);Analysis:
- ** Type Resolution at Analysis Time**: No runtime type checking overhead, follows proven patterns
- ** Optimal Performance**: Different optimized implementations for arrays vs single values
- ** Existing Infrastructure**: Uses established
createFunctionImpWithTypeCheckerpattern - ** Clean Architecture**: Multiple function registrations handle different type combinations
- ** Behavioral Control**: Each registration can have specific null handling and type coercion
- ** Proven Pattern**: Same approach successfully used by
LOG(1 vs 2 args),ATAN(1 vs 2 args),XOR(boolean-specific)
Option 1d: CASE-Based Single Function Implementation
Use Calcite's CASE expressions for runtime type detection and routing within a single function:
// In PPLFuncImpTable.java - single function with CASE-based type routing
register(
MVJOIN,
createFunctionImpWithTypeChecker(
(builder, mvFieldArg, delimiterArg) -> {
// Create type detection condition
RexNode isArrayCondition = builder.makeCall(
PPLBuiltinOperators.IS_ARRAY_TYPE, mvFieldArg);
// Array path: delegate to ARRAY_JOIN with null handling
RexNode arrayPath = builder.makeCall(
PPLBuiltinOperators.MVJOIN_ARRAY_HANDLER, mvFieldArg, delimiterArg);
// Single value path: direct string conversion
RexNode singleValuePath = builder.makeCall(
SqlStdOperatorTable.CAST, mvFieldArg,
builder.getTypeFactory().createSqlType(SqlTypeName.VARCHAR));
// CASE expression: runtime routing based on type
return builder.makeCall(
SqlStdOperatorTable.CASE,
isArrayCondition, // WHEN condition
arrayPath, // THEN array handling
singleValuePath); // ELSE single value handling
},
PPLTypeChecker.family(SqlTypeFamily.ANY, SqlTypeFamily.STRING)));Analysis:
- ** Single Function**: One registration handles all input types dynamically
- ** Runtime Flexibility**: Can handle mixed type scenarios within same query
- ** Proven CASE Pattern**: Uses existing
SqlStdOperatorTable.CASEfromNULLIF,IFfunctions - ** Runtime Overhead**: Type detection happens during execution, not analysis
- ** Complex Debugging**: Conditional logic embedded in SQL expression tree
- ** CASE Infrastructure**: Leverages existing Calcite CASE support and type detection utilities
Option 3: OpenSearch Pushdown Optimization
Explore pushing mvjoin operations to OpenSearch level:
Challenges:
- OpenSearch doesn't have native array join functionality in aggregations
- Would require custom painless scripts for complex operations
- Limited benefit since mvjoin is typically used in eval (post-processing)
Analysis:
- Not suitable for pushdown due to lack of native OpenSearch support
- Function operates on results rather than during aggregation
- Better to implement at Calcite level for consistency
Recommended Approach: Type-Aware AST Implementation (Option 1c)
Based on analysis and discovery of existing proven patterns in the codebase, the type-aware AST implementation provides the optimal balance of performance, maintainability, and architectural consistency.
Detailed Implementation Comparison:
| Aspect | mvjoin (Required) | Option 1: ARRAY_JOIN | Option 1b: Hybrid | Option 1c: Type-Aware AST | Option 1d: CASE-Based | Option 2: Custom Implementation |
|---|---|---|---|---|---|---|
| Single Values | ✅ mvjoin(42, ",") → "42" |
❌ ERROR | ✅ Wrapper logic | ✅ Separate registration | ✅ CASE routing | ✅ Full control |
| Universal Input | ✅ Any field type | ❌ Arrays only | ✅ Runtime detection | ✅ Analysis-time routing | ✅ Runtime CASE logic | ✅ Any type accepted |
| Null Elements | ✅ "a,null,c" |
❌ "a,c" (omits) |
✅ Custom logic | ✅ PPL-specific handling | ✅ Conditional handling | ✅ Custom behavior |
| Empty Arrays | ✅ "" (empty string) |
✅ Compatible | ✅ Could delegate | ✅ Type-specific logic | ✅ CASE branch logic | ✅ Configurable |
| Type Coercion | ✅ Auto-convert all types | ✅ Array elements only | ✅ Wrapper + built-in | ✅ Per-registration logic | ✅ CASE-based routing | ✅ Full control |
| Return on NULL | ✅ null (graceful) |
❌ NULL if any arg NULL |
✅ Wrapper handles | ✅ Type-specific handling | ✅ CASE can handle | ✅ Configurable |
| Type Resolution | N/A | ❌ Runtime error | ✅ Analysis-time | |||
| Performance | N/A | ✅ Optimized built-in | ✅ Optimal per type | ✅ Direct implementation | ||
| Architecture | N/A | ❌ Not viable | ✅ Proven pattern | ✅ Single function | ✅ Clean single implementation | |
| Maintenance | N/A | N/A | ✅ Standard registration | ✅ Single function body | ✅ Single codebase | |
| Debugging | N/A | N/A | ✅ Clear type-specific paths | ✅ Straightforward logic | ||
| Code Generation | N/A | ✅ Calcite optimized | ✅ Type-optimized code | ✅ Direct compiled code |
Why Option 1c (Type-Aware AST Implementation) is Strongly Recommended:
-
Proven Architecture Pattern & Recent Improvements: Uses the exact same pattern successfully employed by existing functions, now enhanced by upstream commit d758163:
- LOG function: Different implementations for 1 vs 2 arguments
- ATAN function:
ATAN(single arg) vsATAN2(two args) routing - XOR function: Type-specific boolean logic implementation
- IS_BLANK function: Complex multi-operator type-aware composition
- New: Improved
wrapSqlOperandTypeChecker()utility for better type validation
-
Analysis-Time Type Resolution: Major performance advantage over runtime approaches:
- Zero runtime overhead: Type detection happens during query analysis phase
- Optimal code generation: Each type gets specialized implementation path
- No branching logic: Calcite generates optimal code for each registration
- Proven performance: Same pattern used throughout high-performance PPL functions
-
Implementation Simplicity Comparison:
- Option 1 (Pure ARRAY_JOIN): Not viable due to single-value requirement mismatch
- Option 1b (Hybrid): Complex runtime detection and wrapper logic
- Option 1c (Type-Aware AST): Clean multiple registrations using existing infrastructure
- Option 2 (Custom UDAF): Requires new implementation class and runtime type checking
-
Existing Infrastructure Leverage (Enhanced by upstream commit):
- PPLFuncImpTable.java: Improved multiple registrations per function name
- PPLTypeChecker: Enhanced type validation framework with new wrapper utilities
- registerOperator/registerExternalOperator: Simplified registration methods replacing older patterns
- PPLBuiltinOperators: Standard registration point with improved type checking
-
Behavioral Correctness: Each registration can have specialized logic:
- Array registration: Optimized for array inputs with PPL-specific null handling
- Single-value registration: Direct string conversion without array overhead
- Numeric-specific registration: Type-optimized number-to-string conversion
-
Future Extensibility: Establishes the foundation for the complete mv function family:
- Consistent pattern: All mv functions can use same type-aware registration approach
- Reusable utilities: Type-specific implementations can share common logic
- Proven scalability: Pattern already proven with multiple function implementations
-
Maintainability:
- Standard registration: Follows existing codebase patterns developers already understand
- Isolated implementations: Each type registration is independent and testable
- Clear separation: No complex branching logic or runtime type detection
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status