[FEATURE] RFC: mvjoin/array_join Function for PPL Calcite Engine

## Summary

This RFC proposes the implementation of the `mvjoin` function in OpenSearch PPL Calcite (V3 engine). The function joins array elements into a single string using a specified delimiter, enhancing PPL's array manipulation capabilities and supporting common data processing workflows.

## Motivation

### Background
OpenSearch PPL provides comprehensive array manipulation capabilities for data processing and analysis. The `mvjoin` function fills a key gap by providing array-to-string conversion functionality that is essential for many data processing workflows.

### Current State
PPL Calcite currently supports several array functions introduced in version 3.1.0:
- `ARRAY()` - Creates arrays from input values
- `ARRAY_LENGTH()` - Returns array length  
- `FORALL`, `EXISTS`, `FILTER`, `TRANSFORM`, `REDUCE` - Higher-order functions with lambda support

However, there is no function to join array elements into strings, which is essential for:
- **Data Export**: Converting arrays to delimited strings for external systems
- **Display Formatting**: Creating human-readable string representations
- **Data Integration**: Supporting common data processing patterns and workflows
- **Report Generation**: Creating formatted output strings from array data

### Use Cases
1. **Log Analysis**: Join extracted field arrays into CSV format
2. **Data Integration**: Convert multivalue fields to string format for downstream systems  
3. **Report Generation**: Create formatted output strings from array data
4. **Data Processing**: Transform arrays for use in string-based operations and filters

## Detailed Design

### Function Specification

#### Syntax
```ppl
mvjoin(<mv>, <delim>)
```

#### Parameters
- **mv** (ARRAY): Multivalue field (array) containing values to join
- **delim** (STRING): String delimiter to use as separator between individual values

#### Return Type
- **STRING**: Single concatenated string with individual values separated by delimiter

#### Behavior
- **Multivalue Field Processing**: Works on any multivalue field, not just string arrays
- **Type Coercion**: Numeric values and other data types are automatically coerced to text during join
- **Single Value Handling**: If input is single-valued, returns that value as-is (converted to string)
- **String Concatenation**: Concatenates individual values using the delimiter as separator
- **Element Order**: Maintains element order from original multivalue field
- **Empty Array**: Returns empty string for empty arrays
- **Null Handling**: Handles null elements by converting them to "null" strings
- **Universal Input**: Accepts any field type (string, numeric, boolean, etc.) and always returns string
- **Industry Standard**: Follows common multivalue join function patterns

#### Examples
```ppl
# Basic numeric array joining
source=logs | eval base = array(1, 2, 3, 4, 5) | eval joined = mvjoin(base, " OR ")
# Result: joined = "1 OR 2 OR 3 OR 4 OR 5" (numbers coerced to strings)

# Single-valued field handling
source=logs | eval single_value = 42 | eval result = mvjoin(single_value, ",")
# Result: result = "42" (single value returned as string)

# Mixed data types with automatic coercion
source=logs | eval mixed = array(1, 2.5, "text", true) | eval result = mvjoin(mixed, ";")
# Result: result = "1;2.5;text;true" (all types coerced to strings)

# Join multivalue field with semicolon
source=logs | eval n = mvjoin(myfield, ";")  
# Result: n = "value1;value2;value3"

# String arrays (traditional case)
source=logs | eval tags = array("error", "timeout", "retry") | eval tags_str = mvjoin(tags, ",")
# Result: tags_str = "error,timeout,retry"

# Numeric arrays with automatic string conversion
source=logs | eval nums = array(100, 200, 300) | eval result = mvjoin(nums, " | ")
# Result: result = "100 | 200 | 300"

# Boolean and numeric mix
source=logs | eval flags = array(true, false, 1, 0) | eval result = mvjoin(flags, ",")
# Result: result = "true,false,1,0"

# Edge case: Empty array handling
source=logs | eval empty = array() | eval result = mvjoin(empty, ",")
# Result: result = ""

# Use in where clause for filtering
source=logs | where mvjoin(tags, ",") LIKE "%error%"

# Format arrays for display
source=logs | eval formatted_tags = mvjoin(tag_array, ", ")
```

### Implementation Architecture

#### Implementation Strategy Analysis

##### Option 1: Calcite Built-in Functions
Calcite provides relevant string joining functions that could potentially be leveraged:

- **ARRAY_JOIN(array, delimiter [, nullText])**: Synonym for ARRAY_TO_STRING, joins array elements into a string
  - **Signature**: `ARRAY_JOIN(array, delimiter [, nullText])`
  - **Behavior**: Returns concatenation of array elements as STRING using delimiter
  - **Null Handling**: Omits NULL values (or replaces with nullText if provided)
  - **Return**: NULL if any argument is NULL
  - **Input Requirement**: Strict ARRAY type only
- **STRING_AGG(value [, separator])**: Aggregate function for joining strings
- **CONCAT_WS(separator, arg1, ...)**: Concatenate with separator

**Analysis**: 
- `ARRAY_JOIN` exists and provides basic array-to-string functionality
- **Critical Limitation**: Requires strict array input - cannot handle single-value fields
- **Behavioral Mismatch**: Omits NULL values instead of converting to "null" strings
- **Type Restriction**: Only accepts array types, not universal field types as required
- Would require complex wrapper logic to handle single values and achieve mvjoin behavior

##### Option 1b: Hybrid Approach - Custom UDAF Using Calcite ARRAY_JOIN
Implement custom function that leverages Calcite's `ARRAY_JOIN` for array cases while handling single values:

```java
public static class MvjoinHybridImplementor implements NotNullImplementor {
    @Override
    public Expression implement(RexToLixTranslator translator, RexCall call, List<Expression> translatedOperands) {
        // Runtime logic: detect array vs single value, delegate to ARRAY_JOIN for arrays
        return Expressions.call(
            MvjoinHybridImplementor.class,
            "mvjoinHybrid",
            translatedOperands.get(0),  // mv field
            translatedOperands.get(1)   // delimiter
        );
    }
    
    public static String mvjoinHybrid(Object mvField, String delimiter) {
        if (mvField == null) return null;
        
        // Handle single-valued fields: return as string
        if (!(mvField instanceof List)) {
            return mvField.toString();
        }
        
        // For arrays, could potentially use ARRAY_JOIN via RexBuilder call
        // But still need custom logic for null handling differences
        List<?> array = (List<?>) mvField;
        return array.stream()
            .map(element -> element == null ? "null" : element.toString())
            .collect(Collectors.joining(delimiter));
    }
}
```

**Analysis**:
- **Partial Reuse**: Could leverage ARRAY_JOIN for pure array cases
- **Still Complex**: Requires wrapper logic for single values and null handling
- **Mixed Approach**: Combines custom logic with built-in function calls
- **Performance Overhead**: Type checking and delegation adds complexity

##### Option 2: Custom Implementation with Calcite Integration
Implement custom `MvjoinFunctionImpl` using Calcite's UDF framework:

- Full control over SPL-compatible behavior
- Handles universal input types (numeric, string, boolean, etc.)
- Manages single-value vs multivalue field logic
- Integrates with existing CollectionUDF pattern

**Analysis**: 
- Provides exact behavioral compatibility
- Better control over type coercion and edge cases
- Consistent with existing OpenSearch PPL function implementations

##### Option 1c: Type-Aware AST Implementation Using Existing Patterns
Leverage OpenSearch SQL's existing type-aware function registration system used by functions like `LOG`, `ATAN`, and `XOR`. **Note**: The recent upstream commit (d758163cb) improved the type checking infrastructure and simplified registration patterns:

```java
// In PPLBuiltinOperators.java - define SqlOperators for different input types
public static final SqlOperator MVJOIN_ARRAY = new MvjoinArrayFunctionImpl().toUDF("mvjoin");
public static final SqlOperator MVJOIN_SINGLE = new MvjoinSingleFunctionImpl().toUDF("mvjoin");

// In PPLFuncImpTable.java - register multiple operators under same function name
// The improved registerOperator method handles type checking automatically
registerOperator(MVJOIN, MVJOIN_ARRAY, MVJOIN_SINGLE);

// Alternative: Use the new external registration pattern
registerExternalOperator(MVJOIN, MVJOIN_ARRAY);
registerExternalOperator(MVJOIN, MVJOIN_SINGLE);
```

**Analysis**:
- ** Type Resolution at Analysis Time**: No runtime type checking overhead, follows proven patterns
- ** Optimal Performance**: Different optimized implementations for arrays vs single values
- ** Existing Infrastructure**: Uses established `createFunctionImpWithTypeChecker` pattern
- ** Clean Architecture**: Multiple function registrations handle different type combinations
- ** Behavioral Control**: Each registration can have specific null handling and type coercion
- ** Proven Pattern**: Same approach successfully used by `LOG` (1 vs 2 args), `ATAN` (1 vs 2 args), `XOR` (boolean-specific)

##### Option 1d: CASE-Based Single Function Implementation
Use Calcite's CASE expressions for runtime type detection and routing within a single function:

```java
// In PPLFuncImpTable.java - single function with CASE-based type routing
register(
    MVJOIN,
    createFunctionImpWithTypeChecker(
        (builder, mvFieldArg, delimiterArg) -> {
            // Create type detection condition
            RexNode isArrayCondition = builder.makeCall(
                PPLBuiltinOperators.IS_ARRAY_TYPE, mvFieldArg);
            
            // Array path: delegate to ARRAY_JOIN with null handling
            RexNode arrayPath = builder.makeCall(
                PPLBuiltinOperators.MVJOIN_ARRAY_HANDLER, mvFieldArg, delimiterArg);
            
            // Single value path: direct string conversion
            RexNode singleValuePath = builder.makeCall(
                SqlStdOperatorTable.CAST, mvFieldArg, 
                builder.getTypeFactory().createSqlType(SqlTypeName.VARCHAR));
            
            // CASE expression: runtime routing based on type
            return builder.makeCall(
                SqlStdOperatorTable.CASE,
                isArrayCondition,    // WHEN condition
                arrayPath,           // THEN array handling  
                singleValuePath);    // ELSE single value handling
        },
        PPLTypeChecker.family(SqlTypeFamily.ANY, SqlTypeFamily.STRING)));
```

**Analysis**:
- ** Single Function**: One registration handles all input types dynamically
- ** Runtime Flexibility**: Can handle mixed type scenarios within same query
- ** Proven CASE Pattern**: Uses existing `SqlStdOperatorTable.CASE` from `NULLIF`, `IF` functions
- ** Runtime Overhead**: Type detection happens during execution, not analysis
- ** Complex Debugging**: Conditional logic embedded in SQL expression tree
- ** CASE Infrastructure**: Leverages existing Calcite CASE support and type detection utilities

##### Option 3: OpenSearch Pushdown Optimization
Explore pushing mvjoin operations to OpenSearch level:

**Challenges**:
- OpenSearch doesn't have native array join functionality in aggregations
- Would require custom painless scripts for complex operations
- Limited benefit since mvjoin is typically used in eval (post-processing)

**Analysis**: 
- Not suitable for pushdown due to lack of native OpenSearch support
- Function operates on results rather than during aggregation
- Better to implement at Calcite level for consistency

##### Recommended Approach: Type-Aware AST Implementation (Option 1c)
Based on analysis and discovery of existing proven patterns in the codebase, the type-aware AST implementation provides the optimal balance of performance, maintainability, and architectural consistency.

**Detailed Implementation Comparison:**

| Aspect | mvjoin (Required) | Option 1: ARRAY_JOIN | Option 1b: Hybrid | Option 1c: Type-Aware AST | Option 1d: CASE-Based | Option 2: Custom Implementation |
|--------|------------------|---------------------|------------------|--------------------------|----------------------|------------------------------|
| **Single Values** | ✅ `mvjoin(42, ",")` → `"42"` | ❌ ERROR | ✅ Wrapper logic | ✅ Separate registration | ✅ CASE routing | ✅ Full control |
| **Universal Input** | ✅ Any field type | ❌ Arrays only | ✅ Runtime detection | ✅ Analysis-time routing | ✅ Runtime CASE logic | ✅ Any type accepted |
| **Null Elements** | ✅ `"a,null,c"` | ❌ `"a,c"` (omits) | ✅ Custom logic | ✅ PPL-specific handling | ✅ Conditional handling | ✅ Custom behavior |
| **Empty Arrays** | ✅ `""` (empty string) | ✅ Compatible | ✅ Could delegate | ✅ Type-specific logic | ✅ CASE branch logic | ✅ Configurable |
| **Type Coercion** | ✅ Auto-convert all types | ✅ Array elements only | ✅ Wrapper + built-in | ✅ Per-registration logic | ✅ CASE-based routing | ✅ Full control |
| **Return on NULL** | ✅ `null` (graceful) | ❌ `NULL` if any arg NULL | ✅ Wrapper handles | ✅ Type-specific handling | ✅ CASE can handle | ✅ Configurable |
| **Type Resolution** | N/A | ❌ Runtime error | ⚠️ Runtime detection | ✅ **Analysis-time** | ⚠️ Runtime CASE evaluation | ⚠️ Runtime detection |
| **Performance** | N/A | ✅ Optimized built-in | ⚠️ Type checking overhead | ✅ **Optimal per type** | ⚠️ CASE evaluation overhead | ✅ Direct implementation |
| **Architecture** | N/A | ❌ Not viable | ⚠️ Complex dual paths | ✅ **Proven pattern** | ✅ Single function | ✅ Clean single implementation |
| **Maintenance** | N/A | N/A | ⚠️ Multiple code paths | ✅ **Standard registration** | ✅ Single function body | ✅ Single codebase |
| **Debugging** | N/A | N/A | ⚠️ Multiple paths to trace | ✅ Clear type-specific paths | ⚠️ Conditional logic in SQL tree | ✅ Straightforward logic |
| **Code Generation** | N/A | ✅ Calcite optimized | ⚠️ Mixed optimization | ✅ **Type-optimized code** | ⚠️ Runtime branching | ✅ Direct compiled code |

**Why Option 1c (Type-Aware AST Implementation) is Strongly Recommended:**

1. **Proven Architecture Pattern & Recent Improvements**: Uses the exact same pattern successfully employed by existing functions, now enhanced by upstream commit d758163cb:
   - **LOG function**: Different implementations for 1 vs 2 arguments
   - **ATAN function**: `ATAN` (single arg) vs `ATAN2` (two args) routing
   - **XOR function**: Type-specific boolean logic implementation
   - **IS_BLANK function**: Complex multi-operator type-aware composition
   - **New**: Improved `wrapSqlOperandTypeChecker()` utility for better type validation

2. **Analysis-Time Type Resolution**: Major performance advantage over runtime approaches:
   - **Zero runtime overhead**: Type detection happens during query analysis phase
   - **Optimal code generation**: Each type gets specialized implementation path  
   - **No branching logic**: Calcite generates optimal code for each registration
   - **Proven performance**: Same pattern used throughout high-performance PPL functions

3. **Implementation Simplicity Comparison**:
   - **Option 1 (Pure ARRAY_JOIN)**: Not viable due to single-value requirement mismatch
   - **Option 1b (Hybrid)**: Complex runtime detection and wrapper logic
   - **Option 1c (Type-Aware AST)**: **Clean multiple registrations using existing infrastructure**
   - **Option 2 (Custom UDAF)**: Requires new implementation class and runtime type checking

4. **Existing Infrastructure Leverage** (Enhanced by upstream commit):
   - **PPLFuncImpTable.java**: Improved multiple registrations per function name
   - **PPLTypeChecker**: Enhanced type validation framework with new wrapper utilities
   - **registerOperator/registerExternalOperator**: Simplified registration methods replacing older patterns
   - **PPLBuiltinOperators**: Standard registration point with improved type checking

5. **Behavioral Correctness**: Each registration can have specialized logic:
   - **Array registration**: Optimized for array inputs with PPL-specific null handling
   - **Single-value registration**: Direct string conversion without array overhead
   - **Numeric-specific registration**: Type-optimized number-to-string conversion

6. **Future Extensibility**: Establishes the foundation for the complete mv function family:
   - **Consistent pattern**: All mv functions can use same type-aware registration approach
   - **Reusable utilities**: Type-specific implementations can share common logic
   - **Proven scalability**: Pattern already proven with multiple function implementations

7. **Maintainability**: 
   - **Standard registration**: Follows existing codebase patterns developers already understand
   - **Isolated implementations**: Each type registration is independent and testable
   - **Clear separation**: No complex branching logic or runtime type detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] RFC: mvjoin/array_join Function for PPL Calcite Engine #4146

Summary

Motivation

Background

Current State

Use Cases

Detailed Design

Function Specification

Syntax

Parameters

Return Type

Behavior

Examples

Implementation Architecture

Implementation Strategy Analysis

Option 1: Calcite Built-in Functions

Option 1b: Hybrid Approach - Custom UDAF Using Calcite ARRAY_JOIN

Option 2: Custom Implementation with Calcite Integration

Option 1c: Type-Aware AST Implementation Using Existing Patterns

Option 1d: CASE-Based Single Function Implementation

Option 3: OpenSearch Pushdown Optimization

Recommended Approach: Type-Aware AST Implementation (Option 1c)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	mvjoin (Required)	Option 1: ARRAY_JOIN	Option 1b: Hybrid	Option 1c: Type-Aware AST	Option 1d: CASE-Based	Option 2: Custom Implementation
Single Values	✅ `mvjoin(42, ",")` → `"42"`	❌ ERROR	✅ Wrapper logic	✅ Separate registration	✅ CASE routing	✅ Full control
Universal Input	✅ Any field type	❌ Arrays only	✅ Runtime detection	✅ Analysis-time routing	✅ Runtime CASE logic	✅ Any type accepted
Null Elements	✅ `"a,null,c"`	❌ `"a,c"` (omits)	✅ Custom logic	✅ PPL-specific handling	✅ Conditional handling	✅ Custom behavior
Empty Arrays	✅ `""` (empty string)	✅ Compatible	✅ Could delegate	✅ Type-specific logic	✅ CASE branch logic	✅ Configurable
Type Coercion	✅ Auto-convert all types	✅ Array elements only	✅ Wrapper + built-in	✅ Per-registration logic	✅ CASE-based routing	✅ Full control
Return on NULL	✅ `null` (graceful)	❌ `NULL` if any arg NULL	✅ Wrapper handles	✅ Type-specific handling	✅ CASE can handle	✅ Configurable
Type Resolution	N/A	❌ Runtime error	⚠️ Runtime detection	✅ Analysis-time	⚠️ Runtime CASE evaluation	⚠️ Runtime detection
Performance	N/A	✅ Optimized built-in	⚠️ Type checking overhead	✅ Optimal per type	⚠️ CASE evaluation overhead	✅ Direct implementation
Architecture	N/A	❌ Not viable	⚠️ Complex dual paths	✅ Proven pattern	✅ Single function	✅ Clean single implementation
Maintenance	N/A	N/A	⚠️ Multiple code paths	✅ Standard registration	✅ Single function body	✅ Single codebase
Debugging	N/A	N/A	⚠️ Multiple paths to trace	✅ Clear type-specific paths	⚠️ Conditional logic in SQL tree	✅ Straightforward logic
Code Generation	N/A	✅ Calcite optimized	⚠️ Mixed optimization	✅ Type-optimized code	⚠️ Runtime branching	✅ Direct compiled code

[FEATURE] RFC: mvjoin/array_join Function for PPL Calcite Engine #4146

Description

Summary

Motivation

Background

Current State

Use Cases

Detailed Design

Function Specification

Syntax

Parameters

Return Type

Behavior

Examples

Implementation Architecture

Implementation Strategy Analysis

Option 1: Calcite Built-in Functions

Option 1b: Hybrid Approach - Custom UDAF Using Calcite ARRAY_JOIN

Option 2: Custom Implementation with Calcite Integration

Option 1c: Type-Aware AST Implementation Using Existing Patterns

Option 1d: CASE-Based Single Function Implementation

Option 3: OpenSearch Pushdown Optimization

Recommended Approach: Type-Aware AST Implementation (Option 1c)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions