Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3882: [Rust] Cast Kernel for most types #3797

Closed
wants to merge 8 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Mar 4, 2019

This is an implementation of a cast kernel for most Arrow types.

Limitations (when PR is complete):

  • Casting to or from StructArray not supported
  • Casting ListArray to non-list array not supported
  • Casting of incompatible primitive types not supported (e.g. temporal to boolean)

@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 4, 2019

@andygrove I created one cast fn to rule them all, but I need some help in DataFusion. When I replace the cast macros with this cast function, I get lifetime issues per below:

error[E0621]: explicit lifetime required in the type of `expr`
   --> datafusion\src\execution\expression.rs:340:24
    |
299 |       expr: &Expr,
    |             ----- help: add explicit lifetime `'static` to the type of `expr`: `&'static logicalplan::Expr`
...
340 |                       f: Rc::new(|batch: &RecordBatch| {
    |  ________________________^
341 | |                         // match compute::cast(batch.column(index), data_type) {
342 | |                         //     Ok(array) => Ok(array),
343 | |                         //     Err(e) => Err(e.into())
...   |
346 | |                             .map_err(|e| ExecutionError::ArrowError(e))
347 | |                     }),
    | |______________________^ lifetime `'static` required

Adding a 'static lifetime as required leads me down a rabbit hole that doesn't solve the problem. I've already tried a few things on the cast's side, including not taking references, but I'm still not getting joy.

@paddyhoran @sunchao do you have any ideas on how we could overcome this? And does the implementation of cast itself look sound enough for me to continue with this?

// Ok(array) => Ok(array),
// Err(e) => Err(e.into())
// }
compute::cast(batch.column(index).clone(), data_type.clone())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove this is where I'm getting the error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can help with this. I'll take a look now and see if it is easy to fix or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go. This works:

&Expr::Column(index) => {
                let col = input_schema.field(index);
                let dt = data_type.clone();
                Ok(RuntimeExpr::Compiled {
                    name: col.name().clone(),
                    t: col.data_type().clone(),
                    f: Rc::new(move |batch: &RecordBatch| {
                        compute::cast(batch.column(index), &dt)
                            .map_err(|e| ExecutionError::ArrowError(e))
                    }),

The issue was that the closure being built had a reference to data_type from the expression. The fix was to clone it before creating the closure, and use the move keyword to move ownership of the variables into the closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I would have never figured. Code's compiling, so I should be able to complete this PR by the weekend

(List(_), _) => Err(ArrowError::ComputeError(
"Cannot cast list to non-list data types".to_string(),
)),
(_, List(_)) => unimplemented!("Casting scalars to lists not yet supported"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be easy to change these unimplemented! calls into Err now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the places where I'm not already returning an error are placeholders. I'll replace the unimplemented with cast functions

@nevi-me nevi-me marked this pull request as ready for review March 4, 2019 19:27
@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 4, 2019

@andygrove @paddyhoran @liurenjie1024 this is ready for review

use crate::error::{ArrowError, Result};

/// Macro rule to cast between numeric types
macro_rules! cast_numeric_arrays {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these have to be macro? can we use generic function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had this problem in a lot of areas across the project. I don't know how I'd convert DateTime::Int32 to Int32Type. Do you have an idea/example of how I could be able to make the cast generic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use:

fn cast_numeric_arrays<FROM, TO>(src: &ArrayRef) -> Result<ArrayRef>
where
    FROM: ArrowNumericType,
    TO: ArrowNumericType,
    FROM::Native: num::NumCast,
    TO::Native: num::NumCast,
{
    match numeric_cast::<FROM, TO>(
        src.as_any().downcast_ref::<PrimitiveArray<FROM>>().unwrap(),
    ) {
        Ok(to) => Ok(Arc::new(to) as ArrayRef),
        Err(e) => Err(e),
    }
}

This is very similar to the numeric_cast you have below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chao, I've removed all macros and replaced with the above

}

/// Cast array to provided data type
pub fn cast(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to the top and add more comments on its behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't put it at the top as it depends on the macro definitions. I've expanded the doc comment to include behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved this by removing macros

(UInt64, Float32) => cast_numeric_arrays!(array, UInt64Type, Float32Type),
(UInt64, Float64) => cast_numeric_arrays!(array, UInt64Type, Float64Type),

(Int8, UInt8) => cast_numeric_arrays!(array, Int8Type, UInt8Type),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these casts where the physical storage format is the same, can we just change the data type associated with ArrayData such as how it is done in cpp .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be for going from UInt8 to Int{8|16|32|64} and not the inverse like Int{x} -> UInt{x}? I've just done a quick test:

    fn test_associated_data() {
        let a = Int32Array::from(vec![-5, 6]);
        let b: PrimitiveArray<UInt32Type> = PrimitiveArray::from(a.data());
        let c: PrimitiveArray<Int32Type> = PrimitiveArray::from(b.data());

        // I expect -5 to be invalid for UInt8, not sure how I'd change the bitmask 
        assert_eq!(false, b.is_valid(0));
        assert_eq!(6, b.value(1));

        // this is also not the case as I still get -5
        assert_eq!(false, c.is_null(0));
        assert_eq!(6, c.value(1));
    }

I expect a negative Int cast to UInt to have is_valid(x) == false, but because I haven't changed the bitmask, that's not the case.

I'll have a look at how I could do that if I need to cast from signed to unsigned. I also wouldn't mind getting help with those (and other more efficient) casts in follow-up JIRAs.

@@ -0,0 +1,543 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a kernels directory under compute and rename this to cast.rs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is to be consistent with cpp? In that case we should move all the existing kernels too (boolean, comparison).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think we should do that too.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @nevi-me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants