Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the bytecode / interpreter machinery and use the serde API directly #118

Closed
Ten0 opened this issue Jan 16, 2024 · 7 comments
Closed

Comments

@Ten0
Copy link
Contributor

Ten0 commented Jan 16, 2024

First of all, thanks for this crate!
Secondly, sorry for opening an issue to ask a question - I couldn't find a "discussions" section.

Disclaimer: I don't know much about building arrow values but I do know serde very well - I'm considering using this crate for transcoding from avro to arrow but I'm wondering about the performance implications of the intermediate-event-stream representation.

Why the interpreter and event stream intermediate representation instead of just pushing in arrow vectors directly from your impl Serializers? (Isn't it possible to "just push to arrow vectors"?) Is that a performance improvement? If so, why does that improved performance (compared to the alternative below)? Doesn't that prevent e.g. copying the &strs provided by the Serialize implementor directly in their appropriate buffer and instead require extra allocation? (Event::Str("1970-01-01T00:00:00Z").to_owned(),)

Basically my question is: why not something along those lines:

struct Field {
	name: &str,
	serializer: FieldSerializer,
}
enum FieldSerializer {
	Int4(SerializeFieldToInt4Array),
	// ... other array types and/or handling - basically everything that's specific to this field
}
struct ArrowSerializer {
	fields: &mut [Field],
}
impl Serializer for ArrowSerializer {
	fn serialize_map(self, len: Option<usize>) -> Result<ArrowSerializeMap> {
		Ok(ArrowSerializeMap { fields: (&mut *self.fields).iter_mut() }) // reborrow might be necessary
	}
}
struct ArrowSerializeMap {
	fields: std::slice::IterMut<Field>,
}
impl SerializeMap for ArrowSerializeMap {
	fn serialize_entry(&mut self, key: impl Serialize, value: impl Serialize) -> Result<...> {
		let field: &mut ... = self.fields.next();
		// Possibly check field name based on key, or opt out as suggested in #92
		
		// Then pick the serializer for this type.
		// With a 0-based enum discriminant without holes on FieldArray this should
		// be compiled into a jump table by the compiler, so O(1).
		match field.serializer {
			Type::Int4(serializer) => {
				value.serialize(serializer) // reborrow
			}
			// ... other types
		}
	}
}
struct SerializeFieldToInt4Array {
	array: AvroMutableArrayOfInt4,
}
impl Serialize for &mut SerializeFieldToInt4Array { // Note the &mut here
	fn serialize_i32(value: i32) -> Result<...> {
		self.array.push(value);
	}

	// Most other input types are incorrect for this arrow output type
	serde_serializer_quick_unsupported::serializer_unsupported! {
		err = (<Self::Error as serde::ser::Error>::custom("Unsupported type for arrow Int4 array"));
		bool i8 i16 i64 u8 u16 u32 u64 f32 f64 char str bytes none some unit unit_struct
		unit_variant newtype_struct newtype_variant seq tuple tuple_struct tuple_variant map struct
		struct_variant i128 u128
	}
}

(I've achieved x10 performance on avro serialization compared to apache-avro implementation specifically by removing the intermediate representation they have which is why at first glance I'm surprised by this, but I don't know arrow well.)

Thanks a lot ❤️

@chmp
Copy link
Owner

chmp commented Jan 17, 2024

Thanks for raising this issue :). The reasons are mostly historic ones. I agree that replacing the event stream abstraction with direct serialization makes sense. I simply did not get around to actually implement it.

I don't think there is a factor 10x on the table table, though. At most 4x - 5x (see benchmarks). I believe arrow2-convert to be close to hand-rolled implementation. Due to the dynamic nature of serde-arrow, I would expect any speed up be to closer to 2x. The data is fully handled at runtime and there is a perfomance gap associated with that that simply cannot be closed.

To explain the history: this crate is my first exposure of interacting with the serde API and I found it easier to conceptualize the underlying data model mismatch by thinking of it in terms of a pseudo JSON format. The JSON format example of the serde docs makes it also really easy to get started down this road. Initially the events were properly matrialized (e.g., there were actual Event instances exchanged). However, currently most communication is handled via accept_XX methods and the events are not constructed as objecs (hence also there is no copying of Strings) (*). To handle the events I ended up with a depply nested state machines (e.g., for List > Struct > Struct > List > Struct) and performance was not great :D. The bytecode implementation is a way to flatten the state machine. Its introduction resulted in approximately a 3x - 4x performance gain. By removing the underlying event model, I agree it would be possible to forgo the whole state machine and to use the call stack instead. Hopefully giving another performance boost.

(*) See for example here and here.

@chmp chmp changed the title QUESTION(/doc): Why the interpreter and event stream? Remove the bytecode / interpreter machinery and use the serde API directly Jan 17, 2024
@Ten0
Copy link
Contributor Author

Ten0 commented Jan 17, 2024

Very clear, thanks! :)

I would expect any speed up be to closer to 2x

Ah yes, we probably wouldn't get significantly better than arrow2-convert, I did get those gains compared to apache-avro but that implementation was very far from optimal in many places - I didn't mean to express that we could expect the same gains here.

@chmp
Copy link
Owner

chmp commented Jan 21, 2024

I implemented everything required to execute benchmarks (90 % sure the resulting buffers are filled correctly :D). The result is a speed up of aprox. 2x. (1.81 for complex types, 2.12 for simple types):

complex_common_serialize(1000000)

label time [ms] arrow2_convert serde_arrow_ng serde_arrow arrow
arrow2_convert 695.25 1.00 0.40 0.22 0.14
serde_arrow_ng 1745.76 2.51 1.00 0.55 0.36
serde_arrow 3152.74 4.53 1.81 1.00 0.65
arrow 4873.73 7.01 2.79 1.55 1.00

primitives_serialize(1000000)

label time [ms] arrow2_convert serde_arrow_ng serde_arrow arrow
arrow2_convert 136.05 1.00 0.47 0.22 0.10
serde_arrow_ng 290.84 2.14 1.00 0.47 0.21
serde_arrow 617.97 4.54 2.12 1.00 0.45
arrow 1373.00 10.09 4.72 2.22 1.00

The next steps will be to switch the internal test suite to the new serializer and then squash test errors one by one. I would expect progress the next couple of weeks to be a bit slower though due to time constraints. As a side note: I am really surprised how well the idea of the bytecode serializer holds up.

@Ten0
Copy link
Contributor Author

Ten0 commented Jan 21, 2024

Amazing! Congratz!

@chmp
Copy link
Owner

chmp commented Jan 27, 2024

Not sure how urgent this feature is for you. I defintely would like to touch up some more details before cutting a proper release. But I could do a pre-release for you to test, if that helps.

Oh. And thanks again for pointing me into this direction. The result is now approx 2x performance, -3000 lines of code, and, I beliefe, a much easier to understand impl :).

@Ten0
Copy link
Contributor Author

Ten0 commented Jan 27, 2024

thanks again for pointing me into this direction. The result is now approx 2x performance, -3000 lines of code, and, I beliefe, a much easier to understand impl :)

Your open-mindedness and reactivity made that an absolute pleasure 😊

I defintely would like to touch up some more details before cutting a proper release. But I could do a pre-release for you to test, if that helps.

Don't mind me, I'll just point to github in the meantime. 😊

@chmp
Copy link
Owner

chmp commented Jan 28, 2024

I cut a pre-release after all for my own package using serde_arrow :)

https://crates.io/crates/serde_arrow/0.10.0-rc.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants