Remove the bytecode / interpreter machinery and use the serde API directly #118

Ten0 · 2024-01-16T23:38:53Z

First of all, thanks for this crate!
Secondly, sorry for opening an issue to ask a question - I couldn't find a "discussions" section.

Disclaimer: I don't know much about building arrow values but I do know serde very well - I'm considering using this crate for transcoding from avro to arrow but I'm wondering about the performance implications of the intermediate-event-stream representation.

Why the interpreter and event stream intermediate representation instead of just pushing in arrow vectors directly from your impl Serializers? (Isn't it possible to "just push to arrow vectors"?) Is that a performance improvement? If so, why does that improved performance (compared to the alternative below)? Doesn't that prevent e.g. copying the &strs provided by the Serialize implementor directly in their appropriate buffer and instead require extra allocation? (Event::Str("1970-01-01T00:00:00Z").to_owned(),)

Basically my question is: why not something along those lines:

struct Field {
	name: &str,
	serializer: FieldSerializer,
}
enum FieldSerializer {
	Int4(SerializeFieldToInt4Array),
	// ... other array types and/or handling - basically everything that's specific to this field
}
struct ArrowSerializer {
	fields: &mut [Field],
}
impl Serializer for ArrowSerializer {
	fn serialize_map(self, len: Option<usize>) -> Result<ArrowSerializeMap> {
		Ok(ArrowSerializeMap { fields: (&mut *self.fields).iter_mut() }) // reborrow might be necessary
	}
}
struct ArrowSerializeMap {
	fields: std::slice::IterMut<Field>,
}
impl SerializeMap for ArrowSerializeMap {
	fn serialize_entry(&mut self, key: impl Serialize, value: impl Serialize) -> Result<...> {
		let field: &mut ... = self.fields.next();
		// Possibly check field name based on key, or opt out as suggested in #92
		
		// Then pick the serializer for this type.
		// With a 0-based enum discriminant without holes on FieldArray this should
		// be compiled into a jump table by the compiler, so O(1).
		match field.serializer {
			Type::Int4(serializer) => {
				value.serialize(serializer) // reborrow
			}
			// ... other types
		}
	}
}
struct SerializeFieldToInt4Array {
	array: AvroMutableArrayOfInt4,
}
impl Serialize for &mut SerializeFieldToInt4Array { // Note the &mut here
	fn serialize_i32(value: i32) -> Result<...> {
		self.array.push(value);
	}

	// Most other input types are incorrect for this arrow output type
	serde_serializer_quick_unsupported::serializer_unsupported! {
		err = (<Self::Error as serde::ser::Error>::custom("Unsupported type for arrow Int4 array"));
		bool i8 i16 i64 u8 u16 u32 u64 f32 f64 char str bytes none some unit unit_struct
		unit_variant newtype_struct newtype_variant seq tuple tuple_struct tuple_variant map struct
		struct_variant i128 u128
	}
}

(I've achieved x10 performance on avro serialization compared to apache-avro implementation specifically by removing the intermediate representation they have which is why at first glance I'm surprised by this, but I don't know arrow well.)

Thanks a lot ❤️

The text was updated successfully, but these errors were encountered:

chmp · 2024-01-17T06:12:58Z

Thanks for raising this issue :). The reasons are mostly historic ones. I agree that replacing the event stream abstraction with direct serialization makes sense. I simply did not get around to actually implement it.

I don't think there is a factor 10x on the table table, though. At most 4x - 5x (see benchmarks). I believe arrow2-convert to be close to hand-rolled implementation. Due to the dynamic nature of serde-arrow, I would expect any speed up be to closer to 2x. The data is fully handled at runtime and there is a perfomance gap associated with that that simply cannot be closed.

To explain the history: this crate is my first exposure of interacting with the serde API and I found it easier to conceptualize the underlying data model mismatch by thinking of it in terms of a pseudo JSON format. The JSON format example of the serde docs makes it also really easy to get started down this road. Initially the events were properly matrialized (e.g., there were actual Event instances exchanged). However, currently most communication is handled via accept_XX methods and the events are not constructed as objecs (hence also there is no copying of Strings) (*). To handle the events I ended up with a depply nested state machines (e.g., for List > Struct > Struct > List > Struct) and performance was not great :D. The bytecode implementation is a way to flatten the state machine. Its introduction resulted in approximately a 3x - 4x performance gain. By removing the underlying event model, I agree it would be possible to forgo the whole state machine and to use the call stack instead. Hopefully giving another performance boost.

(*) See for example here and here.

Ten0 · 2024-01-17T09:19:02Z

Very clear, thanks! :)

I would expect any speed up be to closer to 2x

Ah yes, we probably wouldn't get significantly better than arrow2-convert, I did get those gains compared to apache-avro but that implementation was very far from optimal in many places - I didn't mean to express that we could expect the same gains here.

chmp · 2024-01-21T08:58:47Z

I implemented everything required to execute benchmarks (90 % sure the resulting buffers are filled correctly :D). The result is a speed up of aprox. 2x. (1.81 for complex types, 2.12 for simple types):

`complex_common_serialize(1000000)`

label	time [ms]	arrow2_convert	serde_arrow_ng	serde_arrow	arrow
arrow2_convert	695.25	1.00	0.40	0.22	0.14
serde_arrow_ng	1745.76	2.51	1.00	0.55	0.36
serde_arrow	3152.74	4.53	1.81	1.00	0.65
arrow	4873.73	7.01	2.79	1.55	1.00

`primitives_serialize(1000000)`

label	time [ms]	arrow2_convert	serde_arrow_ng	serde_arrow	arrow
arrow2_convert	136.05	1.00	0.47	0.22	0.10
serde_arrow_ng	290.84	2.14	1.00	0.47	0.21
serde_arrow	617.97	4.54	2.12	1.00	0.45
arrow	1373.00	10.09	4.72	2.22	1.00

The next steps will be to switch the internal test suite to the new serializer and then squash test errors one by one. I would expect progress the next couple of weeks to be a bit slower though due to time constraints. As a side note: I am really surprised how well the idea of the bytecode serializer holds up.

Ten0 · 2024-01-21T11:34:08Z

Amazing! Congratz!

chmp · 2024-01-27T09:28:58Z

Not sure how urgent this feature is for you. I defintely would like to touch up some more details before cutting a proper release. But I could do a pre-release for you to test, if that helps.

Oh. And thanks again for pointing me into this direction. The result is now approx 2x performance, -3000 lines of code, and, I beliefe, a much easier to understand impl :).

Ten0 · 2024-01-27T20:42:44Z

thanks again for pointing me into this direction. The result is now approx 2x performance, -3000 lines of code, and, I beliefe, a much easier to understand impl :)

Your open-mindedness and reactivity made that an absolute pleasure 😊

I defintely would like to touch up some more details before cutting a proper release. But I could do a pre-release for you to test, if that helps.

Don't mind me, I'll just point to github in the meantime. 😊

chmp · 2024-01-28T10:01:59Z

I cut a pre-release after all for my own package using serde_arrow :)

https://crates.io/crates/serde_arrow/0.10.0-rc.1

chmp mentioned this issue Jan 17, 2024

Figure out how to exploit that field names are static #92

Closed

chmp changed the title ~~QUESTION(/doc): Why the interpreter and event stream?~~ Remove the bytecode / interpreter machinery and use the serde API directly Jan 17, 2024

Ten0 mentioned this issue Jan 17, 2024

Add Avro Support apache/arrow-rs#4886

Open

chmp mentioned this issue Jan 17, 2024

Add proof of concept for serializers without bytecode #120

Merged

24 tasks

chmp closed this as completed Jan 27, 2024

chmp mentioned this issue Jan 27, 2024

Remove the bytecode / interpreter impl of deserialize #121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the bytecode / interpreter machinery and use the serde API directly #118

Remove the bytecode / interpreter machinery and use the serde API directly #118

Ten0 commented Jan 16, 2024 •

edited

Loading

chmp commented Jan 17, 2024 •

edited

Loading

Ten0 commented Jan 17, 2024

chmp commented Jan 21, 2024 •

edited

Loading

Ten0 commented Jan 21, 2024

chmp commented Jan 27, 2024 •

edited

Loading

Ten0 commented Jan 27, 2024

chmp commented Jan 28, 2024

Remove the bytecode / interpreter machinery and use the serde API directly #118

Remove the bytecode / interpreter machinery and use the serde API directly #118

Comments

Ten0 commented Jan 16, 2024 • edited Loading

chmp commented Jan 17, 2024 • edited Loading

Ten0 commented Jan 17, 2024

chmp commented Jan 21, 2024 • edited Loading

complex_common_serialize(1000000)

primitives_serialize(1000000)

Ten0 commented Jan 21, 2024

chmp commented Jan 27, 2024 • edited Loading

Ten0 commented Jan 27, 2024

chmp commented Jan 28, 2024

Ten0 commented Jan 16, 2024 •

edited

Loading

chmp commented Jan 17, 2024 •

edited

Loading

chmp commented Jan 21, 2024 •

edited

Loading

`complex_common_serialize(1000000)`

`primitives_serialize(1000000)`

chmp commented Jan 27, 2024 •

edited

Loading