[QST] Recommended approach for a reference avro writer/reader

**What is your question?**


This is a question to the spark team of rapids. As part of cuIO refactor, We(rapids cudf team) are currently working on adding fuzz testing coverage for our avro reader(we currently only have reader support - no writer support). To compare and evaluate our avro reader we would require a reference writer/reader to write/read and compare the dataframes. 

Since there are is no avro python API support yet in [pandas](https://github.com/pandas-dev/pandas/issues/30407) and in [pyarrow](https://issues.apache.org/jira/browse/ARROW-1209). We have explored using (pandavro)[https://github.com/ynqa/pandavro], but this library lacks support for nullable values like `pd.NA`, `pd.NAT`. So while leaves us with no ability to test for nullable columns. 

To be able to achieve that, we tried going from:

`Pandas nullable dtypes written to a parquet file -> read parquet in pyspark -> write to avro in pyspark`

This final file would be read in by cudf python API and be compared against the original data frame.

The pipeline from pandas to pyspark to cudf is kind of as follows:

```python
>>> import pandas as pd
>>> df
    str  float   int  int8   bool  unsigned cat
0     a   0.32     1     1   <NA>         1   a
1  <NA>   0.32     2     2   True         2   v
2     v   0.00     3     3  False         3   z
3     a   0.23  <NA>  <NA>   True      <NA>   a
>>> df.to_parquet('a.parquet')


>>> from pyspark.sql import SparkSession
>>> # initialise sparkContext
>>> spark = SparkSession.builder \
...     .master('local') \
...     .appName('myAppName') \
...     .config('spark.executor.memory', '5gb') \
...     .config("spark.cores.max", "6") \
...     .getOrCreate()


>>> df = sqlContext.read.parquet('a.parquet')
>>> df
DataFrame[str: string, float: double, int: bigint, int8: tinyint, time: timestamp, bool: boolean, unsigned: bigint, cat: string]
>>> df.show()
+----+-----+----+----+-----+--------+---+
| str|float| int|int8| bool|unsigned|cat|
+----+-----+----+----+-----+--------+---+
|   a| 0.32|   1|   1| null|       1|  a|
|null| 0.32|   2|   2| true|       2|  v|
|   v|  0.0|   3|   3|false|       3|  z|
|   a| 0.23|null|null| true|    null|  a|
+----+-----+----+----+-----+--------+---+
>>> df.write.format("avro").save("file_avro")


>>> import cudf
>>> cudf.read_avro("file_avro")
    str  float   int  int8   bool  unsigned cat
0     a   0.32     1     1   <NA>         1   a
1  <NA>   0.32     2     2   True         2   v
2     v   0.00     3     3  False         3   z
3     a   0.23  <NA>  <NA>   True      <NA>   a
```

By choosing this approach there is one [limitation](https://issues.apache.org/jira/browse/ARROW-6780) which will not allow duration types to be written to parquet. 


So having this one drawback for that approach would you suggest this way or recommend a better way to achieve a reference avro writer?





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Recommended approach for a reference avro writer/reader #927

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Recommended approach for a reference avro writer/reader #927

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions