Blocking tree can not be saved in cloud environment #82

sonalgoyal · 2021-12-20T14:37:33Z

the blocking tree gets written using local file system apis which do not work in cloud env like databricks. If we do trainMatch in one go, the file probably goes into the local machine form where it will get picked up. Otherwise it will not work if we do train and match separately.

sonalgoyal · 2022-01-03T16:51:32Z

can you please check if there is a way to write the serialized blockingtree to parquet using spark.write? We may(?) need to coalesce to 1 to make sure only one file is written. Then we can read like standard spark files without worrying about the location being local. @navinrathore

navinrathore · 2022-01-04T09:25:43Z

From the typical datasources used in Spark, The solution mentioned https://stackoverflow.com/questions/35200988/writing-custom-java-objects-to-parquet looks promising. but it will need additional dependency jar (org.apache.parquet (parquet-avro.jar))For this or other format, we need schema of the object. It can get the same from class object of (Tree). But I am getting runtime issue to get the same, thoughPlease take a glance on the approach, we can discuss it then. or let me know if you have any suggestion
Stack OverflowStack Overflow
Writing custom java objects to Parquet
I have some custom java objects (which internally are composed of other custom objects). I wish to write these to HDFS in parquet format. Even after a lot of searching, most suggestions seem to be
2:48
Seems the getScehme() api here does not work for generic type. Our type is Tree
2:50
Can we have this schema defined separately?

navinrathore · 2022-01-04T09:27:31Z

Can’t we serialise the bytes?
Serealise object to bytearray and write that
Then read and reconstruct

We can only read binary_file. reading not supported

sonalgoyal · 2022-01-04T09:37:34Z

I don’t have a ready answer here but I think one of the following approaches should work.
A. Write object as bytes
B. Write as json
C. Build spark dataset with canopy class and read and write through that

in all cases, we should invoke the spark apis for reading and writing. So essentially serialise as byte/json/custom obj and build a dataset and use spark.read or spark.write

navinrathore · 2022-01-08T21:49:18Z

Option A (object in bytes) has worked. PR #120

Blocking tree are saved in parquet file #82

sonalgoyal self-assigned this Dec 31, 2021

sonalgoyal assigned navinrathore and unassigned sonalgoyal Jan 3, 2022

sonalgoyal added the highPriority label Jan 5, 2022

navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 8, 2022

Blocking tree are saved in parquet file zinggAI#82

b513ed9

navinrathore mentioned this issue Jan 8, 2022

new febrl models #121

Merged

navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 11, 2022

Blocking tree are saved in parquet file zinggAI#82

46e23ef

navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 11, 2022

using coalescs() before saving blocking tree zinggAI#82

f1c6a28

navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 11, 2022

using Buffered Streams in blocking tree read/write zinggAI#82

d70ef74

sonalgoyal added a commit that referenced this issue Jan 11, 2022

Merge pull request #120 from navinrathore/zBlockingTree

a88c043

Blocking tree are saved in parquet file #82

sonalgoyal closed this as completed Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking tree can not be saved in cloud environment #82

Blocking tree can not be saved in cloud environment #82

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Jan 3, 2022

navinrathore commented Jan 4, 2022

navinrathore commented Jan 4, 2022

sonalgoyal commented Jan 4, 2022

navinrathore commented Jan 8, 2022

Blocking tree can not be saved in cloud environment #82

Blocking tree can not be saved in cloud environment #82

Comments

sonalgoyal commented Dec 20, 2021

sonalgoyal commented Jan 3, 2022

navinrathore commented Jan 4, 2022

navinrathore commented Jan 4, 2022

sonalgoyal commented Jan 4, 2022

navinrathore commented Jan 8, 2022