Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking tree can not be saved in cloud environment #82

Closed
sonalgoyal opened this issue Dec 20, 2021 · 5 comments
Closed

Blocking tree can not be saved in cloud environment #82

sonalgoyal opened this issue Dec 20, 2021 · 5 comments
Assignees

Comments

@sonalgoyal
Copy link
Member

the blocking tree gets written using local file system apis which do not work in cloud env like databricks. If we do trainMatch in one go, the file probably goes into the local machine form where it will get picked up. Otherwise it will not work if we do train and match separately.

@sonalgoyal sonalgoyal self-assigned this Dec 31, 2021
@sonalgoyal
Copy link
Member Author

can you please check if there is a way to write the serialized blockingtree to parquet using spark.write? We may(?) need to coalesce to 1 to make sure only one file is written. Then we can read like standard spark files without worrying about the location being local. @navinrathore

@sonalgoyal sonalgoyal assigned navinrathore and unassigned sonalgoyal Jan 3, 2022
@navinrathore
Copy link
Contributor

From the typical datasources used in Spark, The solution mentioned https://stackoverflow.com/questions/35200988/writing-custom-java-objects-to-parquet looks promising. but it will need additional dependency jar (org.apache.parquet (parquet-avro.jar))For this or other format, we need schema of the object. It can get the same from class object of (Tree). But I am getting runtime issue to get the same, thoughPlease take a glance on the approach, we can discuss it then. or let me know if you have any suggestion
Stack OverflowStack Overflow
Writing custom java objects to Parquet
I have some custom java objects (which internally are composed of other custom objects). I wish to write these to HDFS in parquet format. Even after a lot of searching, most suggestions seem to be
2:48
Seems the getScehme() api here does not work for generic type. Our type is Tree
2:50
Can we have this schema defined separately?

@navinrathore
Copy link
Contributor

Can’t we serialise the bytes?
Serealise object to bytearray and write that
Then read and reconstruct

We can only read binary_file. reading not supported

@sonalgoyal
Copy link
Member Author

I don’t have a ready answer here but I think one of the following approaches should work.
A. Write object as bytes
B. Write as json
C. Build spark dataset with canopy class and read and write through that

in all cases, we should invoke the spark apis for reading and writing. So essentially serialise as byte/json/custom obj and build a dataset and use spark.read or spark.write

navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 8, 2022
@navinrathore
Copy link
Contributor

Option A (object in bytes) has worked. PR #120

navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 11, 2022
navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 11, 2022
navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 11, 2022
sonalgoyal added a commit that referenced this issue Jan 11, 2022
Blocking tree are saved in parquet file #82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants