-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking tree can not be saved in cloud environment #82
Comments
can you please check if there is a way to write the serialized blockingtree to parquet using spark.write? We may(?) need to coalesce to 1 to make sure only one file is written. Then we can read like standard spark files without worrying about the location being local. @navinrathore |
From the typical datasources used in Spark, The solution mentioned https://stackoverflow.com/questions/35200988/writing-custom-java-objects-to-parquet looks promising. but it will need additional dependency jar (org.apache.parquet (parquet-avro.jar))For this or other format, we need schema of the object. It can get the same from class object of (Tree). But I am getting runtime issue to get the same, thoughPlease take a glance on the approach, we can discuss it then. or let me know if you have any suggestion |
We can only read binary_file. reading not supported |
I don’t have a ready answer here but I think one of the following approaches should work. in all cases, we should invoke the spark apis for reading and writing. So essentially serialise as byte/json/custom obj and build a dataset and use spark.read or spark.write |
Option A (object in bytes) has worked. PR #120 |
Blocking tree are saved in parquet file #82
the blocking tree gets written using local file system apis which do not work in cloud env like databricks. If we do trainMatch in one go, the file probably goes into the local machine form where it will get picked up. Otherwise it will not work if we do train and match separately.
The text was updated successfully, but these errors were encountered: