-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hdfs url feed to Apache.Beam should not contain the <hostname:port> part #740
Comments
Thanks @alionkun for reporting this. Amazing work for digging into the details of both Tensorflow and Beam code base to understand this. Let me circular this back into relevant engineers maintaining Beam and Tensorflow gfile API and see whether they should unified their convention first. If that cannot be achieved, we can discuss a fix from TFX side. My sense is that I'd rather take the HDFS path (likely from pipeline_root) using the format compatible with hadoop fs command which is more widely understood by Hadoop users. Its format aligns with Tensorflow's better and keeps host:port inside. |
@alionkun The Beam team is kind of enough to file https://issues.apache.org/jira/browse/BEAM-8399 to track this fix. Can you voice your thoughts on this one there, especially confirm if a host:port format is guaranteed to be used (i.e, always carries Depending the progress of the other one, let's see whether TFX team should carry a patch like you suggested. |
@zhitaoli I have added a comment on that issue. But I wonder whether the Beam team will supports this in a short time :) |
I would like to try to fix this from Beam side directly. Discussing with some relevant folks on whether we can fast track this in next version of Beam (since the fix is much more isolated than patching this from TFX side). |
Update: this is being fixed from beam side directly: apache/beam#10223 (comment) Once that is merged and released in beam, TFX will pick up in a future release. For now I'll close this one from TFX side. |
I used HDFS as the distribute file system to hold training dataset and pipeline_root.
My HDFS directory struction is something as follow:
After i triggered my pipeline dag, i got the following error messages:
It seems that Apache.Beam treats the host:port of the hdfs-url as path of the file system, and found nothing for sure.
After some source code reading and analysis, i found that TFX uses both TensorFlow APIs and Apache.Beam APIs to access training dataset and artifacts.
Problem is:
TensorFlow APIs aceepts hdfs-url like hdfs://host:port/path/to/file, but Apache.Beam APIs accepts hdfs-url like hdfs://path/to/file. and when we give something like hdfs://host:port/path/to/file to TFX, we will get the above error.
How to fix:
considering that TFX store artifacts' urls in MLMD and these urls should be complete, to fastest way to fix this problem is: TFX adjust all urls that pass to Apache.Beam APIs by removing the host:port part.
I made my own patch and haved it validated in my TFX cluster, everything looks good.
Expecting offical releases new version to fix it.
The text was updated successfully, but these errors were encountered: