-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine fails on siva files from pipeline-staging cluster #414
Comments
You can skip siva read errors: https://github.com/src-d/engine/blob/master/python/sourced/engine/engine.py#L30 Just change engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva") to engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva",False,True) |
When I encountered the error I was using this option and it still fails in the same way with:
Updated logs. |
This must be a siva-java issue, go-siva unpacks it just fine. $ echo '5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva' | pga get -i
$ cd siva/latest/55/
$ siva unpack 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
$ ls
5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
HEAD
config
objects
refs |
@erizocosmico I downloaded all siva files starting with 55 with pga tool and it works fine. The problem seems to be with the files on the hdfs on the pipeline-staging cluster. btw: I also encountered this:
|
You mean if you download the file locally the engine works just fine with it? |
Turns out opening the file with siva java does not fail either, so the problem is not there |
Yes, when I download the single file it works fine. |
You gave me an idea. I downloaded the same siva file, this time from HDFS instead of @ajnavarro what should we do about that? |
@fulaphex whoever did the copy of the original siva files to that folder you're using did it wrong or some files were corrupted during that copy, because the siva file there is corrupted, but the original (in You can either use |
@erizocosmico @ajnavarro this is indeed the case of .siva file beeing corrupted
Although this suggestion does not look very constructive, this most probably happened on You guys did a great job introducing option Right now, we can see Engine canceling the job on read error in a different place, in RepositoryProvider.genSivaRepository. Here is the relevant log:
And this is the .siva file https://drive.google.com/open?id=1yyDAfzFiDgK8YwjOuByc6zfifPCkGw84 What do you think if with the same configuration option we allow This way, it can be treated as part of original https://github.com/src-d/engine/issues/393 |
We cannot expect engine working correctly if the files themselves are not even readable. If some parquet file is corrupted, what do you expect that spark will do? continue? or fail? The bug here is in the tool or process used to download siva files in my opinion. |
I see! I'm not saying it's a bug in Engine - engine works and siva files are corrupted for sure. Rather I was trying to say is that as a user, I would love to have an option not to fail the whole job on such files, and was asking if you would be open to extend If that case, we could close this in favor of https://github.com/src-d/engine/issues/393 (or a new similar feature request) and then I would be happy to look deeper into this and submit a patch. WDYT? |
On the other hand, if we have further corruption in one of our downloaded copies, I do not see much value in us working with that corrupted copy. I would rather direct our efforts at fixing pga tool as well as ensuring that we have a proper copy in our cluster. @fulaphex In the mean time, you can try with the PGA copy at |
@smola |
Expected Behavior
Skip the siva file and continue processing.
Current Behavior
I'm getting this error while processing
hdfs://hdfs-namenode/pga/siva/latest/55
onpipeline-staging
cluster:Steps to Reproduce (for bugs)
srcd/engine-jupyter:v0.7.0
image.I left the pod
gabor-engine-jupyter
with the modifiedExample.ipynb
and saved the failed run.Copy of the logs is also here.
Context
I'm processing siva files on
pipeline-staging
and gathering stats to compare pga versions between clusters.The text was updated successfully, but these errors were encountered: