Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine fails on siva files from pipeline-staging cluster #414

Closed
fulaphex opened this issue Jul 10, 2018 · 14 comments
Closed

Engine fails on siva files from pipeline-staging cluster #414

fulaphex opened this issue Jul 10, 2018 · 14 comments
Assignees
Labels

Comments

@fulaphex
Copy link

Expected Behavior

Skip the siva file and continue processing.

Current Behavior

I'm getting this error while processing hdfs://hdfs-namenode/pga/siva/latest/55 on pipeline-staging cluster:

tech.sourced.siva.SivaException: Exception at file 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva: At Index footer, index size: Java implementation of siva doesn't support values greater than 9223372036854775807

Steps to Reproduce (for bugs)

  1. Deploy a pod on pipeline-staging cluster with srcd/engine-jupyter:v0.7.0 image.
  2. Modify first cell to change the repositories location and run:
from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
.master("local[*]").appName("Examples")\
.getOrCreate()

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva")

print("%d repositories successfully loaded" % (engine.repositories.count()/2))

I left the pod gabor-engine-jupyter with the modified Example.ipynb and saved the failed run.
Copy of the logs is also here.

Context

I'm processing siva files on pipeline-staging and gathering stats to compare pga versions between clusters.

@ajnavarro
Copy link
Contributor

You can skip siva read errors: https://github.com/src-d/engine/blob/master/python/sourced/engine/engine.py#L30

Just change

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva")

to

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva",False,True)

@fulaphex
Copy link
Author

When I encountered the error I was using this option and it still fails in the same way with:

from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
.master("local[*]").appName("Examples")\
.getOrCreate()

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva", False, True)

print("%d repositories successfully loaded" % (engine.repositories.count()/2))

Updated logs.

@ajnavarro ajnavarro added the bug label Jul 11, 2018
@erizocosmico
Copy link
Contributor

erizocosmico commented Jul 11, 2018

This must be a siva-java issue, go-siva unpacks it just fine.

$ echo '5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva' | pga get -i
$ cd siva/latest/55/
$ siva unpack 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
$ ls
5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
HEAD
config
objects
refs

@erizocosmico erizocosmico self-assigned this Jul 11, 2018
@fulaphex
Copy link
Author

@erizocosmico I downloaded all siva files starting with 55 with pga tool and it works fine. The problem seems to be with the files on the hdfs on the pipeline-staging cluster.

btw: I also encountered this:

tech.sourced.siva.SivaException: Exception at file 9b1a3c1efc7e3c9ee0625159b909e6e7f98d2963.siva: Error reading index of file.

@erizocosmico
Copy link
Contributor

You mean if you download the file locally the engine works just fine with it?

@erizocosmico
Copy link
Contributor

Turns out opening the file with siva java does not fail either, so the problem is not there

@fulaphex
Copy link
Author

Yes, when I download the single file it works fine.
I also tried recreating same scenario with files downloaded with pga tool and it also works fine.
It only crashes when I run on cluster reading files from hdfs, although I don't know if the files on hdfs are the same ones that pga tool downloads.

@erizocosmico
Copy link
Contributor

You gave me an idea. I downloaded the same siva file, this time from HDFS instead of pga. go-siva and siva-java cannot read it now. So the problem is the files are corrupted in HDFS.

@ajnavarro what should we do about that?

@erizocosmico
Copy link
Contributor

@fulaphex whoever did the copy of the original siva files to that folder you're using did it wrong or some files were corrupted during that copy, because the siva file there is corrupted, but the original (in /apps/borges/root-repositories/) is not.

You can either use /apps/borges/root-repositories/ or try to do the copy again

@bzz
Copy link
Contributor

bzz commented Jul 17, 2018

@erizocosmico @ajnavarro this is indeed the case of .siva file beeing corrupted

whoever did the copy of the original siva files to that folder you're using did it wrong

Although this suggestion does not look very constructive, this most probably happened on pga get, before it get md5 verification on download src-d/datasets#69 (which BTW is still not merged yet, so this means ALL pga users have high chance of stumbling up on this)

You guys did a great job introducing option spark.tech.sourced.engine.skip.read.errors=true for not breaking Engine on reading corrupted repositories in https://github.com/src-d/engine/pull/395 but it only covers things, reachable with iterators.

Right now, we can see Engine canceling the job on read error in a different place, in RepositoryProvider.genSivaRepository.

Here is the relevant log:

tech.sourced.siva.SivaException: Exception at file 022c7272f0c1333a536cb319beadc4171cc8ff6a.siva: At Index footer, index size: Java implementation of siva doesn't support values greater than 9223372036854775807

And this is the .siva file https://drive.google.com/open?id=1yyDAfzFiDgK8YwjOuByc6zfifPCkGw84

What do you think if with the same configuration option we allow RepositoryProvider to also skip broken .siva files (ideally with some metric counter)?

This way, it can be treated as part of original https://github.com/src-d/engine/issues/393

@bzz bzz reopened this Jul 17, 2018
@ajnavarro
Copy link
Contributor

We cannot expect engine working correctly if the files themselves are not even readable. If some parquet file is corrupted, what do you expect that spark will do? continue? or fail?

The bug here is in the tool or process used to download siva files in my opinion.

@bzz
Copy link
Contributor

bzz commented Jul 18, 2018

I see! I'm not saying it's a bug in Engine - engine works and siva files are corrupted for sure.

Rather I was trying to say is that as a user, I would love to have an option not to fail the whole job on such files, and was asking if you would be open to extend spark.tech.sourced.engine.skip.read.errors behaviour to avoid this failure mode as well.

If that case, we could close this in favor of https://github.com/src-d/engine/issues/393 (or a new similar feature request) and then I would be happy to look deeper into this and submit a patch. WDYT?

@smola smola removed the bug label Jul 18, 2018
@smola
Copy link
Contributor

smola commented Jul 18, 2018

spark.tech.sourced.engine.skip.read.errors was a compromise solution that we agreed on because final PGA release contains some repository with missing objects. We just needed to work with our own released dataset.

On the other hand, if we have further corruption in one of our downloaded copies, I do not see much value in us working with that corrupted copy. I would rather direct our efforts at fixing pga tool as well as ensuring that we have a proper copy in our cluster.

@fulaphex In the mean time, you can try with the PGA copy at /pga2 directory, which was a second download with improved pga tool.

@fulaphex
Copy link
Author

@smola /pga2 works better, I was able to process everything there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants