Engine fails on siva files from pipeline-staging cluster #414

fulaphex · 2018-07-10T21:16:03Z

Expected Behavior

Skip the siva file and continue processing.

Current Behavior

I'm getting this error while processing hdfs://hdfs-namenode/pga/siva/latest/55 on pipeline-staging cluster:

tech.sourced.siva.SivaException: Exception at file 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva: At Index footer, index size: Java implementation of siva doesn't support values greater than 9223372036854775807

Steps to Reproduce (for bugs)

Deploy a pod on pipeline-staging cluster with srcd/engine-jupyter:v0.7.0 image.
Modify first cell to change the repositories location and run:

from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
.master("local[*]").appName("Examples")\
.getOrCreate()

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva")

print("%d repositories successfully loaded" % (engine.repositories.count()/2))

I left the pod gabor-engine-jupyter with the modified Example.ipynb and saved the failed run.
Copy of the logs is also here.

Context

I'm processing siva files on pipeline-staging and gathering stats to compare pga versions between clusters.

The text was updated successfully, but these errors were encountered:

ajnavarro · 2018-07-11T07:31:25Z

You can skip siva read errors: https://github.com/src-d/engine/blob/master/python/sourced/engine/engine.py#L30

Just change

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva")

to

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva",False,True)

fulaphex · 2018-07-11T08:34:34Z

When I encountered the error I was using this option and it still fails in the same way with:

from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
.master("local[*]").appName("Examples")\
.getOrCreate()

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva", False, True)

print("%d repositories successfully loaded" % (engine.repositories.count()/2))

Updated logs.

erizocosmico · 2018-07-11T15:01:16Z

This must be a siva-java issue, go-siva unpacks it just fine.

$ echo '5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva' | pga get -i
$ cd siva/latest/55/
$ siva unpack 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
$ ls
5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
HEAD
config
objects
refs

fulaphex · 2018-07-11T17:21:25Z

@erizocosmico I downloaded all siva files starting with 55 with pga tool and it works fine. The problem seems to be with the files on the hdfs on the pipeline-staging cluster.

btw: I also encountered this:

tech.sourced.siva.SivaException: Exception at file 9b1a3c1efc7e3c9ee0625159b909e6e7f98d2963.siva: Error reading index of file.

erizocosmico · 2018-07-12T08:12:25Z

You mean if you download the file locally the engine works just fine with it?

erizocosmico · 2018-07-12T08:28:51Z

Turns out opening the file with siva java does not fail either, so the problem is not there

fulaphex · 2018-07-12T09:31:38Z

Yes, when I download the single file it works fine.
I also tried recreating same scenario with files downloaded with pga tool and it also works fine.
It only crashes when I run on cluster reading files from hdfs, although I don't know if the files on hdfs are the same ones that pga tool downloads.

erizocosmico · 2018-07-12T10:11:14Z

You gave me an idea. I downloaded the same siva file, this time from HDFS instead of pga. go-siva and siva-java cannot read it now. So the problem is the files are corrupted in HDFS.

@ajnavarro what should we do about that?

erizocosmico · 2018-07-12T10:20:54Z

@fulaphex whoever did the copy of the original siva files to that folder you're using did it wrong or some files were corrupted during that copy, because the siva file there is corrupted, but the original (in /apps/borges/root-repositories/) is not.

You can either use /apps/borges/root-repositories/ or try to do the copy again

bzz · 2018-07-17T17:17:34Z

@erizocosmico @ajnavarro this is indeed the case of .siva file beeing corrupted

whoever did the copy of the original siva files to that folder you're using did it wrong

Although this suggestion does not look very constructive, this most probably happened on pga get, before it get md5 verification on download src-d/datasets#69 (which BTW is still not merged yet, so this means ALL pga users have high chance of stumbling up on this)

You guys did a great job introducing option spark.tech.sourced.engine.skip.read.errors=true for not breaking Engine on reading corrupted repositories in https://github.com/src-d/engine/pull/395 but it only covers things, reachable with iterators.

Right now, we can see Engine canceling the job on read error in a different place, in RepositoryProvider.genSivaRepository.

Here is the relevant log:

tech.sourced.siva.SivaException: Exception at file 022c7272f0c1333a536cb319beadc4171cc8ff6a.siva: At Index footer, index size: Java implementation of siva doesn't support values greater than 9223372036854775807

And this is the .siva file https://drive.google.com/open?id=1yyDAfzFiDgK8YwjOuByc6zfifPCkGw84

What do you think if with the same configuration option we allow RepositoryProvider to also skip broken .siva files (ideally with some metric counter)?

This way, it can be treated as part of original https://github.com/src-d/engine/issues/393

ajnavarro · 2018-07-18T07:29:58Z

We cannot expect engine working correctly if the files themselves are not even readable. If some parquet file is corrupted, what do you expect that spark will do? continue? or fail?

The bug here is in the tool or process used to download siva files in my opinion.

bzz · 2018-07-18T08:00:53Z

I see! I'm not saying it's a bug in Engine - engine works and siva files are corrupted for sure.

Rather I was trying to say is that as a user, I would love to have an option not to fail the whole job on such files, and was asking if you would be open to extend spark.tech.sourced.engine.skip.read.errors behaviour to avoid this failure mode as well.

If that case, we could close this in favor of https://github.com/src-d/engine/issues/393 (or a new similar feature request) and then I would be happy to look deeper into this and submit a patch. WDYT?

smola · 2018-07-18T10:59:29Z

spark.tech.sourced.engine.skip.read.errors was a compromise solution that we agreed on because final PGA release contains some repository with missing objects. We just needed to work with our own released dataset.

On the other hand, if we have further corruption in one of our downloaded copies, I do not see much value in us working with that corrupted copy. I would rather direct our efforts at fixing pga tool as well as ensuring that we have a proper copy in our cluster.

@fulaphex In the mean time, you can try with the PGA copy at /pga2 directory, which was a second download with improved pga tool.

fulaphex · 2018-07-26T10:39:44Z

@smola /pga2 works better, I was able to process everything there.

ajnavarro added the bug label Jul 11, 2018

erizocosmico self-assigned this Jul 11, 2018

erizocosmico closed this as completed Jul 12, 2018

smacker mentioned this issue Jul 17, 2018

Run Gemini file-level duplicate detection on PGA src-d/gemini#42

Open

bzz reopened this Jul 17, 2018

smola removed the bug label Jul 18, 2018

ajnavarro added the wontfix label Jul 18, 2018

ajnavarro closed this as completed Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine fails on siva files from pipeline-staging cluster #414

Engine fails on siva files from pipeline-staging cluster #414

fulaphex commented Jul 10, 2018

ajnavarro commented Jul 11, 2018

fulaphex commented Jul 11, 2018

erizocosmico commented Jul 11, 2018 •

edited

Loading

fulaphex commented Jul 11, 2018

erizocosmico commented Jul 12, 2018

erizocosmico commented Jul 12, 2018

fulaphex commented Jul 12, 2018

erizocosmico commented Jul 12, 2018

erizocosmico commented Jul 12, 2018

bzz commented Jul 17, 2018 •

edited

Loading

ajnavarro commented Jul 18, 2018

bzz commented Jul 18, 2018 •

edited

Loading

smola commented Jul 18, 2018

fulaphex commented Jul 26, 2018

Engine fails on siva files from pipeline-staging cluster #414

Engine fails on siva files from pipeline-staging cluster #414

Comments

fulaphex commented Jul 10, 2018

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

ajnavarro commented Jul 11, 2018

fulaphex commented Jul 11, 2018

erizocosmico commented Jul 11, 2018 • edited Loading

fulaphex commented Jul 11, 2018

erizocosmico commented Jul 12, 2018

erizocosmico commented Jul 12, 2018

fulaphex commented Jul 12, 2018

erizocosmico commented Jul 12, 2018

erizocosmico commented Jul 12, 2018

bzz commented Jul 17, 2018 • edited Loading

ajnavarro commented Jul 18, 2018

bzz commented Jul 18, 2018 • edited Loading

smola commented Jul 18, 2018

fulaphex commented Jul 26, 2018

erizocosmico commented Jul 11, 2018 •

edited

Loading

bzz commented Jul 17, 2018 •

edited

Loading

bzz commented Jul 18, 2018 •

edited

Loading