Provide an example of output using Public Git Archive #138

campoy · 2018-06-20T19:18:34Z

I just downloaded all of the siva files in PGA corresponding to the Google, Docker, and one more org and ran Gemini using docker-compose.

The hash process fails at the end with this message:

ERROR 19:13:53 org.apache.spark.internal.Logging$class (Logging.scala:70) - SparkListenerBus
 has already stopped! Dropping event SparkListenerJobEnd(2,1529522033360,JobFailed(org.apach
e.spark.SparkException: Job 2 cancelled because SparkContext was shut down))

This is after successfully storing all of the information to DB, apparently.

 WARN 19:13:38 tech.sourced.gemini.Gemini (Gemini.scala:33) - Getting repositories at /repos
itories in siva format
 WARN 19:13:40 tech.sourced.gemini.Gemini (Gemini.scala:42) - Hashing
 WARN 19:13:40 tech.sourced.gemini.Hash (Hash.scala:56) - Listing files
 WARN 19:13:41 tech.sourced.gemini.Hash (Hash.scala:67) - Extracting UASTs
 WARN 19:13:42 tech.sourced.gemini.Hash (Hash.scala:86) - Extracting features
 WARN 19:13:42 tech.sourced.gemini.Hash (Hash.scala:103) - creating document frequencies
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:122) - hashing features     
 WARN 19:13:44 tech.sourced.gemini.Gemini (Gemini.scala:45) - Saving hashes to DB
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:148) - save meta to DB
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:142) - save document frequencies to docf
req.json
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:167) - save hashtables to DB
Done

Unfortunately when I run report I don't see any similar files.

$ docker-compose exec gemini ./report
Reporting all similar files
No duplicates found.
No similar files found.

It is possible there's none, or that there is an issue. And I can't really tell because I don't know which repositories should report similar files.

Could you check whether this is an actual issue or simply these organization do not contain any similar files (which seem improbable)?

Thanks

The text was updated successfully, but these errors were encountered:

bzz · 2018-06-20T22:18:55Z

@campoy sure, will be happy to double-check this for you, if you provide either access to the dataset that you are running on (ssh to GCP), or a command that allows to get the data locally.

What instance type did you use? How much memory does it have? How many Gb is the size of .siva files?

From the fragments of the logs above it's very hard to tell if hash has succeed, having a full logs would help.

Please, keep in mind that

Gemini is an application on top of Engine, that means it has all the performance expectations that Engine has right now
running multiple distributed systems on a single machine with docker-compose adds on top of that

bzz · 2018-06-21T14:30:32Z

Quick glance shows that Google has 5008 siva files of 17Gb, Docker has 80 of 1.4G

Numbers got from

pga list -u github.com/google/ -f json |  jq -r '.sivaFilenames[]' | wc -l
pga list -u github.com/google/ -f json |  jq -r '.sivaFilenames[]' | pga get -j 90 -i -o googl

In my experience, this is way beyond the scenarios that docker-compose can be useful for and is not very realistic. To make that happen, it will require a lot of manual labor tuning every part of the process.

For the reference - you can see exact timings of every stage of Gemini that we have collected in order to set performance expectations, as part of the initial release of Gemini by the end of quarter, using:

smaller dataset of just 10Gb
on Apache Spark cluster of 3 machines, 16Gb/32 cores for each worker

TL;DR It is order of ~1h for the first stages that just use Engine to extract UASTs even on distributed configuration.

Update: meanwhile, switched to testing docker compose workflow instead and was able to reproduce this issue on smaller sub-set of the data locally.

bzz · 2018-06-21T18:38:25Z

@campoy To answer original question on example of using Gemini on PGA, you would need to run it like:

docker-compose exec gemini ./hash ./repositories/siva/latest/

Reason is - Engine only supports one level of nested dirs, so it does not find any .siva files but Gemini in this case provides confusing user output by listing .siva files recursively (that is going to be fixed).

On performance on single machine - first step of Gemini is a Engine query to extract all non-binary files content with uniq SHAs at HEAD revision and convert it to UASTs. On 4-core laptop running over just the Docker repositories it takes 13.6 h.

Although slower, but fixes user experience i.e src-d#138 The only possible workaround \wo Engine using `org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)` did not work, as "$path/*" there does not matche recursivly so ``` fs.globStatus(new Path(s"$path/*")) .flatMap(file => if (file.getPath.getName.endsWith(".siva")) { Some(file.getPath) } else { None }) ``` would not support bucketed repos, and thus we would need to mimick `org.apache.hadoop.mapreduce.lib.input.FileInputFormat#listStatus` manually. Signed-off-by: Alexander Bezzubov <bzz@apache.org>

bzz · 2018-06-28T10:55:51Z

Closing this for now, as both, examples were provided and confusing CLI UX was fixed in #143

But please, feel free to re-open in case something is not solved.

campoy · 2018-07-02T14:10:18Z

Why does the engine only support one level of nested directories?

bzz · 2018-07-24T16:15:25Z

Why does the engine only support one level of nested directories?

IDK, but guess that may be a new issue in https://github.com/src-d/engine might be a better place to ask.

For your convenience, quick search over engine repo brings what looks like original issue that brought that feature src-d/sourced-ce#176

bzz · 2018-08-13T16:04:02Z

Closing, as there is no further discussion and all the questions above have been addressed.

bzz mentioned this issue Jun 24, 2018

Hash: src repo listing using Engine #143

Merged

bzz closed this as completed Jun 28, 2018

campoy reopened this Jul 2, 2018

bzz closed this as completed Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an example of output using Public Git Archive #138

Provide an example of output using Public Git Archive #138

campoy commented Jun 20, 2018

bzz commented Jun 20, 2018 •

edited

Loading

bzz commented Jun 21, 2018 •

edited

Loading

bzz commented Jun 21, 2018 •

edited

Loading

bzz commented Jun 28, 2018

campoy commented Jul 2, 2018

bzz commented Jul 24, 2018

bzz commented Aug 13, 2018 •

edited

Loading

Provide an example of output using Public Git Archive #138

Provide an example of output using Public Git Archive #138

Comments

campoy commented Jun 20, 2018

bzz commented Jun 20, 2018 • edited Loading

bzz commented Jun 21, 2018 • edited Loading

bzz commented Jun 21, 2018 • edited Loading

bzz commented Jun 28, 2018

campoy commented Jul 2, 2018

bzz commented Jul 24, 2018

bzz commented Aug 13, 2018 • edited Loading

bzz commented Jun 20, 2018 •

edited

Loading

bzz commented Jun 21, 2018 •

edited

Loading

bzz commented Jun 21, 2018 •

edited

Loading

bzz commented Aug 13, 2018 •

edited

Loading