Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an example of output using Public Git Archive #138

Closed
campoy opened this issue Jun 20, 2018 · 7 comments
Closed

Provide an example of output using Public Git Archive #138

campoy opened this issue Jun 20, 2018 · 7 comments

Comments

@campoy
Copy link

campoy commented Jun 20, 2018

I just downloaded all of the siva files in PGA corresponding to the Google, Docker, and one more org and ran Gemini using docker-compose.

The hash process fails at the end with this message:

ERROR 19:13:53 org.apache.spark.internal.Logging$class (Logging.scala:70) - SparkListenerBus
 has already stopped! Dropping event SparkListenerJobEnd(2,1529522033360,JobFailed(org.apach
e.spark.SparkException: Job 2 cancelled because SparkContext was shut down))

This is after successfully storing all of the information to DB, apparently.

 WARN 19:13:38 tech.sourced.gemini.Gemini (Gemini.scala:33) - Getting repositories at /repos
itories in siva format
 WARN 19:13:40 tech.sourced.gemini.Gemini (Gemini.scala:42) - Hashing
 WARN 19:13:40 tech.sourced.gemini.Hash (Hash.scala:56) - Listing files
 WARN 19:13:41 tech.sourced.gemini.Hash (Hash.scala:67) - Extracting UASTs
 WARN 19:13:42 tech.sourced.gemini.Hash (Hash.scala:86) - Extracting features
 WARN 19:13:42 tech.sourced.gemini.Hash (Hash.scala:103) - creating document frequencies
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:122) - hashing features     
 WARN 19:13:44 tech.sourced.gemini.Gemini (Gemini.scala:45) - Saving hashes to DB
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:148) - save meta to DB
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:142) - save document frequencies to docf
req.json
 WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:167) - save hashtables to DB
Done

Unfortunately when I run report I don't see any similar files.

$ docker-compose exec gemini ./report
Reporting all similar files
No duplicates found.
No similar files found.

It is possible there's none, or that there is an issue. And I can't really tell because I don't know which repositories should report similar files.

Could you check whether this is an actual issue or simply these organization do not contain any similar files (which seem improbable)?

Thanks

@bzz
Copy link
Contributor

bzz commented Jun 20, 2018

@campoy sure, will be happy to double-check this for you, if you provide either access to the dataset that you are running on (ssh to GCP), or a command that allows to get the data locally.

What instance type did you use? How much memory does it have? How many Gb is the size of .siva files?

From the fragments of the logs above it's very hard to tell if hash has succeed, having a full logs would help.

Please, keep in mind that

  • Gemini is an application on top of Engine, that means it has all the performance expectations that Engine has right now
  • running multiple distributed systems on a single machine with docker-compose adds on top of that

@bzz
Copy link
Contributor

bzz commented Jun 21, 2018

Quick glance shows that Google has 5008 siva files of 17Gb, Docker has 80 of 1.4G

Numbers got from

pga list -u github.com/google/ -f json |  jq -r '.sivaFilenames[]' | wc -l
pga list -u github.com/google/ -f json |  jq -r '.sivaFilenames[]' | pga get -j 90 -i -o googl

In my experience, this is way beyond the scenarios that docker-compose can be useful for and is not very realistic. To make that happen, it will require a lot of manual labor tuning every part of the process.

For the reference - you can see exact timings of every stage of Gemini that we have collected in order to set performance expectations, as part of the initial release of Gemini by the end of quarter, using:

  • smaller dataset of just 10Gb
  • on Apache Spark cluster of 3 machines, 16Gb/32 cores for each worker

TL;DR It is order of ~1h for the first stages that just use Engine to extract UASTs even on distributed configuration.

Update: meanwhile, switched to testing docker compose workflow instead and was able to reproduce this issue on smaller sub-set of the data locally.

@bzz
Copy link
Contributor

bzz commented Jun 21, 2018

@campoy To answer original question on example of using Gemini on PGA, you would need to run it like:

docker-compose exec gemini ./hash ./repositories/siva/latest/

Reason is - Engine only supports one level of nested dirs, so it does not find any .siva files but Gemini in this case provides confusing user output by listing .siva files recursively (that is going to be fixed).

On performance on single machine - first step of Gemini is a Engine query to extract all non-binary files content with uniq SHAs at HEAD revision and convert it to UASTs. On 4-core laptop running over just the Docker repositories it takes 13.6 h.

bzz added a commit to bzz/gemini that referenced this issue Jun 24, 2018
Although slower, but fixes user experience i.e src-d#138

The only possible workaround \wo Engine using
`org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)`

did not work, as "$path/*" there does not matche recursivly so

```
      fs.globStatus(new Path(s"$path/*"))
      .flatMap(file => if (file.getPath.getName.endsWith(".siva")) {
          Some(file.getPath)
        } else {
          None
        })
```

would not support bucketed repos, and thus we would need to mimick
`org.apache.hadoop.mapreduce.lib.input.FileInputFormat#listStatus`
manually.

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
bzz added a commit to bzz/gemini that referenced this issue Jun 26, 2018
Although slower, but fixes user experience i.e src-d#138

The only possible workaround \wo Engine using
`org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)`

did not work, as "$path/*" there does not matche recursivly so

```
      fs.globStatus(new Path(s"$path/*"))
      .flatMap(file => if (file.getPath.getName.endsWith(".siva")) {
          Some(file.getPath)
        } else {
          None
        })
```

would not support bucketed repos, and thus we would need to mimick
`org.apache.hadoop.mapreduce.lib.input.FileInputFormat#listStatus`
manually.

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor

bzz commented Jun 28, 2018

Closing this for now, as both, examples were provided and confusing CLI UX was fixed in #143

But please, feel free to re-open in case something is not solved.

@bzz bzz closed this as completed Jun 28, 2018
@campoy
Copy link
Author

campoy commented Jul 2, 2018

Why does the engine only support one level of nested directories?

@campoy campoy reopened this Jul 2, 2018
@bzz
Copy link
Contributor

bzz commented Jul 24, 2018

Why does the engine only support one level of nested directories?

IDK, but guess that may be a new issue in https://github.com/src-d/engine might be a better place to ask.

For your convenience, quick search over engine repo brings what looks like original issue that brought that feature src-d/sourced-ce#176

@bzz
Copy link
Contributor

bzz commented Aug 13, 2018

Closing, as there is no further discussion and all the questions above have been addressed.

@bzz bzz closed this as completed Aug 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants