-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an example of output using Public Git Archive #138
Comments
@campoy sure, will be happy to double-check this for you, if you provide either access to the dataset that you are running on (ssh to GCP), or a command that allows to get the data locally. What instance type did you use? How much memory does it have? How many Gb is the size of .siva files? From the fragments of the logs above it's very hard to tell if hash has succeed, having a full logs would help. Please, keep in mind that
|
Quick glance shows that Google has 5008 siva files of 17Gb, Docker has 80 of 1.4G Numbers got from
In my experience, this is way beyond the scenarios that For the reference - you can see exact timings of every stage of Gemini that we have collected in order to set performance expectations, as part of the initial release of Gemini by the end of quarter, using:
TL;DR It is order of ~1h for the first stages that just use Engine to extract UASTs even on distributed configuration. Update: meanwhile, switched to testing |
@campoy To answer original question on example of using Gemini on PGA, you would need to run it like:
Reason is - Engine only supports one level of nested dirs, so it does not find any .siva files but Gemini in this case provides confusing user output by listing .siva files recursively (that is going to be fixed). On performance on single machine - first step of Gemini is a Engine query to extract all non-binary files content with uniq SHAs at HEAD revision and convert it to UASTs. On 4-core laptop running over just the Docker repositories it takes 13.6 h. |
Although slower, but fixes user experience i.e src-d#138 The only possible workaround \wo Engine using `org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)` did not work, as "$path/*" there does not matche recursivly so ``` fs.globStatus(new Path(s"$path/*")) .flatMap(file => if (file.getPath.getName.endsWith(".siva")) { Some(file.getPath) } else { None }) ``` would not support bucketed repos, and thus we would need to mimick `org.apache.hadoop.mapreduce.lib.input.FileInputFormat#listStatus` manually. Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Although slower, but fixes user experience i.e src-d#138 The only possible workaround \wo Engine using `org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path)` did not work, as "$path/*" there does not matche recursivly so ``` fs.globStatus(new Path(s"$path/*")) .flatMap(file => if (file.getPath.getName.endsWith(".siva")) { Some(file.getPath) } else { None }) ``` would not support bucketed repos, and thus we would need to mimick `org.apache.hadoop.mapreduce.lib.input.FileInputFormat#listStatus` manually. Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Closing this for now, as both, examples were provided and confusing CLI UX was fixed in #143 But please, feel free to re-open in case something is not solved. |
Why does the engine only support one level of nested directories? |
IDK, but guess that may be a new issue in https://github.com/src-d/engine might be a better place to ask. For your convenience, quick search over engine repo brings what looks like original issue that brought that feature src-d/sourced-ce#176 |
Closing, as there is no further discussion and all the questions above have been addressed. |
I just downloaded all of the siva files in PGA corresponding to the Google, Docker, and one more org and ran Gemini using docker-compose.
The
hash
process fails at the end with this message:This is after successfully storing all of the information to DB, apparently.
WARN 19:13:38 tech.sourced.gemini.Gemini (Gemini.scala:33) - Getting repositories at /repos itories in siva format WARN 19:13:40 tech.sourced.gemini.Gemini (Gemini.scala:42) - Hashing WARN 19:13:40 tech.sourced.gemini.Hash (Hash.scala:56) - Listing files WARN 19:13:41 tech.sourced.gemini.Hash (Hash.scala:67) - Extracting UASTs WARN 19:13:42 tech.sourced.gemini.Hash (Hash.scala:86) - Extracting features WARN 19:13:42 tech.sourced.gemini.Hash (Hash.scala:103) - creating document frequencies WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:122) - hashing features WARN 19:13:44 tech.sourced.gemini.Gemini (Gemini.scala:45) - Saving hashes to DB WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:148) - save meta to DB WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:142) - save document frequencies to docf req.json WARN 19:13:44 tech.sourced.gemini.Hash (Hash.scala:167) - save hashtables to DB Done
Unfortunately when I run
report
I don't see any similar files.$ docker-compose exec gemini ./report Reporting all similar files No duplicates found. No similar files found.
It is possible there's none, or that there is an issue. And I can't really tell because I don't know which repositories should report similar files.
Could you check whether this is an actual issue or simply these organization do not contain any similar files (which seem improbable)?
Thanks
The text was updated successfully, but these errors were encountered: