Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD
commit), using all implemented extractors in sourced.ml
at the time (identifiers
, literals
, graphlets
, children
, node2vec
and uast2seq
) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use apollo at scale. We hit scipy.sparse
limits while trying to merge sparse matrices for all bags, so this is only one of three BOW
model holding bags.
Example:
from sourced.ml.models import BOW
bow = BOW().load("694c20a0-9b96-4444-80ae-f2fa5bd1395b")
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))
ID | 694c20a0-9b96-4444-80ae-f2fa5bd1395b |
Uploaded | 2018-07-17 10:28:56.243131 |
Version | 1.0.0 |
File | https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F694c20a0-9b96-4444-80ae-f2fa5bd1395b.asdf |
Size | 26.0 GB |
Data collection date | July 2018 |
Number of distinct documents (files) | 3,512,171 |
Number of distinct features | 6,194,874 |
Other parts | da8c5dee-b285-4d55-8913-a5209f716564 and 1e0deee4-7dc1-400f-acb6-74c0f4aec471 |
License |