BOW

Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in sourced.ml at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use apollo at scale. We hit scipy.sparse limits while trying to merge sparse matrices for all bags, so this is only one of three BOW model holding bags.

Example:

from sourced.ml.models import BOW
bow = BOW().load("694c20a0-9b96-4444-80ae-f2fa5bd1395b")
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))

References


ID	694c20a0-9b96-4444-80ae-f2fa5bd1395b
Uploaded	2018-07-17 10:28:56.243131
Version	1.0.0
File	https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F694c20a0-9b96-4444-80ae-f2fa5bd1395b.asdf
Size	26.0 GB
Data collection date	July 2018
Number of distinct documents (files)	3,512,171
Number of distinct features	6,194,874
Other parts	da8c5dee-b285-4d55-8913-a5209f716564 and 1e0deee4-7dc1-400f-acb6-74c0f4aec471
License

Dependencies

55215392-36fc-43e5-b277-500f5b68d0c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

694c20a0-9b96-4444-80ae-f2fa5bd1395b.md

694c20a0-9b96-4444-80ae-f2fa5bd1395b.md

BOW

References

Dependencies

Files

694c20a0-9b96-4444-80ae-f2fa5bd1395b.md

Latest commit

History

694c20a0-9b96-4444-80ae-f2fa5bd1395b.md

File metadata and controls

BOW

References

Dependencies