-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support scraping Subversion #35
Comments
Hmm. Yeah, pretty low for me too. I'm trying to think how we might do this for arbitrary SVN repos (where we get the metadata itself). Are there specific SVN hosting tools that we would specifically want / need to target? I wonder if we can get a list of all of the |
I think we have about 100-200 projects or so but haven't counted yet since no one is really asking internally and since they aren't scraped properly to determine if they are excludable, it's a viscous cycle since people can't find them. I'm not sure what hosting tools to target. I was reading through the svn book's api chapter and it seems like a crawl using the svn client to checkout every directory and then got through it to find history and comments and maybe enough metadata. I haven't looked at it since them because it seemed like a decent amount of boring work digging into svn history files and such. I tried checking all the repos, but https://api.code.gov/repos?size=10000 only returned 1000 of the reported 6565 repos. None of those thousand had subversion and they were all vcs=git. |
Hi, found the help-wanted tag for this issue on code.gov . You can see more than 1000 repos at once by passing '&from=[start]' to the code api. Here's a list of all the repository urls by repository ID: repository_ids_and_urls.txt |
I tried doing a simple 'svn co ${url}' on each repositoryURL (so no authentication performed). It only worked on two projects: https://code.gov/projects/doe_office_scientific_technical_information_osti_1_kepler and https://code.gov/projects/doe_office_scientific_technical_information_osti_1_zeptoos . What did I miss? |
What kind of metadata can you extract from those checkouts? Are you able to populate the code.json elements? |
They're just source code repositories, containing branch and tag names, sourcefiles, and detailed change history, and that's it. Human intervention would be needed for many fields, but you could auto-populate things like vcs=svn and maybe offer guesses for things like releases, license, e-mail, or description, based on repo content. I'm guessing that even a barebones code.json file is helpful, here. |
Being able to scrape subversion projects would be helpful and is not yet supported. It's a pretty low priority for my agency, but you requested we add issues for examples of repos not yet supported.
The text was updated successfully, but these errors were encountered: