-
Notifications
You must be signed in to change notification settings - Fork 91
Add support for crawling a GitHub User’s various contributions, independent of a specific org or repo #146
Comments
When referring to user's events, do you mean the following API: list-public-events-that-a-user-has-received? AFAIK there is no way to set up public webhooks to trigger events for specific users. According to github webhooks documentation:
ghcrawler is more targeted towards collecting both public and private data by listening to webhook events even though it is possible to queue up different types of requests manually. The following project may be of interest. It collects all GitHub publicly available data: http://ghtorrent.org/. |
The user processing in GHCrawler could be enhanced to follow the events for the user via The other option is to periodically trigger a refresh of each team member's data. That would recrawl their events and potentially the entities related to those events. That might work though there are a couple caveats:
As to your specific questions
As an aside, you might also consider https://www.gharchive.org/ which has the events for all of GitHub. That data is surfaced in Big Query and you can query that for activity related to your users. If you don't otherwise need GHCrawler, that might work well, If you are running GHCrawler anyway, ... |
Thanks so much for the tips and info. This is still a WIP, but I wanted to write an update. It's a bummer that GitHub doesn't have webhooks for users/{username}/events, but we're going to use crawler for this anyway and just queue it up manually on a regular interval. Mostly because, in the future, we'll also care about our org and the org's repos and it will be nice to have one tool for all GitHub crawling. Currently, I am able to crawl a user's events, and will be working on trying to limit the crawling to only new events. (I'll check out etags, like you suggested.) Incidentally, you were right that _shouldFilter was keeping me from getting a user's repos. Thanks. |
I have a couple more questions (since they're sort of general, I'll be happy to throw the answers into the docs/wiki once legal approves my signing of the CLA) : When I run the CIABatta 😁 I get containers for Mongo, Redis, RabbitMQ, and Metabase. If I run the crawler in-memory, I understand that there's no persistence of data because there's no Mongo. So my question is: is the in-memory crawler also running without Redis and RabbitMQ? Are they not necessary for very basic crawling that doesn't look at stored data but just crawls everything fresh? I don't know much about RabbitMQ, but given that it's 'for queueing" I expected it to be necessary for crawler to function. If Redis and Rabbit aren't being used when running the crawler in-memory, what are they being used to do? |
Please see https://github.com/Microsoft/ghcrawler/blob/develop/README.md#running-in-memory: |
Hi Gene, I'll try to elaborate on my questions- RabbitMQ is for queuing. When I run the crawler with Docker, RabbitMQ is there doing some sort of queueing for something. But when I run the crawler in-memory, I'm still able to queue things up to get entered into the crawler. Does that mean that RabbitMQ is running? (And, if so, where/how?) Or is the queueing that RabbitMQ does unrelated to the basic crawler queueing- maybe something more involved? Similarly, is Redis running when I start the crawler up in-memory? What is Redis being used for? I know it's "for caching" but caching what? And when? I understand that when you kill the crawler process running in-memory all work is lost. I assumed that was because there's no MongoDB running. But I don't understand exactly what Redis and RabbitMQ are doing and when. |
Hey @danisyellis, the basic point here is that the crawler is configurable with You can, in general, mix and match providers. The classic production setup for us is to have Rabbit for queuing, Azure blob for storage, ... You may be using Mongo for storing docs. In-memory setups are used for testing and generally use in-memory or local file system setups for queuing, storage, ... Redis is used pervasively in the system to coordinate and track multiple instances of the crawler as well as various ratelimiting and caching features. Redis is not used at all for the standard in-memory setup as there is only one crawler running and it is local etc. Check out how the system is configured by following the code at https://github.com/Microsoft/ghcrawler/blob/develop/lib/crawlerFactory.js#L143 |
Our goal is to track contributions by our employees to any open-source project on GitHub. So we'll need to look at each employee’s commits, pull requests, issues, etc. We can do this through the User’s Events.
I have some questions about how to do this:
Is there anything in the current constraints of ghcrawler that will make this an exceptionally difficult task?
How do I say “traverse the Events for a given User”? Where is an example of similar code doing something similar?
this._addCollection(request, ‘repos', ‘repo’)
should tell it to look at a user’s repos and add those repos to the mongodb repo collection. But currently, as far as I can tell, it processes the user, but doesn’t even hit the repo function. Because I care most about events right now, I also triedthis._addCollection(request, 'events', 'null’);
andthis._addCollection(request, 'events', ‘events’);
but neither seemed to do anything.Will this require an advanced traversal policy? I think that I can use the default traversal policy for now and refine it with an advanced one later to grab fewer things from user, if desired, like using graphQL to do a query. Is that right?
The text was updated successfully, but these errors were encountered: