GitHub - kujaomega/crawl_subreddit: This is a subreddit crawler

Assumptions:

In this project I assume that the reddit endpoint ("/r/Python.json") gives you the latest posts ordered by the recent update.

As there is a throttling time in the api calls, I have not implemented parallellism, the bottleneck is the 600 api calls in 600 seconds.

I have choosen to do use AWS lambdas to make this project as It is a stadistics api and will not have a need to be requested every second. For this reason lambdas will decrease the costs of the api. As AWS have got a free tier of service and Mongodb offer a free database, I have implemented the services in a free way.

To use the service I have created first a Mongodb database and I have stored the dabase information in a mongo_db.json file. To execute the files I assume you are using Ubuntu and you have docker and pyhton3.6 installed. You also need to have AWS credentials stored in your "~/.aws/" folder. First of all, to crawl all the data to the database, execute the reddit_crawler/reddit_crawler.py. Then you need to execute the reddit_crawler/deploy_lambda.sh to deploy the lambda for constantly update database. You need to execute the api_endpoint/deploy_lambda.sh to create the api endpoints.

BONUS:

I choosen the explained architecture for an ease to implement this step.

I have implemented top submitters and top commenters as a personal challenge but I have not implemented the most active users due the lack of time.

I have implemented all posts by user and all posts a user commented as a personal challenge.

I have implemented the average comment karma for a user as a challenge but I have not implemented the top 10 most valued users due the lack of time.

I have implemented a script for deploying the api endpoint buy I have not implemented the tests in the continuous integration due the lack of time.

The endpoints are not working ATM

ENDPOINTS: The api calls will be available a week or until a response

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10punctuation

queryparam: rank["all", "discussion" or "external"]

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10comments

queryparam: rank["all", "discussion" or "external"]

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10submitters

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/top10commenters

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/allPostsByUser

queryparam: author

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/allPostsByUserComments

queryparam: author

GET - https://jg6l4dw34i.execute-api.eu-central-1.amazonaws.com/dev/averageCommentKarma

queryparam: author

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
api_endpoint		api_endpoint
reddit_crawler		reddit_crawler
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The endpoints are not working ATM

About

Releases

Packages

Languages

kujaomega/crawl_subreddit

Folders and files

Latest commit

History

Repository files navigation

The endpoints are not working ATM

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages