ImmoScout24-scrapy-mongodb-docker

Docker image for scrapy based spider to scrape https://www.immobilienscout24.de/ and store results in MongoDB

To start scraping run the following command: docker-compose up -e MAX_PAGES=10

By default, the spider scrapes the first 10 pages, change MAX_PAGES environment variable to any number of pages, or leave it empty to crawl all pages.

MongoDB data will be stored in the same directory under ./data/db

MongoDB could be accessed from host system using the following settings:

mongodb://localhost:27017 for MONGO_URI
immoscout24_db is the Mongo database name

How the spider works

The spider is a scrapy spider, and works by sending a POST request to an immoscout24.de endpoint, with the following headers:

headers = {
       'Cookie':'reese84=some_random_string_as_cookie'
       }

The response is in JSON format, it get parsed and store all its fields to MongoDB.

No data validation or specific fields filter is there, change the spider code for any specific data.

No filters for the crawled data. Filters can be added by reading immoscout24.de API docs, see the query parameters and add them to POST endpoint in the spider code.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
immoscout24		immoscout24
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ImmoScout24-scrapy-mongodb-docker

How the spider works

About

Releases

Packages

Languages

ossama131/ImmoScout24-scrapy-mongodb-docker

Folders and files

Latest commit

History

Repository files navigation

ImmoScout24-scrapy-mongodb-docker

How the spider works

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages