Immospider is a python program that crawls the Immoscout24 website. You can also use it to immediately receive an email when new apartments are available at the Immoscout24 website. It is based on ideas from http://mfcabrera.com/data_science/2015/01/17/ichbineinberliner.html and https://github.com/balzer82/immoscraper .
Immospider is using the popular python framework https://scrapy.org/ . For installation you need Python 3. Then you can clone this repository and install the requirements via
pipenv install
This should install all necessary packages for you.
Let's assume you want to move to Berlin. You are searching for a flat with 2-3 rooms bigger than 60m^2 flat which should not be more expensive than 1000 Euro. You must enter these requirements in Immoscout24 website and search. If you search for whole Berlin you probably will find more than 500 results. As next step copy the url of your Immoscout search, because Immospider will use it. For the example given here the url is https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00 . With this information you can now start Immospider like
scrapy crawl immoscout -o apartments.csv -a url=https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00 -L INFO
You should be able to scrape all results within 30 seconds. The results will be stored as CSV file
apartments.csv
.
- Docker
- Account at SendGrid (for sending out email)
Make a copy of config.tmpl
and rename it to config
. Edit config
and
file out the following environment variables:
URL=<your immoscout search url>
FROM=<from email address>
TO=<to email address>
SENDGRID_API_KEY=<your sendgrid API key>
By default Immospider is configured to run every 10 minutes. To change it edit the
file yacrontab.yaml
and edit the line
schedule: "*/10 * * * *"
To create the docker container and run it with your configuration do
$ sh run_docker.sh
This will create a docker container from the Dockerfile
, install the dependencies
and Immospider into the container and run it with your configuration. It will scrape
the Immoscout24 in regular intervals, store the results and will send out an email
when it detects new results it hasn't seen before. Neat, isn't it?
To deploy the docker container in an Amazon EC2 instance you can do the following:
Create an instance with
Amazon Linux
Image Size 8GB
t2.micro
Configure it to allow for SSH access from your machine
security group (ssh-from-my-ip) SSH TCP 22 <your IP>/32
Start your instance and login to it with your private key `<your_keyfile>.pem
$ chmod 400 <your_keyfile>.pem
$ ssh -i "<your_keyfile>.pem" ec2-user@ec2-<your_instance_name>.<your_compute_zone>.compute.amazonaws.com
Then install Docker onto your VM instance
$ sudo yum update -y
$ sudo yum install -y docker
$ sudo service docker start
Finally clone this repository, create and run the docker container
$ git clone https://github.com/asmaier/ImmoSpider.git
$ cd ImmoSpider
# don't forget to change your configuration first
$ sudo sh run_docker.sh
See also
- https://www.ybrikman.com/writing/2015/11/11/running-docker-aws-ground-up/#launching-an-ec2-instance
- https://docker-curriculum.com/#docker-on-aws
- https://techsparx.com/software-development/docker/deploy-images-without-registry.html
Finding a good flat which is near to your work place and is also near to e.g. the kindergarden/school of your kids, your favorite park etc. can be very difficult. Unfortunately the existing search engines in Germany for apartments like Immoscout, Immowelt, Immonet don't support computing the travel time for an apartment to several destinations. Here I want to show you how to use Immospider to do that.
You need an API key for the googlemaps API, if you want to compute travel times to several destinations. You should follow the instructions at https://github.com/googlemaps/google-maps-services-python#api-keys to get your API key.
Let's assume you want to move to Berlin. You will work at some fancy startup near Alexanderplatz but your partner likes to go shopping at the KaDeWe. And you are searching for a flat with 2-3 rooms bigger than 60m^2 flat which should not be more expensive than 1000 Euro. You must enter these requirements in Immoscout24 website and search. If you search for whole Berlin you probably will find more than 500 results. As next step copy the url of your Immoscout search, because Immospider will use it. For the example given here the url is https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00 . With this information you can now start Immospider like
scrapy crawl immoscout -o apartments.csv -s GM_KEY=<Google Maps API Key> -a url=https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Berlin/Berlin/-/2,50-/60,00-/EURO--1000,00 -a dest="Alexanderplatz, Berlin" -a mode=transit -a dest2="KaDeWe, Berlin" -L INFO
The option -o apartments.csv
specifies the output file. The parameter -s GM_KEY=<Google Maps API Key>
sets your
Google maps API key. The argument dest="Alexanderplatz, Berlin" -a mode=transit
tells Immospider that you want to
calculate the travel time for each apartment to Alexanderplatz using public transportation mode. The
argument dest2="KaDeWe, Berlin"
will additionaly compute the travel time via car (the default mode) to KaDeWe. You
can have up to three destinations dest1,dest2,dest3
and specify the mode for each destination mode1,mode2,mode3
.
The argument -a url=...
must hold the search url from Immoscout. The optional parameter -L INFO
can be added to
generate more log output.
If you start Immospider with the given parameters here it might run up to 20 minutes, not because the crawler is slow, but because the Google Maps API takes some time to compute the travel time for each of the more than 500 apartments. If that is too slow for you, you should modify your search on Immoscout (and again copy the new url), so that the amount of search results is lower. If your result set is about 50 apartments, Immospider will only need 1-2 minutes to compute all the travel times.
How one can analyze the search results you can see in several jupyter notebooks