This repository is now archived. The project is being continued but is currently closed to new members. Data for Democracy is a community driven organization. If you want to start a new project in a similar area, you are welcome to do so! Check out the #refugees channel and rally your fellow data nerds!
Slack Channel: #internal-displacement
Project Description: Classifying, tagging, analyzing and visualizing news articles about internal displacement. Based on a challenge from the IDMC.
The tool we are building carries out a number of functions:
- Ingest a list of URLs
- Scrape content from the respective web pages
- Tag the article as relating to disaster or conflict
- Extract key information from text
- Store information in a database
- Display data in interactive visualisations
The final aim is a simple app that can perform all of these functions with little technical knowledge needed by the user.
Project Lead:
Maintainers: These are the additional people mainly responsible for reviewing pull requests, providing feedback and monitoring issues.
Scraping, processing, NLP
Front end and infrastructure
- Join the Slack channel.
- Read the rest of this page and the IDETECT challenge page to understand the project.
- Check out our issues (small tasks) and milestones. Keep an eye out for
help-wanted
,beginner-friendly
, anddiscussion
tags. - See something you want to work on? Make a comment on the issue or ping us on Slack to let us know.
- Beginner with GitHub? Make sure you've read the steps for contributing to a D4D project on GitHub.
- Write your code and submit a pull request to add it to the project. Reach out for help any time!
- Beginners are welcome! We're happy to help you get started. (For beginners with Git and GitHub specifically, our github-playground repo and the #github-help Slack channel are good places to start.)
- We believe good code is reviewed code. All commits to this repository are approved by project maintainers and/or leads (listed above). The goal here is not to criticize or judge your abilities! Rather, sharing insights and achievements. Code reviews help us continually refine the project's scope and direction, and encourage discussion.
- This README belongs to everyone. If we've missed some crucial information or left anything unclear, edit this document and submit a pull request. We welcome the feedback! Up-to-date documentation is critical to what we do, and changes like this are a great way to make your first contribution to the project.
There are millions of articles containing information about displaced people. Each of these is a rich source of information that can be used to analyse the flow of people and reporting about them.
We are looking to record:
- URL
- Number of times URL has been submitted
- Main text
- Source (eg. new york times)
- Publication date
- Title
- Author(s)
- Language of article
- Reason for displacement (violence/disaster/both/other)
- The location where the displacement happened
- Reporting term: displaced/evacuated/forced to fee/homeless/in relief camp/sheltered/relocated/destroyed housing/partially destroyed housing/uninhabitable housing
- Reporting unit: people/persons/individuals/children/inhabitants/residents/migrants or families/households/houses/homes
- Number displaced
- Metrics relating to machine learning accuracy and reliability
These are the main parts and functions that make up the project.
- Scraper and Pipeline
- Take lists of URLs as input from input dataset
- Filter irrelevant articles and types of content (videos etc.)
- Scrape the main body text and metadata (publish date, language etc.)
- Store the information in a database
- Interpreter
- Classify URLs as conflict/violence, disaster or other. There is a training dataset to help with tagging.
- Extract information from articles: location and number of reporting units (households or individuals) displaced, date published and reporting term (conflict/violence, disaster or other). The larger extended input dataset and the text from articles we have already scraped can be used to help here.
- Visualizer
- A mapping tool to visualize the displacement figures and locations, identify hotspots and trends.
- Other visualizations for a selected region to identify reporting frequency on the area
- Visualizing the excerpts of documents where the relevant information is reported (either looking at the map or browsing the list of URLs).
- Visualise relability of classification and information extraction algorithms (either overall or by article)
- Some pre-tagged datasets (1, 2) can be used to start exploring visualization options.
- App is in the
internal-displacement-web
folder - A non-technical-user friendly front end to wrap around the components above for inputting URLs, managing the databases, verifying data and interacting with visualisations
- Automation of scraping, pipeline and interpreter
You can run everything as you're accustomed to by installing dependencies locally, but another option is to run in a Docker container. That way, all of the dependencies will be installed in a controlled, reproducible way.
-
Install Docker: https://www.docker.com/products/overview
-
Run this command:
docker-compose up
or
docker-compose -f docker-compose-spacy.yml up
The
spacy
version will include the en_core_web_md 1.2.1 NLP model It is multiple gigabytes in size. The one without the model is much smaller.Either way, this will take some time the first time. It's fetching and building all of its dependencies. Subsequent runs should be much faster.
This will start up several docker containers, running postgres, a Jupyter notebook server, and the node.js front end.
In the output, you should see a line like:
jupyter_1 | http://0.0.0.0:3323/?token=536690ac0b189168b95031769a989f689838d0df1008182c
That URL will connect you to the Jupyter notebook server.
-
Visit the node.js server at http://localhost:3322
Note: You can stop the docker containers using Ctrl-C.
Note: If you already have something running on port 3322 or 3323, edit docker-compose.yml
and change the first
number in the ports config to a free port on your system. eg. for 9999, make it:
ports:
- "9999:3322"
Note: If you want to add python dependencies, add them to requirements.txt
and run the jupyter-dev version
of the docker-compose file:
docker-compose -f docker-compose-dev.yml up --build
You'll need to use the jupyter-dev version until your dependencies are merged to master and a new version is built. Talk to @aneel on Slack if you need to do this.
Note: if you want to run SQL commands againt the database directly, you can do that by starting a Terminal within Jupyter and running the PostgreSQL shell:
psql -h localdb -U tester id_test
Note: If you want to connect to a remote database, edit the docker.env
file with the DB url for your remote database.
- Python 3
- JavaScript/HTML/css
- Nodejs
- AWS
- Visualisation (D3)
- Try to keep each contribution and pull request focussed mostly on solving the issue at hand. If you see more things that are needed, feel free to let us know and/or make another issue.
- Datasets can be accessed from Dropbox
- We have a working plan for the project.
- Not ready to submit code to the main project? Feel free to play around with notebooks and submit them to the repository.