If you are attending the OpenRefine sessions at Dataharvest, and not using the pre-installed computers from the lab, this repository will help you installing OpenRefine on your computer.
Please note that a regular OpenRefine install will work fine out-of-the-box for the beginner session.
With this repository, we will help you install a pimped docker version of OpenRefine, in a docker container for the OSINT session.
What do I mean by "pimped"? well...
This version adds a plugin called vib-bits, some command line tools (cli) such as ddgr, and trafilatura to boost OpenRefine's potential.
The container also embeds a TOR service that allows you to torify command line instruction (for anonymization).
- You need Docker on your machine. Please visit this page in order to do so :
https://docs.docker.com/desktop/
- You may also need git (not mandatory but useful)
1 - Clone the repository: git clone https://github.com/openfacto/DataHarvest23-OpenRefine-for-Osint.git
(you can also download the zip archive and unzip it on your hard disk).
2 - Build the image: docker-compose build
We will you the last version of OpenRefine (3.7.2)
3- Enjoy a cup of coffee, the container takes up to 4mn to build...
This project also includes a docker-compose file that allows you to easily pass some parameters to the container, such as the OpenRefine version, memory and cpu limits, etc...
docker-compose --compatibility up -d
To stop the stack, simply type :
docker-compose down
You can now access OpenRefine by browsing the following URL in Firefox or Chrome :
http://127.0.0.1
If you don't want to use docker, you can also install OpenRefine, as well as all plugins and CLI tools directly on your computer. Please be aware of the fact that not every tools are available in that case.
- OpenRefine : https://openrefine.org
- Python+pip : https://python.org
OpenRefine uses Java to work so you might use it on all traditional platforms (Mac, Linux and Windows). Installation is detailed here.
We do recommend to install the vib-bits plugin, that will allow you to easily cross-match data between your different projects (aka join tables...).
- The vib-bits plugin itself.
- The documentation
For this demo, we also recommend to install several tools on your machine (not mandatory, though...) :
- It's generally a good idea to have wget, curl, grep (or ripgrep), tar, python3 and python3-pip on your machines
- whois, dnsutils, and geoip-bin
- ddgr, to interact with duckduckgo using the command line
- trafilatura , a tool to scrape text from webpage.
Our Docker version now includes JQ for parsing json files., and JC, a tool that jsonize the command line output.
I usually find useful to have a TOR service that allow you to torify command line instruction.
Example : torify curl https://ipinfo.io/ip
will display your IP via TOR.
Now that OpenRefine is installed on your machine, let's practice using some examples and use-cases.
- ONE - How to Geocode and enrich using APIs.
- TWO - Let's import some OSINT-obtained json data and map them.
- THREE - How about using more APIs to enrich your data?
- FOUR - Cherry on the cake; let's enrich our data using the command line!
- FIVE - Scrape & Enrich your data using TRAFILATURA
You can send your commands anonymously by wrapping it in TOR with the command torify.
Example :
torify curl --silent http://monip.org
The retrieved IP address in that case should be very different than the IP retrieved by a simple curl --silent http://monip.org
.
A good way to anonymize your request using TOR, is to use this command to refresh the IP address for every requests :
killall -HUP tor
This command refreshes the TOR ip Address.
For instance, let's consider a project to test this, by creating a new project with two identical lines :
killall -HUP tor && torify curl --silent http://monip.org
killall -HUP tor && torify curl --silent http://monip.org
If we apply the jython script to this, the IP address will be different for both lines.
Trafilatura is a command-line tool that allows you to scrape webpages (for example to retrieve the full text of an article)
Install it with pip3 (Python) :
pip3 install trafilatura