This tutorial describes the process of performing data cleaning operations with OpenRefine and how to extract transformation rules that we use to clean a specific dataset. These rules can be later used to pass to the Batchrefinebatchrefine transformer.
Requirements for this tutorial:
- Install OpenRefine from here or clone and build it from source git
- As input file we use a CSV file provided by RET.
- Start OpenRefine, a new web browser window should open automatically, or you can point it to http://localhost:3333 (default configuration). On the Create Project tab upload the aforementioned CSV file.
OpenRefine attempts to identify input file format automatically but allows to configure data importing (format, separator type, etc.)
Our input file format should be recognized correctly, so go ahead and click on Create Project.
On the project screen we have a preview of our tabular data entries and we can easily identify the following problems that we want to clean:
- In column "nome" all entries are in uppercase, which we want to transform into CammelCase, such that:
LA VALLE DEI MEDICI ---> La Valle Dei Medici
- In columns "lt" and "lg" we have numerical values that represent latitude, longitude coordinates. Some entries appear have to many digits after the decimal point. Therefore, we want to limit the amount of digits after the decimal points to 2 digits, such that:
37.40425419999999689935066271 ---> 37.40
- To transform all the entries in column "nome" into CamelCase click on the arrow near the column title and select:
Edit cells --> Common transforms --> To titlecase*
- Fixing the length of coordinates in columns "lt" and "lg" is performed in two steps:
- Parse the string values to have them identified as numbers by clicking on the arrow near the column title and selecting:
Edit cells --> Common transforms --> To number
Numeric entries are represented with green color. Perform the same task for both columns.
- In the next step we will round the numeric values to 2 digits after the decimal point. Open the transform window by clicking on the arrow near the column title and selecting:
Edit cells --> Transform In this window we can apply custom transformations on column entries by writing expressions in one of the available languages (Google Refine Expression Language (GREL), Clojure or Jython. In this tutorial we use Jython as an example:
return round(value,2)
You can preview the result of an expression before applying it on the whole dataset. To complete the transform click OK.
OpenRefine keeps track of all changes you apply to a dataset and allows you to undo or redo them using the history entries that are listed in Undo/Redo tab. Moreover, you it allows you to extract transformation rules in as a JSON array. Each element of an array is a JSON object describing a transformation rule.
To save transformation rules for later use click on the extract button of the history panel and copy its contents to a file. Use such transformation rules file to pass to the Batchrefine transformer together with the corresponding dataset.