Skip to content

andreybratus/tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

Tutorial to create a simple transform with OpenRefine

This tutorial describes the process of performing data cleaning operations with OpenRefine and how to extract transformation rules that we use to clean a specific dataset. These rules can be later used to pass to the Batchrefinebatchrefine transformer.

Requirements for this tutorial:

  • Install OpenRefine from here or clone and build it from source git
  • As input file we use a CSV file provided by RET.

Tutorial

  1. Start OpenRefine, a new web browser window should open automatically, or you can point it to http://localhost:3333 (default configuration). On the Create Project tab upload the aforementioned CSV file.
Create project

OpenRefine attempts to identify input file format automatically but allows to configure data importing (format, separator type, etc.)

Create project

Our input file format should be recognized correctly, so go ahead and click on Create Project.

Identify problems with data

On the project screen we have a preview of our tabular data entries and we can easily identify the following problems that we want to clean:

  • In column "nome" all entries are in uppercase, which we want to transform into CammelCase, such that:

LA VALLE DEI MEDICI ---> La Valle Dei Medici

  • In columns "lt" and "lg" we have numerical values that represent latitude, longitude coordinates. Some entries appear have to many digits after the decimal point. Therefore, we want to limit the amount of digits after the decimal points to 2 digits, such that:

37.40425419999999689935066271 ---> 37.40

Perform cleaning

  1. To transform all the entries in column "nome" into CamelCase click on the arrow near the column title and select:

Edit cells --> Common transforms --> To titlecase*

To CamelCase transform
  1. Fixing the length of coordinates in columns "lt" and "lg" is performed in two steps:
  • Parse the string values to have them identified as numbers by clicking on the arrow near the column title and selecting:

Edit cells --> Common transforms --> To number

Parse number

Numeric entries are represented with green color. Perform the same task for both columns.

  • In the next step we will round the numeric values to 2 digits after the decimal point. Open the transform window by clicking on the arrow near the column title and selecting:

Edit cells --> Transform In this window we can apply custom transformations on column entries by writing expressions in one of the available languages (Google Refine Expression Language (GREL), Clojure or Jython. In this tutorial we use Jython as an example:

return round(value,2)
Jython expression

You can preview the result of an expression before applying it on the whole dataset. To complete the transform click OK.

Extract transformation rules

OpenRefine keeps track of all changes you apply to a dataset and allows you to undo or redo them using the history entries that are listed in Undo/Redo tab. Moreover, you it allows you to extract transformation rules in as a JSON array. Each element of an array is a JSON object describing a transformation rule.

Extract transform rules

To save transformation rules for later use click on the extract button of the history panel and copy its contents to a file. Use such transformation rules file to pass to the Batchrefine transformer together with the corresponding dataset.

About

OpenRefine create transform tutorial

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published