Java Runtime
git clone https://github.com/anjlab/csv2db.git csv2db
cd csv2db
chmod +x gradlew
./gradlew build
./build/libs/run.sh --help
usage: ./run.sh
-b,--batchSize <arg> Override batch size
-c,--config <arg> Configuration file
-d,--driverClass <arg> JDBC driver class name
-g,--progress Display progress
-h,--help Prints this help
-i,--input <arg> Input CSV file, or ZIP containing CSV files,
or path to folder that contains CSV files
-l,--connectionUrl <arg> JDBC connection URL
-m,--mode <arg> Operation mode (INSERT, MERGE, INSERTONLY)
-n,--include <arg> Only process files whose names match this
regexp (matches all files in input ZIP or
input folder by default)
-p,--password <arg> Connection password
-s,--skip <arg> Skip files whose names match this regexp
(skip nothing by default)
-t,--numberOfThreads <arg> Number of threads (default is number of
processors available to JVM)
-u,--username <arg> Connection username
-v,--verbose Verbose output, useful for debugging
-V,--verbose2 Print some internal statistics every 1 minute
Command line options take precedence over values from config file.
run.sh
supports $JAVA_OPTS
environment variable.
This distribution shipped with latest PostgreSQL 9.x JDBC driver.
If you need to connect to different RDBMS try either:
-
(re)build
csv2db
with-PjdbcDriver=<groupId:artifactId:version>
key.For example, to get latest 5.x MySQL Connector/J driver run build with:
./gradlew -PjdbcDriver=mysql:mysql-connector-java:5+ build
- or manually put JDBC driver to
./build/libs/lib
.
You can find names of JDBC drivers in Maven Central.
Only few columns listed here, just to show the format of CSV that were used with config below.
CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2
"! LTD","08209948","","","METROHOUSE 57 PEPPER ROAD","HUNSLET"
"!BIG IMPACT GRAPHICS LIMITED","07382019","","","335 ROSDEN HOUSE","372 OLD STREET"
"!NFERNO LTD.","04753368","","","FIRST FLOOR THAVIES INN HOUSE 3-4","HOLBORN CIRCUS"
{
"extend": "parent-config.json",
"operationMode": "MERGE",
"driverClass": "org.apache.derby.jdbc.EmbeddedDriver",
"connectionUrl": "jdbc:derby:memory:myDB;create=true",
"connectionProperties": {
"username": { "function": "connectionUsername" },
"password": ""
},
"targetTable": "companies",
"sqlEscapeChar": "\"",
"primaryKeys": [
"companies_house_id"
],
"columnMappings": {
"0": "name",
"1": "companies_house_id",
"2": "address_care_of",
"4": "address_line_1",
"5": "address_line_2",
"6": "post_town",
"9": "post_code"
},
"transientColumns": [
"address_care_of"
],
"syntheticColumns": [
"record_hash"
],
"insertValues": {
"id": { "sql": "nextval('company_id_sequence')" },
"created_at": { "sql": "current_timestamp" }
},
"updateValues": {
"updated_at": { "sql": "current_timestamp" }
},
"transform": {
"post_code": { "function": "uppercase" },
"address_line_2" : { "function": "transformAddressLine2" }
},
"map": {
"function": "filterRows"
},
"scripting": [
"functions.js"
],
"forceUpdate": false,
"batchSize": 100,
"csvOptions": {
"separatorChar": ",",
"quoteChar": "\"",
"escapeChar": "\b",
"skipLines": 1,
"strictQuotes": false,
"ignoreLeadingWhiteSpace": true
}
}
Since version 2.2.0 csv2db will use Nashorn JavaScript engine (built-in to Java 8+) and fallback to Rhino if running on Java 7 and below. There's no switch to control this behavior.
Note: Nashorn is not a drop-in replacement for Rhino, so some things may stop working.
Follow this guide to upgrade your scripts from Rhino to Nashorn.
"use strict";
//load compatibility script when on Java 8
load("nashorn:mozilla_compat.js");
importPackage(org.apache.commons.lang3);
// Map function accepts two arguments:
// - `row` from CSV file containing only values according to `columnMappings`
// - `emit` is a callback function that accepts new value of `row` that will
// be used instead of original value
function filterRows(row, emit) {
if (row.get("companies_house_id")) {
emit(row)
}
}
// For functions used in connection properties
// first argument is the name of evaluating connection property
function connectionUsername(propertyName) {
return java.lang.System.getProperty("user.name")
}
function transformAddressLine2(columnName, row) {
// row.get("address_care_of") available here
return StringUtils.trimToNull(row.get(columnName));
}
function uppercase(columnName, row) {
return row.get(columnName).toUpperCase()
}
As shown in example above, config file should be in JSON format.
extend
allows extending this config definition from parent config. For example, you may extract some common bits from your configs to one shared common file and extend from it. All paths in parent config will be relative to this configuration.
operationMode
may be one of MERGE
, INSERT
or INSERTONLY
.
In MERGE
mode the data will be merged to the target table.
This mode requires that the primaryKeys
property is set.
The tool uses primaryKeys
to find the data in target database table for each CSV row data.
Query by primaryKeys
should return exactly 0 or 1 records.
If 0 records will be found, then CVS row will be INSERTed to the database,
otherwise an UPDATE will be performed and data from CSV will overwrite existing database record.
In INSERTONLY
mode, it operates in a similar way to MERGE
however the row will be ignored if the Query by primaryKeys
returns a row.
The effect is that only the rows that don't already exist in the database will be INSERTed in to the table.
driverClass
, connectionUrl
and connectionProperties
are corresponding values from JDBC documentation for your database. Since version 2.1 it is possible to use JavaScript function references for connection properties.
targetTable
is the name of target table in database. The table should exist before import.
sqlEscapeChar
(optional) if specified then all table and column names will be escaped with this character, i.e. with sqlEscapeChar: "\""
names of columns in generated SQL will be surrounded by quotes. Can be used when name of a column matches one of the SQL reserved words.
primaryKeys
is the set of primary keys on the table. Only used in MERGE
and INSERTONLY
modes.
All primaryKeys
should be present in columnMappings
section, which means that CSV should contain primaryKeys
data.
columnMappings
defines mapping between zero-based column indexes in CSV and target database table column names.
syntheticColumns
lists additional columns that need to be mapped to database columns. These additional columns can appear in rows as a result of map
function (see below).
transientColumns
defines mapped columns as transient, which means they are only available for JavaScript functions in the row
argument,
but these columns won't be mapped to target table columns.
insertValues
and updateValues
allows providing values for columns that are not in CSV (like in example above with required id
field, whose value should be taken from PostgreSQL sequence). insertValues
used in INSERT clauses, updateValues
used in UPDATE clauses. See Value Definitions.
transform
defines transformation rules for imported data. Right now you can define only one transformation for each column. See Value Definitions. Transformations only applied for CSV data and not for the columns defined in insertValues
and updateValues
map
(optional) defines name of a JavaScript function. Every row from input CSV file will be passed through this function. The map
function must accept two arguments: row
and emit
. The value of row
will be JSON object containing key/values according to columnMappings
. emit
is a callback function that accepts new value of row
that will be used instead of original value in further processing. Client code may invoke emit
function 0, 1, or many times, acting like a filter or a splitter.
scripting
defines list of JavaScript file names. The file names are relative to location of the configuration file. You can define your JavaScript functions in these files and reference them from Value Definitions.
forceUpdate
forces executing UPDATE statements for every row even if the data from CSV for this row is the same as in the database table. This may be needed if you want to force applying values from updateValues
section. Default value is false
and it is only used in MERGE
mode.
ignoreNullPK
ignores any row where any of the PK values in the data are null. This may be needed if you want to top up a reference table from a data table with missing reference values where some of them are null
. Default value is false
and it is only used in INSERTONLY
mode.
ignoreDuplicatePK
ignore rows having exactly the same values of PK when they appear in the same insert batch (see batchSize
). Only first record from the batch will be used for insertion.
batchSize
size of INSERT/UPDATE batches. Default value is 100.
limit
number of top records to process. Default is 0: no limit.
csvOptions
is a set of options supported by http://opencsv.sourceforge.net, here's the defaults:
{
"separatorChar": ",",
"quoteChar": "\"",
"escapeChar": "\\",
"skipLines": 0,
"strictQuotes": false,
"ignoreLeadingWhiteSpace": true
}
Value definitions used to define column values in the insertValues
, updateValues
and transform
blocks.
Value definition may be one of:
- Constant
- SQL expression
- JavaScript function reference
Constant value definitions are JSON primitives, and can be strings, numbers or booleans. Use this when you want the same value for all rows of the column:
{
"insertValues": {
"data_source": "CompaniesHouse",
"official_data": true
}
}
Result of SQL expression value definition becomes a part of SQL query, in contrast to "Constants" and "JavaScript function references" whose values becomes SQL query parameters.
{
"insertValues": {
"id": { "sql": "nextval('company_id_sequence')" }
}
}
SQL expressions are only allowed to be used for insertValues
and updateValues
. This is due to performance issues, because INSERT and UPDATE clauses prepared only once and then reused for import of every row.
You can reference one of the functions that declared in JavaScript files from scripting
block:
{
"transform": {
"post_code": { "function": "uppercase" }
}
}
Every function will accept two arguments:
columnName
-- name of the column for which this function should produce a value, androw
-- JSON object representing key-value pairs of currently imported CSV row. Only columns that are incolumnMappings
will be present in this object.
If the function referenced from context of insertValues
or updateValues
then row
argument will be null
.
Note that primary key values will be transformed before being checked against the primary key in both MERGE
and INSERTONLY
modes, enabling transformation of primary key values.