A library for working with Data Packages.
Package
class for working with data packagesResource
class for working with data resourcesProfile
class for working with profilesvalidate
function for validating data package descriptorsinfer
function for inferring data package descriptors
The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify datapackage
version range in your package.json
file e.g. datapackage: ^1.0
which will be added by default by npm install --save
.
$ npm install datapackage@latest # v1.0
$ npm install datapackage # v0.8
<script src="//unpkg.com/datapackage/dist/datapackage.min.js"></script>
Let's start with a simple example for Node.js:
const {Package} = require('datapackage')
const descriptor = {
resources: [
{
name: 'example',
profile: 'tabular-data-resource',
data: [
['height', 'age', 'name'],
['180', '18', 'Tony'],
['192', '32', 'Jacob'],
],
schema: {
fields: [
{name: 'height', type: 'integer'},
{name: 'age', type: 'integer'},
{name: 'name', type: 'string'},
],
}
}
]
}
const dataPackage = await Package.load(descriptor)
const resource = dataPackage.getResource('example')
await resource.read() // [[180, 18, 'Tony'], [192, 32, 'Jacob']]
And for browser:
After the script registration the library will be available as a global variable datapackage
:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>datapackage-js</title>
</head>
<body>
<script src="//unpkg.com/datapackage/dist/datapackage.min.js"></script>
<script>
const main = async () => {
const resource = await datapackage.Resource.load({path: 'https://raw.githubusercontent.com/frictionlessdata/datapackage-js/master/data/data.csv'})
const rows = await resource.read()
document.body.innerHTML += `<div>${resource.headers}</div>`
for (const row of rows) {
document.body.innerHTML += `<div>${row}</div>`
}
}
main()
</script>
</body>
</html>
A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more.
Consider we have some local csv files in a data
directory. Let's create a data package based on this data using a Package
class:
data/cities.csv
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,"41.89,12.51"
data/population.csv
city,year,population
london,2017,8780000
paris,2017,2240000
rome,2017,2860000
First we create a blank data package::
const dataPackage = await Package.load()
Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern **/*.csv
:
await dataPackage.infer('**/*.csv')
dataPackage.descriptor
//{ profile: 'tabular-data-package',
// resources:
// [ { path: 'data/cities.csv',
// profile: 'tabular-data-resource',
// encoding: 'utf-8',
// name: 'cities',
// format: 'csv',
// mediatype: 'text/csv',
// schema: [Object] },
// { path: 'data/population.csv',
// profile: 'tabular-data-resource',
// encoding: 'utf-8',
// name: 'population',
// format: 'csv',
// mediatype: 'text/csv',
// schema: [Object] } ] }
An infer
method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit:
dataPackage.descriptor.resources[1].schema.fields[1].type = 'year'
dataPackage.commit()
dataPackage.valid // true
Because our resources are tabular we could read it as a tabular data:
await dataPackage.getResource('population').read({keyed: true})
//[ { city: 'london', year: 2017, population: 8780000 },
// { city: 'paris', year: 2017, population: 2240000 },
// { city: 'rome', year: 2017, population: 2860000 } ]
Let's save our descriptor on the disk. After it we could update our datapackage.json
as we want, make some changes etc:
await dataPackage.save('datapackage.json')
To continue the work with the data package we just load it again but this time using local datapackage.json
:
const dataPackage = await Package.load('datapackage.json')
// Continue the work
It was onle basic introduction to the Package
class. To learn more let's take a look on Package
class API reference.
A class for working with data resources. You can read or iterate tabular resources using the iter/read
methods and all resource as bytes using rowIter/rowRead
methods.
Consider we have some local csv file. It could be inline data or remote link - all supported by Resource
class (except local files for in-brower usage of course). But say it's data.csv
for now:
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A
Let's create and read a resource. We use static Resource.load
method instantiate a resource. Because resource is tabular we could use resource.read
method with a keyed
option to get an array of keyed rows:
const resource = await Resource.load({path: 'data.csv'})
resource.tabular // true
resource.headers // ['city', 'location']
await resource.read({keyed: true})
// [
// {city: 'london', location: '51.50,-0.11'},
// {city: 'paris', location: '48.85,2.30'},
// {city: 'rome', location: 'N/A'},
// ]
As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a N/A
string instead of JavaScript null
. First we have to infer resource metadata:
await resource.infer()
resource.descriptor
//{ path: 'data.csv',
// profile: 'tabular-data-resource',
// encoding: 'utf-8',
// name: 'data',
// format: 'csv',
// mediatype: 'text/csv',
// schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } }
await resource.read({keyed: true})
// Fails with a data validation error
Let's fix not available location. There is a missingValues
property in Table Schema specification. As a first try we set missingValues
to N/A
in resource.descriptor.schema
. Resource descriptor could be changed in-place but all changes should be commited by resource.commit()
:
resource.descriptor.schema.missingValues = 'N/A'
resource.commit()
resource.valid // false
resource.errors
// Error: Descriptor validation error:
// Invalid type: string (expected array)
// at "/missingValues" in descriptor and
// at "/properties/missingValues/type" in profile
As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for missingValues
property. Also don't forget to have an empty string as a missing value:
resource.descriptor.schema['missingValues'] = ['', 'N/A']
resource.commit()
resource.valid // true
All good. It looks like we're ready to read our data again:
await resource.read({keyed: true})
// [
// {city: 'london', location: [51.50,-0.11]},
// {city: 'paris', location: [48.85,2.30]},
// {city: 'rome', location: null},
// ]
Now we see that:
- locations are arrays with numeric lattide and longitude
- Rome's location is a native JavaScript
null
And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor:
await resource.save('dataresource.json')
Let's check newly-crated dataresource.json
. It contains path to our data file, inferred metadata and our missingValues
tweak:
{
"path": "data.csv",
"profile": "tabular-data-resource",
"encoding": "utf-8",
"name": "data",
"format": "csv",
"mediatype": "text/csv",
"schema": {
"fields": [
{
"name": "city",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "geopoint",
"format": "default"
}
],
"missingValues": [
"",
"N/A"
]
}
}
If we decide to improve it even more we could update the dataresource.json
file and then open it again. But this time let's read our resoure as a byte stream:
const resource = await Resource.load('dataresource.json')
const stream = await resource.rawIter({stream: true})
stream.on('data', (data) => {
// handle data chunk as a Buffer
})
It was onle basic introduction to the Resource
class. To learn more let's take a look on Resource
class API reference.
A component to represent JSON Schema profile from Profiles Registry:
await profile = Profile.load('data-package')
profile.name // data-package
profile.jsonschema // JSON Schema contents
const {valid, errors} = profile.validate(descriptor)
for (const error of errors) {
// inspect Error objects
}
A standalone function to validate a data package descriptor:
const {valid, errors} = await validate({name: 'Invalid Datapackage'})
for (const error of errors) {
// inspect Error objects
}
The library supports foreign keys described in the Table Schema specification. It means if your data package descriptor use resources[].schema.foreignKeys
property for some resources a data integrity will be checked on reading operations.
Consider we have a data package:
const DESCRIPTOR = {
'resources': [
{
'name': 'teams',
'data': [
['id', 'name', 'city'],
['1', 'Arsenal', 'London'],
['2', 'Real', 'Madrid'],
['3', 'Bayern', 'Munich'],
],
'schema': {
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'},
{'name': 'city', 'type': 'string'},
],
'foreignKeys': [
{
'fields': 'city',
'reference': {'resource': 'cities', 'fields': 'name'},
},
],
},
}, {
'name': 'cities',
'data': [
['name', 'country'],
['London', 'England'],
['Madrid', 'Spain'],
],
},
],
}
Let's check relations for a teams
resource:
const {Package} = require('datapackage')
const package = await Package.load(DESCRIPTOR)
teams = package.getResource('teams')
await teams.checkRelations()
// tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4"
As we could see there is a foreign key violation. That's because our lookup table cities
doesn't have a city of Munich
but we have a team from there. We need to fix it in cities
resource:
package.descriptor['resources'][1]['data'].push(['Munich', 'Germany'])
package.commit()
teams = package.getResource('teams')
await teams.checkRelations()
// True
Fixed! But not only a check operation is available. We could use relations
argument for resource.iter/read
methods to dereference a resource relations:
await teams.read({keyed: true, relations: true})
//[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}},
// {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}},
// {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}]
Instead of plain city name we've got a dictionary containing a city data. These resource.iter/read
methods will fail with the same as resource.check_relations
error if there is an integrity issue. But only if relations: true
flag is passed.
A standalone function to infer a data package descriptor.
const descriptor = await infer('**/*.csv')
//{ profile: 'tabular-data-resource',
// resources:
// [ { path: 'data/cities.csv',
// profile: 'tabular-data-resource',
// encoding: 'utf-8',
// name: 'cities',
// format: 'csv',
// mediatype: 'text/csv',
// schema: [Object] },
// { path: 'data/population.csv',
// profile: 'tabular-data-resource',
// encoding: 'utf-8',
// name: 'population',
// format: 'csv',
// mediatype: 'text/csv',
// schema: [Object] } ] }
Package representation
- Package
- instance
- .valid ⇒
Boolean
- .errors ⇒
Array.<Error>
- .profile ⇒
Profile
- .descriptor ⇒
Object
- .resources ⇒
Array.<Resoruce>
- .resourceNames ⇒
Array.<string>
- .getResource(name) ⇒
Resource
|null
- .addResource(descriptor) ⇒
Resource
- .removeResource(name) ⇒
Resource
|null
- .infer(pattern) ⇒
Object
- .commit(strict) ⇒
Boolean
- .save(target, raises, returns)
- .valid ⇒
- static
- instance
Validation status
It always true
in strict mode.
Returns: Boolean
- returns validation status
Validation errors
It always empty in strict mode.
Returns: Array.<Error>
- returns validation errors
Profile
Descriptor
Returns: Object
- schema descriptor
Resources
Resource names
Return a resource
Returns: Resource
| null
- resource instance if exists
Param | Type |
---|---|
name | string |
Add a resource
Returns: Resource
- added resource instance
Param | Type |
---|---|
descriptor | Object |
Remove a resource
Returns: Resource
| null
- removed resource instance if exists
Param | Type |
---|---|
name | string |
Infer metadata
Param | Type | Default |
---|---|---|
pattern | string |
false |
Update package instance if there are in-place changes in the descriptor.
Returns: Boolean
- returns true on success and false if not modified
Throws:
DataPackageError
raises any error occurred in the process
Param | Type | Description |
---|---|---|
strict | boolean |
alter strict mode for further work |
Example
const dataPackage = await Package.load({
name: 'package',
resources: [{name: 'resource', data: ['data']}]
})
dataPackage.name // package
dataPackage.descriptor.name = 'renamed-package'
dataPackage.name // package
dataPackage.commit()
dataPackage.name // renamed-package
Save data package to target destination.
If target path has a zip file extension the package will be zipped and saved entirely. If it has a json file extension only the descriptor will be saved.
Param | Type | Description |
---|---|---|
target | string |
path where to save a data package |
raises | DataPackageError |
error if something goes wrong |
returns | boolean |
true on success |
Package.load(descriptor, basePath, strict) ⇒ Package
Factory method to instantiate Package
class.
This method is async and it should be used with await keyword or as a Promise
.
Returns: Package
- returns data package class instance
Throws:
DataPackageError
raises error if something goes wrong
Param | Type | Description |
---|---|---|
descriptor | string | Object |
package descriptor as local path, url or object. If ththe path has a zip file extension it will be unzipped to the temp directory first. |
basePath | string |
base path for all relative paths |
strict | boolean |
strict flag to alter validation behavior. Setting it to true leads to throwing errors on any operation with invalid descriptor |
Resource representation
- Resource
- instance
- .valid ⇒
Boolean
- .errors ⇒
Array.<Error>
- .profile ⇒
Profile
- .descriptor ⇒
Object
- .name ⇒
string
- .inline ⇒
boolean
- .local ⇒
boolean
- .remote ⇒
boolean
- .multipart ⇒
boolean
- .tabular ⇒
boolean
- .source ⇒
Array
|string
- .headers ⇒
Array.<string>
- .schema ⇒
tableschema.Schema
- .iter(keyed, extended, cast, forceCast, relations, stream) ⇒
AsyncIterator
|Stream
- .read(limit) ⇒
Array.<Array>
|Array.<Object>
- .checkRelations() ⇒
boolean
- .rawIter(stream) ⇒
Iterator
|Stream
- .rawRead() ⇒
Buffer
- .infer() ⇒
Object
- .commit(strict) ⇒
boolean
- .save(target) ⇒
boolean
- .valid ⇒
- static
- instance
Validation status
It always true
in strict mode.
Returns: Boolean
- returns validation status
Validation errors
It always empty in strict mode.
Returns: Array.<Error>
- returns validation errors
Profile
Descriptor
Returns: Object
- schema descriptor
Name
Whether resource is inline
Whether resource is local
Whether resource is remote
Whether resource is multipart
Whether resource is tabular
Source
Combination of resource.source
and resource.inline/local/remote/multipart
provides predictable interface to work with resource data.
Headers
Only for tabular resources
Returns: Array.<string>
- data source headers
Schema
Only for tabular resources
Iterate through the table data
Only for tabular resources
And emits rows cast based on table schema (async for loop).
With a stream
flag instead of async iterator a Node stream will be returned.
Data casting can be disabled.
Returns: AsyncIterator
| Stream
- async iterator/stream of rows:
-
[value1, value2]
- base -
{header1: value1, header2: value2}
- keyed -
[rowNumber, [header1, header2], [value1, value2]]
- extended Throws: -
TableSchemaError
raises any error occurred in this process
Param | Type | Description |
---|---|---|
keyed | boolean |
iter keyed rows |
extended | boolean |
iter extended rows |
cast | boolean |
disable data casting if false |
forceCast | boolean |
instead of raising on the first row with cast error return an error object to replace failed row. It will allow to iterate over the whole data file even if it's not compliant to the schema. Example of output stream: [['val1', 'val2'], TableSchemaError, ['val3', 'val4'], ...] |
relations | boolean |
if true foreign key fields will be checked and resolved to its references |
stream | boolean |
return Node Readable Stream of table rows |
Read the table data into memory
Only for tabular resources; the API is the same as
resource.iter
has except for:
Returns: Array.<Array>
| Array.<Object>
- list of rows:
[value1, value2]
- base{header1: value1, header2: value2}
- keyed[rowNumber, [header1, header2], [value1, value2]]
- extended
Param | Type | Description |
---|---|---|
limit | integer |
limit of rows to read |
It checks foreign keys and raises an exception if there are integrity issues.
Only for tabular resources
Returns: boolean
- returns True if no issues
Throws:
DataPackageError
raises if there are integrity issues
Iterate over data chunks as bytes. If stream
is true Node Stream will be returned.
Returns: Iterator
| Stream
- returns Iterator/Stream
Param | Type | Description |
---|---|---|
stream | boolean |
Node Stream will be returned |
Returns resource data as bytes.
Returns: Buffer
- returns Buffer with resource data
Infer resource metadata like name, format, mediatype, encoding, schema and profile.
It commits this changes into resource instance.
Returns: Object
- returns resource descriptor
Update resource instance if there are in-place changes in the descriptor.
Returns: boolean
- returns true on success and false if not modified
Throws:
- DataPackageError raises error if something goes wrong
Param | Type | Description |
---|---|---|
strict | boolean |
alter strict mode for further work |
Save resource to target destination.
For now only descriptor will be saved.
Returns: boolean
- returns true on success
Throws:
DataPackageError
raises error if something goes wrong
Param | Type | Description |
---|---|---|
target | string |
path where to save a resource |
Resource.load(descriptor, basePath, strict) ⇒ Resource
Factory method to instantiate Resource
class.
This method is async and it should be used with await keyword or as a Promise
.
Returns: Resource
- returns resource class instance
Throws:
DataPackageError
raises error if something goes wrong
Param | Type | Description |
---|---|---|
descriptor | string | Object |
resource descriptor as local path, url or object |
basePath | string |
base path for all relative paths |
strict | boolean |
strict flag to alter validation behavior. Setting it to true leads to throwing errors on any operation with invalid descriptor |
Profile representation
- Profile
- instance
- .name ⇒
string
- .jsonschema ⇒
Object
- .validate(descriptor) ⇒
Object
- .name ⇒
- static
- instance
Name
JsonSchema
Validate a data package descriptor
against the profile.
Returns: Object
- returns a {valid, errors}
object
Param | Type | Description |
---|---|---|
descriptor | Object |
retrieved and dereferenced data package descriptor |
Profile.load(profile) ⇒ Profile
Factory method to instantiate Profile
class.
This method is async and it should be used with await keyword or as a Promise
.
Returns: Profile
- returns profile class instance
Throws:
DataPackageError
raises error if something goes wrong
Param | Type | Description |
---|---|---|
profile | string |
profile name in registry or URL to JSON Schema |
This function is async so it has to be used with await
keyword or as a Promise
.
Returns: Object
- returns a {valid, errors}
object
Param | Type | Description |
---|---|---|
descriptor | string | Object |
data package descriptor (local/remote path or object) |
This function is async so it has to be used with await
keyword or as a Promise
.
Returns: Object
- returns data package descriptor
Param | Type | Description |
---|---|---|
pattern | string |
glob file pattern |
Base class for the all DataPackage errors.
Base class for the all TableSchema errors.
The project follows the Open Knowledge International coding standards. There are common commands to work with the project:
$ npm install
$ npm run test
$ npm run build
Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.
Updated behaviour:
- Resource's
escapeChar
andquoteChar
are mutually exclusive now
New API added:
- Added support of
zip
files for data packages - Added support of
format/encoding/dialect
for resources
This version includes various big changes. A migration guide is under development and will be published here.
First stable version of the library.