A CKAN extension that performs stricter validation of resource formats for uploaded files, ensuring that the file extension, file contents, and resource format are all compatible with each other.
- Reduces workload on back of house staff in fixing up format selection on miscategorised files.
- Better restrictions on allowed formats by also running them through magic/type sniffing systems. This ensures that an invalid file can't be uploaded by selected a random format and changing the file type ending.
It is also possible to specify whitelists of allowed file extensions and/or allowed MIME types. Future development may allow a blacklist, but this is harder to make reliable.
See the configuration file for more details.
To install ckanext-resource-type-validation
:
-
Install CKAN <= 2.8. CKAN 2.9 compatibility is not yet verified.
-
Activate your CKAN virtual environment, eg:
. /usr/lib/ckan/default/bin/activate
-
Install the extension into your virtual environment:
pip install -e git+https://github.com/qld-gov-au/ckanext-resource-type-validation.git#egg=ckanext-resource-type-validation
-
Install the extension dependencies:
pip install -r ckanext-resource-type-validation/requirements.txt
-
Add
resource_type_validation
to theckan.plugins
setting in your CKAN config file (by default the config file is located at/etc/ckan/default/production.ini
). -
Restart CKAN. Eg if you've deployed CKAN with Apache on Ubuntu:
sudo service apache2 reload
ckan.plugins = resource_type_validation
# Path to the configuration file for specifying file types and their
# relationships. Defaults to built-in
# ckanext/resource_type_validation/resources/resource_types.json
ckanext.resource_validation.types_file = /path/to/file.json
# Support contact to list in any error messages
ckanext.resource_validation.support_contact = webmaster@example.com
# Whitelist of allowed mimetypes
ckan.mimetypes_allowed = application/pdf,text/plain,text/xml
The configuration file can contain the following, all optional and in any order:
-
allowed_extensions
: A list of allowed file extensions, case-insensitive. If this is not specified, any extension is allowed. -
allowed_overrides
: A dictionary specifying which MIME types are treated as subtypes of others, egapplication/xml
is a subtype oftext/plain
, and anything is a subtype ofapplication/octet-stream
. So, a file namedexample.xml
with content that looks liketext/plain
, and a specified format of "XML", would be accepted. Wildcards are partially supported; an override can be a single asterisk to allow any other type to be a subtype (typically used forapplication/octet-stream
), or it can have the formprefix/*
to allow any type with that prefix to be a subtype (egtext/*
can overridetext/plain
). -
equal_types
: A list of lists of types that are interchangeable, egtext/xml
is the same asapplication/xml
. This can be used in a similar manner toallowed_overrides
, but will affect the resulting displayed format for the resource. Overrides attempt to use the most specific subtype, whereas equal types take whichever is encountered first. For example, a file namedexample.rdf
and containing XML data, withapplication/rdf+xml
as an override forapplication/xml
, would have a resource mimetype ofapplication/rdf+xml
, but ifapplication/xml
andapplication/rdf+xml
are configured as equal types, then the resource mimetype might be simplyapplication/xml
. -
archive_types
: A list of types that are archives and require special handling, egapplication/zip
. Archives can specify any resource format (since the format might refer to the archive contents), so long as the archive is well-formed (file extension and contents match). -
generic_types
: A list of types that are 'generic' ie supertype to many others (egtext/plain
andapplication/octet-stream
). File contents of these types can be overridden with a subtype, but if the file extension or format matches them, then that cannot be overridden. Eg a file withtext/plain
content could specify a CSV extension and format, but a file with.txt
extension could not specify a "CSV" format. Similarly, a resource with "TXT" format could not have a.xml
extension. This is intended to prevent browser-based content-sniffing attacks, where a file with an innocuous extension like.txt
may be handled in a different way by the browser based on the apparent type of its contents. -
extra_mimetypes
: A dictionary of additional mappings to add to the Pythonmimetypes
library for guessing types based on file extensions. For example, a site that expects to upload Quartus Tabular Text Files might define the.ttf
extension to havetext/plain
MIME type.
To run the tests:
-
Activate your CKAN virtual environment, eg:
. /usr/lib/ckan/default/bin/activate
-
Switch to the extension directory, eg:
cd /usr/lib/ckan/default/src/ckanext-resource-type-validation
-
Run the tests. This can be done in multiple ways.
-
Execute the test class directly:
python ckanext/resource_type_validation/test_mime_type_validation.py
-
Run
nosetests
-
Alternative testing with Docker
docker run -it -v $(pwd):/home/runner/work openknowledge/ckan-dev:2.9 bash -x /home/runner/work/test.sh