Skip to content

Commit

Permalink
Data: allow source to be passed as a keyword argument (#5163)
Browse files Browse the repository at this point in the history
The `Data` class defines the `source` property which allows to get and
set the `source` attribute. It was designed to hold information on the
source of the data if it came directly from an external database.

Prior to `aiida-core` v1.0 it was possible to set this attribute
directly through the constructor as a keyword argument, but due to the
refactoring of the ORM, this was no longer possible. It went undetected
because it was never tested and the two data types where this feature is
typically used, the `UpfData` and `CifData`, implemented the keyword
argument themselves, hiding the fact that `Data` no longer supported it.

The `Data` constructor signature is updated to once again recognize the
`source` keyword argument, and the custom implementation is removed from
the `UpfData` and `CifData` plugins which now simply rely on the super
class.

A section is added to the "How to work with Data" section in the docs to
document this feature and to recommend its usage when constructing data
nodes by hand from external databases.
  • Loading branch information
sphuber authored Oct 6, 2021
1 parent 5480267 commit 83155d2
Show file tree
Hide file tree
Showing 8 changed files with 85 additions and 55 deletions.
16 changes: 1 addition & 15 deletions aiida/orm/nodes/data/cif.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,25 +262,14 @@ class CifData(SinglefileData):
_values = None
_ase = None

def __init__(
self,
ase=None,
file=None,
filename=None,
values=None,
source=None,
scan_type=None,
parse_policy=None,
**kwargs
):
def __init__(self, ase=None, file=None, filename=None, values=None, scan_type=None, parse_policy=None, **kwargs):
"""Construct a new instance and set the contents to that of the file.
:param file: an absolute filepath or filelike object for CIF.
Hint: Pass io.BytesIO(b"my string") to construct the SinglefileData directly from a string.
:param filename: specify filename to use (defaults to name of provided file).
:param ase: ASE Atoms object to construct the CifData instance from.
:param values: PyCifRW CifFile object to construct the CifData instance from.
:param source:
:param scan_type: scan type string for parsing with PyCIFRW ('standard' or 'flex'). See CifFile.ReadCif
:param parse_policy: 'eager' (parse CIF file on set_file) or 'lazy' (defer parsing until needed)
"""
Expand All @@ -301,9 +290,6 @@ def __init__(
self.set_scan_type(scan_type or CifData._SCAN_TYPE_DEFAULT)
self.set_parse_policy(parse_policy or CifData._PARSE_POLICY_DEFAULT)

if source is not None:
self.set_source(source)

if ase is not None:
self.set_ase(ase)

Expand Down
6 changes: 6 additions & 0 deletions aiida/orm/nodes/data/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,12 @@ class Data(Node):
_storable = True
_unstorable_message = 'storing for this node has been disabled'

def __init__(self, *args, source=None, **kwargs):
"""Construct a new instance, setting the ``source`` attribute if provided as a keyword argument."""
super().__init__(*args, **kwargs)
if source is not None:
self.source = source

def __copy__(self):
"""Copying a Data node is not supported, use copy.deepcopy or call Data.clone()."""
raise exceptions.InvalidOperation('copying a Data node is not supported, use copy.deepcopy')
Expand Down
13 changes: 0 additions & 13 deletions aiida/orm/nodes/data/upf.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,19 +253,6 @@ def parse_upf(fname, check_filename=True):
class UpfData(SinglefileData):
"""`Data` sub class to represent a pseudopotential single file in UPF format."""

def __init__(self, file=None, filename=None, source=None, **kwargs):
"""Create UpfData instance from pseudopotential file.
:param file: filepath or filelike object of the UPF potential file to store.
Hint: Pass io.BytesIO(b"my string") to construct directly from a string.
:param filename: specify filename to use (defaults to name of provided file).
:param source: Dictionary with information on source of the potential (see ".source" property).
"""
# pylint: disable=redefined-builtin
super().__init__(file, filename=filename, **kwargs)
if source is not None:
self.set_source(source)

@classmethod
def get_or_create(cls, filepath, use_first=False, store_upf=True):
"""Get the `UpfData` with the same md5 of the given file, or create it if it does not yet exist.
Expand Down
1 change: 1 addition & 0 deletions aiida/plugins/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
'SchedulerFactory',
'TransportFactory',
'WorkflowFactory',
'get_entry_points',
'load_entry_point',
'load_entry_point_from_string',
'parse_entry_point',
Expand Down
2 changes: 1 addition & 1 deletion aiida/plugins/entry_point.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

from aiida.common.exceptions import LoadingEntryPointError, MissingEntryPointError, MultipleEntryPointError

__all__ = ('load_entry_point', 'load_entry_point_from_string', 'parse_entry_point')
__all__ = ('load_entry_point', 'load_entry_point_from_string', 'parse_entry_point', 'get_entry_points')

ENTRY_POINT_GROUP_PREFIX = 'aiida.'
ENTRY_POINT_STRING_SEPARATOR = ':'
Expand Down
35 changes: 35 additions & 0 deletions docs/source/howto/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,41 @@ If none of the currently available data types, as listed by ``verdi plugin list`
For details refer to the next section :ref:`"How to add support for custom data types"<topics:data_types:plugin>`.


.. _how-to:data:import:provenance:

Provenance
----------

While AiiDA will automatically keep the provenance of data that is created by it through calculations and workflows, this is clearly not the case when creating data nodes manually, as described in the previous section.
Typically, the manual creation of data happens at the beginning of a project when data from external databases is imported as a starting point for further calculations.
To still keep some form of provenance, the :class:`~aiida.orm.nodes.data.Data` base class allows to record the _source_ of the data it contains.
When constructing a new data node, of any type, you can pass a dictionary with information of the source under the ``source`` keyword argument:

.. code-block:: python
data = Data(source={'uri': 'http://some.domain.org/files?id=12345', 'id': '12345'})
Once stored, this data can always be retrieved through the ``source`` property:

.. code-block:: python
data.source # Will return the ``source`` dictionary that was passed in the constructor, if any
The following list shows all the keys that are allowed to be set in the ``source`` dictionary:

* 'db_name': The name of the external database.
* 'db_uri': The base URI of the external database.
* 'uri': The exact URI of where the data can be retrieved. Ideally this is a persistent URI.
* 'id': The external ID with which the data is identified in the external database.
* 'version': The version of the data, if any.
* 'extras': Optional dictionary with other fields for source description.
* 'source_md5': MD5 checksum of the data.
* 'description': Human-readable free form description of the data's source.
* 'license': A string with the type of license that applies to the data, if any.

If any other keys are defined, an exception will be raised by the constructor.


.. _how-to:data:organize:

Organizing data
Expand Down
1 change: 1 addition & 0 deletions docs/source/nitpick-exceptions
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,7 @@ py:class Session
py:class Query
py:class BackendQueryBuilder
py:class importlib_metadata.EntryPoint
py:class importlib_metadata.EntryPoints
py:class Command

py:class BooleanClauseList
Expand Down
66 changes: 40 additions & 26 deletions tests/orm/data/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,23 @@
# For further information on the license, see the LICENSE.txt file #
# For further information please visit http://www.aiida.net #
###########################################################################
"""Tests for the `Data` base class."""

# pylint: disable=redefined-outer-name
"""Tests for the :class:`aiida.orm.nodes.data.data:Data` class."""
import os

import numpy
import pytest

from aiida import orm
from aiida.backends.testbase import AiidaTestCase
from aiida import orm, plugins
from tests.static import STATIC_DIR


class TestData(AiidaTestCase):
"""Test for the `Data` base class."""
@pytest.fixture
@pytest.mark.usefixtures('clear_database_before_test')
def generate_class_instance():
"""Generate a dummy `Data` instance for the given sub class."""

@staticmethod
def generate_class_instance(data_class):
"""Generate a dummy `Data` instance for the given sub class."""
def _generate_class_instance(data_class):
if data_class is orm.CifData:
instance = data_class(file=os.path.join(STATIC_DIR, 'data', 'Si.cif'))
return instance
Expand Down Expand Up @@ -67,26 +66,41 @@ def generate_class_instance(data_class):
'for this data class, add a generator of a dummy instance here'.format(data_class)
)

# Tracked in issue #4281
@pytest.mark.flaky(reruns=2)
def test_data_exporters(self):
"""Verify that the return value of the export methods of all `Data` sub classes have the correct type.
return _generate_class_instance


@pytest.fixture(scope='function', params=plugins.get_entry_points('aiida.data'))
def data_plugin(request):
"""Fixture that parametrizes over all the registered entry points of the ``aiida.data`` entry point group."""
return request.param.load()


@pytest.mark.usefixtures('clear_database_before_test')
def test_constructor():
"""Test the constructor.
Specifically, verify that the ``source`` attribute can be set through a keyword argument.
"""
source = {'id': 1}
node = orm.Data(source=source)
assert isinstance(node, orm.Data)
assert node.source == source

It should be a tuple where the first should be a byte string and the second a dictionary.
"""
from aiida.plugins.entry_point import get_entry_points

for entry_point in get_entry_points('aiida.data'):
@pytest.mark.usefixtures('clear_database_before_test')
def test_data_exporters(data_plugin, generate_class_instance):
"""Verify that the return value of the export methods of all `Data` sub classes have the correct type.
data_class = entry_point.load()
export_formats = data_class.get_export_formats()
It should be a tuple where the first should be a byte string and the second a dictionary.
"""
export_formats = data_plugin.get_export_formats()

if not export_formats:
continue
if not export_formats:
return

instance = self.generate_class_instance(data_class)
instance = generate_class_instance(data_plugin)

for fileformat in export_formats:
content, dictionary = instance._exportcontent(fileformat) # pylint: disable=protected-access
self.assertIsInstance(content, bytes)
self.assertIsInstance(dictionary, dict)
for fileformat in export_formats:
content, dictionary = instance._exportcontent(fileformat) # pylint: disable=protected-access
assert isinstance(content, bytes)
assert isinstance(dictionary, dict)

1 comment on commit 83155d2

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'pytest-benchmarks:ubuntu-18.04,django'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 2.

Benchmark suite Current: 83155d2 Previous: 5480267 Ratio
tests/benchmark/test_engine.py::test_workchain_local[basic-loop] 1.8475926434051226 iter/sec (stddev: 0.073821) 3.939000322730887 iter/sec (stddev: 0.057244) 2.13
tests/benchmark/test_engine.py::test_workchain_local[serial-wc-loop] 0.4840469274973136 iter/sec (stddev: 0.13224) 1.0119643034809538 iter/sec (stddev: 0.062996) 2.09
tests/benchmark/test_engine.py::test_workchain_daemon[threaded-calcjob-loop] 0.11455980629188288 iter/sec (stddev: 0.62339) 0.23008759838467183 iter/sec (stddev: 0.12608) 2.01
tests/benchmark/test_importexport.py::test_import[no-objects] 0.9325280990188525 iter/sec (stddev: 0.13076) 1.8653289856124962 iter/sec (stddev: 0.060490) 2.00
tests/benchmark/test_importexport.py::test_import[with-objects] 0.9061610709477295 iter/sec (stddev: 0.14016) 1.837667245498786 iter/sec (stddev: 0.042833) 2.03
tests/benchmark/test_nodes.py::test_store_backend 491.2941864754864 iter/sec (stddev: 0.0016128) 1311.8586722354637 iter/sec (stddev: 0.000046131) 2.67
tests/benchmark/test_nodes.py::test_store 107.6950364028528 iter/sec (stddev: 0.0038981) 266.27955614886065 iter/sec (stddev: 0.00024339) 2.47
tests/benchmark/test_nodes.py::test_store_with_object 48.49644674223217 iter/sec (stddev: 0.0047330) 104.18813886532936 iter/sec (stddev: 0.014142) 2.15
tests/benchmark/test_nodes.py::test_delete_backend 116.52076567541222 iter/sec (stddev: 0.0040502) 246.61716589957334 iter/sec (stddev: 0.00064268) 2.12
tests/benchmark/test_nodes.py::test_delete 23.94078732731629 iter/sec (stddev: 0.021801) 63.96033034992959 iter/sec (stddev: 0.0011335) 2.67
tests/benchmark/test_nodes.py::test_delete_with_object 23.40589224060146 iter/sec (stddev: 0.029405) 57.27830731291876 iter/sec (stddev: 0.017524) 2.45

This comment was automatically generated by workflow using github-action-benchmark.

CC: @chrisjsewell @giovannipizzi

Please sign in to comment.