Separate the data models of EduSharing Service and Crawler Implementation #54

MRuecklCC · 2022-07-20T15:53:46Z

Currenlty, the data models that is used throughout the whole repository is defined in items.py. This data model is neither really equivalent to the EduSharing services data model (which is why there are so many transformations in es_connector.py), nor does it really cover what the most of the crawlers produce.

I propose to:

define a data model that is oriented w.r.t. the problem domain (e.g. only contains the metadata that is actually relevant for the end users)
define this model with either python dataclasses or via pydantic (which will allow automatic validation via type annotations)
define a seconds data model which exactly reflects the EduSharing services requirement
write one big transformation step (eventually as pipeline) to transform from one into the other.

This will

remove a lot of complexity from the scrawler/spider layer of the repository
will make debugging of mapping / transformation issues a lot easier
will allows for automatic and spider independent unit-tests (we can simply populate input data model and validate the transformed output)
removes complexity from the es_connector.py classes / files

The text was updated successfully, but these errors were encountered:

MRuecklCC · 2022-07-21T16:08:58Z

Summary of some discussion:

Quite significant work (because all crawlers might need to be touched)
Potentially obsolete in the near future, when the target data model needs to be changed to a new format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate the data models of EduSharing Service and Crawler Implementation #54

Separate the data models of EduSharing Service and Crawler Implementation #54

MRuecklCC commented Jul 20, 2022

MRuecklCC commented Jul 21, 2022

Separate the data models of EduSharing Service and Crawler Implementation #54

Separate the data models of EduSharing Service and Crawler Implementation #54

Comments

MRuecklCC commented Jul 20, 2022

MRuecklCC commented Jul 21, 2022