Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate the data models of EduSharing Service and Crawler Implementation #54

Open
MRuecklCC opened this issue Jul 20, 2022 · 1 comment

Comments

@MRuecklCC
Copy link

Currenlty, the data models that is used throughout the whole repository is defined in items.py. This data model is neither really equivalent to the EduSharing services data model (which is why there are so many transformations in es_connector.py), nor does it really cover what the most of the crawlers produce.

I propose to:

  • define a data model that is oriented w.r.t. the problem domain (e.g. only contains the metadata that is actually relevant for the end users)
  • define this model with either python dataclasses or via pydantic (which will allow automatic validation via type annotations)
  • define a seconds data model which exactly reflects the EduSharing services requirement
  • write one big transformation step (eventually as pipeline) to transform from one into the other.

This will

  • remove a lot of complexity from the scrawler/spider layer of the repository
  • will make debugging of mapping / transformation issues a lot easier
  • will allows for automatic and spider independent unit-tests (we can simply populate input data model and validate the transformed output)
  • removes complexity from the es_connector.py classes / files
@MRuecklCC
Copy link
Author

Summary of some discussion:

  • Quite significant work (because all crawlers might need to be touched)
  • Potentially obsolete in the near future, when the target data model needs to be changed to a new format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant