Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite staff importers (UserImporter, EnrollmentImporter) #1634

Merged
merged 2 commits into from
Aug 29, 2022

Conversation

richardebeling
Copy link
Member

@richardebeling richardebeling commented Oct 15, 2021

Preparation to solving #1574.

Minimal implementation to pass all existing tests. Putting this here already to get some feedback on the overall architecture.

The idea is that importers start with the binary excel content and then put their current state through a series of steps that each gets them a bit closer to their result. On errors, it simply raises an error aborting the import. After each step, checkers can be run to add errors or warnings that are more user friendly. This hopefully allows to understand the existing checks and add new checks very easily.
I tried to separate and encapsulate the different concerns as much as possible: Logic required for an import, errors/warnings, message handling

Its probably best check out how user import and enrollment import files look like before looking at the code.

related code changes that need to be integrated:

@richardebeling richardebeling changed the title Rewrite staff importers Rewrite staff importers (UserImporter, EnrollmentImporter) Oct 15, 2021
@richardebeling richardebeling force-pushed the importer-refactoring branch 2 times, most recently from bfec022 to 02e5ee2 Compare October 16, 2021 18:40
Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't read till the end, but here are some thoughts along the way

evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@karyon karyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • at 1400 lines, this is rather long. Consider splitting this up, e.g. putting every importer in its own file. Since this basically a full rewrite, there's no git history lost either ;)
  • The importer classes are currently only scopes for a set of methods, they are never instantiated. With separate files, there wouldn't be any need for the classes. I suggest to also mark all of their methods but the process one as private.

Generally, this is really a lot of code to understand. I know it was your goal to make this more easily extensible, but I think the current structure achieves this only partially: IF you figure out the structure, it's simple to not make any mistakes. BUT currently you need a lot of time to understand the structure, and people might not be willing to spend this time before extending this.

  • E.g., try to understand the CourseDataAdapter. It's only 14 LOC and a few comments, but it's using/mentioning a total of 7 classes, containing a finalize which does nothing (?), requires explanation why Exceptions are being swallowed, and all it does is basically course_data_checker.check_course_data(row.get_course_data(...)). Can this be inlined into CourseDataMismatchChecker? Or maybe I'm just not used to the adapter pattern and if I were, I could just gloss over this...
  • Also in the enrollment importer, course data is mapped three times: first in Invalid[Degree|CourseType|IsGraded]Checker and CourseNameChecker, then in CourseDataAdapter for the CourseDataMismatchChecker, then in map_enrollment_input_rows. This also uses three Mapper classes and a mixture of exceptions and error messages. All this took a while to understand.
    • Here's a proposal: map_enrollment_input_rows runs first, and somehow uses Invalid[Degree|CourseType|IsGraded]Checker. If those say that the value is not valid, they make it None or an error value. Then CourseNameChecker and CourseDataMismatchChecker would run on parsed rows, ignoring any error value.
    • alternatively the mapping could run after the first three checkers, so that the other two can run on the parsed data.
    • Maybe that's not the best way. But I do have the feeling that dissolving the mappers and adapters would result in a lot less structure to understand, and the remaining classes might not be that much larger.

evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@karyon karyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple Checkers which have no error aggregation, but you seem to know that already. But there are also a couple checkers already doing some aggregation, but not using the Tracker.

evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
evap/staff/importers.py Outdated Show resolved Hide resolved
@karyon
Copy link
Collaborator

karyon commented Oct 25, 2021

Also, I would suggest to not split the classes any further. Smaller code units are not always better ;)

Smaller code units might make the individual parts easier to understand, but it doesn't really remove the complexity, but rather moves it from the code itself into the code structure. kinda like microservices :D

One could argue that small code units are easier to test. In this case here, the individual parts are implementation details and shouldn't be tested anyway in my opinion.

Something something about all the boilerplate code something, and the pendulum swinging from monoliths to small units and back. I'm rather tired now :D

@karyon
Copy link
Collaborator

karyon commented Oct 25, 2021

One last thought. You said you want to have the invariant that the input data is always valid or something like that. I didn't really get that feeling from the code. I think more assertions and less checking-again-for-good-measure would communicate your intention better.

@karyon
Copy link
Collaborator

karyon commented Nov 2, 2021

Random thought, if you want to scope-creep your refactor, you could improve testability of the importer. A big step towards better testability might be having the test data in code instead of less-nice-to-edit xls files. Either the test data from the code could first be converted to an xls file, or the importers would need to enable tests to insert non-xls input somewhere.

@richardebeling richardebeling force-pushed the importer-refactoring branch 3 times, most recently from 1aff4e9 to e8484ec Compare December 20, 2021 17:51
@richardebeling
Copy link
Member Author

@karyon @niklasmohrin all the stuff suggested that I marked as resolved is now incorporated.

The structuring of components across files is still open, there's one TODO in the code for that.

IDK. If you want to review again, you can do that. I can also try to move stuff around files first if you prefer that.

It's not much less code than before, mainly because the distinction between "PartialCourseData", which might be incomplete, and "CourseData" which is guaranteed to be complete, forces a translation step between them. I couldn't bring myself to store invalid values without any reminders/guarantees in CourseData though. I think the current approach allows humans to reason about correctness that wouldn't be given otherwise.

@richardebeling richardebeling force-pushed the importer-refactoring branch 2 times, most recently from d17f073 to c0bebf7 Compare January 10, 2022 17:30
@richardebeling richardebeling marked this pull request as ready for review January 10, 2022 18:33
Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(review in progress)

evap/evaluation/models_logging.py Show resolved Hide resolved
@richardebeling richardebeling force-pushed the importer-refactoring branch 3 times, most recently from b7c5f39 to e7af028 Compare January 17, 2022 19:58
Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what I see 👍 (not completely through yet)

evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
@niklasmohrin
Copy link
Member

I really like the changes, I feel like it really helps when trying to find / understand where something particular is happening without having to backward-slice (swt namedrop, I know) for some importer-wide state modifications 👍

I haven't proof-read whether the logic is actually correctly mapped to the new code, but I would defer that until final review (that is, until @karyon is through again); and the tests should have already done most of this work for me :^)

@richardebeling
Copy link
Member Author

I've removed the importer classes, they are now methods (import_users, import_enrollments, import_persons_from_file). This resolves the last todo that I still had on my list.

Some discussions in this PR are still unresolved, waiting for input there. Apart from that, waiting for review.

@richardebeling
Copy link
Member Author

@janno42 in #1703 we replaced xls import with xlsx import. With the architecture here, we might be able to support both, xls and xlsx. Is that desirable, or do we prefer only supporting xlsx?

@janno42
Copy link
Member

janno42 commented Feb 21, 2022

I'd say only xlsx is fine. No need to add more complexity to the importer :)

Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't read everything (again), good stuff

evap/staff/importers/__init__.py Outdated Show resolved Hide resolved
evap/staff/importers/person.py Outdated Show resolved Hide resolved
evap/staff/importers/person.py Outdated Show resolved Hide resolved
evap/staff/importers/person.py Outdated Show resolved Hide resolved
evap/staff/importers/person.py Show resolved Hide resolved
evap/staff/importers/person.py Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
@richardebeling richardebeling force-pushed the importer-refactoring branch 4 times, most recently from 970339e to a41e9bf Compare March 7, 2022 19:24
Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In person we found that functions / methods that dont have any annotations in the signature are not checked; in a follow up, we / I should add -> None as a default

evap/staff/importers/base.py Outdated Show resolved Hide resolved
evap/staff/importers/base.py Outdated Show resolved Hide resolved
evap/staff/importers/base.py Outdated Show resolved Hide resolved
evap/staff/importers/base.py Outdated Show resolved Hide resolved
evap/staff/importers/base.py Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Show resolved Hide resolved
Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over user.py, seems good, some notes:

evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/user.py Outdated Show resolved Hide resolved
Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now read until the CourseNameChecker in enrollment.py

evap/staff/importers/user.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with enrollment

evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
evap/staff/importers/enrollment.py Outdated Show resolved Hide resolved
@niklasmohrin
Copy link
Member

reading import_enrollments, I felt like these abstractions we now have really want to be composed in a more abstract fashion and I thought about something like sklearn's pipeline:

with ConvertExceptionsToMessages(importer_log):
    pipeline = Pipeline(
        ExcelFileRowMapper(skip_first_n_rows=1, row_cls=EnrollmentInputRow),
        RaiseIfErrors(),
        EnrollmentInputRowMapper(),
        TooManyEnrollmentsChecker(test_run),
        CourseDataAdapter(CourseNameChecker(test_run, semester=semester)),
        CourseDataAdapter(CourseDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataEmptyFieldsChecker(test_run)),
        UserDataAdapter(UserDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataValidationChecker(test_run)),
        RaiseIfErrors(),
        importer_log=importer_log,
    )
    parsed_rows = pipeline.run(excel_content)

This will however require some more (typing?) effort without much "business value" - maybe it will be interesting to look into at some point though :^)

@richardebeling
Copy link
Member Author

reading import_enrollments, I felt like these abstractions we now have really want to be composed in a more abstract fashion and I thought about something like sklearn's pipeline:

with ConvertExceptionsToMessages(importer_log):
    pipeline = Pipeline(
        ExcelFileRowMapper(skip_first_n_rows=1, row_cls=EnrollmentInputRow),
        RaiseIfErrors(),
        EnrollmentInputRowMapper(),
        TooManyEnrollmentsChecker(test_run),
        CourseDataAdapter(CourseNameChecker(test_run, semester=semester)),
        CourseDataAdapter(CourseDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataEmptyFieldsChecker(test_run)),
        UserDataAdapter(UserDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataValidationChecker(test_run)),
        RaiseIfErrors(),
        importer_log=importer_log,
    )
    parsed_rows = pipeline.run(excel_content)

This will however require some more (typing?) effort without much "business value" - maybe it will be interesting to look into at some point though :^)

Created #1795 for this.

Copy link
Member

@niklasmohrin niklasmohrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Idea / Model is that importers start with the binary excel content and
then put their current state through a series of steps that each gets
them a bit closer to their result. On errors, it simply raises an error
aborting the import. After each step, checkers can be run to add errors
or warnings that are more user friendly. This should allow to understand
the existing checks and add new checks very easily.
@richardebeling richardebeling merged commit 06f0c1f into e-valuation:main Aug 29, 2022
@richardebeling richardebeling deleted the importer-refactoring branch August 29, 2022 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants