Rewrite staff importers (UserImporter, EnrollmentImporter) #1634

richardebeling · 2021-10-15T18:18:13Z

Preparation to solving #1574.

Minimal implementation to pass all existing tests. Putting this here already to get some feedback on the overall architecture.

The idea is that importers start with the binary excel content and then put their current state through a series of steps that each gets them a bit closer to their result. On errors, it simply raises an error aborting the import. After each step, checkers can be run to add errors or warnings that are more user friendly. This hopefully allows to understand the existing checks and add new checks very easily.
I tried to separate and encapsulate the different concerns as much as possible: Logic required for an import, errors/warnings, message handling

Its probably best check out how user import and enrollment import files look like before looking at the code.

related code changes that need to be integrated:

xls was changed to xlsx in change user import and tests to xlsX #1703.
Fix importer error-merge error #1714 added a test that didn't really match code anymore. I adapted it to the new infrastructure, but we might consider removing it.
Import enrollment data with existing courses #1683 allowed importing into existing courses.
Importer messages: More intuitive counts / numbers #1597 changes importer success messages to include user lists.
Link to user edit pages in importer messages #1730 Adds links to user profiles in importer messages.

niklasmohrin

didn't read till the end, but here are some thoughts along the way

evap/staff/importers.py

karyon

at 1400 lines, this is rather long. Consider splitting this up, e.g. putting every importer in its own file. Since this basically a full rewrite, there's no git history lost either ;)
The importer classes are currently only scopes for a set of methods, they are never instantiated. With separate files, there wouldn't be any need for the classes. I suggest to also mark all of their methods but the process one as private.

Generally, this is really a lot of code to understand. I know it was your goal to make this more easily extensible, but I think the current structure achieves this only partially: IF you figure out the structure, it's simple to not make any mistakes. BUT currently you need a lot of time to understand the structure, and people might not be willing to spend this time before extending this.

E.g., try to understand the CourseDataAdapter. It's only 14 LOC and a few comments, but it's using/mentioning a total of 7 classes, containing a finalize which does nothing (?), requires explanation why Exceptions are being swallowed, and all it does is basically course_data_checker.check_course_data(row.get_course_data(...)). Can this be inlined into CourseDataMismatchChecker? Or maybe I'm just not used to the adapter pattern and if I were, I could just gloss over this...
Also in the enrollment importer, course data is mapped three times: first in Invalid[Degree|CourseType|IsGraded]Checker and CourseNameChecker, then in CourseDataAdapter for the CourseDataMismatchChecker, then in map_enrollment_input_rows. This also uses three Mapper classes and a mixture of exceptions and error messages. All this took a while to understand.
- Here's a proposal: map_enrollment_input_rows runs first, and somehow uses Invalid[Degree|CourseType|IsGraded]Checker. If those say that the value is not valid, they make it None or an error value. Then CourseNameChecker and CourseDataMismatchChecker would run on parsed rows, ignoring any error value.
- alternatively the mapping could run after the first three checkers, so that the other two can run on the parsed data.
- Maybe that's not the best way. But I do have the feeling that dissolving the mappers and adapters would result in a lot less structure to understand, and the remaining classes might not be that much larger.

evap/staff/importers.py

karyon

There are multiple Checkers which have no error aggregation, but you seem to know that already. But there are also a couple checkers already doing some aggregation, but not using the Tracker.

evap/staff/importers.py

karyon · 2021-10-25T21:30:43Z

Also, I would suggest to not split the classes any further. Smaller code units are not always better ;)

Smaller code units might make the individual parts easier to understand, but it doesn't really remove the complexity, but rather moves it from the code itself into the code structure. kinda like microservices :D

One could argue that small code units are easier to test. In this case here, the individual parts are implementation details and shouldn't be tested anyway in my opinion.

Something something about all the boilerplate code something, and the pendulum swinging from monoliths to small units and back. I'm rather tired now :D

karyon · 2021-10-25T21:33:33Z

One last thought. You said you want to have the invariant that the input data is always valid or something like that. I didn't really get that feeling from the code. I think more assertions and less checking-again-for-good-measure would communicate your intention better.

karyon · 2021-11-02T12:12:12Z

Random thought, if you want to scope-creep your refactor, you could improve testability of the importer. A big step towards better testability might be having the test data in code instead of less-nice-to-edit xls files. Either the test data from the code could first be converted to an xls file, or the importers would need to enable tests to insert non-xls input somewhere.

richardebeling · 2021-12-20T18:04:27Z

@karyon @niklasmohrin all the stuff suggested that I marked as resolved is now incorporated.

The structuring of components across files is still open, there's one TODO in the code for that.

IDK. If you want to review again, you can do that. I can also try to move stuff around files first if you prefer that.

It's not much less code than before, mainly because the distinction between "PartialCourseData", which might be incomplete, and "CourseData" which is guaranteed to be complete, forces a translation step between them. I couldn't bring myself to store invalid values without any reminders/guarantees in CourseData though. I think the current approach allows humans to reason about correctness that wouldn't be given otherwise.

evap/staff/importers/base.py

evap/staff/importers/user.py

niklasmohrin

(review in progress)

evap/evaluation/models_logging.py

niklasmohrin

I like what I see 👍 (not completely through yet)

evap/staff/importers/user.py

evap/staff/importers/enrollment.py

niklasmohrin · 2022-01-24T22:38:03Z

I really like the changes, I feel like it really helps when trying to find / understand where something particular is happening without having to backward-slice (swt namedrop, I know) for some importer-wide state modifications 👍

I haven't proof-read whether the logic is actually correctly mapped to the new code, but I would defer that until final review (that is, until @karyon is through again); and the tests should have already done most of this work for me :^)

richardebeling · 2022-02-21T18:39:27Z

I've removed the importer classes, they are now methods (import_users, import_enrollments, import_persons_from_file). This resolves the last todo that I still had on my list.

Some discussions in this PR are still unresolved, waiting for input there. Apart from that, waiting for review.

richardebeling · 2022-02-21T18:47:40Z

@janno42 in #1703 we replaced xls import with xlsx import. With the architecture here, we might be able to support both, xls and xlsx. Is that desirable, or do we prefer only supporting xlsx?

evap/staff/importers/enrollment.py

janno42 · 2022-02-21T18:59:40Z

I'd say only xlsx is fine. No need to add more complexity to the importer :)

niklasmohrin

haven't read everything (again), good stuff

evap/staff/importers/__init__.py

evap/staff/importers/person.py

evap/staff/importers/user.py

niklasmohrin

In person we found that functions / methods that dont have any annotations in the signature are not checked; in a follow up, we / I should add -> None as a default

evap/staff/importers/base.py

evap/staff/importers/user.py

niklasmohrin

Looked over user.py, seems good, some notes:

evap/staff/importers/user.py

niklasmohrin

Now read until the CourseNameChecker in enrollment.py

evap/staff/importers/user.py

evap/staff/importers/enrollment.py

niklasmohrin

done with enrollment

evap/staff/importers/enrollment.py

niklasmohrin · 2022-08-08T20:11:10Z

reading import_enrollments, I felt like these abstractions we now have really want to be composed in a more abstract fashion and I thought about something like sklearn's pipeline:

with ConvertExceptionsToMessages(importer_log):
    pipeline = Pipeline(
        ExcelFileRowMapper(skip_first_n_rows=1, row_cls=EnrollmentInputRow),
        RaiseIfErrors(),
        EnrollmentInputRowMapper(),
        TooManyEnrollmentsChecker(test_run),
        CourseDataAdapter(CourseNameChecker(test_run, semester=semester)),
        CourseDataAdapter(CourseDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataEmptyFieldsChecker(test_run)),
        UserDataAdapter(UserDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataValidationChecker(test_run)),
        RaiseIfErrors(),
        importer_log=importer_log,
    )
    parsed_rows = pipeline.run(excel_content)

This will however require some more (typing?) effort without much "business value" - maybe it will be interesting to look into at some point though :^)

richardebeling · 2022-08-15T18:39:25Z

reading import_enrollments, I felt like these abstractions we now have really want to be composed in a more abstract fashion and I thought about something like sklearn's pipeline:

with ConvertExceptionsToMessages(importer_log):
    pipeline = Pipeline(
        ExcelFileRowMapper(skip_first_n_rows=1, row_cls=EnrollmentInputRow),
        RaiseIfErrors(),
        EnrollmentInputRowMapper(),
        TooManyEnrollmentsChecker(test_run),
        CourseDataAdapter(CourseNameChecker(test_run, semester=semester)),
        CourseDataAdapter(CourseDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataEmptyFieldsChecker(test_run)),
        UserDataAdapter(UserDataMismatchChecker(test_run)),
        UserDataAdapter(UserDataValidationChecker(test_run)),
        RaiseIfErrors(),
        importer_log=importer_log,
    )
    parsed_rows = pipeline.run(excel_content)

This will however require some more (typing?) effort without much "business value" - maybe it will be interesting to look into at some point though :^)

Created #1795 for this.

niklasmohrin

🎉

Idea / Model is that importers start with the binary excel content and then put their current state through a series of steps that each gets them a bit closer to their result. On errors, it simply raises an error aborting the import. After each step, checkers can be run to add errors or warnings that are more user friendly. This should allow to understand the existing checks and add new checks very easily.

richardebeling changed the title ~~Rewrite staff importers~~ Rewrite staff importers (UserImporter, EnrollmentImporter) Oct 15, 2021

richardebeling force-pushed the importer-refactoring branch 2 times, most recently from bfec022 to 02e5ee2 Compare October 16, 2021 18:40

niklasmohrin reviewed Oct 18, 2021

View reviewed changes

karyon reviewed Oct 25, 2021

View reviewed changes

evap/staff/importers.py Outdated Show resolved Hide resolved

evap/staff/importers.py Outdated Show resolved Hide resolved

evap/staff/importers.py Outdated Show resolved Hide resolved

evap/staff/importers.py Outdated Show resolved Hide resolved

richardebeling force-pushed the importer-refactoring branch from 82b782a to 7174862 Compare November 15, 2021 17:46

richardebeling force-pushed the importer-refactoring branch from 7174862 to 837a516 Compare November 22, 2021 21:09

richardebeling force-pushed the importer-refactoring branch from 837a516 to 6ad9e31 Compare December 6, 2021 18:07

richardebeling force-pushed the importer-refactoring branch 3 times, most recently from 1aff4e9 to e8484ec Compare December 20, 2021 17:51