Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet input to OS #186

Merged
merged 29 commits into from
Dec 21, 2023
Merged

Parquet input to OS #186

merged 29 commits into from
Dec 21, 2023

Conversation

ruaridhg
Copy link
Contributor

@ruaridhg ruaridhg commented Dec 15, 2023

Closes #159

Added:

  • csv_file True/False flag within populate of cli/src/pixl_cli/main.py (tests pass for original functionality with csv_file = True in cli/tests/test_queue_start_and_stop.py)
  • messages_from_parquet function within cli/src/pixl_cli/main.py function which follows code example from OMOP ES files as inputs to PIXL CLI #159 and works locally with directory structure public_dir = ~/resources/public and private_dir = ~/resources/private
  • cli/tests/test_queue_start_and_stop_parquet.py which follows similar structure as csv test, but fails with the following errors which could be because I can't call my local files within the testing (the resources within the test is missing the private directory) or some other Path error I'm not sure about
>       assert result.exit_code == 0
E       assert 2 == 0
E        +  where 2 = <Result SystemExit(2)>.exit_code

test_queue_start_and_stop_parquet.py:28: AssertionError
>       assert state_path.exists()
E       AssertionError: assert False
E        +  where False = <bound method Path.exists of PosixPath('test_down.state')>()
E        +    where <bound method Path.exists of PosixPath('test_down.state')> = PosixPath('test_down.state').exists

test_queue_start_and_stop_parquet.py:43: AssertionError

@milanmlft
Copy link
Member

milanmlft commented Dec 19, 2023

To do

  • Update system test (PIXL/test/run-system-test.sh), which currently fails because

    + pixl populate data/test.csv
    Usage: pixl populate [OPTIONS]
    Try 'pixl populate --help' for help.
    
    Error: Missing option '--parquet-dir'.

* Refactor message serialization and deserialization

Addiing `Message` and `SerialisedMessage` classes in attempt to improve information hiding and decoupling.

* Rename `utils.py` -> `message.py`

* Add `decode()` method for `SerialisedMessage`

* Update docstring

* Use new classes in message testing

* Refactor message processing in the CLI

* Refactor `process_message` to use `SerialisedMessage` class in EHR API

* Refactor `process_message` to use `SerialisedMessage` class in imaging API

* Fix `ImagingStudy` initalisation in  `ImagingStudy.from_message()`

* Fix imports

* Fix test: access serialised message bodies

* Turn `Message` into a `dataclass`

* Fix failing tests

* Use `jsonpickle` for (de)serializing messages

This also removes the need for the `SerialisedMessage` class

* Fix `test_deserialise_datetime()` so it uses the `Message` class to assert the `study_datetime`

* Add `study_datetime` property for `Message`

* No need to test deserialising individual fields, already covered by `test_deserialise()` which deserialises the entire object

* Remove `study_date_from_serialised()`, use the class attribute `study_datetime` instead

* Revert "Add `study_datetime` property for `Message`"

This reverts commit 8719153.

* Remove `Messages` class, use `list[Message]` instead

* Add type checking for messages parsed from parquet input

* Update `test_messages_from_parquet()` to use JSON strings instead of bytes

* Update `PixlProducer.publish()` to use a list of Message objects and handle serialisation

* Convert JSON string to bytes when serialising

* Revert "Update `test_messages_from_parquet()` to use JSON strings instead of bytes"

This reverts commit 0e4fce4.

* `PixlProducer.publish()` should take a `list[Message]` as input in tests

* Update EHR API to use new `Message` design

* Update imaging API to use new `Message` design

* Update deserialise function to accept bytes-encoded JSON string

* Assert messages against list of `Message`s

* Print dataclass in logs

* `jsonpickle.decode()` can handle bytes so no need to decode first

Also add a note about why we ignore ruff rule S301

* Make `deserialisable` a keyword only argument

* Copilot forgot to convert dates to datetimes 🥲

* Refactor PixlConsumer run method to accept Message object as callback parameter and deserialise

* Update consumer in `test_subscriber` to accept Message object instead of bytes
@milanmlft milanmlft marked this pull request as ready for review December 20, 2023 17:57
Copy link
Contributor

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Looking good. Thanks for your work on this @ruaridhg @jeremyestein and @milanmlft

could probably slim down test files to only be the ones required? Might also be worth removing more of the public files in the test resources that I added too

cli/README.md Outdated Show resolved Hide resolved
cli/src/pixl_cli/main.py Outdated Show resolved Hide resolved
cli/src/pixl_cli/main.py Show resolved Hide resolved
cli/src/pixl_cli/main.py Outdated Show resolved Hide resolved
cli/src/pixl_cli/main.py Outdated Show resolved Hide resolved
cli/src/pixl_cli/main.py Outdated Show resolved Hide resolved
cli/src/pixl_cli/main.py Outdated Show resolved Hide resolved
pixl_pacs/src/pixl_pacs/_processing.py Show resolved Hide resolved
@milanmlft milanmlft merged commit 573dea7 into main Dec 21, 2023
7 checks passed
@milanmlft milanmlft deleted the rmg/parquet_inp branch December 21, 2023 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OMOP ES files as inputs to PIXL CLI
4 participants