Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major refactor: flatten code and object structures #34

Merged
merged 22 commits into from
Aug 20, 2021

Conversation

sempervictus
Copy link
Contributor

Rebuild of PR #31

RageLtMan added 9 commits August 19, 2021 13:50
Instead of deprecating the `index(String)` and `index(&str)` calls
for `Message`, `GenericSegment`, and `Field`, move them behind a
feature flag for `string_index`.
The `query()` semantic has limits to how it handles called params
(attempts to convert the str idx calls to use `query()` reveal that
it converts an &str to str in the process), but the string indexing
approach isn't very Rusty in terms of idiomatic implementation
(despite bringing parity to libs in other languages). So instead of
throwing the baby out with the bath water, we can use feature flags
to allow users to choose how the library works for them.

Testing:
  Builds, passes tests, with the feature enabled and normally.
After converting Field from an enum to a Struct and implementing
various generic mechanisms for common functionality, we are seeing
legacy code and object structures cluttering and breaking efforts
respectively. The `MshSegment` does not behave like a generic one
in terms of structure or implementation, and being "in the line of
fire" whenever segments are used to be parsed out through a match
statement makes it very cumbersome, sometimes unworkable due to
borrow checks on the string references.

Remove the `Segment` enum and convert `GenericSegment` struct to be
the new `Segment` struct - possible because `Separators` live in
the `Message` structure permitting on-demand re-parse of a `Segment`
struct into any other named segment type such as `MshSegment`. This
is how the `msh()` implementation now works for `Message` - finding
the relevant `Segment` and returning a parsed `MshSegment` from it
and the `Message` `Separators` already stored.

This permits `query()` and `Index<&str>` mechanisms to search for
all types of segments and fields, using syntax a la `MSH.F3` for
which a test-case has been added to the feature-gated test.

Now that the object structures and interfaces are generic, simple,
and common - the code structure itself has been reorganized per wokket#25
into a flat set of files within src/.

Testing:
  Updated test object derivations to remove destructuring of enums.
  Added check to string-index for MSH fields.
  All tests passing with `index_string` feature enabled and without
The removal of native implementation initializers for Message has
resulted in users having to `use std::convert::TryFrom` whenever
instantiating `Message` structs. Unlike the more common From trait,
this one is not always in-scope making users import it.

Add `Message::new(&str)` which synchronously returns a `Message`
directly without Result wrapping - intended for use when developers
are sure of what they're parsing and can afford to skip validation
and async dispatch.

Testing:
  Added test to compare the results of `new` and `from_str`
  All tests passing
Address wokket#26 by updating `Field::parse()` to split on both the
component delimeter (as before) and the repeat delimeter to permit
component indexing/querying regardless of subdivision modality.

Testing:
  Add test per wokket#26 to find R1 and R2 from `A~S`, passing with
--features string_index.
The HL7 spec defines the first field of the `MSH` segment to be the
field separator directly following the "MSH" characters in the HL7
string. Machines parsing on that delimeter obviously can't handle
such meatland nonsense in format specification natively and need to
have a workaround implemented to offset by -1 for indices into the
`MSH` segment by field.

Implement the off-by-one calculation, and create a temporary hack of
sorts to return `&"|"` because attempts to slice into the `source`
of an MSH segment to pull out `seg[3..3]` violates borrow checker
constraints failing compilation.

@wokket: do you want me to port the change to your `query()` impl?

Relevant spec:
  https://hl7-definition.caristix.com/v2/HL7v2.8/Segments/MSH

Testing:
  Updated MSH string index test to check for MSH.F2 being the set
of delimeters despite that actually being the first field.
  Tests passing
@sempervictus
Copy link
Contributor Author

I have no idea what the deuce is up with github these last two days - it somehow managed to offset the master reference. Fixed

RageLtMan added 3 commits August 19, 2021 15:38
Under normal `Index<usize>` and friends contexts, indexing out of
bounds is prevented by length checks against the relevant vectors.
However, when extracting a compound index by String or &str, the
decomposition of the compound idx calls into the `fields` Vec when
passing the index operation down to the fields themselves. This
indexing operation does not first check to ensure that it is within
the extents of the vector its touching, resulting in an OOB error
which otherwise appears to be handled in the `usize` versions.

Fix this by adding a bounds check in the compound index destructure
phase and extending tests to cover the condition.

Testing:
  Expanded tests to catch the error
  All tests passing
src/segments.rs Outdated Show resolved Hide resolved
src/fields.rs Show resolved Hide resolved
src/fields.rs Outdated Show resolved Hide resolved
src/fields.rs Show resolved Hide resolved
src/fields.rs Show resolved Hide resolved
src/message.rs Outdated
})
.expect("Failed to find hl7 header");
.find(|s| s.fields[0].source == "MSH")
.unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just call get_segment_by_name("MSH") here and avoid dupe logic? Or rely on the spec that segments[0] is the MSH and assert that fact (prob faster)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah... about that: i've found that method calls in the Indexers are frowned upon by the borrow checker, so have gotten into the habit of using local primitives within the function (and avoiding vecs) to keep it from yelling at me. Will take a pass at DRYing it up a bit.

@wokket
Copy link
Owner

wokket commented Aug 19, 2021

OK, I have a feeling that one of us isn't quite on the same page with the repeats thing. There's pseudocode in the spec for this, but here's my take on it:

A field has one or more 'occurrence ', separated by the repeat delimiter (default ~).
An occurrence has one or more components, separated by the component delimiter (default ^).
A component has one or more sub-components, separated by the sub component delimiter (default &),

So a field of "aaa^bbb" as 1 occurrence (repeat), and two components, each with a single sub component. You could retrieve "aaa" using a query of "R1.C1", or just "C1" as the R1 is kind of redundant.

A field of "aaa^bbb~ccc^ddd" could also retrieve "aaa" using "R1.C1" or "C1" per above, but would use "R2.C1" to get "ccc".

If we had "aaa^bbb&ccc" then we have one occurrence, 2 components, and the second component has two subcomponents. We could retrieve "ccc" using "C2.SC2".

Do you know something I don't?

@sempervictus
Copy link
Contributor Author

Ah! Thank you!
So the logic really should be more along the lines of having 3 vectors per field:

  1. Repeats
  2. Components of repeats (Vec)
  3. Subcomponents of components? (Vec<Vec>)
    I've never seen notation accessing the subcomponent by string query, so i think i screwed up the implementation of field actually.
    Is the logic above correct? If so, gimme a few and i'll get it implemented for the PR.

@wokket
Copy link
Owner

wokket commented Aug 19, 2021

That's generally correct, but is also where I headed in the first iteration of this library. When the majority of fields are only a "simple" field with a single value ("blah") the cost of maintaing multiple vectors (in terms of brain power as much as any runtime cost) gets really high.

That's why I'd started down the enum path of having a "SimpleField", a "RepeatingField" etc etc, but hadn't got as far as a good working model when you came along with the path based stuff which is great, means you don't need to build and then deconstruct a bunch of nested vecs all the time, you can just query in and split as needed based on the query.

As an example, if I have a field with a simple value "blah" (ie one occurrance, one component, one sub-component), do you have a vec[0][0][0] == "blah" ? or just vec[0] = "blah" ? How do you then index your field? does field[0] == field[0,0,0] == field[0][0][0], or are some of those considered out of bounds? How do you make those behave intuitively the same as a simple set of nested vecs if the are all the same value?

This is why I've been pushing down the external Selecter::query() line of thinking...

@sempervictus
Copy link
Contributor Author

All makes sense, thank you. Impl mostly done, getting test in line with the new structure.
In terms of cost... i was thinking maybe there's some way to get the compiler to agree to let us replace Vec<'a str> types with &[&'a str] slices which should be generally cheaper.
Alternatively, instead of enums, we might be able to do this by having a "simple field" initializer which skips all that work and just assigns empty Vec's to the struct members. Lemme wrap up the refactor for field and push to review, and then i'll see what i can do about optimizing away some of the added cost.

@wokket
Copy link
Owner

wokket commented Aug 19, 2021

Yeah let's not make this change any larger than it already is. The cost for me was mainly brain tax.

Do I have to specify a repeat[0] if I want a component?
If a component has subcomponents but I don't specify which one do I get the raw string value? or an error? or ??
A general rule that you have to include all 'higher' levels if you want a 'lower' value makes sense for indexing, but gets in the way if you want (say) the second subcomponent of the only component, in the only repeat.

When you're doing simple indexing into vecs (or arrays, whatever) those questions are really hard, but the path selection stuff you wrote is a much nicer direction, as we get to describe the semantics of the query. "SC2" vs field[0][0][1] is much easier on the neurons.

Thanks @wokket for spec clarification on the internal structures of
a complex HL7 `Field` - the first set of separation occurs in the
repeat delimeter, each of which is then treated effectively as a
whole field. The repeating segments are split like entire `Field`s
were prior to this commit - dividing them up by component delims,
each product of which is split into subcomponent delims.

Functionally, this introduces a new dimension to the data requiring
updates to the `Index` implementations to handle another level of
depth. `Query()` and string indexing are also impacted since their
syntax now maps correctly to the subdivisions of a `Field` struct.

Testing:
  Updated tests for added depth and separation semantic
  All tests passing
  Requires manual review from @wokket since this is a major change
and he knows the spec better.
@sempervictus
Copy link
Contributor Author

you do not have to specify repeat[0] to get a component, that would just yield the entire field in most cases where repeats are not implemented. Components are now where subcomponents were previously - so component[0] would get you the first repeat's worth of components:

Field {
    source: "x&x^y&y~a&a^b&b",
    delims: Separators {
        segment: '\r',
        field: '|',
        repeat: '~',
        component: '^',
        subcomponent: '&',
        escape_char: '\\',
    },
    repeats: [
        "x&x^y&y",
        "a&a^b&b",
    ],
    components: [
        [
            "x&x",
            "y&y",
        ],
        [
            "a&a",
            "b&b",
        ],
    ],
    subcomponents: [
        [
            [
                "x",
                "x",
            ],
            [
                "y",
                "y",
            ],
        ],
        [
            [
                "a",
                "a",
            ],
            [
                "b",
                "b",
            ],
        ],
    ],
}

With the added dimension of repeats pushing existing field vectors
down one level of nesting, the indexing mechanism to quickly pull
values needs to have an additional level of depth.
The Index implementations push down their called parameters to the
lower-level struct within the one handling the current depth of
indexing, so the only effort required to achieve this is to add one
more possible lentgh of index values to the conditional checking
which `uint` indexer to call.
Implement the additional change, add a test to Message extracting
the `S2` subcomponent of `OBR.F1.R1.C2`.

Testing:
  Added test for new functionality
  All tests passing
@wokket
Copy link
Owner

wokket commented Aug 19, 2021

OK I'm AFK for a bit. Ping me when you've worked through whatever you want to do, had a look at the other points I raised in the PR and run it through a clippy pass, and I'll double check it later, but I'd expect it to be pretty close by then.

@sempervictus
Copy link
Contributor Author

Thank you. Grokking clippy and going over the comments again.

RageLtMan added 6 commits August 19, 2021 19:27
Moved call to `msh()` out of the loop, removed the result matcher
and replaced with `unwrap()` since we're measuring field retreival.
@sempervictus
Copy link
Contributor Author

sempervictus commented Aug 20, 2021

That was educational :), appreciate the references.
I think i broke the benchmark somehow - code looks sane to me, but GH seems to be hanging on the changed bench. Even i'm not so unlucky as to have CI coincidentally hang on the thing i just changed. Probably something i dont get about how benchmarking works. When you're back, could you please peek at that and kick me in the appropriate direction if you know it?

@sempervictus
Copy link
Contributor Author

On the depth of Field structure - trying to make it use borrowed slices a la

diff --git i/src/fields.rs w/src/fields.rs
index 71e7f7f..441cb10 100644
--- i/src/fields.rs
+++ w/src/fields.rs
@@ -9,7 +9,7 @@ use std::ops::Index;
 pub struct Field<'a> {
     pub source: &'a str,
     pub delims: Separators,
-    pub repeats: Vec<&'a str>,
+    pub repeats: &[&'a str],
     pub components: Vec<Vec<&'a str>>,
     pub subcomponents: Vec<Vec<Vec<&'a str>>>,
 }
@@ -40,7 +40,7 @@ impl<'a> Field<'a> {
         let field = Field {
             source: input,
             delims: *delims,
-            repeats,
+            repeats: &repeats[..]
             components,
             subcomponents,

results in some nay-saying by the borrow-checker:

error[E0515]: cannot return value referencing local variable `repeats`
  --> src/fields.rs:47:9
   |
43 |             repeats: &repeats[..],
   |                       ------- `repeats` is borrowed here
...
47 |         Ok(field)
   |         ^^^^^^^^^ returns a value referencing data owned by the current function

error: aborting due to previous error

For more information about this error, try `rustc --explain E0515`.
error: could not compile `rust-hl7`

without the borrow, it has an unknown size issue with this:

error[E0277]: the size for values of type `[&'a str]` cannot be known at compilation time
  --> src/fields.rs:12:18
   |
12 |     pub repeats: [&'a str],
   |                  ^^^^^^^^^ doesn't have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `[&'a str]`
   = note: only the last field of a struct may have a dynamically sized type
   = help: change the field's type to have a statically known size

so i think i'll let sleeping dogs lie for the time being on this.

@sempervictus
Copy link
Contributor Author

Fun note from clippy regarding efficiency: the distinct MshSegment type is ~40x larger than a generic Segment:

warning: large size difference between variants
 --> src/segments.rs:8:5
  |
8 |     MSH(MshSegment<'a>),
  |     ^^^^^^^^^^^^^^^^^^^ this variant is 1952 bytes
  |
  = note: `#[warn(clippy::large_enum_variant)]` on by default
note: and the second-largest variant is 48 bytes:
 --> src/segments.rs:7:5
  |
7 |     Generic(Segment<'a>),
  |     ^^^^^^^^^^^^^^^^^^^^
  = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#large_enum_variant
help: consider boxing the large fields to reduce the total size of the enum
  |
8 |     MSH(Box<MshSegment<'a>>),
  |         ^^^^^^^^^^^^^^^^^^^

warning: 1 warning emitted

I definitely broke the benchmark :( - even when borrowing msh there, its still stuck showing:


Benchmarking Read Field from MSH (variable)
Benchmarking Read Field from MSH (variable): Warming up for 3.0000 s

or just a blank output if you look at it while its running and stuck there.

@sempervictus
Copy link
Contributor Author

Either that benchmark method won the lottery or something weird is going on inside the iterator:

Benchmarking Read Field from MSH (variable)
Benchmarking Read Field from MSH (variable): Warming up for 3.0000 s
Benchmarking Read Field from MSH (variable): Collecting 100 samples in estimated 5.0000 s (18446744074B iterations)
Benchmarking Read Field from MSH (variable): Analyzing
Criterion.rs ERROR: At least one measurement of benchmark Read Field from MSH (variable) took zero time per iteration. This should not be possible. If using iter_custom, please verify that your routine is correctly measured.

while it appears (during benchmarking) to take a rather long time.

@wokket
Copy link
Owner

wokket commented Aug 20, 2021

Benchmark is running fine for me locally, so I'm inclined to merge it and see if the zombie army that powers github actions are back on duty.

Anything else before I merge?

@wokket wokket merged commit 2c2a78e into wokket:master Aug 20, 2021
sempervictus added a commit to sempervictus/rust-hl7 that referenced this pull request Aug 25, 2021
 wokket#26

* Move string indexing behind a feature flag

Instead of deprecating the `index(String)` and `index(&str)` calls
for `Message`, `GenericSegment`, and `Field`, move them behind a
feature flag for `string_index`.
The `query()` semantic has limits to how it handles called params
(attempts to convert the str idx calls to use `query()` reveal that
it converts an &str to str in the process), but the string indexing
approach isn't very Rusty in terms of idiomatic implementation
(despite bringing parity to libs in other languages). So instead of
throwing the baby out with the bath water, we can use feature flags
to allow users to choose how the library works for them.

Testing:
  Builds, passes tests, with the feature enabled and normally.

* Major refactor: flatten code and object structures

After converting Field from an enum to a Struct and implementing
various generic mechanisms for common functionality, we are seeing
legacy code and object structures cluttering and breaking efforts
respectively. The `MshSegment` does not behave like a generic one
in terms of structure or implementation, and being "in the line of
fire" whenever segments are used to be parsed out through a match
statement makes it very cumbersome, sometimes unworkable due to
borrow checks on the string references.

Remove the `Segment` enum and convert `GenericSegment` struct to be
the new `Segment` struct - possible because `Separators` live in
the `Message` structure permitting on-demand re-parse of a `Segment`
struct into any other named segment type such as `MshSegment`. This
is how the `msh()` implementation now works for `Message` - finding
the relevant `Segment` and returning a parsed `MshSegment` from it
and the `Message` `Separators` already stored.

This permits `query()` and `Index<&str>` mechanisms to search for
all types of segments and fields, using syntax a la `MSH.F3` for
which a test-case has been added to the feature-gated test.

Now that the object structures and interfaces are generic, simple,
and common - the code structure itself has been reorganized per wokket#25
into a flat set of files within src/.

Testing:
  Updated test object derivations to remove destructuring of enums.
  Added check to string-index for MSH fields.
  All tests passing with `index_string` feature enabled and without

* Implement `new(&str)` for Message

The removal of native implementation initializers for Message has
resulted in users having to `use std::convert::TryFrom` whenever
instantiating `Message` structs. Unlike the more common From trait,
this one is not always in-scope making users import it.

Add `Message::new(&str)` which synchronously returns a `Message`
directly without Result wrapping - intended for use when developers
are sure of what they're parsing and can afford to skip validation
and async dispatch.

Testing:
  Added test to compare the results of `new` and `from_str`
  All tests passing

* Fix benchmark

* Split field in repeat delim

Address wokket#26 by updating `Field::parse()` to split on both the
component delimeter (as before) and the repeat delimeter to permit
component indexing/querying regardless of subdivision modality.

Testing:
  Add test per wokket#26 to find R1 and R2 from `A~S`, passing with
--features string_index.

* Field convenience method and housekeeping

* Cleanup and test `Field::is_subdivided` method

* CI: test/bench with features

* String index - handle MSH off-by-one problem

The HL7 spec defines the first field of the `MSH` segment to be the
field separator directly following the "MSH" characters in the HL7
string. Machines parsing on that delimeter obviously can't handle
such meatland nonsense in format specification natively and need to
have a workaround implemented to offset by -1 for indices into the
`MSH` segment by field.

Implement the off-by-one calculation, and create a temporary hack of
sorts to return `&"|"` because attempts to slice into the `source`
of an MSH segment to pull out `seg[3..3]` violates borrow checker
constraints failing compilation.

@wokket: do you want me to port the change to your `query()` impl?

Relevant spec:
  https://hl7-definition.caristix.com/v2/HL7v2.8/Segments/MSH

Testing:
  Updated MSH string index test to check for MSH.F2 being the set
of delimeters despite that actually being the first field.
  Tests passing

* Update readme

* Fix OOB index attempt on fields vec of segment

Under normal `Index<usize>` and friends contexts, indexing out of
bounds is prevented by length checks against the relevant vectors.
However, when extracting a compound index by String or &str, the
decomposition of the compound idx calls into the `fields` Vec when
passing the index operation down to the fields themselves. This
indexing operation does not first check to ensure that it is within
the extents of the vector its touching, resulting in an OOB error
which otherwise appears to be handled in the `usize` versions.

Fix this by adding a bounds check in the compound index destructure
phase and extending tests to cover the condition.

Testing:
  Expanded tests to catch the error
  All tests passing

* Add OOB fix to `Segment::query()`

* First pass @ PR review notes

* Refactor `Field` for 3 tiers of subdivision

Thanks @wokket for spec clarification on the internal structures of
a complex HL7 `Field` - the first set of separation occurs in the
repeat delimeter, each of which is then treated effectively as a
whole field. The repeating segments are split like entire `Field`s
were prior to this commit - dividing them up by component delims,
each product of which is split into subcomponent delims.

Functionally, this introduces a new dimension to the data requiring
updates to the `Index` implementations to handle another level of
depth. `Query()` and string indexing are also impacted since their
syntax now maps correctly to the subdivisions of a `Field` struct.

Testing:
  Updated tests for added depth and separation semantic
  All tests passing
  Requires manual review from @wokket since this is a major change
and he knows the spec better.

* Expand str idx to take additional level of depth

With the added dimension of repeats pushing existing field vectors
down one level of nesting, the indexing mechanism to quickly pull
values needs to have an additional level of depth.
The Index implementations push down their called parameters to the
lower-level struct within the one handling the current depth of
indexing, so the only effort required to achieve this is to add one
more possible lentgh of index values to the conditional checking
which `uint` indexer to call.
Implement the additional change, add a test to Message extracting
the `S2` subcomponent of `OBR.F1.R1.C2`.

Testing:
  Added test for new functionality
  All tests passing

* Resolve clippy complaints

* Restore separate `query()` tests

* Simplify get_msh_and_read_field benchmark

Moved call to `msh()` out of the loop, removed the result matcher
and replaced with `unwrap()` since we're measuring field retreival.

* CI: use all features

* Fix typo in benchmark

* Simplify `msh()` per @wokket

* Borrow for benchmark iter

Co-authored-by: RageLtMan <rageltman [at] sempervictus>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants