Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EDI] Support repetition separator #212

Closed
manuel-neuhauser-hs opened this issue Jul 20, 2023 · 8 comments · Fixed by #215
Closed

[EDI] Support repetition separator #212

manuel-neuhauser-hs opened this issue Jul 20, 2023 · 8 comments · Fixed by #215

Comments

@manuel-neuhauser-hs
Copy link

For EDI input files, the repetition separator is currently not supported in omniparser. It is used to allow for multipe values in a composite element.

Example

The following ISA segment defines ^ as the repetition separator, and segment DMG uses it in the 5th element to list 2 composite values.

ISA*00*          *00*          *ZZ*CMSFFM         *ZZ*987654321      *230614*1605*^*00501*000001001*1*T*:~
...
DMG*D8*19910512*M*M*:RET:2135-2^:RET:2106-3~
$ op transform --schema 834-schema-repetition.json.txt --input 987654321.I834.D230614.T161330125.T.OUT.txt

Actual

[
  {
    "race_ethnicity_code": [
      "2135-2^"
    ]
  }
]

Expected

[
  {
    "race_ethnicity_code": [
      "2135-2",
      "2106-3"
    ]
  }
]

Files

834-schema-repetition.json.txt

987654321.I834.D230614.T161330125.T.OUT.txt

@manuel-neuhauser-hs manuel-neuhauser-hs changed the title Support repetition separator [EDI] Support repetition separator Jul 20, 2023
@jf-tech
Copy link
Owner

jf-tech commented Jul 21, 2023

@manuel-neuhauser-hs ack. Will get back to you with a solution early next week.

@jf-tech
Copy link
Owner

jf-tech commented Jul 24, 2023

@manuel-neuhauser-hs a prototype is WIP. Will keep you updated.

@jf-tech
Copy link
Owner

jf-tech commented Jul 25, 2023

212.diff.txt

Attached a working prototype diff. Now need to formalize the changes

jf-tech added a commit that referenced this issue Jul 25, 2023
Issue: #212

`repetition_delimiter`: delimiter to separate multiple data instances for an element. For example,
if `^` is the repetition delimiter for a segment `DMG*D8*19690815*M**A^B^C^D~`, then the last
element has 4 pieces of data: `A`, `B`, `C`, and `D`. Any element without `repetition_delimiter`
present has essentially one piece of data; similarly, if `^` is the repetition delimiter for a
segment `CLM*A37YH556*500***11:B:1^12:B:2~`, the last element has 2 pieces of data: `11:B:1` and
`12:B:2`, each of which is further delimited by a `component_delimiter` `:`. Note, since
`repetition_delimiter` creates multiple pieces of data under the same element name in the schema,
in most cases the suitable construct type in `transform_declarations` is `array`.

Currently we read in all the elements and their components in serial in `NonValidatingReader` into
a slice: `[]RawSegElem`, each of which contains the element value, the element index, and component
index if there are more than 1 component. When `repetition_delimiter` is added, we continue down
the same pattern: `NonValidatingReader` still reads everything into the slice, except now, there
potentially can be multiple `RawSegElem` share the same `ElemIndex` and `CompIndex`.

Using the example above: `^` is the rep delim and seg is `CLM*A37YH556*500***11:B:1^12:B:2~`. After
`NonValidatingReader.Read()` is done, we'll have the following `[]RawSegElem` (simplified):

```
{
   {'CLM', ElemIndex: 0, CompIndex: 1},
   {'A37YH556', ElemIndex: 1, CompIndex: 1},
   {'500', ElemIndex: 2, CompIndex: 1},
   {'', ElemIndex: 3, CompIndex: 1},
   {'', ElemIndex: 4, CompIndex: 1},
   {'', ElemIndex: 4, CompIndex: 1},
   {'11', ElemIndex: 5, CompIndex: 1},
   {'B', ElemIndex: 5, CompIndex: 2},
   {'1', ElemIndex: 5, CompIndex: 3},
   {'12', ElemIndex: 5, CompIndex: 1},
   {'B', ElemIndex: 5, CompIndex: 2},
   {'2', ElemIndex: 5, CompIndex: 3},
}
```

Note the last 3 elements have the same `ElemIndex` and `CompIndex` as the previous 3 elements.
This behavior is new and introduced in this PR.

Now on the EDI reader side (reader.go), previously when we match element decl against the raw element
slice, we only do one way scan, because `ElemIndex` and `CompIndex` are always increase, thus we
never need to back-scan. With introduction of potentially duplicate `ElemIndex` and `CompIndex`, now
for each of the element decl, we simply do a full `[]RawSegElem` scan. Yes, it is a bit more expensive
but given usually the number of total elements and components in a seg is really really small (around
20), we feel this trade-off is acceptable without making the already-complex code even more so.

With this reader change, the IDR produced will potentially contain child element nodes with the same
element name. Thus in schema writing, it's practically required that the user of the
`repetition_delimiter` feature needs to use `array` type in the `transform_declarations`.
@jf-tech jf-tech linked a pull request Jul 25, 2023 that will close this issue
@jf-tech
Copy link
Owner

jf-tech commented Jul 25, 2023

@manuel-neuhauser-hs PR is out #215

Now the output looks like:

			"race_ethnicity_code": [
				"2135-2",
				"2106-3"
			]

Kindly request you to sync to the dev branch: https://github.com/jf-tech/omniparser/tree/rep2 and do some testing before we merge.

@manuel-neuhauser-hs
Copy link
Author

@jf-tech – Tested our use case and additionally tried several edge cases, and the outputs and errors are as expected. Thanks for the lightning fast addition of this feature. 🚀

jf-tech added a commit that referenced this issue Jul 25, 2023
Issue: #212

`repetition_delimiter`: delimiter to separate multiple data instances for an element. For example,
if `^` is the repetition delimiter for a segment `DMG*D8*19690815*M**A^B^C^D~`, then the last
element has 4 pieces of data: `A`, `B`, `C`, and `D`. Any element without `repetition_delimiter`
present has essentially one piece of data; similarly, if `^` is the repetition delimiter for a
segment `CLM*A37YH556*500***11:B:1^12:B:2~`, the last element has 2 pieces of data: `11:B:1` and
`12:B:2`, each of which is further delimited by a `component_delimiter` `:`. Note, since
`repetition_delimiter` creates multiple pieces of data under the same element name in the schema,
in most cases the suitable construct type in `transform_declarations` is `array`.

Currently we read in all the elements and their components in serial in `NonValidatingReader` into
a slice: `[]RawSegElem`, each of which contains the element value, the element index, and component
index if there are more than 1 component. When `repetition_delimiter` is added, we continue down
the same pattern: `NonValidatingReader` still reads everything into the slice, except now, there
potentially can be multiple `RawSegElem` share the same `ElemIndex` and `CompIndex`.

Using the example above: `^` is the rep delim and seg is `CLM*A37YH556*500***11:B:1^12:B:2~`. After
`NonValidatingReader.Read()` is done, we'll have the following `[]RawSegElem` (simplified):

```
{
   {'CLM', ElemIndex: 0, CompIndex: 1},
   {'A37YH556', ElemIndex: 1, CompIndex: 1},
   {'500', ElemIndex: 2, CompIndex: 1},
   {'', ElemIndex: 3, CompIndex: 1},
   {'', ElemIndex: 4, CompIndex: 1},
   {'', ElemIndex: 4, CompIndex: 1},
   {'11', ElemIndex: 5, CompIndex: 1},
   {'B', ElemIndex: 5, CompIndex: 2},
   {'1', ElemIndex: 5, CompIndex: 3},
   {'12', ElemIndex: 5, CompIndex: 1},
   {'B', ElemIndex: 5, CompIndex: 2},
   {'2', ElemIndex: 5, CompIndex: 3},
}
```

Note the last 3 elements have the same `ElemIndex` and `CompIndex` as the previous 3 elements.
This behavior is new and introduced in this PR.

Now on the EDI reader side (reader.go), previously when we match element decl against the raw element
slice, we only do one way scan, because `ElemIndex` and `CompIndex` are always increase, thus we
never need to back-scan. With introduction of potentially duplicate `ElemIndex` and `CompIndex`, now
for each of the element decl, we simply do a full `[]RawSegElem` scan. Yes, it is a bit more expensive
but given usually the number of total elements and components in a seg is really really small (around
20), we feel this trade-off is acceptable without making the already-complex code even more so.

With this reader change, the IDR produced will potentially contain child element nodes with the same
element name. Thus in schema writing, it's practically required that the user of the
`repetition_delimiter` feature needs to use `array` type in the `transform_declarations`.
@jf-tech
Copy link
Owner

jf-tech commented Jul 25, 2023

@jf-tech – Tested our use case and additionally tried several edge cases, and the outputs and errors are as expected. Thanks for the lightning fast addition of this feature. 🚀

@manuel-neuhauser-hs And thank you and your company's sponsorship! Highly appreciate it!

@jf-tech
Copy link
Owner

jf-tech commented Jul 26, 2023

@manuel-neuhauser-hs are you in a hurry needing a new release to be cut or you can live with master for a while? The reason we're a bit hesitant against cutting a new release is that seems you and some other people are actively working on fairly complex EDI parsing and there could be more issues/requests coming up. Let us know.

@manuel-neuhauser-hs
Copy link
Author

We're fetching the latest master branch for our needs. No new release necessary. Thanks for double-checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants