Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve identifier: sourceSystemId #54

Closed
hurni opened this issue Nov 19, 2024 · 6 comments
Closed

Improve identifier: sourceSystemId #54

hurni opened this issue Nov 19, 2024 · 6 comments
Labels
enhancement New feature or request jira Tag to activate JIRA integration major v2

Comments

@hurni
Copy link

hurni commented Nov 19, 2024

Identifiertype contains ID and sourcesystemId.
SourceSystemID accepts a token of length 50. Can we improve? For example: Use a code list, demand an UID (for CH-sources) or UID?

https://github.com/blw-ofag-ufag/eCH-0261/blob/2e70dda84d9971ca7fe699da64985de70d643e9a/src/eCH-0261-1-0.xsd#L373C1-L384C17

@hurni hurni added enhancement New feature or request v2 jira Tag to activate JIRA integration major labels Nov 19, 2024
@AFoletti
Copy link
Member

identifierType defines a really generic way to handle identifiers (which is good, since we use it across all our standards).

I am not sure I fully understand what you mean by "improve" in this context, but I can react to your proposals:

  • To me, a code list is not really feasible since we want to point to objects and not categories. I would not go that way
  • dc:source is semantically not ideal, since has kind of a "derived from" meaning, which is not our usecase.
  • dc:publisher can be a good pick but, due to the generic usage of our identifierType, will only be correct in a few instances (where we actually point to a publisher, which is by far not always the case)

My take is that, due to the extreme flexibility we need for this identiferType, the current implementation is an OK one. I am however fully for improvements should we find some.

@hurni
Copy link
Author

hurni commented Nov 20, 2024

sadly, I agree... was hoping you had a magical solution.

IdentifierType contains id and sourceSystemId. Dream scenario: sourceSystemID is an URI (hence no need for a curated list). However, not every source system has an URI atm and forseeable future.
Will propose a lower level angle directly linked to zoologicalAnimalType and botanicalPlantType

@hurni hurni closed this as completed Nov 20, 2024
@montanajava
Copy link
Contributor

Hi all. What do you think of this?

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <!-- Define the enum type for sourceSystemId -->
    <xs:simpleType name="sourceSystemEnumType">
        <xs:restriction base="xs:token">
            <xs:enumeration value="SYSTEM_A"/>
            <xs:enumeration value="SYSTEM_B"/>
            <xs:enumeration value="SYSTEM_C"/>
        </xs:restriction>
    </xs:simpleType>

    <!-- Define an id token type with a length restriction -->
    <xs:simpleType name="idToken">
        <xs:restriction base="xs:token">
            <xs:maxLength value="50"/>
        </xs:restriction>
    </xs:simpleType>

    <!-- Define the union type that combines enum and restricted-length token -->
    <xs:simpleType name="sourceSystemUnionType">
        <xs:union memberTypes="sourceSystemEnumType idToken"/>
    </xs:simpleType>

    <!-- Example element using the union type -->
    <xs:element name="record">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="id" type="xs:string"/>
                <xs:element name="sourceSystemId" type="sourceSystemUnionType"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

This way, you could define common known source systems in sourceSystemEnumType, while allowing freeform IDs for all other systems.

You would have to get consensus from the working group for the entries that would go into sourceSystemEnumType. Subsequent expansion of those entries would be a minor change to the spec. Anything else would be a major change.

The proposed solution is backwards-compatible with current systems, i.e., anything goes.

So where is the improvement?

The improvement comes for consuming systems that wish to constrain allowed values: There, you can create validators that insist that only the entries of the enumeration are used. In the Java world, a rudimentary example might look something like this:

Java Validator Example

public class SourceSystemIdValidator {

    // The values would be read from the XSD ...
    private static final List<String> VALID_ENUMS = Arrays.asList("SYSTEM_A", "SYSTEM_B", "SYSTEM_C");
    
    public static void validate(Record record) throws IllegalArgumentException {
        if (!VALID_ENUMS.contains(record.getSourceSystemId())) {
            throw new IllegalArgumentException("Invalid sourceSystemId value: " + record.getSourceSystemId());
        }
    }
}

An implementation at the DB level could look something like this:

SQL Implementation Example

CREATE TABLE my_agricultural_imported_data (
    id VARCHAR(255) NOT NULL,
    source_system_id VARCHAR(10) NOT NULL,
    CONSTRAINT chk_source_system_id CHECK (
        source_system_id IN ('SYSTEM_A', 'SYSTEM_B', 'SYSTEM_C')
    )
);

@AFoletti
Copy link
Member

@montanajava
Hey! Thanks for the proposal.
I may be missing something, but I fail to understand how you can implement a real check if the sourceSystemId is both an enumType AND freeform at the same time.

   <xs:simpleType name="sourceSystemEnumType">
        <xs:restriction base="xs:token">
            <xs:enumeration value="SYSTEM_A"/>
            <xs:enumeration value="SYSTEM_B"/>
            <xs:enumeration value="SYSTEM_C"/>
        </xs:restriction>
    </xs:simpleType>

What if I enter "SYSTEM_a"? Is it a typo and should really be "SYSTEM_A" or is it actually another system and I am using the freeform flexibility given to me? There is no way to differenciate.

@montanajava
Copy link
Contributor

montanajava commented Nov 28, 2024 via email

@AFoletti
Copy link
Member

AFoletti commented Nov 28, 2024

Understood.
To me, this build unnecessary complexity in the model. Either we enforce a codelist (not possible...) or we leave it freeform, and if someone can/wants to define a set of valid values for a very specific usecase, he is still free to do so.

This could of course result in data that conforms to the eCH standard but not to the tigher restrictions posed by the specific usecase, which is in my opinion perfectly acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request jira Tag to activate JIRA integration major v2
Projects
None yet
Development

No branches or pull requests

3 participants