-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotation Data Design #40
Comments
Proposed design:
(Entity details still very much a work in progress) |
Some of the known issues with the current Alpheios morphology service engines can serve as a set of use cases for annotations on morphological data. Reference: alpheios-project/morphsvc#38 Summary: Whitaker Engine of Morph Service reports the lemma of 'afore' as 'afore'. While it's possible that 'afore' is an accepted lemma variant of 'absum' our inflection table and full definition for this verb is keyed off of the lemma 'absum' as the "canonical" lemma Entity Nodes:
Edges:
Sample Query: https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70 Reference: alpheios-project/morphsvc#29 Summary: Whitaker Engine of Morph Service is missing the identification of the vocative case as a possible inflection of the form senatu of the lemma senatus Entity Nodes:
Edges:
Sample Query: https://gist.github.com/balmas/f6e55dc3b3551a60d034ef131798ba4d Reference: alpheios-project/morpheus#28 Summary: Morpheus Engine of Morph Service parses τίνος with the Lemma τίς specifying pos of irregular (along with a parse of a demonstrative pronoun). The irregular lemma should be the interrogative pronoun τίς with one genitive singular inflection Entity Nodes:
Edges:
Sample Query: https://gist.github.com/balmas/ecc9db3da04fbf32d3e0f8efdf6b2774 Reference: alpheios-project/morpheus#32 Summary: Morpheus Engine of Morph Service doesn't parse the word μεμνήμεθα because it only recognizes this word by the alternate spelling μεμνῄμεθα. Entity Nodes:
Edges:
Sample Query: https://gist.github.com/balmas/f402883b85041e5227737509be6adce3 (Additional use cases from the morph bugs can be found at https://docs.google.com/spreadsheets/d/1ej-7dAntWQZVASg7aQp0P-PRo2u9Nkn3LYChclYtDVo/edit?usp=sharing) |
I need more time to read and understand the idea, will work on it tomorrow. |
Need time to study this and mull over it as well |
About the data structure: If I understood right, the structure has two main entities - Word (I believe that in terms of Alpheios extension it is TargetWord) and User, all other entities are around them:
And here I have some questions:
How this roles are defined in the Model? Where would be defined user's rights for the words, tokens and comments?
According to the examples I think it is a worthy structure, but it is really difficult to see how it would work with all languages in GraphQL bparadigm for me. |
These are really good questions.
I think it's not really true that Word and User are the main entities. There can be relationships that don't involve either of these -- for example inflectionA canBeInflectionOf lemmaA and so on. The connection points to Alpheios applications will in many cases be specific to a User and a Word, but they are not the core of the data model. User does have a somewhat special place in the model though, because it is through a User's assertion that a relationship between entities is True or False that makes the data usable. But it isn't true that a user of Alpheios will only have access to data that is connected to the data that is asserted as true or false by their User id. We will have to give users control over what data they do and do not see, based upon who asserted it. And we also have to give users control over whether the data they create is available to other users. For now, we have decided that there are 2 possibilities: public or private. In the future it is very likely we will need to be able to express finer-grained levels of access - such as group-level, site-level etc. But to start we are going to support these two. So for example, as the "alpheios.net" user, we may publish corrections to the results of the morphological parsers as annotations (these are the use cases I've described above). The assertions of their truth will be made by the "alpheios.net" user (exact identifier TBD) and they will be available to anyone. On the client-side, a user will have the choice of which data to retrieve -- they will be able to say, for example, 'give me all data asserted by alpheios.net and by myself, and no other' or 'give me all data that is publicly asserted, but exclude data that is asserted by userx'. Or maybe even 'give me all data asserted by alpheios.net and myself, plus any data that is public and which has been asserted X number of times' (the implicit assumption being that the more people agree with a statement the more it is likely to be correct).
I need to think more about this. For the moment, there are 2 roles identified in the data model - (1) creator of data (the user that put the data into the database) and (2) subject or object of an assertion (as in User X assertsTrue Relationship Y). At the moment, I'm thinking that access restrictions would be tied only to the former, and the isPublic property on the assertion is what gates the access to it. In this approach, it means that we CANNOT make the data or a query endpoint to it itself publicly available, all access would have to be gated through queries that enforce the access restriction rules. This is how the wordlist works currently although it's not a graph-based model. There are certainly other roles with respect to data that it might be good to record, some of which get quite philosophical (who actually "created" the words of a text that is aligned? the author of the text?). I should also explain that I don't right now at least envision making this database the source of truth for User information. I would like to keep that data separate, as it is much more sensitive. I think we will continue to use Auth0 for our User database, and at a minimum, just the opaque userId would be available in this new shared data store. Additional information about themselves that a user chooses to make public might be retrieved from the Auth0 database at runtime, or synced, depending upon performance. But I want to keep user-identifying data separate from this new data store as much as possible. We WILL have to make enhancements to our use of Auth0 to support this to allowing users to enter and edit profile information.
I though a lot about this and I'm not sure what makes the most sense. For the initial modeling, I used a document property for language to limit the proliferation of document collections as I was still trying to figure out what they should be, but it could definitely be better to have individual collections of document per language. It's something that I think we'll have to analyze the performance of different queries to decide. The indexing options in Arangodb are pretty flexible and we can index on property values, but it does definitely introduce potential error into the data using a language property unless we enforce a schema on its values.
I think Token probably does need to be represented because it isn't exactly the same as Word -- i.e. multiple Tokens could be combined to make up a single Word. In order to make that connection though, we would have to retain that information from the tokenizer. It's something that needs to be thought through more. |
Thanks for the diagrams and the detailed description. I have several small questions:
Does the lower lemmf is a lemma variant of the lemma above?
I like the idea of separating nodes (entities) and edges (connections) very much. It creates a strong concept and a meaningful vocabulary to represent the concepts of handling the lexical data. Many details of the implementation are not defined yet but I think there is already a solid foundation that we can use to move forward. |
Yes, in this case it's showing 2 separate Lemma nodes, connected by the isLemmaVariant edge
I think we will need to have delete protection on anything that can be pointed at, which includes edges (that can function as nodes). Otherwise we will end up with meaningless assertions of the truth or falseness of a relationship. I think this may also mean that edges cannot be edited substantially once they are created, otherwise it would put into doubt the validity of the assertions that point at them. I think anything that can be pointed at might need to be in a frozen state once it participates in a relationship.
Yes, I think the graphs will be built dynamically based upon the query.
Definitely we cannot store everything that might ever be part of the graph, but we need to store them once they become part of the graph. That is, it's the point at which someone annotates a relationship that is asserted by an external resource that that relationship (and the nodes in it) will get added to the database. When we have a persistent IRI for an external resource (which right now is rare) we should use it. We can also use properties to identify the original source of data (see for example the lemma properties in the sample query at https://gist.github.com/balmas/f6e55dc3b3551a60d034ef131798ba4d where I am specifying that the data I'm looking for annotations on has come from the "net.alpheios:tools:wordsxml.v1" source.)
Yes, I have some naive first attempts at the api in the Gists I've linked to above in the sample use cases. |
I'm afraid that the protection deletion would be pretty hard to manage. I've started to think if we could find a way around it without imposing and maintaining such restrictions. I might be wrong, but it seems to me that the nodes are more "stable" pieces than the edges in the lexical data model. If there is a lexeme, or an inflection, their existence is probably a more-or-less reliable truth. The question usually arises around whether a particular inflection belongs to a particular lexeme, or several lexemes (it might in theory not belong to anything at all if it is considered incorrect), i.e. if there should be a relationship (an edge) between the one and the other. Someone may say that "A is an inflection of lexeme B". This statement not only asserts the relationship, but also establishes a relationship itself (creates an edge) between the "inflection A" and the "lexeme B" (if such edge was not already established by the other assertion before). The relationship is solely based on the assertion, if we can put it this way. If the assertion will be revoked, relationship should be destroyed too. I'm wondering if, in situations like this, it would make sense to store relationship along with the corresponding assertions. If the assertion is edited, the relationship is edited too. If the assertion is destroyed, the relationship should cease to exist too. It's like storing parts of a graph separately. When we construct a graph for a specific word, we can check if there are any relationships for its lexemes that are part of it, and if there are, a final graph will reflect that. I think this will make management of relationships more flexible. If we want to remove or edit a relationship or an assertion, we can do it all in one place. There will be no need to "lock" the whole graph or the parts of it. I think this way it should be easier to manage the lexical data than when it's all in one complex graph. Does such an approach make sense? What do you think? |
The service for the decentralized annotations publishing is dokieli. It uses, among other, the following technologies:
We might also look at Solid as it is within pretty much the same problem domain as wll. |
We should also look at https://www.hypergraphql.org/ when designing the GraphQL API |
https://linkeddatafragments.org/ is relevant to suggestions from both @irina060981 and @kirlat |
Revised sample use cases: Reference: alpheios-project/morphsvc#38 Summary: Whitaker Engine of Morph Service reports the lemma of 'afore' as 'afore'. While it's possible that 'afore' is an accepted lemma variant of 'absum' our inflection table and full definition for this verb is keyed off of the lemma 'absum' as the "canonical" lemma LexicalEntity Nodes:
LexicalEntityRelation Edges:
User Collection (not part of the graph}
Prototype GraphQL Query: https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70 Reference: alpheios-project/morphsvc#29 Summary: Whitaker Engine of Morph Service is missing the identification of the vocative case as a possible inflection of the form senatu of the lemma senatus Entity Nodes:
Edges:
User Collection (not part of the graph}
Sample Query: https://gist.github.com/balmas/f6e55dc3b3551a60d034ef131798ba4d Reference: alpheios-project/morpheus#28 Summary: Morpheus Engine of Morph Service parses τίνος with the Lemma τίς specifying pos of irregular (along with a parse of a demonstrative pronoun). The irregular lemma should be the interrogative pronoun τίς with one genitive singular inflection Entity Nodes:
LexicalRelation Edges:
Sample Query: https://gist.github.com/balmas/ecc9db3da04fbf32d3e0f8efdf6b2774 Reference: alpheios-project/morpheus#32 Summary: Morpheus Engine of Morph Service doesn't parse the word μεμνήμεθα because it only recognizes this word by the alternate spelling μεμνῄμεθα. Entity Nodes:
** LexicalRelation Edges**:
@irina060981 and @kirlat thank you both for your feedback and for talking me off the complexity ledge :-) Above is a revised approach to the data model, based upon your suggestions and the additional reading mentioned above. A few things to point out:
Also, in case it helps with understanding this, my code for the ArangoDB prototype where I have been working through all of this is at https://github.com/alpheios-project/arangodb-svcs |
Thanks for the detailed description! I like the new model, I think it's much more flexible and extendable now. A few comments to it:
I think it does not matter what ontology do we use as long as we specify the ontology IRI along with the ontology terms. This will provide client with the reference to the ontology and make things non-ambiguous. I think this, even being more verbose, will give us flexibility to use any ontologies we want without any limitations.
I agree with keeping the edges language agnostic. Edge is a connection between two nodes and it does not, on my opinion, "belong" to any language on itself as a word, an entity that caries some language-specific data, does. I think we can use language-based indexes to group nodes into language-based collections. Edges can be included into such collections based on nodes it connects: if those nodes do belong to a certain language, edges can be included to collections for that language too.
I think we can also have comments on edges being shown on the lexical entities graph as well if we will introduce edges as entities in the query. We can have something like:
I've seen this approach used in GraphQL queries on many occasions, including the popular Gatsby generator. We can do something along those lines.
I think we can add user information to the edge using the approach shown above. Extending the example above, we could have something like:
Or if there are multiple users, we can show an array of users using the plural
Maybe we can use a timestamp as a version? In that case, if we would like to assemble the latest version of a graph, we'll use entities with the most recent timestamp. If we would like to go back to the previous version, we can specify a certain point in time and assemble a graph from entries that have a timestemp below that date. Of course that would not work if we would need to establish specific snapshots that are not time-synced (such as a special version of a graph for some particular purpose). |
I'm thinking if our GraphQL queries may be simpler if we introduce edges into them For example, if to take a request from https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70 and change it to something like:
Then the response might be something like:
I might mess up some syntax and details but hope the code above conveys the idea. What do you think? Would it work for us? Would it make things simpler? |
I also have a question about query parameters of the sample query I'm thinking if it would make sense to use a less formal, but a simpler approach and specify only the fields whose values would serve as an actual filter only? Something like:
This would be consistent with usage examples I've seen and will make the query simpler. What do you think? Does it make sense? |
Yes, I agree -- this is the nice thing about the GraphQL api being separate from the database implementation. Even if in the database the comments are in a separate graph from the data they comment on, we can present them as a single graph in the GraphQL API. |
I agree timestamps make sense as a way to identify versions. I am not sure how many past versions of data we want to keep available in the graph though. I like the approach of having a non-versioned IRI for data that always returns the latest version, and referencing the prior versions in the the data (per the approach described in http://lrec-conf.org/workshops/lrec2018/W23/pdf/2_W23.pdf) but I don't think we will keep unlimited versions of all data points. This may not be too big of an issue though, because for the most part, at least for the lexical data, we are talking about very small data objects that likely won't change once created. We could also have different status for data such as draft and published, and only allow referencing published data. |
I think this is an interesting suggestion. The type of relationships (edges) that we might query will be many and will grow over time and the same input will feed into many of them. I think we can use variables in GraphQL to keep the queries concise in case (i.e. to keep from repeating the same expanded lemma object over and over). But I think your suggestion is very much in line with the approach outlined in the linked fragments proposal in that it puts in the hands of the client to know exactly what it is asking for. |
I think generally I agree with you. I'm a little uncertain about the example though. In the data model, word is an abstract object (based upon the Ontolex ontology https://www.w3.org/community/ontolex/wiki/Final_Model_Specification), with a property "representation" that contains the actual letters that make up the written representation of the word. We can hide that detail from the client of course in the GraphQL api, but that's why "representation" is there along with "pos" and "lang" which are also properties. |
I've added #42 for the discussion of the annotation UI concepts. |
See #43 for discussion of PIDs for data objects. |
note that we might want to consider TEI Lex-0 as a possible export format for the lexical data. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html# |
I've started to work on the implementation of annotations for short definitions and I think our requirements dictate us to change the way we store and serve lexical data within the application. The major driver for the change is the requirement for the user to specify dynamically what annotations should be applied to the data model displayed in the UI. I think the best way to achieve this is to move from static props to methods that return data dynamically, based on user preferences supplied to it. Let's take a It seems to be very similar by the approach to the GraphQL where any field requested may have options that would specify what data should be returned and how it must be filtered. If to accept the approach above, the following implementation would make sense, on my opinion. Upon a lexical query request for the target word a word object containing all information related to the target word specified is returned from the GraphQL facade. This information as returned is, maybe with some amends made by the lexical query, stored within the The user would then use methods of the In order to avoid data duplication and to let data changes to be reflected in all instances of the return objects, the nodes data has to be references to the "original" objects stored within the instance of the It could be backward compatible as the "old" props would be combined with the "new" methods within same objects. The annotation-aware code would call the "new" methods while the existing code would use data from the props. When the user would want to annotate (edit) a connection (an edge), as when saying "D is not a definition of the lexeme L", a method of the This data objects that were created as results of the The Would the approach like this allow us to achieve what we want? I think it's more complex but infinitely more flexible. As we were discussing before that the What do you think about the approach? Do you see any pitfalls in it? |
I agree this is the right direction. In #37 we also have proposed changes to the data model to introduce the |
I think this is a good point. If not all clients of data models need annotations, it should not be in the data model package. I think we should try to keep annotation-related and the "regular" business logic separated, if possible. I think it's too hard to build such model theoretically so we should probably try to implement it in code keeping in mind a separation of knowledge domains. |
This is how I think we should represent edges. All edges should always represent asserting, not negating, statements, like "D is a definition of a Lexeme L". The reason for this, I think, is that an asserting statement creates a connection. There is no reason to deny something that does not exist in the first place. So the connection should always be created first before any statements can be made about it. So here is a statement that defines an edge: Then we can start to gather statements about this edge. The statements could be either assertions, confirming that this connection is valid, or the negations, denying the connection's existence. There could be multiple instances of both from various users. The first assertion should probably come from the party that created it (as Statements could be attached to the connection as metadata:
So in this case we have two assertions and three negations and under the normal conditions the connection should not be used during the lemma construction: the definition should not be attached to the lexeme. But if the user sest an option to respect only his/her statements, then the definition should be attached to the lexeme: we have 1 assertion versus 0 negations. When someone creates an assertion or a negation, it will be passed to the GraphQL API in order to be stored as the edge's metadata. If every edge would have its unique ID, that would be easy to do: we would need to pass an assertion/negation and the ID of the edge it should be attached to. There are also comments. I think we should be able to attach comments to anything that has an ID: to the node, to the edge, or to the other comment (that would allow to create threaded discussions, if necessary). So if the connection has comments, it would look like the below:
Let's say that the
Now let's assume that someone wants to create a new definition for the existing lexeme. In that case we'll need to:
Would something like this work? if so, I will create GraphQL transactions around them. |
Agree this is an interesting point. However, to be clear, it's not just annotations we're talking about here. Another reason for this refactoring is that we need to be able to retrieve resources from a wider variety of sources and let the user choose which to include and how to combine them. But that's also the point of using GraphQL -- it is supposed to address just this use case. We should keep that business logic for combining resources behind the GraphQL facade, but I'm not sure it means that we shouldn't have a method on the data model object to specifically request the data according to the user preferences. I think it's good to proceed cautiously here and and at each step ask ourselves if we have appropriate separation of concerns. |
I need to think about this a bit. Is the primary difference between this and the original model I proposed (and then revised to remove the assertions as nodes), is that rather than assertions/negations being nodes with a user at one end and an edge (treated as a node) at the other, they are properties of the edge itself? |
I was drawing a lot of diagrams on paper picturing possible ways to express lexical relationships and then was trying to match it to our existing data structures and to the possible GraphQL API. What I've described is the simplest way to achieve what we want that I've found. There is an edge between lexicaI nodes and the user, but they are straightforward, has no metadata attached, is always one-to-one, would never be amended once created, so I decided to omit them and show users as properties. But technically it is an edge. I just did expose it as that for simplicity. I think it's very similar to your approach:
I believe the "main" graph should portray relationships between lexical entities only. Users represent a different concept and I think they probably should not be on the graph. Having them as props that hold a reference to the user object in the user collection (as multiple objects could refer to the same user) should be sufficient for us, I think.
I liked this approach and I've tried it first, but I think it's too complex and would create too many issues with representing it in both GraphQL and JS objects. So I've tried to replace it with a simpler one: one edge with many assertions attached. The sum of those assertions would decide how "strong", or "valid" the connection is. So I think the major difference is that I suggest to replace multiple relationships representing an individual assertion or negation each with a single relationship having many assertions/negations attached to it (unless I'm missing any other important points). I think it would be way simpler to store it in DB this way and to present in GraphQL results.
Similar to users, I think it's simpler not to create an edge, but just to attach a comment object to either node, edge, or another comment. First, comments are conceptually different from lexical entities; they probably belong to other, "non-lexical" dimension. Second, we have to be able to add comments to relationships (edges) but then we wouldn't be able to do so because we'd have to create an edge between an edge (a lexical relationship we want to comment on) and the comment itself (the node). Edges can connect nodes only. So we should not have an edge here, I think. Here are my thoughts on this. What do you think? It's still fresh in my head and not fully formalized, but I think it's probably enough to represent some adjustments to a concept. |
I think your approach about the assertions/negations is worth trying. It is probably easier to support than having edges as negations and I agree with the philosophical point that creating an edge to say a relationship doesn't exist is counterintuitive. However, we need to be able to have properties on the Assertions other than the user -- they also need, for example, level of confidence, and creation dates. For comments, I'm a little less certain. I agree comments are probably in a separate dimension but we have to also consider the use case of comments on other comments. Maybe they need to go the other way --- i.e. Comments are Nodes and there can be a commentsOn relationship between two Comment Nodes, but a Comment on a LexicalRelationship references the LexicalRelationship edge it comments on as a property? |
Should have no problems with it, I think.
What if in order to solve this conundrum we follow the FB approach and split comments on "comments" and "replies"? Comments could be attached to both lexical entities and lexical relationships. But if someone wants to add a comment on a comment, that would be a reply, and there will be an edge connecting the comment and the reply, or two replies, if it's a threaded discussion. It also would be in-line with original meanings of the terms: https://ux.stackexchange.com/questions/118624/comment-vs-reply-in-a-social-feed. Here is a piece of documentation confirming that FB treats comments and replies differently. Not sure what's the reason, but maybe they faced issues similar to what we're trying to solve. We could probably think of it as of different transparent planes stacked on top of each other. The base plane is the one with the lexical relationships graph. The one on top is comments/replies. The comment prop on the lexical graph plane may become a node on the comments plane to which replies can be attached. So the comments/replies graph would exist only if there are replies to a comment. There would be multiple reply graphs each having a comment as a root node. |
Hmm. It's not clear to me from that FB link that Facebook really treats comments and replies separately -- to get the comments on a comment you access the /comments edge and it says that a comment may be a reply. I think we have the following use cases for comments (1) a comment on a lexical entity node for (1) and (3) it seems pretty clear to me that the comment should be a separate node, and the relationship between the comment and the thing it comments on is an edge. for (2) it's murkier, but it seems like perhaps then we still create the comment as a node, but here it is referenced as a property of the lexical entity relationship, and then comments/replies to it are in the comments/replies graph. I think this is essentially what you were suggesting, except I think the comment should always be a node regardless of whether it has any comments/replies to it. |
I'm not fully familiar with the FB approach but that phrase from documentation
and the way they use See also the user comment in the other link stating that
But those probably are just terms used so that the model made more sense. I guess generally we have a plan and this is just one of the minor points. Technically, since edge would have an ID, we can attach anything to it (a node). And if we consider comments to be on a different plane this would not break the model of the lexical entity relationships graph. But I'm not sure if it's the best way to implement it. Will think more about it. |
I think that could work. Would there also be a “replaces” edge between definition c and b ? (Eg definitionC replaces definitionB)? |
Ok. I guess I was thinking of the versioning scenario, where definition c was a correction of the text of definition b . But I think we should not get too bogged down with all of the possible variations right now. I think the structure you have proposed, with the node-in-the-middle addresses one of the key things that was still troubling me about the data model design and is a reasonable jumping off point. I will work on introducing that into the prototype ArangoDB model. |
As we've discussed previously, it would be not a good idea to change the existing Data Model objects in order for them to support annotations. That's because other apps are using them that do not embrace the annotations concept. How about the So my question is that: would we ever need an assembly of
If that is not needed, we can simply let the (hopefully) limited amount of annotation knowledge to trickle into the components, maybe in a form of plug-ins and/or modules. What would be the best approach to handle that? |
I'm not sure we have concluded that. See my comments at #40 (comment) |
I would rather not think of this as annotations-included or annotation-free. But instead, recognize that the data sources that contribute to produce the final data that the user sees are fluid and both the user and the application may influence not only which data sources are included but also how they are combined. |
Here is the first take at GraphQL type definitions with annotation support: https://gist.github.com/kirlat/5c36baaf26e3ea399bfe36d0a354c7b1. Only some objects are annotatable (Lexeme, Definition, and the connection between them); I think we can add that to other objects later. What do you think? Am I missing anything there? |
Thanks! I added some comments directly in the Gist. |
Please check an updated version with the suggested changes implemented and some mutations added: https://gist.github.com/kirlat/5c36baaf26e3ea399bfe36d0a354c7b1 I've also made types more specific by introducing the |
Comments added to the gist. |
Per discussion on Slack with @balmas: we need a way to integrate annotation data into our existing data model without significant changes to the data model itself (for several reasons). The current situation: Lexical query produces a How could this be changed to accommodate for the annotations data? Two approaches comes to mind. One is the centralized annotation data storage: An annotation data could be retrieved, updated, added, or removed via specialized methods of the Data Model Object class. Lexical elements to which annotation data is connected are referred by their IDs. In this model any piece of code has to have a reference to the Data Model Object and can use its methods to retrieve/alter the annotation data. The other approach is when annotation data is spread across all lexical data objects within the hierarchy. We could keep the structure of the lexical objects (Lexeme, DefinitionSet, Definition) the same as it is now, but add an Another option, a combination of the two approaches described above, is to integrate not the annotation data, but the annotation API to the lexical data items. These API methods would retrieve and change the data that is located within the Data Model Object: We might use the same "distributed API approach" in other cases. For example, if the lexeme has changed and we want to pull the updated data, we can use something like Are there any other approaches possible? What do you think would be the best way to go for us? P.S. After reading this interesting Stack Oveflow question I've started to think that another beneficial approach might be for the Data Model Object to return lexical objects without the annotation data attached but then the annotation-aware code to use the annotation API of the Data Model Object (or something else) to pull the corresponding annotation data and, possibly, attach it to the lexical data objects (or not to attach it and use as separated objects). That would provide the best isolation between the lexical data and the annotations as only the annotation-aware code would pull the annotation data into the application context. |
I think the 3rd approach is the closest to what we need to do, not only for annotations but also for the lexical data itself. A big problem for the 2nd approach (annotation data kept in the data models) is that it doesn't account for the inter-dependencies between the different parts of the lexical data objects. The work on the treebank disambiguation has made this a little clearer to me and I think it applies to both the aggregation/disamibugation workflow and the annotation workflow. Looking at that a little more closely, we currently have something like this:
There are a number of problems with this including:
|
In domain driven design there is a repository pattern that doesn't apply perfectly to our use cases, but I think we need something like that. I think we need a place where we can aggregate the results of individual resource queries (instantiated as alpheios data model objects) and reach back into as needed to recompose the data objects we present to the available for the user to view and annotate. |
I think that based on the all above we can conclude that lexical data and annotations are different domain contexts and should be kept separate as much as possible. We, however, need a context mapping between those two contexts. The lexical data could be the supplier of the information and the annotation data would be a consumer. I think something like the repository pattern may work well. There could be two repositories: of lexical data and of annotations. The code that does not use annotations would pull data from the lexical data repository (get lexical data for a specific word). The annotations-aware code would pull data from the annotation repository (annotations for a specific word). The code of the annotation data object then would get what it needs from the lexical data repository, and will combine information from two repositories together. The lexical data should assembled so that it would be possible to track how this data as combined. It should allow data to be recombined a different way at any moment. Same can be said about annotations. |
I think this is about right.
A small point but I think technically this logic to combine data from different lexical sources actually currently lives in a combination of the lexis module and the data model objects (e.g. Homonym, etc.) I need to think a bit about the question about the lexical objects and the DTOs. |
I would like to summarize what I think we can do in order to support annotations. Here is a detailed diagram showing all the architectural components and the workflow of getting the word data: The object that stores lexical data is the The rule for the aggregate roots is there should be no references from the outside to the objects that are stored inside the root. All changes to the object within the aggregate root (i.e. the Keeping object instances within the Actually, the I think our current lexical objects ( The data retrieval workflow could work the following way. When user selects a word in the UI, the presentation layer (the Vue component) sends request to the application layer represented by the Once created, the Once each piece of lexical or annotations data is retrieved, the
The presentation layer (Vue components) tracks changes in those flags and, when the changes affect the data it displayes, it runs a method on How does the data flow from the How would the data updates be handled in this model? Updates are simpler, in a way. Only annotations can be updated, not the lexical data itself. The update of annotations, however, may affect the resulting lexical data DTOs, so the Vue components that display those DTOs would need to pull an updated data. In order to update annotation, the Vue component in the presentation layer uses a method of the This is how this process looks on the diagram: There should be specialized methods to change, add, or remove each type of annotation. The method's argument should be specialized DTOs containing the word ID and the data describing the change in the annotations to be made. So we'll need to have multiple annotations input DTOs, each for a specific type of operation. That's what I think might work for the purpose. It should be flexible and extendable, but I've also tried not to overcomplicate it. It may use many objects of the existing infrastructure and requires only smaller amount of newer objects to be created. Some details are probably missing but I think we'll be able to figure them out once this will go into implementation. I would greatly appreciate your feedback on this. |
I think this approach makes perfect sense.
This is an important point. One of the difficulties we have right now, when we have only one source of annotations impacting the DTOs, is that there are interdependencies between the components of a DTO that are needed to be taken into account in order to construct the DTO that is displayed to the user. For example, a inflection can impact a decision about whether a lexeme is equivalent to another and needs to be merged with it. As we increase the number of data sources, the possible permutations will only grow. I think that storing the results from adapter queries as plain (but normalized) JSON objects within the Word repository would probably make it easier to deal with this. |
Related discussions: #33 #38 #24
See also https://github.com/alpheios-project/documentation/blob/master/development/lex-domain-design.tsv which defines some domain requirements for creation of data in the data store
Data design needs to accommodate:
The text was updated successfully, but these errors were encountered: