Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT behavior #97

Open
pp86 opened this issue Dec 15, 2017 · 19 comments
Open

SELECT behavior #97

pp86 opened this issue Dec 15, 2017 · 19 comments
Assignees

Comments

@pp86
Copy link
Contributor

pp86 commented Dec 15, 2017

Please consider this query:


DATA_SET_VAR = SELECT(provider == "Campaner") HG19_BED_ANNOTATION;

MATERIALIZE DATA_SET_VAR into test_select;

It returns two samples.
Btw, no sample with provider = "Campaner" exists. The two output samples have the metadata original_provider = "Campaner".

What happens is that the SELECT checks is the attribute name ends with "provider".

I remember we decided to have a SQL-like approach, i.e. the names are splitted on "." (similar to the "mariage"/"age" example we discussed). So the results would be correct if the metadata was "original.provider", but not "original_provider".

Is anything changed in the specification, or should this be fixed?
@marcomass @sunbrn

@sunbrn
Copy link
Collaborator

sunbrn commented Dec 15, 2017

According to what is written in the documentation, in the SELECT predicate on metadata only metadata exactly equal to what is specified should be matched, e.g., only metadata which are exactly "provider" (without prefix of any kind).

I think the matter with the separator (".") is only relevant when we talk about the default/EXACT/FULL options for semijoin/joinby_/_groupby predicates.
I mean that, when using SELECT, we should concern about the separator if we are writing something like
SELECT(...; semijoin: FULL(provider) IN DS)

Therefore, @pp86 I would fix the metadata predicate to only match the indicated metadata.
@marcomass could you please confirm? Am I missing something?

@marcomass
Copy link
Contributor

I confirm; it must be fixed. Thanks @pp86 for spotting it

@marcomass
Copy link
Contributor

Can this be fixed by @OlgaGorlova ?

@pp86
Copy link
Contributor Author

pp86 commented Dec 15, 2017

In the case we have a sample with:
original.provider = "Campaner"
should the query
provider == "Campaner"
select the sample or not?

According to @sunbrn it should not, right?

@marcomass @OlgaGorlova I am taking care of it.

@marcomass marcomass assigned pp86 and unassigned OlgaGorlova and akaitoua Dec 15, 2017
@marcomass
Copy link
Contributor

The definition in the doc is the following:
• metadata_attribute_name: it matches all attributes that are equal to OR end with the specified name;
• EXACT(metadata_attribute_name): it matches all attributes that are equal to the specified name (without any prefixes);
• FULL(metadata_attribute_name): it matches two attributes if they end with the specified name AND their full names are equal.

So, the query original.provider = "Campaner" should match only attributes original.provider, or *.original.provider not just provider

@sunbrn
Copy link
Collaborator

sunbrn commented Dec 15, 2017

@marcomass but these three options should only be defined for the semijoin predicate of the SELECT operator. @pp86's example is concerning the metadata predicate of the SELECT operator.
So, if I understand correctly the documentation, in the case he is mentioning, in which we have a sample with:
original.provider = "Campaner"
the query
provider == "Campaner"
should not select it.
Or are you saying that the def/EXACT/FULL options should be available also in this part?

@Erlaad
Copy link

Erlaad commented Dec 16, 2017

I do believe that the issue at hand is more simply resolved by changing the way that "by metadata_attribute_name" selection (point 1) works in the above definition by @marcomass.

In particular, checking that the metadata value ends in a certain way poses ambiguity cases such as the one found by @pp86 (cf. example below). What was probably intended is that the last token of the metadata name after splitting according to the '.' separator should match the metadata attribute mentioned in the SELECT clause.

Some examples to clarify. Let the following SELECT be of interest:
SELECT(data == 'NICE') SOME_DS (1)

Let S1 have the following metadata:
data NICE
and S2 have instead
wrong_data NICE
finally, let S3 have
OLD.data NICE
for S1, S2 and S3 samples in SOME_DS.

If I use the "ending with" definition as above, all samples will be valid (as the criteria is equivalent to looking for strings in the form "*data", where * is a wildcard). Instead, if I tokenize metadata with the '.' separator and match only the last element, only S1 and S3 will be chosen.

Note that it is entirely possible that a sample S4 having the
reallystupid.data NICE
metadata would be selected by (1), so some ambiguity would remain. Perhaps a requirement could be made that dataset do not contain metadata with '.', another separator could be defined or a small subroutine could be added to the uploader that changes all '.' to some other separator.

  • Stefano

EDIT: some confusion between dataset and sample

@marcomass
Copy link
Contributor

Thanks @sunbrn for the clarification and to @Erlaad for the examples.

We introduced the:
• metadata_attribute_name (default)
• EXACT(metadata_attribute_name)
• FULL(metadata_attribute_name)
to avoid the issue that who write the query should know how previous GMQL operator in the query change the metadata attribute names with prefixing (as some operators do use prefixes with "." separator).
So, already now the option "metadata_attribute_name (default)" is to be intended with regards to prefixes of metadata: attribute_name (i.e., metadata_attribute_name: it matches all attributes that are equal to OR end with the specified name (regardless additional metadata_attribute_name prefixes not explicitly specified)). If it is not working in this way, it must be fixed in this way.
Thus, in @Erlaad example only S1 and S3 must be chosen.
Please note that the "LIKE" option, as described in issue #55 is not (yet) implemented.

As @sunbrn clarified, the three options:
• metadata_attribute_name (default)
• EXACT(metadata_attribute_name)
• FULL(metadata_attribute_name)
have been defined only for the semijoin predicate of the SELECT operator (and joinby and groupby predicates of the other GMQL operators).
This is because we defined such options in the view I described above, when we considered SELECT to be generally the initial statement of a GMQL query, so considering cases where the metadata attribute names have no "." separated prefixes.
Yet, SELECT can be for sure useful also applied within a GMQL query, particularly after a MAP operation (which introduces "." separated prefixes).
So, to avoid the issue mentioned above, I think it is better to extend the application of these 3 options also to the main predicate of the SELECT, not only to its semijoin part.

Please note that in doing so, the issue pointed out by @Erlaad would not hold anymore, since in the provided example case it would be possible to disambiguate between S1 and S3 using the one of the three options (default, EXACT, FULL), according to the desired behavior.

In summary, according to the issue pointed out initially in this thread by @pp86, SELECT behavior should be fixed as here above mentioned.

Please comment if you see anything unclear/wrong in the above.

@pp86
Copy link
Contributor Author

pp86 commented Dec 16, 2017

I can do that, but keep in mind that this is a major change (Compiler, Dag, Executor, API, Python, R have to be touched).
This will require a deep re-testing of the SELECT .

Since in select we can use AND, OR, NOT to compose complex conditions, EXACT and FULL my not be strictly necessary (although surely helpful).

Just fixing the behaviour to split on "." should be much easier (only the implementation will be touched).
I would go with this lighter option if you all agree.

@marcomass
Copy link
Contributor

Yes, the fix of splitting on "." must be done for sure.
It must be fixed in a way to act as the default option
• metadata_attribute_name: it matches all attributes that are equal to OR end with the specified name;
i.e.: in the case we have a sample with:
DS1.provider = "Campaner"
the query
provider == "Campaner"
should select the sample, which is how it is working now with regards to the application of a SELECT after a MAP.
Please note that this would not avoid current ambiguities without implementing the use also of EXACT and FULL in main SELECT predicate.

I understand the other change (introduction also of EXACT and FULL in metadata main predicate) requires more effort. My opinion is that it would be much better to do it now when we are closing the API implementation and contemporary we are also actively working on closing RGMQL and fixing bugs on pyGMQL. Postponing it would then require much more effort to retaken also the other projects (RGMQL, pyGMQL) once they will be closed.
Anyway, I agree this is somehow a lower priority aspect with respect to other issues currently still open, still keeping in mind the ambiguity here above mentioned that it leaves.

pp86 added a commit that referenced this issue Dec 16, 2017
Refactor the Select Execution in Spark, so as to produce a smaller
execution plan
@pp86
Copy link
Contributor Author

pp86 commented Dec 16, 2017

The problem of attribute name splitting has been solved by my last commit.

I still have to work on the EXACT/FULL options.

@pp86
Copy link
Contributor Author

pp86 commented Jan 10, 2018

I started thinking on EXACT/FULL.
I think that in the select we can only have the FULL option. The exact makes no sense, isn't it?

@marcomass
Copy link
Contributor

@pp86
Why do you think so? Can you make examples?
Usually EXACT is useful to consider attributes that in the matadata do not have prefixes, discriminating them from other attributes with same name, but different prefixes. For instance this can occur when the user creates attributes during the processing (extend, project) with same name (e.g. regionNum), but in different points of the query flow.

@pp86
Copy link
Contributor Author

pp86 commented Jan 10, 2018

I am missing the difference between the two.
Should the result of
FULL(antibody)
be different form the result of:
EXACT(antibody)
?

@marcomass
Copy link
Contributor

I copy here below the current definition in the manual for metajoin conditions (we will extend it also to select, when implemented).

● In a metajoin condition (i.e., for all operators that include such condition: SELECT has semijoin, DIFFERENCE, MAP, and JOIN have joinby, GROUP, MERGE, and COVER have groupby), different matching options can be used:
○ metadata_attribute_name: it matches all attributes that are equal to OR end with the dot-separated suffix specified name (regardless additional metadata_attribute_name dot-separated prefixes not explicitly specified);
○ EXACT(metadata_attribute_name): it matches all attributes that are equal to the specified name (without any prefixes);
○ FULL(metadata_attribute_name): it matches two attributes if they end with the specified name AND their full names are equal;
For instance, if we consider the following attributes:

  1. pref1.pref2.att
  2. pref1.att
  3. att
  4. pref1.att
    Then:
    • att matches all of the above attributes;
    • EXACT(att) matches only attribute 3. (i.e., att);
    • FULL(att) matches attributes 2. and 4. (i.e., pref1.att).

@Erlaad
Copy link

Erlaad commented Jan 10, 2018 via email

@pp86
Copy link
Contributor Author

pp86 commented Jan 11, 2018

You did not understand my point. Definition and example describe metajoin or groupby, where we compare two attribute names in order to tell if they are in the same group.
Here we are comparing 1 attribute name and 1 fixed string, so (contrary to what I said previously) only EXACT may help. Am I right?

If not please provide a counter example

@sunbrn
Copy link
Collaborator

sunbrn commented Jan 11, 2018

I think @pp86 is correct.
So, if I understand correctly, I think we should be able to handle these cases:

  1. using the default option SELECT <attribute_name>=="xxxx" you are matching all samples which have the keys some_prefixes.<attribute_name> or just <attribute_name>;
  2. using the option SELECT EXACT(<attribute_name>=="xxxx") you are matching all sample which have key <attribute_name>;
  3. using the option SELECT EXACT(some_prefixes.<attribute_name>) you are matching all sample which have keys some_prefixes.<attribute_name>.

Therefore default and EXACT should be enough to handle the requested cases.
I see no other case that a potential "FULL" option could cover.

@marcomass
Copy link
Contributor

@sunbrn
I agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants