Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert covidcast signal OR clauses to UNION #763

Closed
krivard opened this issue Nov 8, 2021 · 4 comments
Closed

Convert covidcast signal OR clauses to UNION #763

krivard opened this issue Nov 8, 2021 · 4 comments
Labels
enhancement mysql mysql database related refactor Substantial projects to make the code do the same thing, better. suspect efficacy and usefulness are questionable or unclear

Comments

@krivard
Copy link
Contributor

krivard commented Nov 8, 2021

The following query returns 248 rows and takes 30 minutes to compute:

SELECT t.geo_type, t.geo_value, t.source, t.signal, t.time_value, t.value 
FROM covidcast t 
WHERE ((t.source = 'jhu-csse' 
AND (t.signal = 'confirmed_7dav_incidence_prop' OR t.signal = 'deaths_7dav_incidence_prop'))) 
AND ((t.geo_type = 'nation' AND (t.geo_value = 'us'))) 
AND ((t.time_type = 'day' AND (t.time_value BETWEEN 20210706 AND 20211106))) 
AND (t.is_latest_issue IS TRUE) 
ORDER BY t.geo_type ASC, t.geo_value ASC, t.source ASC, t.signal ASC, t.time_value ASC 
LIMIT 10000001;

Either one of the signals on its own runs in less than 1s each, as does the UNION of the two:

SELECT t.geo_type, t.geo_value, t.source, t.signal, t.time_value, t.value 
FROM covidcast t 
WHERE ((t.source = 'jhu-csse' AND (t.signal = 'confirmed_7dav_incidence_prop'))) 
AND ((t.geo_type = 'nation' AND (t.geo_value = 'us'))) 
AND ((t.time_type = 'day' AND (t.time_value BETWEEN 20210706 AND 20211106))) 
AND (t.is_latest_issue IS TRUE) 
UNION ALL 
SELECT t.geo_type, t.geo_value, t.source, t.signal, t.time_value, t.value 
FROM covidcast t 
WHERE ((t.source = 'jhu-csse' AND (t.signal = 'deaths_7dav_incidence_prop'))) 
AND ((t.geo_type = 'nation' AND (t.geo_value = 'us'))) 
AND ((t.time_type = 'day' AND (t.time_value BETWEEN 20210706 AND 20211106))) 
AND (t.is_latest_issue IS TRUE) 
ORDER BY geo_type ASC, geo_value ASC, source ASC, `signal` ASC, time_value ASC 
LIMIT 10000001;

If possible, we should switch how we build multiple-signal queries to use UNION instead of OR.

@krivard krivard added enhancement refactor Substantial projects to make the code do the same thing, better. labels Nov 8, 2021
@krivard
Copy link
Contributor Author

krivard commented Sep 8, 2022

Here is where we parse source-signal pairs out of the API request:

def parse_source_signal_pairs() -> List[SourceSignalPair]:
ds = request.values.get("data_source")
if ds:
# old version
require_any("signal", "signals")
signals = extract_strings(("signals", "signal"))
return [SourceSignalPair(ds, signals)]
if ":" not in request.values.get("signal", ""):
raise ValidationFailedException("missing parameter: signal or (data_source and signal[s])")
return parse_source_signal_arg()

Here is where we set the source-signal filters in the WHERE clause of the SQL query we're building:

q.where_source_signal_pairs("source", "signal", source_signal_pairs)

We're currently just doing an append, which treats multiple source-signal pairs like an OR:

def where_source_signal_pairs(
self,
type_field: str,
value_field: str,
values: Sequence[SourceSignalPair],
param_key: Optional[str] = None,
) -> "QueryBuilder":
fq_type_field = self._fq_field(type_field)
fq_value_field = self._fq_field(value_field)
self.conditions.append(
filter_source_signal_pairs(
fq_type_field,
fq_value_field,
values,
param_key or type_field,
self.params,
)
)
return self

Here is where a list of conditions get turned into an AND of ORs:

def __str__(self):
where = f"WHERE {self.conditions_clause}" if self.conditions else ""
order = f"ORDER BY {_join_l(self.order)}" if self.order else ""
group_by = f"GROUP BY {_join_l(self.group_by)}" if self.group_by else ""
index = f"USE INDEX ({self.index})" if self.index else ""
return f"SELECT {self.fields_clause} FROM {self.table} {index} {self.subquery} {where} {group_by} {order}"

...where we want signals in particular to turn into UNIONs

@krivard
Copy link
Contributor Author

krivard commented Sep 8, 2022

Tests of the server code should live in here if they do not access a database:

https://github.com/cmu-delphi/delphi-epidata/tree/dev/tests/server

and in here if they do access a database:

https://github.com/cmu-delphi/delphi-epidata/tree/dev/integrations/server

@melange396
Copy link
Collaborator

we now have a different schema with better indexes compared to when this issue was created. its not clear that this change will still be effective.

@melange396 melange396 added the suspect efficacy and usefulness are questionable or unclear label Aug 22, 2024
@melange396
Copy link
Collaborator

I just tested the queries from the opening comment of this issue against the newer epimetric_latest_v and epimetric_full_v VIEWs on our staging database, and the two query versions have virtually identical timing -- all returning in tiny fractions of a second. I am going to declare this "resolved".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement mysql mysql database related refactor Substantial projects to make the code do the same thing, better. suspect efficacy and usefulness are questionable or unclear
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants