-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Info files from str to jsonb #2743
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #2743 +/- ##
=========================================
+ Coverage 94.28% 94.3% +0.02%
=========================================
Files 166 166
Lines 19900 19939 +39
=========================================
+ Hits 18763 18804 +41
+ Misses 1137 1135 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing!
qiita_db/environment_manager.py
Outdated
# we are going to open and close 2 main transactions; this is a required | ||
# change since patch 68.sql where we transition to jsonb for all info | ||
# files. The 2 main transitions are: (1) get the current settings, | ||
# (2) each patch in their independent trasaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trasaction -> transaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just questions
# this in its own transition | ||
if sql_patch_filename == '68.sql' and test: | ||
with qdb.sql_connection.TRN: | ||
_populate_test_db() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this execute on production as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, line 417: if sql_patch_filename == '68.sql' and test:
|
||
return d | ||
return result[0]['sample_values'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't a loads
needed here when it is on like 198?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Took me a minute to think about it. Basically, it's due to the magic of psycopg2 and json. In line 198 (sample_values->>'columns'
) we are asking specifically the value within a key in the json object, while here we are asking for all values; and when you ask for all values psycopg2 automatically transforms the json vs. a specific key, that is a json string, it's the devs responsibility.
VALUES (%s, {2})""".format( | ||
table_name, ", ".join(headers), | ||
', '.join(["%s"] * len(headers))) | ||
values = '{"columns": %s}' % dumps(md_template.columns.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about: values = dumps({"columns": md_template.columns.tolist()})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha! that works; changing.
cols = self.categories() | ||
cols.extend(new_cols) | ||
|
||
values = '{"columns": %s}' % dumps(cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, safer to rely fully on the encoder instead of manually creating json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
', '.join(["%s"] * len(headers))) | ||
# inserting new samples to the info file | ||
values = [(k, df.to_json()) | ||
for k, df in md_filtered.iterrows()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df -> row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
qdb.sql_connection.TRN.add(sql, sample_vals) | ||
new_columns = [] | ||
for sid, df in to_update.groupby(level=0): | ||
values = {k[1]: v for k, v in df.to_dict()['to'].iteritems()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "to" a special key? i don't really understand what's going on here. Why not just take a row or a column and cast that to dict
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, my guess is that the confusion is how github displays this block. Basically, to is created in the line above: to_update = pd.DataFrame({'to': changed_to}, index=changed.index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What i think is going on here:
- the group is cast to a dict
- the key "to" is selected from this dict, which corresponds to the column in the group
- the key/value pairs in the Series representing the column are iterated
- a new dict is created using something. I don't know what that something is. It seems like the index is a heirarchical one in which case
k[1]
is pulling out some field, but that doesn't make sense given that the index value on line 1315 is described assid
so I'm assuming that's a sample ID. In which case,k[1]
would be the character at index 1 in the sample ID string
If the above is correct, than I remain confused :) If the above is not correct, would it be possible to restructure the code, and/or include a comment describing precisely what this block of code does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Now I understand, but I think all of this can be deleted and just use a pivot table?
In [5]: d = pd.DataFrame([['s1', 'cat1', 6], ['s2', 'cat1', 5], ['s2', 'cat2', 'foo'], ['s3', 'cat1', 10], ['s3', 'cat3', 'bar']], columns=['sample_id', 'category', 'value'])
In [6]: d.pivot(index='sample_id', columns='category', values='value')
Out[6]:
category cat1 cat2 cat3
sample_id
s1 6 NaN NaN
s2 5 foo NaN
s3 10 NaN bar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After your comment I realized that this could be simplified much more, hope that's the case.
|
||
new_columns = list(set(new_columns).union(set(self.categories()))) | ||
table_name = self._table_name(self.id) | ||
values = '{"columns": %s}' % dumps(new_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same same. may be better to decompose this into a helper method since this pattern has come up a bunch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
category, self._table_name(self._id)) | ||
if category not in self.categories(): | ||
raise qdb.exceptions.QiitaDBColumnError(category) | ||
sql = """SELECT sample_id, sample_values->>'{0}' as {0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty similar to __getitem__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but again Sample vs SampleTemplate issue ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay
'2015-09-01 00:00:00']] | ||
exp = [ | ||
['%s.Sample2' % self.new_study.id, { | ||
'bool_col': 'true', 'date_col': '2015-09-01 00:00:00'}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to be careful here:
In [11]: json.loads('{"foo": true}')
Out[11]: {'foo': True}
In [12]: json.loads('{"foo": "true"}')
Out[12]: {'foo': 'true'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! For our peace of mind, we always load the info via pandas as str (dtype=str) and use the df.to_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a lot of use of json.dumps
which is going to likely have a similar issue. I don't see how the use of pandas protects us from implicit behavior of the json encoders / decoders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas dtype=str load (or should load) everything as string; thus, pandas will not try to infer datatypes. From my own experience, tools will use the dtype of pandas to infer the datatype of the column. Thus, if in pandas everything is str, the other tools will use as the type.
In [1]: import pandas as pd
In [2]: pd.DataFrame.from_dict({'val': {'row': 5}}).dtypes
Out[2]:
val int64
dtype: object
In [3]: pd.DataFrame.from_dict({'val': {'row': 5}}, dtype=str).dtypes
Out[3]:
val object
dtype: object
@@ -1651,22 +1649,22 @@ def get_artifacts_information(artifact_ids, only_biom=True): | |||
'deprecated': cmd.software.deprecated} | |||
|
|||
# now let's get the actual artifacts | |||
ts = {} | |||
ts = {None: []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
for a key is weird, why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ts is a cache (prep id : target subfragment) so we don't have to query multiple times the target subfragment for a prep info file. Now, some artifacts (like analysis) do not have a prep info file; thus the None. It's possible to do and if/else but this was easier when first written, let me know ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If None
corresponds to analyses which do not have a prep (or it doesn't make sense to have one), would it be possible to include a comment as such denoting the interpretation of the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wasade, thanks!
# this in its own transition | ||
if sql_patch_filename == '68.sql' and test: | ||
with qdb.sql_connection.TRN: | ||
_populate_test_db() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, line 417: if sql_patch_filename == '68.sql' and test:
|
||
return d | ||
return result[0]['sample_values'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Took me a minute to think about it. Basically, it's due to the magic of psycopg2 and json. In line 198 (sample_values->>'columns'
) we are asking specifically the value within a key in the json object, while here we are asking for all values; and when you ask for all values psycopg2 automatically transforms the json vs. a specific key, that is a json string, it's the devs responsibility.
VALUES (%s, {2})""".format( | ||
table_name, ", ".join(headers), | ||
', '.join(["%s"] * len(headers))) | ||
values = '{"columns": %s}' % dumps(md_template.columns.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha! that works; changing.
cols = self.categories() | ||
cols.extend(new_cols) | ||
|
||
values = '{"columns": %s}' % dumps(cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
', '.join(["%s"] * len(headers))) | ||
# inserting new samples to the info file | ||
values = [(k, df.to_json()) | ||
for k, df in md_filtered.iterrows()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
qdb.sql_connection.TRN.add(sql, sample_vals) | ||
new_columns = [] | ||
for sid, df in to_update.groupby(level=0): | ||
values = {k[1]: v for k, v in df.to_dict()['to'].iteritems()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, my guess is that the confusion is how github displays this block. Basically, to is created in the line above: to_update = pd.DataFrame({'to': changed_to}, index=changed.index)
|
||
new_columns = list(set(new_columns).union(set(self.categories()))) | ||
table_name = self._table_name(self.id) | ||
values = '{"columns": %s}' % dumps(new_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
category, self._table_name(self._id)) | ||
if category not in self.categories(): | ||
raise qdb.exceptions.QiitaDBColumnError(category) | ||
sql = """SELECT sample_id, sample_values->>'{0}' as {0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but again Sample vs SampleTemplate issue ...
'2015-09-01 00:00:00']] | ||
exp = [ | ||
['%s.Sample2' % self.new_study.id, { | ||
'bool_col': 'true', 'date_col': '2015-09-01 00:00:00'}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! For our peace of mind, we always load the info via pandas as str (dtype=str) and use the df.to_json
@@ -1651,22 +1649,22 @@ def get_artifacts_information(artifact_ids, only_biom=True): | |||
'deprecated': cmd.software.deprecated} | |||
|
|||
# now let's get the actual artifacts | |||
ts = {} | |||
ts = {None: []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ts is a cache (prep id : target subfragment) so we don't have to query multiple times the target subfragment for a prep info file. Now, some artifacts (like analysis) do not have a prep info file; thus the None. It's possible to do and if/else but this was easier when first written, let me know ...
No description provided.