Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info files from str to jsonb #2743

Merged
merged 20 commits into from
Dec 11, 2018
Merged

Info files from str to jsonb #2743

merged 20 commits into from
Dec 11, 2018

Conversation

antgonza
Copy link
Member

No description provided.

@antgonza antgonza changed the title Info files from str to jsonb WIP: Info files from str to jsonb Nov 28, 2018
qiita_db/environment_manager.py Outdated Show resolved Hide resolved
qiita_db/support_files/patches/68.sql Outdated Show resolved Hide resolved
qiita_db/support_files/patches/68.sql Outdated Show resolved Hide resolved
qiita_db/support_files/patches/python_patches/45.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Dec 4, 2018

Codecov Report

Merging #2743 into dev will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##              dev   #2743      +/-   ##
=========================================
+ Coverage   94.28%   94.3%   +0.02%     
=========================================
  Files         166     166              
  Lines       19900   19939      +39     
=========================================
+ Hits        18763   18804      +41     
+ Misses       1137    1135       -2
Impacted Files Coverage Δ
...ta_pet/handlers/artifact_handlers/base_handlers.py 92.81% <ø> (ø) ⬆️
...lers/artifact_handlers/tests/test_base_handlers.py 95.32% <ø> (ø) ⬆️
qiita_db/test/test_util.py 99.69% <ø> (ø) ⬆️
qiita_db/processing_job.py 93.17% <ø> (ø) ⬆️
qiita_db/metadata_template/util.py 89.68% <ø> (ø) ⬆️
...t/handlers/api_proxy/tests/test_sample_template.py 97.76% <100%> (ø) ⬆️
..._db/metadata_template/test/test_sample_template.py 99.85% <100%> (ø) ⬆️
qiita_db/environment_manager.py 42.58% <100%> (+1.12%) ⬆️
qiita_db/util.py 95.49% <100%> (+0.13%) ⬆️
qiita_db/test/test_setup.py 97.87% <100%> (ø) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a9da1d...e15c4c1. Read the comment docs.

@antgonza antgonza changed the title WIP: Info files from str to jsonb Info files from str to jsonb Dec 4, 2018
qiita_db/environment_manager.py Outdated Show resolved Hide resolved
qiita_db/environment_manager.py Outdated Show resolved Hide resolved
qiita_db/environment_manager.py Show resolved Hide resolved
qiita_db/environment_manager.py Outdated Show resolved Hide resolved
qiita_db/meta_util.py Show resolved Hide resolved
qiita_db/support_files/populate_test_db.sql Outdated Show resolved Hide resolved
Copy link
Member Author

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing!

qiita_db/environment_manager.py Show resolved Hide resolved
qiita_db/environment_manager.py Outdated Show resolved Hide resolved
qiita_db/environment_manager.py Outdated Show resolved Hide resolved
qiita_db/environment_manager.py Outdated Show resolved Hide resolved
qiita_db/meta_util.py Show resolved Hide resolved
# we are going to open and close 2 main transactions; this is a required
# change since patch 68.sql where we transition to jsonb for all info
# files. The 2 main transitions are: (1) get the current settings,
# (2) each patch in their independent trasaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trasaction -> transaction

Copy link
Contributor

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just questions

# this in its own transition
if sql_patch_filename == '68.sql' and test:
with qdb.sql_connection.TRN:
_populate_test_db()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this execute on production as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, line 417: if sql_patch_filename == '68.sql' and test:


return d
return result[0]['sample_values']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't a loads needed here when it is on like 198?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Took me a minute to think about it. Basically, it's due to the magic of psycopg2 and json. In line 198 (sample_values->>'columns') we are asking specifically the value within a key in the json object, while here we are asking for all values; and when you ask for all values psycopg2 automatically transforms the json vs. a specific key, that is a json string, it's the devs responsibility.

VALUES (%s, {2})""".format(
table_name, ", ".join(headers),
', '.join(["%s"] * len(headers)))
values = '{"columns": %s}' % dumps(md_template.columns.tolist())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about: values = dumps({"columns": md_template.columns.tolist()})?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! that works; changing.

cols = self.categories()
cols.extend(new_cols)

values = '{"columns": %s}' % dumps(cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, safer to rely fully on the encoder instead of manually creating json

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

', '.join(["%s"] * len(headers)))
# inserting new samples to the info file
values = [(k, df.to_json())
for k, df in md_filtered.iterrows()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df -> row?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

qdb.sql_connection.TRN.add(sql, sample_vals)
new_columns = []
for sid, df in to_update.groupby(level=0):
values = {k[1]: v for k, v in df.to_dict()['to'].iteritems()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "to" a special key? i don't really understand what's going on here. Why not just take a row or a column and cast that to dict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, my guess is that the confusion is how github displays this block. Basically, to is created in the line above: to_update = pd.DataFrame({'to': changed_to}, index=changed.index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What i think is going on here:

  • the group is cast to a dict
  • the key "to" is selected from this dict, which corresponds to the column in the group
  • the key/value pairs in the Series representing the column are iterated
  • a new dict is created using something. I don't know what that something is. It seems like the index is a heirarchical one in which case k[1] is pulling out some field, but that doesn't make sense given that the index value on line 1315 is described as sid so I'm assuming that's a sample ID. In which case, k[1] would be the character at index 1 in the sample ID string

If the above is correct, than I remain confused :) If the above is not correct, would it be possible to restructure the code, and/or include a comment describing precisely what this block of code does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Now I understand, but I think all of this can be deleted and just use a pivot table?

In [5]: d = pd.DataFrame([['s1', 'cat1', 6], ['s2', 'cat1', 5], ['s2', 'cat2', 'foo'], ['s3', 'cat1', 10], ['s3', 'cat3', 'bar']], columns=['sample_id', 'category', 'value'])

In [6]: d.pivot(index='sample_id', columns='category', values='value')
Out[6]:
category  cat1 cat2 cat3
sample_id
s1           6  NaN  NaN
s2           5  foo  NaN
s3          10  NaN  bar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After your comment I realized that this could be simplified much more, hope that's the case.


new_columns = list(set(new_columns).union(set(self.categories())))
table_name = self._table_name(self.id)
values = '{"columns": %s}' % dumps(new_columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same same. may be better to decompose this into a helper method since this pattern has come up a bunch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

category, self._table_name(self._id))
if category not in self.categories():
raise qdb.exceptions.QiitaDBColumnError(category)
sql = """SELECT sample_id, sample_values->>'{0}' as {0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty similar to __getitem__?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but again Sample vs SampleTemplate issue ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay

'2015-09-01 00:00:00']]
exp = [
['%s.Sample2' % self.new_study.id, {
'bool_col': 'true', 'date_col': '2015-09-01 00:00:00'}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be careful here:

In [11]: json.loads('{"foo": true}')
Out[11]: {'foo': True}

In [12]: json.loads('{"foo": "true"}')
Out[12]: {'foo': 'true'}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! For our peace of mind, we always load the info via pandas as str (dtype=str) and use the df.to_json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a lot of use of json.dumps which is going to likely have a similar issue. I don't see how the use of pandas protects us from implicit behavior of the json encoders / decoders

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas dtype=str load (or should load) everything as string; thus, pandas will not try to infer datatypes. From my own experience, tools will use the dtype of pandas to infer the datatype of the column. Thus, if in pandas everything is str, the other tools will use as the type.

In [1]: import pandas as pd

In [2]: pd.DataFrame.from_dict({'val': {'row': 5}}).dtypes
Out[2]: 
val    int64
dtype: object

In [3]: pd.DataFrame.from_dict({'val': {'row': 5}}, dtype=str).dtypes
Out[3]: 
val    object
dtype: object

@@ -1651,22 +1649,22 @@ def get_artifacts_information(artifact_ids, only_biom=True):
'deprecated': cmd.software.deprecated}

# now let's get the actual artifacts
ts = {}
ts = {None: []}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None for a key is weird, why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ts is a cache (prep id : target subfragment) so we don't have to query multiple times the target subfragment for a prep info file. Now, some artifacts (like analysis) do not have a prep info file; thus the None. It's possible to do and if/else but this was easier when first written, let me know ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If None corresponds to analyses which do not have a prep (or it doesn't make sense to have one), would it be possible to include a comment as such denoting the interpretation of the key?

Copy link
Member Author

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade, thanks!

# this in its own transition
if sql_patch_filename == '68.sql' and test:
with qdb.sql_connection.TRN:
_populate_test_db()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, line 417: if sql_patch_filename == '68.sql' and test:


return d
return result[0]['sample_values']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Took me a minute to think about it. Basically, it's due to the magic of psycopg2 and json. In line 198 (sample_values->>'columns') we are asking specifically the value within a key in the json object, while here we are asking for all values; and when you ask for all values psycopg2 automatically transforms the json vs. a specific key, that is a json string, it's the devs responsibility.

VALUES (%s, {2})""".format(
table_name, ", ".join(headers),
', '.join(["%s"] * len(headers)))
values = '{"columns": %s}' % dumps(md_template.columns.tolist())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! that works; changing.

cols = self.categories()
cols.extend(new_cols)

values = '{"columns": %s}' % dumps(cols)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

', '.join(["%s"] * len(headers)))
# inserting new samples to the info file
values = [(k, df.to_json())
for k, df in md_filtered.iterrows()]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

qdb.sql_connection.TRN.add(sql, sample_vals)
new_columns = []
for sid, df in to_update.groupby(level=0):
values = {k[1]: v for k, v in df.to_dict()['to'].iteritems()}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, my guess is that the confusion is how github displays this block. Basically, to is created in the line above: to_update = pd.DataFrame({'to': changed_to}, index=changed.index)


new_columns = list(set(new_columns).union(set(self.categories())))
table_name = self._table_name(self.id)
values = '{"columns": %s}' % dumps(new_columns)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

category, self._table_name(self._id))
if category not in self.categories():
raise qdb.exceptions.QiitaDBColumnError(category)
sql = """SELECT sample_id, sample_values->>'{0}' as {0}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but again Sample vs SampleTemplate issue ...

'2015-09-01 00:00:00']]
exp = [
['%s.Sample2' % self.new_study.id, {
'bool_col': 'true', 'date_col': '2015-09-01 00:00:00'}],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! For our peace of mind, we always load the info via pandas as str (dtype=str) and use the df.to_json

@@ -1651,22 +1649,22 @@ def get_artifacts_information(artifact_ids, only_biom=True):
'deprecated': cmd.software.deprecated}

# now let's get the actual artifacts
ts = {}
ts = {None: []}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ts is a cache (prep id : target subfragment) so we don't have to query multiple times the target subfragment for a prep info file. Now, some artifacts (like analysis) do not have a prep info file; thus the None. It's possible to do and if/else but this was easier when first written, let me know ...

@antgonza antgonza mentioned this pull request Dec 10, 2018
@wasade wasade merged commit 3e63bfd into qiita-spots:dev Dec 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants