Info files from str to jsonb #2743

antgonza · 2018-11-28T15:34:42Z

No description provided.

qiita_db/environment_manager.py

qiita_db/support_files/patches/68.sql

qiita_db/support_files/patches/python_patches/45.py

codecov-io · 2018-12-04T16:35:05Z

Codecov Report

Merging #2743 into dev will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##              dev   #2743      +/-   ##
=========================================
+ Coverage   94.28%   94.3%   +0.02%     
=========================================
  Files         166     166              
  Lines       19900   19939      +39     
=========================================
+ Hits        18763   18804      +41     
+ Misses       1137    1135       -2

Impacted Files	Coverage Δ
...ta_pet/handlers/artifact_handlers/base_handlers.py	`92.81% <ø> (ø)`	⬆️
...lers/artifact_handlers/tests/test_base_handlers.py	`95.32% <ø> (ø)`	⬆️
qiita_db/test/test_util.py	`99.69% <ø> (ø)`	⬆️
qiita_db/processing_job.py	`93.17% <ø> (ø)`	⬆️
qiita_db/metadata_template/util.py	`89.68% <ø> (ø)`	⬆️
...t/handlers/api_proxy/tests/test_sample_template.py	`97.76% <100%> (ø)`	⬆️
..._db/metadata_template/test/test_sample_template.py	`99.85% <100%> (ø)`	⬆️
qiita_db/environment_manager.py	`42.58% <100%> (+1.12%)`	⬆️
qiita_db/util.py	`95.49% <100%> (+0.13%)`	⬆️
qiita_db/test/test_setup.py	`97.87% <100%> (ø)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a9da1d...e15c4c1. Read the comment docs.

qiita_db/environment_manager.py

qiita_db/meta_util.py

qiita_db/metadata_template/base_metadata_template.py

qiita_db/metadata_template/test/test_prep_template.py

qiita_db/metadata_template/test/test_sample_template.py

qiita_db/support_files/populate_test_db.sql

antgonza

Thank you for reviewing!

qiita_db/environment_manager.py

qiita_db/meta_util.py

qiita_db/metadata_template/base_metadata_template.py

qiita_db/metadata_template/test/test_prep_template.py

qiita_db/metadata_template/test/test_sample_template.py

charles-cowart · 2018-12-04T21:46:29Z

qiita_db/environment_manager.py

+    # we are going to open and close 2 main transactions; this is a required
+    # change since patch 68.sql where we transition to jsonb for all info
+    # files. The 2 main transitions are: (1) get the current settings,
+    # (2) each patch in their independent trasaction


trasaction -> transaction

wasade

Mostly just questions

wasade · 2018-12-07T19:21:20Z

qiita_db/environment_manager.py

+        # this in its own transition
+        if sql_patch_filename == '68.sql' and test:
+            with qdb.sql_connection.TRN:
+                _populate_test_db()


would this execute on production as well?

No, line 417: if sql_patch_filename == '68.sql' and test:

wasade · 2018-12-07T19:29:03Z

qiita_db/metadata_template/base_metadata_template.py


-            return d
+            return result[0]['sample_values']


Why isn't a loads needed here when it is on like 198?

Good question! Took me a minute to think about it. Basically, it's due to the magic of psycopg2 and json. In line 198 (sample_values->>'columns') we are asking specifically the value within a key in the json object, while here we are asking for all values; and when you ask for all values psycopg2 automatically transforms the json vs. a specific key, that is a json string, it's the devs responsibility.

wasade · 2018-12-07T19:31:38Z

qiita_db/metadata_template/base_metadata_template.py

-                     VALUES (%s, {2})""".format(
-                table_name, ", ".join(headers),
-                ', '.join(["%s"] * len(headers)))
+            values = '{"columns": %s}' % dumps(md_template.columns.tolist())


what about: values = dumps({"columns": md_template.columns.tolist()})?

ha! that works; changing.

wasade · 2018-12-07T19:34:31Z

qiita_db/metadata_template/base_metadata_template.py

+                cols = self.categories()
+                cols.extend(new_cols)
+
+                values = '{"columns": %s}' % dumps(cols)


same here, safer to rely fully on the encoder instead of manually creating json

wasade · 2018-12-07T19:35:41Z

qiita_db/metadata_template/base_metadata_template.py

-                    ', '.join(["%s"] * len(headers)))
+                # inserting new samples to the info file
+                values = [(k, df.to_json())
+                          for k, df in md_filtered.iterrows()]


wasade · 2018-12-07T19:41:49Z

qiita_db/metadata_template/base_metadata_template.py

-                qdb.sql_connection.TRN.add(sql, sample_vals)
+            new_columns = []
+            for sid, df in to_update.groupby(level=0):
+                values = {k[1]: v for k, v in df.to_dict()['to'].iteritems()}


is "to" a special key? i don't really understand what's going on here. Why not just take a row or a column and cast that to dict?

Not really, my guess is that the confusion is how github displays this block. Basically, to is created in the line above: to_update = pd.DataFrame({'to': changed_to}, index=changed.index)

What i think is going on here:

the group is cast to a dict

the key "to" is selected from this dict, which corresponds to the column in the group

the key/value pairs in the Series representing the column are iterated

a new dict is created using something. I don't know what that something is. It seems like the index is a heirarchical one in which case k[1] is pulling out some field, but that doesn't make sense given that the index value on line 1315 is described as sid so I'm assuming that's a sample ID. In which case, k[1] would be the character at index 1 in the sample ID string

If the above is correct, than I remain confused :) If the above is not correct, would it be possible to restructure the code, and/or include a comment describing precisely what this block of code does?

Thanks! Now I understand, but I think all of this can be deleted and just use a pivot table?

In [5]: d = pd.DataFrame([['s1', 'cat1', 6], ['s2', 'cat1', 5], ['s2', 'cat2', 'foo'], ['s3', 'cat1', 10], ['s3', 'cat3', 'bar']], columns=['sample_id', 'category', 'value']) In [6]: d.pivot(index='sample_id', columns='category', values='value') Out[6]: category cat1 cat2 cat3 sample_id s1 6 NaN NaN s2 5 foo NaN s3 10 NaN bar

After your comment I realized that this could be simplified much more, hope that's the case.

wasade · 2018-12-07T19:42:34Z

qiita_db/metadata_template/base_metadata_template.py

+
+            new_columns = list(set(new_columns).union(set(self.categories())))
+            table_name = self._table_name(self.id)
+            values = '{"columns": %s}' % dumps(new_columns)


same same. may be better to decompose this into a helper method since this pattern has come up a bunch

wasade · 2018-12-07T19:43:56Z

qiita_db/metadata_template/base_metadata_template.py

-                category, self._table_name(self._id))
+            if category not in self.categories():
+                raise qdb.exceptions.QiitaDBColumnError(category)
+            sql = """SELECT sample_id, sample_values->>'{0}' as {0}


pretty similar to __getitem__?

Yes, but again Sample vs SampleTemplate issue ...

wasade · 2018-12-07T19:47:37Z

qiita_db/metadata_template/test/test_sample_template.py

-                '2015-09-01 00:00:00']]
+        exp = [
+            ['%s.Sample2' % self.new_study.id, {
+                'bool_col': 'true', 'date_col': '2015-09-01 00:00:00'}],


We want to be careful here:

In [11]: json.loads('{"foo": true}') Out[11]: {'foo': True} In [12]: json.loads('{"foo": "true"}') Out[12]: {'foo': 'true'}

Good point! For our peace of mind, we always load the info via pandas as str (dtype=str) and use the df.to_json

There was a lot of use of json.dumps which is going to likely have a similar issue. I don't see how the use of pandas protects us from implicit behavior of the json encoders / decoders

pandas dtype=str load (or should load) everything as string; thus, pandas will not try to infer datatypes. From my own experience, tools will use the dtype of pandas to infer the datatype of the column. Thus, if in pandas everything is str, the other tools will use as the type.

In [1]: import pandas as pd In [2]: pd.DataFrame.from_dict({'val': {'row': 5}}).dtypes Out[2]: val int64 dtype: object In [3]: pd.DataFrame.from_dict({'val': {'row': 5}}, dtype=str).dtypes Out[3]: val object dtype: object

wasade · 2018-12-07T19:55:48Z

qiita_db/util.py

@@ -1651,22 +1649,22 @@ def get_artifacts_information(artifact_ids, only_biom=True):
                    'deprecated': cmd.software.deprecated}

            # now let's get the actual artifacts
-            ts = {}
+            ts = {None: []}


None for a key is weird, why is this needed?

ts is a cache (prep id : target subfragment) so we don't have to query multiple times the target subfragment for a prep info file. Now, some artifacts (like analysis) do not have a prep info file; thus the None. It's possible to do and if/else but this was easier when first written, let me know ...

If None corresponds to analyses which do not have a prep (or it doesn't make sense to have one), would it be possible to include a comment as such denoting the interpretation of the key?

antgonza

@wasade, thanks!

antgonza · 2018-12-07T20:26:54Z

qiita_db/environment_manager.py

+        # this in its own transition
+        if sql_patch_filename == '68.sql' and test:
+            with qdb.sql_connection.TRN:
+                _populate_test_db()


No, line 417: if sql_patch_filename == '68.sql' and test:

antgonza · 2018-12-07T20:31:35Z

qiita_db/metadata_template/base_metadata_template.py


-            return d
+            return result[0]['sample_values']


Good question! Took me a minute to think about it. Basically, it's due to the magic of psycopg2 and json. In line 198 (sample_values->>'columns') we are asking specifically the value within a key in the json object, while here we are asking for all values; and when you ask for all values psycopg2 automatically transforms the json vs. a specific key, that is a json string, it's the devs responsibility.

antgonza · 2018-12-07T20:33:30Z

qiita_db/metadata_template/base_metadata_template.py

-                     VALUES (%s, {2})""".format(
-                table_name, ", ".join(headers),
-                ', '.join(["%s"] * len(headers)))
+            values = '{"columns": %s}' % dumps(md_template.columns.tolist())


ha! that works; changing.

antgonza · 2018-12-07T20:34:17Z

qiita_db/metadata_template/base_metadata_template.py

+                cols = self.categories()
+                cols.extend(new_cols)
+
+                values = '{"columns": %s}' % dumps(cols)


antgonza · 2018-12-07T20:34:53Z

qiita_db/metadata_template/base_metadata_template.py

-                    ', '.join(["%s"] * len(headers)))
+                # inserting new samples to the info file
+                values = [(k, df.to_json())
+                          for k, df in md_filtered.iterrows()]


antgonza · 2018-12-07T21:05:10Z

qiita_db/metadata_template/base_metadata_template.py

-                qdb.sql_connection.TRN.add(sql, sample_vals)
+            new_columns = []
+            for sid, df in to_update.groupby(level=0):
+                values = {k[1]: v for k, v in df.to_dict()['to'].iteritems()}


Not really, my guess is that the confusion is how github displays this block. Basically, to is created in the line above: to_update = pd.DataFrame({'to': changed_to}, index=changed.index)

antgonza · 2018-12-07T21:05:22Z

qiita_db/metadata_template/base_metadata_template.py

+
+            new_columns = list(set(new_columns).union(set(self.categories())))
+            table_name = self._table_name(self.id)
+            values = '{"columns": %s}' % dumps(new_columns)


antgonza · 2018-12-07T21:06:40Z

qiita_db/metadata_template/base_metadata_template.py

-                category, self._table_name(self._id))
+            if category not in self.categories():
+                raise qdb.exceptions.QiitaDBColumnError(category)
+            sql = """SELECT sample_id, sample_values->>'{0}' as {0}


Yes, but again Sample vs SampleTemplate issue ...

antgonza · 2018-12-07T21:47:14Z

qiita_db/metadata_template/test/test_sample_template.py

-                '2015-09-01 00:00:00']]
+        exp = [
+            ['%s.Sample2' % self.new_study.id, {
+                'bool_col': 'true', 'date_col': '2015-09-01 00:00:00'}],


Good point! For our peace of mind, we always load the info via pandas as str (dtype=str) and use the df.to_json

antgonza · 2018-12-07T21:51:32Z

qiita_db/util.py

@@ -1651,22 +1649,22 @@ def get_artifacts_information(artifact_ids, only_biom=True):
                    'deprecated': cmd.software.deprecated}

            # now let's get the actual artifacts
-            ts = {}
+            ts = {None: []}


ts is a cache (prep id : target subfragment) so we don't have to query multiple times the target subfragment for a prep info file. Now, some artifacts (like analysis) do not have a prep info file; thus the None. It's possible to do and if/else but this was easier when first written, let me know ...

antgonza added 2 commits November 26, 2018 21:04

init changes

ad2b2a9

init changes to populate test db for pathc 68

fdee170

antgonza changed the title ~~Info files from str to jsonb~~ WIP: Info files from str to jsonb Nov 28, 2018

charles-cowart suggested changes Nov 29, 2018

View reviewed changes

antgonza added 8 commits December 3, 2018 11:05

fixing metadata_template code

03a55d7

hacking some patches

8533119

improve populate_test_db

db3fb28

populate_test_db processing_job

0f73573

add study_publications

0eada50

fix get_lat_longs

1272e1b

fix get_artifacts_information

3dd73d7

sort df before assert_frame_equal

b661447

updating qiita-db dbs/html

2e1a966

antgonza changed the title ~~WIP: Info files from str to jsonb~~ Info files from str to jsonb Dec 4, 2018

charles-cowart suggested changes Dec 4, 2018

View reviewed changes

addressing @charles-cowart comments

9e8befa

antgonza commented Dec 4, 2018

View reviewed changes

charles-cowart approved these changes Dec 4, 2018

View reviewed changes

antgonza added 5 commits December 4, 2018 19:57

fixing queries for 9.5.15

8298b3e

dealing with SQL: OUT_OF_MEMORY. MSG: out of shared memory

2d0c711

splitting patch in 3

ff4bb94

splitting patch in 4

18186b9

improve to_dataframe

991e827

antgonza mentioned this pull request Dec 6, 2018

Sample generate files improvements #2749

Merged

wasade requested changes Dec 7, 2018

View reviewed changes

antgonza commented Dec 7, 2018

View reviewed changes

addressing @wasade comments

bbf5798

antgonza mentioned this pull request Dec 10, 2018

Fix 2689 #2755

Merged

addressing @wasade comments

59040aa

addressing @wasade comments

e15c4c1

wasade approved these changes Dec 11, 2018

View reviewed changes

wasade merged commit 3e63bfd into qiita-spots:dev Dec 11, 2018

Info files from str to jsonb #2743

Info files from str to jsonb #2743

Uh oh!

Conversation

antgonza commented Nov 28, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-io commented Dec 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antgonza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wasade left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Dec 4, 2018 •

edited

Loading