-
Notifications
You must be signed in to change notification settings - Fork 80
minor improvements #3067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minor improvements #3067
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #3067 +/- ##
=======================================
Coverage 95.02% 95.03%
=======================================
Files 74 74
Lines 14279 14295 +16
=======================================
+ Hits 13569 13585 +16
Misses 710 710
Continue to review full report at Codecov.
|
# Make sure that we are changing np.NaN by Nones | ||
df = pd.DataFrame(qdb.sql_connection.TRN.execute_fetchindex()) | ||
df = pd.concat([df.drop(1, axis=1), | ||
pd.DataFrame(df[1].tolist())], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to do pd.concat([df.drop(1, axis=1), df[1]])
? It not really obvious why the new DataFrame
construction is needed or the tolist
. Oh wait, is this supposed to be a reorder operation moving the first column to the end of the frame? If so, then i think this does it:
cols = [df.columns[0]] + df.columns[2:].tolist() + [df.columns[1]]
df = df[cols]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The metadata tables are basically 2 columns: sample_id, metadata (json, which is returned as a dict). In other words, the SQL will return [[sample_id, {'col_a': 'a', 'col_b': 'b'}, ...], [sample_id, {'col_a': 'a', 'col_b': 'b'}, ...], ...] so that line basically expands the dict into actual columns in the dataframe, like this:
In [61]: data = ['x', {'a': 1, 'b': 2}], ['y', {'a': 1, 'c': 2}]
In [62]: df = pd.DataFrame(data)
In [63]: df
Out[63]:
0 1
0 x {'a': 1, 'b': 2}
1 y {'a': 1, 'c': 2}
In [64]: pd.concat([df.drop(1, axis=1), pd.DataFrame(df[1].tolist())], axis=1)
Out[64]:
0 a b c
0 x 1 2.0 NaN
1 y 1 NaN 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks. What about this? It's a little less juggling and you get the index "for free"?
>>> import pandas as pd
>>> data = ['x', {'a': 1, 'b': 2}], ['y', {'a': 1, 'c': 2}]
>>> pd.DataFrame([d for _, d in data], index=[i for i, _ in data])
a b c
x 1 2.0 NaN
y 1 NaN 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this made the code a bit faster too. If you agree with the changes, and the tests pass, could you merge?
These changes look good. @antgonza I'm not familiar with the pgsql json side of things but, this seems like the kind of thing that pgsql could probably do much faster than Pandas/Python. Did you happen to look into querying the data directly into a table form? Rather than querying a JSON string, and then converting it to a table? I'm looking into it now, but just want to make sure I shared this observation. |
Following up on this @antgonza shared some benchmarks showing that the pgsql implementation was in fact slower than the changes in this PR. |
* adding creation and modification timestamps to prep_templates (#3066) * adding creation and modification timestamps to prep_templates * Apply suggestions from code review [skip ci] Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> * fix #3068 (#3069) * minor improvements (#3067) * minor improvements * fix test * adding @wasade suggestion * minor fixes (#3073) * fix #3072 (#3074) * fix #3072 * flake8 * fix #3076 (#3077) * fix #3076 * addressing @ElDeveloper comments * fix #3070 (#3075) * fix #3070 * Apply suggestions from code review Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> * flake8 Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> * fix-3064 (#3080) * Create qiita-ci.yml * add passwd to psql * nginx + redis * redis via conda * redis-server - -> redis-server -- * daemonize -> daemonize * conda libevent * apt-get libevent-dev * adding multiple sections * rm travis_retry * add missing activagte * onfig_test.cfg * rm indent * add 1 indent * pgport * add drop * change db name * redis version * redis -> redbiom version * split tests * source .profile * adding ssh command * mkdir ~/.ssh/ * only ssh * pushing to retrigger * add run * split ssh conf/run * 60367309 * add bash * fix ident * rm ssh cmd * strict=no * localhost * mod permissions * user ssh * rm keys * echos * start ssh service * ssh -vvv * split setup/test * localhost -> 127 * localhost -> 127 * rm ssh setup * -o StrictHostKeyChecking=no * change key creation * echo sshd_config * readd ssh config * cp sshd_config and then move * rm echos * extra qiita folder * ls test_data * rm qiita folder * use key in .ssh * more ssh params * tee -> > * chmod * fullpath * ls key * rm localhost * params * rsa -> rsa-sha2-256 * rm root login * readd -o StrictHostKeyChecking=no * minimizing params * permissions * cleaning up * comments => echo * -o StrictHostKeyChecking=no * export REDBIOM_HOST=http://127.0.0.1:7379 * localhost -> 127.0.0.1 * add runner * cover package * cleaning env vars * renaming jobs * redbiom host * add env vars via export * changing echo * add env COVER_PACKAGE * rm user sftp * just qiita_db * open_sftp -> SFTPClient * +1 worker * only SCPClient * ssh.SCPClient -> SCPClient * return sftp/ssh back * just qiita_db * rm sftp * add all tests * BASE_URL = https://127.0.0.1:8383 * bash -c * start_biom * start_biom 127 * populate QIITA_SERVER_CERT * nginx 127 * adding ports to conf * lowering ports * service qiita * 8383 postgres * curl test * mkdir nginx * moving tests around * moving curl up * echos for curl * sleep 10 * 127.0.0.1 -> localhost * returning some localhost to 127... * qiita_pet! * final clean up * kill qiita first * addreesing @ElDeveloper comments * checkout repo for flake8 * split cover_package * Workflows GUI (#3078) * fix #3076 * addressing @ElDeveloper comments * adding workflows GUI * improve HTML * db changes and improvements * comment out new db changes * rm default_workflows from PluginHandler.get * Apply suggestions from code review [skip ci] Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> * fix #3081 (#3082) * fix #3081 * generate_analysis_list_per_study to speed up retrival * minor modifications * split sql * addressing @ElDeveloper comments * add Command.processing_jobs (#3085) * add Command.processing_jobs * Update qiita_db/software.py [skip ci] Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> * Improve workflows (#3083) * fix #3076 * addressing @ElDeveloper comments * adding workflows GUI * improve HTML * db changes and improvements * comment out new db changes * rm default_workflows from PluginHandler.get * improving workflows after deployment * fix error * clean up * minor fixes * improving html * comments from @justinshaffer * addressing @ElDeveloper comments * 2021.03 (#3084) * 2021.03 * add Command.processing_jobs to changelog * codecov * codecov required * codecov token Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>
Two minor improvements:
_common_to_dataframe_steps
~17 seconds for AGP: