Skip to content

minor improvements #3067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 29, 2021
Merged

minor improvements #3067

merged 3 commits into from
Jan 29, 2021

Conversation

antgonza
Copy link
Member

Two minor improvements:

  1. Improve performance of _common_to_dataframe_steps ~17 seconds for AGP:
In [50]: %timeit SampleTemplate(10317).to_dataframe()                                                                                                                                                                                                                                       
1min 8s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit ST(10317).to_dataframe()                                                                                                                                                                                                                                                   
51.8 s ± 259 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  1. Turns out that jobs actually append the node where the job was submitted so that list comprehension will miss jobs submitted in the workers.

@antgonza antgonza requested a review from ElDeveloper January 26, 2021 19:25
@codecov-io
Copy link

codecov-io commented Jan 26, 2021

Codecov Report

Merging #3067 (a5b9be5) into dev (36faa57) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##              dev    #3067   +/-   ##
=======================================
  Coverage   95.02%   95.03%           
=======================================
  Files          74       74           
  Lines       14279    14295   +16     
=======================================
+ Hits        13569    13585   +16     
  Misses        710      710           
Impacted Files Coverage Δ
qiita_db/metadata_template/constants.py 100.00% <ø> (ø)
qiita_db/metadata_template/prep_template.py 99.11% <ø> (+0.04%) ⬆️
qiita_db/__init__.py 100.00% <100.00%> (ø)
qiita_db/meta_util.py 88.62% <100.00%> (ø)
...ita_db/metadata_template/base_metadata_template.py 96.25% <100.00%> (-0.08%) ⬇️
qiita_db/metadata_template/sample_template.py 100.00% <100.00%> (ø)
...ta_db/metadata_template/test/test_prep_template.py 99.55% <100.00%> (+<0.01%) ⬆️
..._db/metadata_template/test/test_sample_template.py 99.86% <100.00%> (+<0.01%) ⬆️
qiita_db/processing_job.py 73.80% <100.00%> (ø)
qiita_db/study.py 94.11% <100.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6e5d80...a5b9be5. Read the comment docs.

# Make sure that we are changing np.NaN by Nones
df = pd.DataFrame(qdb.sql_connection.TRN.execute_fetchindex())
df = pd.concat([df.drop(1, axis=1),
pd.DataFrame(df[1].tolist())], axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to do pd.concat([df.drop(1, axis=1), df[1]])? It not really obvious why the new DataFrame construction is needed or the tolist. Oh wait, is this supposed to be a reorder operation moving the first column to the end of the frame? If so, then i think this does it:

cols = [df.columns[0]] + df.columns[2:].tolist() + [df.columns[1]]
df = df[cols]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The metadata tables are basically 2 columns: sample_id, metadata (json, which is returned as a dict). In other words, the SQL will return [[sample_id, {'col_a': 'a', 'col_b': 'b'}, ...], [sample_id, {'col_a': 'a', 'col_b': 'b'}, ...], ...] so that line basically expands the dict into actual columns in the dataframe, like this:

In [61]: data = ['x', {'a': 1, 'b': 2}], ['y', {'a': 1, 'c': 2}]                                                                                                                                                                                                                            

In [62]: df = pd.DataFrame(data)                                                                                                                                                                                                                                                            

In [63]: df                                                                                                                                                                                                                                                                                 
Out[63]: 
   0                 1
0  x  {'a': 1, 'b': 2}
1  y  {'a': 1, 'c': 2}

In [64]: pd.concat([df.drop(1, axis=1), pd.DataFrame(df[1].tolist())], axis=1)                                                                                                                                                                                                              
Out[64]: 
   0  a    b    c
0  x  1  2.0  NaN
1  y  1  NaN  2.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks. What about this? It's a little less juggling and you get the index "for free"?

>>> import pandas as pd
>>> data = ['x', {'a': 1, 'b': 2}], ['y', {'a': 1, 'c': 2}]
>>> pd.DataFrame([d for _, d in data], index=[i for i, _ in data])
   a    b    c
x  1  2.0  NaN
y  1  NaN  2.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this made the code a bit faster too. If you agree with the changes, and the tests pass, could you merge?

@ElDeveloper
Copy link
Contributor

These changes look good. @antgonza I'm not familiar with the pgsql json side of things but, this seems like the kind of thing that pgsql could probably do much faster than Pandas/Python. Did you happen to look into querying the data directly into a table form? Rather than querying a JSON string, and then converting it to a table?

I'm looking into it now, but just want to make sure I shared this observation.

@ElDeveloper
Copy link
Contributor

Following up on this @antgonza shared some benchmarks showing that the pgsql implementation was in fact slower than the changes in this PR.

@ElDeveloper ElDeveloper merged commit f3fefc6 into qiita-spots:dev Jan 29, 2021
@ElDeveloper
Copy link
Contributor

Thanks @wasade and @antgonza

antgonza added a commit that referenced this pull request Mar 19, 2021
* adding creation and modification timestamps to prep_templates (#3066)

* adding creation and modification timestamps to prep_templates

* Apply suggestions from code review [skip ci]

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

* fix #3068 (#3069)

* minor improvements (#3067)

* minor improvements

* fix test

* adding @wasade suggestion

* minor fixes (#3073)

* fix #3072 (#3074)

* fix #3072

* flake8

* fix #3076 (#3077)

* fix #3076

* addressing @ElDeveloper comments

* fix #3070 (#3075)

* fix #3070

* Apply suggestions from code review

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

* flake8

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

* fix-3064 (#3080)

* Create qiita-ci.yml

* add passwd to psql

* nginx + redis

* redis via conda

* redis-server - -> redis-server --

* daemonize -> daemonize

* conda libevent

* apt-get libevent-dev

* adding multiple sections

* rm travis_retry

* add missing activagte

* onfig_test.cfg

* rm indent

* add 1 indent

* pgport

* add drop

* change db name

* redis version

* redis -> redbiom version

* split tests

* source .profile

* adding ssh command

* mkdir ~/.ssh/

* only ssh

* pushing to retrigger

* add run

* split ssh conf/run

* 60367309

* add bash

* fix ident

* rm ssh cmd

* strict=no

* localhost

* mod permissions

* user ssh

* rm keys

* echos

* start ssh service

* ssh -vvv

* split setup/test

* localhost -> 127

* localhost -> 127

* rm ssh setup

* -o StrictHostKeyChecking=no

* change key creation

* echo sshd_config

* readd ssh config

* cp sshd_config and then move

* rm echos

* extra qiita folder

* ls test_data

* rm qiita folder

* use key in .ssh

* more ssh params

* tee -> >

* chmod

* fullpath

* ls key

* rm localhost

* params

* rsa -> rsa-sha2-256

* rm root login

* readd -o StrictHostKeyChecking=no

* minimizing params

* permissions

* cleaning up

* comments => echo

* -o StrictHostKeyChecking=no

* export REDBIOM_HOST=http://127.0.0.1:7379

* localhost -> 127.0.0.1

* add runner

* cover package

* cleaning env vars

* renaming jobs

* redbiom host

* add env vars via export

* changing echo

* add env COVER_PACKAGE

* rm user sftp

* just qiita_db

* open_sftp -> SFTPClient

* +1 worker

* only SCPClient

* ssh.SCPClient -> SCPClient

* return sftp/ssh back

* just qiita_db

* rm sftp

* add all tests

* BASE_URL = https://127.0.0.1:8383

* bash -c

* start_biom

* start_biom 127

* populate QIITA_SERVER_CERT

* nginx 127

* adding ports to conf

* lowering ports

* service qiita

* 8383 postgres

* curl test

* mkdir nginx

* moving tests around

* moving curl up

* echos for curl

* sleep 10

* 127.0.0.1 -> localhost

* returning some localhost to 127...

* qiita_pet!

* final clean up

* kill qiita first

* addreesing @ElDeveloper comments

* checkout repo for flake8

* split cover_package

* Workflows GUI (#3078)

* fix #3076

* addressing @ElDeveloper comments

* adding workflows GUI

* improve HTML

* db changes and improvements

* comment out new db changes

* rm default_workflows from PluginHandler.get

* Apply suggestions from code review [skip ci]

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

* fix #3081 (#3082)

* fix #3081

* generate_analysis_list_per_study to speed up retrival

* minor modifications

* split sql

* addressing @ElDeveloper comments

* add Command.processing_jobs (#3085)

* add Command.processing_jobs

* Update qiita_db/software.py [skip ci]

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>

* Improve workflows (#3083)

* fix #3076

* addressing @ElDeveloper comments

* adding workflows GUI

* improve HTML

* db changes and improvements

* comment out new db changes

* rm default_workflows from PluginHandler.get

* improving workflows after deployment

* fix error

* clean up

* minor fixes

* improving html

* comments from @justinshaffer

* addressing @ElDeveloper comments

* 2021.03 (#3084)

* 2021.03

* add Command.processing_jobs to changelog

* codecov

* codecov required

* codecov token

Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants