-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add predicate function argument to select_by_tag
#94
Conversation
Click to view CI ResultsGitHub pull request #94 of commit 5a30fba9a0f7265edc99afaee592b4834f4f52a1, no merge conflicts. Running as SYSTEM Setting status of 5a30fba9a0f7265edc99afaee592b4834f4f52a1 to PENDING with url https://10.20.13.93:8080/job/merlin_core/59/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/94/*:refs/remotes/origin/pr/94/* # timeout=10 > git rev-parse 5a30fba9a0f7265edc99afaee592b4834f4f52a1^{commit} # timeout=10 Checking out Revision 5a30fba9a0f7265edc99afaee592b4834f4f52a1 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5a30fba9a0f7265edc99afaee592b4834f4f52a1 # timeout=10 Commit message: "Add method select_by_any_tag to find union of columns with tags" > git rev-list --no-walk 2faf07571c187939a09c3083a582188bb90e213f # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins2789700248353621692.sh Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (61.0.0) Collecting setuptools Downloading setuptools-62.3.2-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 35.0 MB/s eta 0:00:00 Installing collected packages: setuptools Attempting uninstall: setuptools Found existing installation: setuptools 61.0.0 Uninstalling setuptools-61.0.0: Successfully uninstalled setuptools-61.0.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. google-auth 1.35.0 requires cachetools<5.0,>=2.0.0, but you have cachetools 5.0.0 which is incompatible. tensorflow-gpu 2.8.0 requires keras<2.9,>=2.8.0rc0, but you have keras 2.6.0 which is incompatible. tensorflow-gpu 2.8.0 requires tensorboard<2.9,>=2.8, but you have tensorboard 2.6.0 which is incompatible. Successfully installed setuptools-62.3.2 ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 343 items / 1 skipped |
Documentation preview |
Click to view CI ResultsGitHub pull request #94 of commit 5498b20bf5a9ce69f58777d7e5abefa2a7707c7c, no merge conflicts. Running as SYSTEM Setting status of 5498b20bf5a9ce69f58777d7e5abefa2a7707c7c to PENDING with url https://10.20.13.93:8080/job/merlin_core/104/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/94/*:refs/remotes/origin/pr/94/* # timeout=10 > git rev-parse 5498b20bf5a9ce69f58777d7e5abefa2a7707c7c^{commit} # timeout=10 Checking out Revision 5498b20bf5a9ce69f58777d7e5abefa2a7707c7c (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5498b20bf5a9ce69f58777d7e5abefa2a7707c7c # timeout=10 Commit message: "Merge branch 'main' into tags-intersection" > git rev-list --no-walk 04fffa5d82b5e9b1ec296d152d585da74ce8879a # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins11471215182046293802.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 344 items / 1 skipped |
Let's hold off on merging this until after the 22.08 release |
Click to view CI ResultsGitHub pull request #94 of commit 883060528464d84e8b962136950738f95be71305, no merge conflicts. Running as SYSTEM Setting status of 883060528464d84e8b962136950738f95be71305 to PENDING with url https://10.20.13.93:8080/job/merlin_core/115/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/94/*:refs/remotes/origin/pr/94/* # timeout=10 > git rev-parse 883060528464d84e8b962136950738f95be71305^{commit} # timeout=10 Checking out Revision 883060528464d84e8b962136950738f95be71305 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 883060528464d84e8b962136950738f95be71305 # timeout=10 Commit message: "Merge branch 'main' into tags-intersection" > git rev-list --no-walk 5f0de7073444a544361decebe7a4312f93aead5e # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins4328059253590593104.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 344 items / 1 skipped |
Click to view CI ResultsGitHub pull request #94 of commit d5bb8dc706554b5d63ad9d0bd7a26665346b7cad, no merge conflicts. Running as SYSTEM Setting status of d5bb8dc706554b5d63ad9d0bd7a26665346b7cad to PENDING with url https://10.20.13.93:8080/job/merlin_core/117/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/94/*:refs/remotes/origin/pr/94/* # timeout=10 > git rev-parse d5bb8dc706554b5d63ad9d0bd7a26665346b7cad^{commit} # timeout=10 Checking out Revision d5bb8dc706554b5d63ad9d0bd7a26665346b7cad (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f d5bb8dc706554b5d63ad9d0bd7a26665346b7cad # timeout=10 Commit message: "Rename select_by_any_tag to select_by_any_tags" > git rev-list --no-walk b37ca6854bb0e3a165f33e453394daddd54a7093 # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins6233883287895504341.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 344 items / 1 skipped |
Click to view CI ResultsGitHub pull request #94 of commit b48439af9033f58890565516887b7a26100dacec, no merge conflicts. Running as SYSTEM Setting status of b48439af9033f58890565516887b7a26100dacec to PENDING with url https://10.20.13.93:8080/job/merlin_core/138/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/94/*:refs/remotes/origin/pr/94/* # timeout=10 > git rev-parse b48439af9033f58890565516887b7a26100dacec^{commit} # timeout=10 Checking out Revision b48439af9033f58890565516887b7a26100dacec (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f b48439af9033f58890565516887b7a26100dacec # timeout=10 Commit message: "Merge branch 'main' into tags-intersection" > git rev-list --no-walk dd7017fdcaaf99afbd93dcfb93cbe044d5723a85 # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins5416395355702112995.sh GLOB sdist-make: /var/jenkins_home/workspace/merlin_core/core/setup.py test-gpu inst-nodeps: /var/jenkins_home/workspace/merlin_core/core/.tox/.tmp/package/1/merlin-core-0+untagged.115.gb48439a.zip WARNING: Discarding $PYTHONPATH from environment, to override specify PYTHONPATH in 'passenv' in your configuration. test-gpu installed: absl-py==1.2.0,alabaster==0.7.12,anyio==3.6.1,argon2-cffi==21.3.0,argon2-cffi-bindings==21.2.0,astroid==2.5.6,asttokens==2.0.7,astunparse==1.6.3,asv==0.5.1,asvdb==0.4.2,attrs==22.1.0,awscli==1.25.61,Babel==2.10.3,backcall==0.2.0,beautifulsoup4==4.11.1,betterproto==1.2.5,black==22.6.0,bleach==5.0.1,boto3==1.24.51,botocore==1.27.60,Brotli==1.0.9,cachetools==5.2.0,certifi==2019.11.28,cffi==1.15.1,chardet==3.0.4,clang==5.0,click==8.1.3,cloudpickle==2.1.0,colorama==0.4.4,coverage==6.4.4,cuda-python==11.7.1,cudf==22.4.0,cupy-cuda116==10.6.0,cycler==0.11.0,Cython==0.29.32,dask==2022.1.1,dask-cuda==22.4.0,dask-cudf==22.4.0,dbus-python==1.2.16,debugpy==1.6.2,decorator==5.1.1,defusedxml==0.7.1,dill==0.3.5.1,distlib==0.3.5,distributed==2022.3.0,distro==1.7.0,dm-tree==0.1.7,docker-pycreds==0.4.0,docutils==0.16,emoji==1.7.0,entrypoints==0.4,execnet==1.9.0,executing==0.10.0,faiss-gpu==1.7.2,fastai==2.7.9,fastapi==0.80.0,fastavro==1.6.0,fastcore==1.5.21,fastdownload==0.0.7,fastjsonschema==2.16.1,fastprogress==1.0.3,fastrlock==0.8,feast==0.19.4,fiddle==0.2.0,filelock==3.8.0,flatbuffers==1.12,fonttools==4.37.0,fsspec==2022.5.0,gast==0.4.0,gevent==21.12.0,geventhttpclient==2.0,gitdb==4.0.9,GitPython==3.1.27,google==3.0.0,google-api-core==2.8.2,google-auth==2.11.0,google-auth-oauthlib==0.4.6,google-pasta==0.2.0,googleapis-common-protos==1.52.0,graphviz==0.20.1,greenlet==1.1.2,grpcio==1.41.0,grpcio-channelz==1.47.0,grpcio-reflection==1.47.0,grpclib==0.4.3,h11==0.13.0,h2==4.1.0,h5py==3.7.0,HeapDict==1.0.1,hpack==4.0.0,httptools==0.4.0,hugectr2onnx==0.0.0,huggingface-hub==0.8.1,hyperframe==6.0.1,idna==2.8,imagesize==1.4.1,implicit==0.6.0,importlib-metadata==4.12.0,importlib-resources==5.9.0,iniconfig==1.1.1,ipykernel==6.15.1,ipython==8.4.0,ipython-genutils==0.2.0,ipywidgets==7.7.0,jedi==0.18.1,Jinja2==3.1.2,jmespath==1.0.1,joblib==1.1.0,json5==0.9.9,jsonschema==4.9.1,jupyter-cache==0.4.3,jupyter-client==7.3.4,jupyter-core==4.11.1,jupyter-server==1.18.1,jupyter-server-mathjax==0.2.5,jupyter-sphinx==0.3.2,jupyterlab==3.4.5,jupyterlab-pygments==0.2.2,jupyterlab-server==2.15.0,jupyterlab-widgets==1.1.0,keras==2.9.0,Keras-Preprocessing==1.1.2,kiwisolver==1.4.4,lazy-object-proxy==1.7.1,libclang==14.0.6,lightfm==1.16,lightgbm==3.3.2,linkify-it-py==1.0.3,llvmlite==0.39.0,locket==1.0.0,lxml==4.9.1,Markdown==3.4.1,markdown-it-py==1.1.0,MarkupSafe==2.1.1,matplotlib==3.5.3,matplotlib-inline==0.1.3,mdit-py-plugins==0.2.8,merlin-core==0+untagged.115.gb48439a,merlin-models==0.6.0+21.gd247c021c,merlin-systems==0+untagged.82.g1ee951d,mistune==0.8.4,mmh3==3.0.0,mpi4py==3.1.3,msgpack==1.0.4,multidict==6.0.2,myst-nb==0.13.2,myst-parser==0.15.2,natsort==8.1.0,nbclassic==0.4.3,nbclient==0.6.6,nbconvert==6.5.3,nbdime==3.1.1,nbformat==5.4.0,nest-asyncio==1.5.5,notebook==6.4.12,notebook-shim==0.1.0,numba==0.56.0,numpy==1.21.5,nvidia-pyindex==1.0.9,# Editable install with no version control (nvtabular==1.3.3+2.g4cecf3e33),-e /usr/local/lib/python3.8/dist-packages,nvtx==0.2.5,oauthlib==3.2.0,onnx==1.12.0,onnxruntime==1.11.1,opt-einsum==3.3.0,packaging==21.3,pandas==1.3.5,pandavro==1.5.2,pandocfilters==1.5.0,parso==0.8.3,partd==1.3.0,pathtools==0.1.2,pexpect==4.8.0,pickleshare==0.7.5,Pillow==9.2.0,pkgutil_resolve_name==1.3.10,platformdirs==2.5.2,pluggy==1.0.0,prometheus-client==0.14.1,promise==2.3,prompt-toolkit==3.0.30,proto-plus==1.19.6,protobuf==3.19.4,psutil==5.9.1,ptyprocess==0.7.0,pure-eval==0.2.2,py==1.11.0,pyarrow==6.0.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pybind11==2.10.0,pycparser==2.21,pydantic==1.9.2,pydot==1.4.2,Pygments==2.12.0,PyGObject==3.36.0,pynvml==11.4.1,pyparsing==3.0.9,pyrsistent==0.18.1,pytest==7.1.2,pytest-cov==3.0.0,pytest-forked==1.4.0,pytest-xdist==2.5.0,python-apt==2.0.0+ubuntu0.20.4.7,python-dateutil==2.8.2,python-dotenv==0.20.0,python-rapidjson==1.8,pytz==2022.2.1,PyYAML==5.4.1,pyzmq==23.2.1,regex==2022.7.25,requests==2.22.0,requests-oauthlib==1.3.1,requests-unixsocket==0.2.0,rmm==21.12.0,rsa==4.7.2,s3fs==2022.2.0,s3transfer==0.6.0,sacremoses==0.0.53,scikit-build==0.15.0,scikit-learn==1.1.2,scipy==1.9.0,seedir==0.3.0,Send2Trash==1.8.0,sentry-sdk==1.9.4,setproctitle==1.3.2,setuptools-scm==7.0.5,shortuuid==1.0.9,six==1.15.0,sklearn==0.0,smmap==5.0.0,sniffio==1.2.0,snowballstemmer==2.2.0,sortedcontainers==2.4.0,soupsieve==2.3.2.post1,Sphinx==5.1.1,sphinx-multiversion==0.2.4,sphinx-togglebutton==0.3.1,sphinx_external_toc==0.3.0,sphinxcontrib-applehelp==1.0.2,sphinxcontrib-copydirs @ git+https://github.com/mikemckiernan/sphinxcontrib-copydirs.git@bd8c5d79b3f91cf5f1bb0d6995aeca3fe84b670e,sphinxcontrib-devhelp==1.0.2,sphinxcontrib-htmlhelp==2.0.0,sphinxcontrib-jsmath==1.0.1,sphinxcontrib-qthelp==1.0.3,sphinxcontrib-serializinghtml==1.1.5,SQLAlchemy==1.4.36,stack-data==0.4.0,starlette==0.19.1,stringcase==1.2.0,supervisor==4.1.0,tabulate==0.8.10,tblib==1.7.0,tdqm==0.0.1,tenacity==8.0.1,tensorboard==2.9.1,tensorboard-data-server==0.6.1,tensorboard-plugin-wit==1.8.1,tensorflow==2.6.2,tensorflow-estimator==2.9.0,tensorflow-gpu==2.9.1,tensorflow-io-gcs-filesystem==0.26.0,tensorflow-metadata==1.9.0,termcolor==1.1.0,terminado==0.15.0,testbook==0.4.2,threadpoolctl==3.1.0,tinycss2==1.1.1,tokenizers==0.10.3,toml==0.10.2,tomli==2.0.1,toolz==0.12.0,torch==1.12.1+cu113,torchmetrics==0.3.2,tornado==6.2,tox==3.25.1,tqdm==4.64.0,traitlets==5.3.0,transformers==4.12.0,transformers4rec==0.1.11+4.g741a912f9,treelite==2.3.0,treelite-runtime==2.3.0,tritonclient==2.22.0,typing_extensions==4.3.0,uc-micro-py==1.0.1,urllib3==1.26.11,uvicorn==0.18.2,uvloop==0.16.0,versioneer==0.20,virtualenv==20.16.3,wandb==0.13.1,watchfiles==0.16.1,wcwidth==0.2.5,webencodings==0.5.1,websocket-client==1.3.3,websockets==10.3,Werkzeug==2.2.2,widgetsnbextension==3.6.0,wrapt==1.12.1,xgboost==1.6.1,zict==2.2.0,zipp==3.8.1,zope.event==4.5.0,zope.interface==5.4.0 test-gpu run-test-pre: PYTHONHASHSEED='240378352' test-gpu run-test: commands[0] | python -m pytest --cov-report term --cov merlin -rxs tests/unit ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 cachedir: .tox/test-gpu/.pytest_cache rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 346 items / 1 skipped |
Perhaps a non-breaking way to implement this would be to add an schema.select_by_tag(["a", "b"]) # => returns col1 and col2
schema.select_by_tag(["a", "b"], all=True) # => returns col1 |
On reflection, I kinda like @nv-alaiacano's suggestion about adding a parameter to control whether it matches any or matches all. Normally, I'm not a fan of boolean arguments (which is probably why I suggested separate methods if I did in fact suggest that), but this seems like a case where it may be simpler/clearer to go that route. Another possibility would be to keep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pred_fn
is a pretty slick way to do this API 👍🏻
select_by_tag
I've updated the description and title to match the change as it is now implemented. It's not longer a breaking change since this is now the addition of a new argument with the same default behaviour as before. We might still consider swapping the all/any deafult around if that turns out to be more convenient in a subsequent PR. |
Goals ⚽
When using the
select_by_tag
method on a schema. If passing multiple tags to the method. Enable column selection by tag with both AND (all
) and OR (any
) logic. The current default is OR (any
). When providing multiple tags, the columns selected can may any of the tags provided.A motivating example for this is if we separate something like the
ITEM_ID
tag into two. Instead of performingschema.select_by_tag(Tags.ITEM_ID)
you'd need to do something likeschema.select_by_tag([Tags.ITEM, Tags.ID])
. Currently, this would return all the columns tagged ITEM and all columns tagged with ID, and not the item ID column.Note that it is possible to achieve the selection of all tags with the current API by chaining calls to
select_by_tag
.This change makes this more convenient by supporting a single method call to achieve this. The equivalent of the line above with this PR:
Implementation Details 🚧
This PR adds a new argument to
select_by_tag
that enables control over the predicate function used to determine column selection from tags.Using the same example as above. The PR changes the result of
select_by_tag
:Before
After
We may wish to swap the default value around from any to all in a further PR, however that would be a breaking change and will be clearer to separate in another PR