You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug ops.Categorify raises ValueError: Column must have no nulls. when num_buckets > 1 and the dataset is shuffled by keys. EDIT: The whole error message: https://pastebin.com/GJRQhxAi
Steps/Code to reproduce bug
importgcimportdask.dataframeasddimportnumpyasnpimportpandasaspdimportnvtabularasnvt# Generate synthetic dataN_ROWS=100_000_000CHUNK_SIZE=10_000_000N=N_ROWS//CHUNK_SIZEdataframes= []
foriinrange(N):
print(f"{i+1}/{N}")
chunk_data=np.random.lognormal(3., 10., int(CHUNK_SIZE)).astype(np.int32)
chunk_ddf=dd.from_pandas(pd.DataFrame({'session_id': (chunk_data//45), 'item_id': chunk_data}), npartitions=1)
dataframes.append(chunk_ddf)
ddf=dd.concat(dataframes, axis=0)
deldataframesgc.collect()
# !!! When `shuffle_by_keys` is commented out, the code finishes successfullydataset=nvt.Dataset(ddf).shuffle_by_keys(keys=["session_id"])
_categorical_feats= [
"item_id",
] >>nvt.ops.Categorify(
freq_threshold=5,
# !!! When `num_buckets=None`, the code finishes successfullynum_buckets=100,
)
workflow=nvt.Workflow(_categorical_feats)
workflow.fit(dataset)
workflow.output_schema
Expected behavior
Properly fitted op.Categorify when num_buckets > 1 and the dataset is shuffled by keys.
Environment details (please complete the following information):
Environment location: JupyterLab in Docker on GCP
Method of NVTabular install: Docker
My Dockerfile:
# AFTER https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai
FROM nvcr.io/nvidia/merlin/merlin-pytorch:23.08
# Install Google Cloud SDK
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && apt-get update -y && apt-get install google-cloud-sdk -y
# Copy your project to the Docker image
COPY . /project
WORKDIR /project
# Install Python dependencies
RUN pip install -U pip
RUN pip install -r requirements/base.txt
# Run Jupyter Lab by default, with no authentication, on port 8080
EXPOSE 8080
CMD ["jupyter-lab", "--allow-root", "--ip=0.0.0.0", "--port=8080", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'"]
Additional context
I need to call shuffle_by_keys because I then do the GroupBy operation.
The text was updated successfully, but these errors were encountered:
piojanu
changed the title
[BUG] ops.Categorify frequency hashing rises RuntimeError when the dataset is shuffled by keys
[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys
Sep 20, 2023
Describe the bug
ops.Categorify
raisesValueError: Column must have no nulls.
whennum_buckets > 1
and the dataset is shuffled by keys. EDIT: The whole error message: https://pastebin.com/GJRQhxAiSteps/Code to reproduce bug
Expected behavior
Properly fitted
op.Categorify
whennum_buckets > 1
and the dataset is shuffled by keys.Environment details (please complete the following information):
My Dockerfile:
Additional context
I need to call
shuffle_by_keys
because I then do the GroupBy operation.The text was updated successfully, but these errors were encountered: