Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Index the (text) datasets contents to enable full-text search - DuckDB #1296

Merged
merged 71 commits into from
Jun 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
1e41964
Draft files
AndreaFrancis May 31, 2023
f37a829
Adding duckdb index job runner
AndreaFrancis Jun 2, 2023
340d85e
Fix style
AndreaFrancis Jun 2, 2023
c53af5f
WIP adding fts on API
AndreaFrancis Jun 2, 2023
8cac1c5
Remove non used code
AndreaFrancis Jun 2, 2023
31387ba
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 2, 2023
23ce3ee
Fix style
AndreaFrancis Jun 2, 2023
ac0a2d9
Adding chart objects
AndreaFrancis Jun 2, 2023
dff50cf
Rollback dependency in API
AndreaFrancis Jun 2, 2023
132d4ca
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 6, 2023
4659117
Depend on parquet an split
AndreaFrancis Jun 6, 2023
f0794a8
Fix libcommon test
AndreaFrancis Jun 6, 2023
ddad27a
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 6, 2023
05d3362
Send index file to dedicated branch
AndreaFrancis Jun 6, 2023
cec74e3
Fix test in first parquet
AndreaFrancis Jun 7, 2023
96587d8
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 9, 2023
b02fa17
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 9, 2023
8679ce9
Fix merge hanges
AndreaFrancis Jun 9, 2023
163928e
Fix poetry files
AndreaFrancis Jun 9, 2023
b1238f5
Adding happy path test
AndreaFrancis Jun 12, 2023
08e784f
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 12, 2023
fd298be
Adding other test scenarios
AndreaFrancis Jun 12, 2023
2afe9f3
Adding chart configuration
AndreaFrancis Jun 12, 2023
0bfcb62
Apply suggestions from code review
AndreaFrancis Jun 13, 2023
2ff4f91
Change ParquetFileItem to SplitHubFile
AndreaFrancis Jun 13, 2023
3c9b4ee
Inherit from SplitCachedJobRunner
AndreaFrancis Jun 13, 2023
c78e99a
Fix style
AndreaFrancis Jun 13, 2023
6eba4d9
Depends on info featues instead of parquet schema
AndreaFrancis Jun 13, 2023
39e7ded
Fix libcommon test
AndreaFrancis Jun 13, 2023
e94e1d4
Apply code review suggestions
AndreaFrancis Jun 14, 2023
4daf93d
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 14, 2023
aa68660
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 15, 2023
e28142f
Some details
AndreaFrancis Jun 15, 2023
a51d7d3
Fix style
AndreaFrancis Jun 15, 2023
129b8c4
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 16, 2023
edd120d
Fix test
AndreaFrancis Jun 16, 2023
a65e8dd
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 16, 2023
059c632
Apply code review suggestions
AndreaFrancis Jun 16, 2023
9ecf923
Update chart/values.yaml
AndreaFrancis Jun 19, 2023
874fabd
Apply suggestions from code review
AndreaFrancis Jun 19, 2023
c36202f
Apply code review suggestions
AndreaFrancis Jun 19, 2023
9b82a66
[docs] Improvements (#1376)
stevhliu Jun 16, 2023
3326014
Fix closing brackets and GH action link (#1389)
baskrahmer Jun 19, 2023
1410737
Fix typo in erro rmessage (#1391)
albertvillanova Jun 19, 2023
1d9574e
Add docker internal to extra_hosts (#1390)
baskrahmer Jun 19, 2023
7971b34
fix: 🐛 support bigger images (#1387)
severo Jun 19, 2023
431163d
Rename dev to staging, and use staging mongodb cluster (#1383)
severo Jun 19, 2023
80c7b5d
feat: 🎸 10x the size of supported images (#1392)
severo Jun 19, 2023
b599b10
Fix exception
AndreaFrancis Jun 19, 2023
a1b3d8e
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 19, 2023
187d7b6
Fix test in libcommon
AndreaFrancis Jun 19, 2023
fd01ec6
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 20, 2023
ff4a833
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 20, 2023
5c9639e
Apply some code review suggestions
AndreaFrancis Jun 20, 2023
ce4163a
Apply code review suggestions
AndreaFrancis Jun 20, 2023
67e801f
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 20, 2023
9e9e25a
Adding close connection
AndreaFrancis Jun 20, 2023
517a479
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 22, 2023
b807613
Upgrade duckdb version
AndreaFrancis Jun 22, 2023
e77b6b4
Apply code review suggestions
AndreaFrancis Jun 22, 2023
3005e2e
Fix style
AndreaFrancis Jun 22, 2023
84687e0
Adding some test cases
AndreaFrancis Jun 22, 2023
27743d5
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 22, 2023
021ea34
Remove duplicate code by merge
AndreaFrancis Jun 22, 2023
2d6b21c
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 26, 2023
80a3c21
Fix imports
AndreaFrancis Jun 26, 2023
b6f3bd9
Apply code review suggestions
AndreaFrancis Jun 26, 2023
550f118
Apply suggestions from code review
AndreaFrancis Jun 26, 2023
930f6c0
Add test
AndreaFrancis Jun 26, 2023
ecfa8c5
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 26, 2023
a92fe90
Merge branch 'main' into duckdb-index
AndreaFrancis Jun 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions chart/static-files/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -925,11 +925,11 @@
"properties": {
"parquet_files": {
"type": "array",
"items": { "$ref": "#/components/schemas/ParquetFileItem" }
"items": { "$ref": "#/components/schemas/SplitHubFile" }
}
}
},
"ParquetFileItem": {
"SplitHubFile": {
"type": "object",
"required": ["dataset", "config", "split", "url", "filename", "size"],
"properties": {
Expand Down
21 changes: 21 additions & 0 deletions chart/templates/_envWorker.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -87,4 +87,25 @@
- name: CONFIG_NAMES_MAX_NUMBER
value: {{ .Values.configNames.maxNumber | quote }}

# specific to 'split-duckdb-index' job runner
- name: DUCKDB_INDEX_COMMIT_MESSAGE
value: {{ .Values.duckDBIndex.commitMessage | quote }}
- name: DUCKDB_INDEX_COMMITTER_HF_TOKEN
{{- if .Values.secrets.appParquetConverterHfToken.fromSecret }}
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved
valueFrom:
secretKeyRef:
name: {{ .Values.secrets.appParquetConverterHfToken.secretName | quote }}
key: HF_TOKEN
optional: false
{{- else }}
value: {{ .Values.secrets.appParquetConverterHfToken.value }}
{{- end }}
- name: DUCKDB_INDEX_TARGET_REVISION
value: {{ .Values.duckDBIndex.targetRevision | quote }}
- name: DUCKDB_INDEX_URL_TEMPLATE
value: {{ .Values.duckDBIndex.urlTemplate | quote }}
- name: DUCKDB_INDEX_MAX_PARQUET_SIZE_BYTES
value: {{ .Values.duckDBIndex.maxParquetSizeBytes | quote }}
- name: DUCKDB_INDEX_STORAGE_DIRECTORY
value: {{ .Values.duckDBIndex.storageDirectory | quote }}
{{- end -}}
9 changes: 9 additions & 0 deletions chart/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,15 @@ The parquet-metadata/ subpath in the NFS
{{- printf "%s/%s/%s/" .Chart.Name .Release.Name "parquet-metadata" }}
{{- end }}

{{/*
The duckdb-index/ subpath in the NFS
- in a subdirectory named as the chart (datasets-server/), and below it,
- in a subdirectory named as the Release, so that Releases will not share the same dir
*/}}
{{- define "duckDBIndex.subpath" -}}
{{- printf "%s/%s/%s/" .Chart.Name .Release.Name "duckdb-index" }}
{{- end }}

{{/*
The datasets library will use this directory as a cache
- in a subdirectory named as the chart (datasets-server/), and below it,
Expand Down
21 changes: 21 additions & 0 deletions chart/templates/_initContainerDuckDBIndex.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# SPDX-License-Identifier: Apache-2.0
# Copyright 2023 The HuggingFace Authors.

{{- define "initContainerDuckDBIndex" -}}
- name: prepare-duckdb-index
image: ubuntu:focal
imagePullPolicy: {{ .Values.images.pullPolicy }}
command: ["/bin/sh", "-c"]
args:
- chown {{ .Values.uid }}:{{ .Values.gid }} /mounted-path;
volumeMounts:
- mountPath: /mounted-path
mountPropagation: None
name: data
subPath: "{{ include "duckDBIndex.subpath" . }}"
readOnly: false
securityContext:
runAsNonRoot: false
runAsUser: 0
runAsGroup: 0
{{- end -}}
10 changes: 10 additions & 0 deletions chart/templates/_volumeMountDuckDBIndex.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# SPDX-License-Identifier: Apache-2.0
# Copyright 2023 The HuggingFace Authors.

{{- define "volumeMountDuckDBIndexRW" -}}
- mountPath: {{ .Values.duckDBIndex.storageDirectory | quote }}
mountPropagation: None
name: data
subPath: "{{ include "duckDBIndex.subpath" . }}"
readOnly: false
{{- end -}}
1 change: 1 addition & 0 deletions chart/templates/worker/_container.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
{{ include "volumeMountAssetsRW" . | nindent 2 }}
{{ include "volumeMountCache" . | nindent 2 }}
{{ include "volumeMountParquetMetadataRW" . | nindent 2 }}
{{ include "volumeMountDuckDBIndexRW" . | nindent 2 }}
securityContext:
allowPrivilegeEscalation: false
resources: {{ toYaml .workerValues.resources | nindent 4 }}
Expand Down
1 change: 1 addition & 0 deletions chart/templates/worker/_deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ spec:
{{ include "initContainerAssets" . | nindent 8 }}
{{ include "initContainerCache" . | nindent 8 }}
{{ include "initContainerParquetMetadata" . | nindent 8 }}
{{ include "initContainerDuckDBIndex" . | nindent 8 }}
containers: {{ include "containerWorker" . | nindent 8 }}
nodeSelector: {{ toYaml .workerValues.nodeSelector | nindent 8 }}
tolerations: {{ toYaml .workerValues.tolerations | nindent 8 }}
Expand Down
11 changes: 11 additions & 0 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,17 @@ parquetMetadata:
# Directory on the shared storage (parquet metadata files used for random access in /rows)
storageDirectory: "/parquet-metadata"

duckDBIndex:
# Directory on the shared storage (used temporarily to prepare the duckdb indexes before sending to the Hub)
storageDirectory: "/duckdb-index"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a non-shared temporary directory would also have worked no ?

I'm fine with using a shared storage though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's due to the size (the pods have nearly no space, while the shared storage has as much space as we need (EFS))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(even if, by default, the local duckdb will be small)

# the git commit message when the duckdb index file is uploaded to the Hub. Defaults to `Update duckdb index files`.
commitMessage: "Update duckdb index files"
# the git revision of the dataset where to store the duckdb index file. Defaults to `refs/convert/parquet`.
targetRevision: "refs/convert/parquet"
# the URL template to build the duckdb index file URL. Defaults to `/datasets/%s/resolve/%s/%s`.
urlTemplate: "/datasets/%s/resolve/%s/%s"
# the maximum size of the split parquets.
maxParquetSizeBytes: "100_000_000"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cant to increase this it we'll need to know how much time it takes to index datasets and to query them. We can see that later by adding some profiling.


# Directory where the cache data will be stored
cacheDirectory: "/datasets-server-cache"
Expand Down
41 changes: 41 additions & 0 deletions libs/libcommon/src/libcommon/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
PROCESSING_STEP_DATASET_PARQUET_VERSION,
PROCESSING_STEP_DATASET_SIZE_VERSION,
PROCESSING_STEP_DATASET_SPLIT_NAMES_VERSION,
PROCESSING_STEP_SPLIT_DUCKDB_INDEX_VERSION,
PROCESSING_STEP_SPLIT_FIRST_ROWS_FROM_PARQUET_VERSION,
PROCESSING_STEP_SPLIT_FIRST_ROWS_FROM_STREAMING_VERSION,
PROCESSING_STEP_SPLIT_IMAGE_URL_COLUMNS_VERSION,
Expand Down Expand Up @@ -104,6 +105,39 @@ def from_env(cls) -> "ParquetMetadataConfig":
)


DUCKDB_INDEX_STORAGE_DIRECTORY = None
DUCKDB_INDEX_COMMIT_MESSAGE = "Update duckdb index file"
DUCKDB_INDEX_COMMITTER_HF_TOKEN = None
DUCKDB_INDEX_MAX_PARQUET_SIZE_BYTES = 100_000_000
DUCKDB_INDEX_TARGET_REVISION = "refs/convert/parquet"
DUCKDB_INDEX_URL_TEMPLATE = "/datasets/%s/resolve/%s/%s"


@dataclass(frozen=True)
class DuckDbIndexConfig:
storage_directory: Optional[str] = DUCKDB_INDEX_STORAGE_DIRECTORY
commit_message: str = DUCKDB_INDEX_COMMIT_MESSAGE
committer_hf_token: Optional[str] = DUCKDB_INDEX_COMMITTER_HF_TOKEN
target_revision: str = DUCKDB_INDEX_TARGET_REVISION
url_template: str = DUCKDB_INDEX_URL_TEMPLATE
max_parquet_size_bytes: int = DUCKDB_INDEX_MAX_PARQUET_SIZE_BYTES

@classmethod
def from_env(cls) -> "DuckDbIndexConfig":
env = Env(expand_vars=True)
with env.prefixed("DUCKDB_INDEX_"):
return cls(
storage_directory=env.str(name="STORAGE_DIRECTORY", default=DUCKDB_INDEX_STORAGE_DIRECTORY),
commit_message=env.str(name="COMMIT_MESSAGE", default=DUCKDB_INDEX_COMMIT_MESSAGE),
committer_hf_token=env.str(name="COMMITTER_HF_TOKEN", default=DUCKDB_INDEX_COMMITTER_HF_TOKEN),
target_revision=env.str(name="TARGET_REVISION", default=DUCKDB_INDEX_TARGET_REVISION),
url_template=env.str(name="URL_TEMPLATE", default=DUCKDB_INDEX_URL_TEMPLATE),
max_parquet_size_bytes=env.int(
name="MAX_PARQUET_SIZE_BYTES", default=DUCKDB_INDEX_MAX_PARQUET_SIZE_BYTES
),
)


COMMON_HF_ENDPOINT = "https://huggingface.co"
COMMON_HF_TOKEN = None

Expand Down Expand Up @@ -319,6 +353,13 @@ class ProcessingGraphConfig:
"triggered_by": ["dataset-config-names", "config-opt-in-out-urls-count"],
"job_runner_version": PROCESSING_STEP_DATASET_OPT_IN_OUT_URLS_COUNT_VERSION,
},
"split-duckdb-index": {
"input_type": "split",
"triggered_by": [
"config-split-names-from-info",
],
"job_runner_version": PROCESSING_STEP_SPLIT_DUCKDB_INDEX_VERSION,
},
}
)

Expand Down
2 changes: 2 additions & 0 deletions libs/libcommon/src/libcommon/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
CACHE_MONGOENGINE_ALIAS = "cache"
CACHED_ASSETS_CACHE_APPNAME = "datasets_server_cached_assets"
PARQUET_METADATA_CACHE_APPNAME = "datasets_server_parquet_metadata"
DUCKDB_INDEX_CACHE_APPNAME = "datasets_server_duckdb_index"
METRICS_COLLECTION_CACHE_TOTAL_METRIC = "cacheTotalMetric"
METRICS_COLLECTION_JOB_TOTAL_METRIC = "jobTotalMetric"
METRICS_MONGOENGINE_ALIAS = "metrics"
Expand Down Expand Up @@ -36,6 +37,7 @@
PROCESSING_STEP_SPLIT_OPT_IN_OUT_URLS_COUNT_VERSION = 2
PROCESSING_STEP_SPLIT_OPT_IN_OUT_URLS_SCAN_VERSION = 4
PROCESSING_STEP_SPLIT_IMAGE_URL_COLUMNS_VERSION = 1
PROCESSING_STEP_SPLIT_DUCKDB_INDEX_VERSION = 1

PROCESSING_STEP_CONFIG_PARQUET_AND_INFO_ROW_GROUP_SIZE_FOR_AUDIO_DATASETS = 100
PROCESSING_STEP_CONFIG_PARQUET_AND_INFO_ROW_GROUP_SIZE_FOR_IMAGE_DATASETS = 100
Expand Down
52 changes: 42 additions & 10 deletions libs/libcommon/src/libcommon/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def as_response(self) -> ErrorResponse:


CacheableErrorCode = Literal[
"CacheDirectoryNotInitializedError",
"ConfigNamesError",
"CreateCommitError",
"DatasetInBlockListError",
Expand All @@ -89,6 +90,7 @@ def as_response(self) -> ErrorResponse:
"DatasetWithTooManyConfigsError",
"DatasetWithTooManyParquetFilesError",
"DisabledViewerError",
"DuckDBIndexFileNotFoundError",
"EmptyDatasetError",
"ExternalFilesSizeRequestConnectionError",
"ExternalFilesSizeRequestError",
Expand All @@ -102,6 +104,7 @@ def as_response(self) -> ErrorResponse:
"JobManagerExceededMaximumDurationError",
"LockedDatasetTimeoutError",
"MissingSpawningTokenError",
"NoIndexableColumnsError",
"NormalRowsError",
"ParameterMissingError",
"ParquetResponseEmptyError",
Expand All @@ -112,6 +115,7 @@ def as_response(self) -> ErrorResponse:
"SplitsNamesError",
"SplitNamesFromStreamingError",
"SplitNotFoundError",
"SplitWithTooBigParquetError",
"StreamingRowsError",
"TooBigContentError",
"TooManyColumnsError",
Expand All @@ -136,6 +140,13 @@ def __init__(
)


class CacheDirectoryNotInitializedError(CacheableError):
"""Raised when the cache directory has not been initialized before job compute."""

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "CacheDirectoryNotInitializedError", cause, True)


class ConfigNamesError(CacheableError):
"""Raised when the config names could not be fetched."""

Expand Down Expand Up @@ -232,6 +243,13 @@ def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "DatasetWithTooBigExternalFilesError", cause, True)


class DatasetWithTooManyConfigsError(CacheableError):
"""Raised when the number of configs of a dataset exceeded the limit."""

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "DatasetWithTooManyConfigsError", cause, True)


class DatasetWithTooManyExternalFilesError(CacheableError):
"""Raised when the number of external data files of a dataset is too big."""

Expand All @@ -246,11 +264,11 @@ def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "DatasetWithTooManyParquetFilesError", cause, True)


class LockedDatasetTimeoutError(CacheableError):
"""Raised when a dataset is locked by another job."""
class DuckDBIndexFileNotFoundError(CacheableError):
"""Raised when no duckdb index file was found for split."""

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "LockedDatasetTimeoutError", cause, True)
super().__init__(message, HTTPStatus.INTERNAL_SERVER_ERROR, "DuckDBIndexFileNotFoundError", cause, False)


class DisabledViewerError(CacheableError):
Expand Down Expand Up @@ -355,6 +373,13 @@ def __init__(self, message: str, cause: Optional[BaseException] = None):
)


class LockedDatasetTimeoutError(CacheableError):
"""Raised when a dataset is locked by another job."""

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "LockedDatasetTimeoutError", cause, True)


class MissingSpawningTokenError(CacheableError):
"""Raised when the spawning.ai token is not set."""

Expand All @@ -369,6 +394,13 @@ def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.INTERNAL_SERVER_ERROR, "NormalRowsError", cause, True)


class NoIndexableColumnsError(CacheableError):
"""Raised when split does not have string columns to index."""

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "NoIndexableColumnsError", cause, True)


class ParameterMissingError(CacheableError):
"""Raised when request is missing some parameter."""

Expand Down Expand Up @@ -450,6 +482,13 @@ def __init__(self, message: str, cause: Optional[BaseException] = None):
)


class SplitWithTooBigParquetError(CacheableError):
"""Raised when the split parquet size (sum of parquet sizes given) is too big."""

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.INTERNAL_SERVER_ERROR, "SplitWithTooBigParquetError", cause, False)


class StreamingRowsError(CacheableError):
"""Raised when the rows could not be fetched in streaming mode."""

Expand Down Expand Up @@ -496,10 +535,3 @@ class UnsupportedExternalFilesError(CacheableError):

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "UnsupportedExternalFilesError", cause, True)


class DatasetWithTooManyConfigsError(CacheableError):
"""Raised when the number of configs of a dataset exceeded the limit."""

def __init__(self, message: str, cause: Optional[BaseException] = None):
super().__init__(message, HTTPStatus.NOT_IMPLEMENTED, "DatasetWithTooManyConfigsError", cause, True)
12 changes: 2 additions & 10 deletions libs/libcommon/src/libcommon/parquet_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from libcommon.processing_graph import ProcessingGraph
from libcommon.prometheus import StepProfiler
from libcommon.simple_cache import get_previous_step_or_raise
from libcommon.utils import SplitHubFile

StrPath = Union[str, PathLike[str]]

Expand All @@ -36,15 +37,6 @@ class FileSystemError(Exception):
pass


class ParquetFileItem(TypedDict):
dataset: str
config: str
split: str
url: str
filename: str
size: int


class ParquetFileMetadataItem(TypedDict):
dataset: str
config: str
Expand Down Expand Up @@ -134,7 +126,7 @@ def query(self, offset: int, length: int) -> pa.Table:

@staticmethod
def from_parquet_file_items(
parquet_file_items: List[ParquetFileItem],
parquet_file_items: List[SplitHubFile],
dataset: str,
config: str,
split: str,
Expand Down
15 changes: 15 additions & 0 deletions libs/libcommon/src/libcommon/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from libcommon.constants import (
ASSETS_CACHE_APPNAME,
CACHED_ASSETS_CACHE_APPNAME,
DUCKDB_INDEX_CACHE_APPNAME,
PARQUET_METADATA_CACHE_APPNAME,
)

Expand Down Expand Up @@ -81,6 +82,20 @@ def init_parquet_metadata_dir(directory: Optional[StrPath] = None) -> StrPath:
return init_dir(directory, appname=PARQUET_METADATA_CACHE_APPNAME)


def init_duckdb_index_cache_dir(directory: Optional[StrPath] = None) -> StrPath:
"""Initialize the duckdb index directory.

If directory is None, it will be set to the default duckdb index location on the machine.

Args:
directory (Optional[Union[str, PathLike[str]]], optional): The directory to initialize. Defaults to None.

Returns:
Union[str, PathLike[str]]: The directory.
"""
return init_dir(directory, appname=DUCKDB_INDEX_CACHE_APPNAME)


def exists(path: StrPath) -> bool:
"""Check if a path exists.

Expand Down
Loading