Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring input and output storage options in SEG-Y ingestion #479

Merged
merged 4 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ The protocols that help choose a backend (i.e. `s3://`, `gs://`, or `az://`) can
prepended to the **_MDIO_** path.

The connection string can be passed to the command-line-interface (CLI) using the
`-storage, --storage-options` flag as a JSON string or the Python API with the `storage_options`
keyword argument as a Python dictionary.
`-storage-{input,output, --storage-options-{input,output}` flag as a JSON string or the Python API with
the `storage_options_{input,output}` keyword argument as a Python dictionary.

````{warning}
On Windows clients, JSON strings are passed to the CLI with a special escape character.
Expand All @@ -66,7 +66,7 @@ If this done incorrectly, you will get an invalid JSON string error from the CLI

Credentials can be automatically fetched from pre-authenticated AWS CLI.
See [here](https://s3fs.readthedocs.io/en/latest/index.html#credentials) for the order `s3fs`
checks them. If it is not pre-authenticated, you need to pass `--storage-options`.
checks them. If it is not pre-authenticated, you need to pass `--storage-options-{input,output}`.

**Prefix:**
`s3://`
Expand All @@ -82,7 +82,7 @@ mdio segy import \
path/to/my.segy \
s3://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"key": "my_super_private_key", "secret": "my_super_private_secret"}'
--storage-options-output '{"key": "my_super_private_key", "secret": "my_super_private_secret"}'
```

Using Windows (note the extra escape characters `\`):
Expand All @@ -92,14 +92,14 @@ mdio segy import \
path/to/my.segy \
s3://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options "{\"key\": \"my_super_private_key\", \"secret\": \"my_super_private_secret\"}"
--storage-options-output "{\"key\": \"my_super_private_key\", \"secret\": \"my_super_private_secret\"}"
```

### Google Cloud Provider

Credentials can be automatically fetched from pre-authenticated `gcloud` CLI.
See [here](https://gcsfs.readthedocs.io/en/latest/#credentials) for the order `gcsfs`
checks them. If it is not pre-authenticated, you need to pass `--storage-options`.
checks them. If it is not pre-authenticated, you need to pass `--storage-options-{input-output}`.

GCP uses [service accounts](https://cloud.google.com/iam/docs/service-accounts) to pass
authentication information to APIs.
Expand All @@ -117,7 +117,7 @@ mdio segy import \
path/to/my.segy \
gs://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"token": "~/.config/gcloud/application_default_credentials.json"}'
--storage-options-output '{"token": "~/.config/gcloud/application_default_credentials.json"}'
```

Using browser to populate authentication:
Expand All @@ -127,14 +127,14 @@ mdio segy import \
path/to/my.segy \
gs://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"token": "browser"}'
--storage-options-output '{"token": "browser"}'
```

### Microsoft Azure

There are various ways to authenticate with Azure Data Lake (ADL).
See [here](https://github.com/fsspec/adlfs#details) for some details.
If ADL is not pre-authenticated, you need to pass `--storage-options`.
If ADL is not pre-authenticated, you need to pass `--storage-options-{input,output}`.

**Prefix:**
`az://` or `abfs://`
Expand All @@ -148,7 +148,7 @@ mdio segy import \
path/to/my.segy \
az://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"account_name": "myaccount", "account_key": "my_super_private_key"}'
--storage-options-output '{"account_name": "myaccount", "account_key": "my_super_private_key"}'
```

### Advanced Cloud Features
Expand Down Expand Up @@ -190,7 +190,7 @@ reduces object-store request costs.

When combining advanced protocols like `simplecache` and using a remote store like `s3` the
URL can be chained like `simplecache::s3://bucket/prefix/file.mdio`. When doing this the
`--storage-options` argument must explicitly state parameters for the cloud backend and the
`--storage-options-{input,output}` argument must explicitly state parameters for the cloud backend and the
extra protocol. For the above example it would look like this:

```json
Expand Down
19 changes: 14 additions & 5 deletions src/mdio/commands/segy.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,10 +115,17 @@
show_default=True,
)
@option(
"-storage",
"--storage-options",
"-storage-input",
"--storage-options-input",
required=False,
help="Storage options for SEG-Y input file.",
type=JSON,
)
@option(
"-storage-output",
"--storage-options-output",
required=False,
help="Custom storage options for cloud backends",
help="Storage options for the MDIO output file.",
type=JSON,
)
@option(
Expand All @@ -144,7 +151,8 @@ def segy_import(
chunk_size: list[int],
lossless: bool,
compression_tolerance: float,
storage_options: dict[str, Any],
storage_options_input: dict[str, Any],
storage_options_output: dict[str, Any],
overwrite: bool,
grid_overrides: dict[str, Any],
):
Expand Down Expand Up @@ -347,7 +355,8 @@ def segy_import(
chunksize=chunk_size,
lossless=lossless,
compression_tolerance=compression_tolerance,
storage_options=storage_options,
storage_options_input=storage_options_input,
storage_options_output=storage_options_output,
overwrite=overwrite,
grid_overrides=grid_overrides,
)
Expand Down
27 changes: 18 additions & 9 deletions src/mdio/converters/segy.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@

import logging
import os
from collections.abc import Sequence
from datetime import datetime
from datetime import timezone
from importlib import metadata
from typing import Any
from typing import Sequence

import numpy as np
import zarr
from segy import SegyFile
from segy.config import SegySettings
from segy.schema import HeaderField

from mdio.api.io_utils import process_url
Expand Down Expand Up @@ -113,7 +114,8 @@ def segy_to_mdio( # noqa: C901
chunksize: Sequence[int] | None = None,
lossless: bool = True,
compression_tolerance: float = 0.01,
storage_options: dict[str, Any] | None = None,
storage_options_input: dict[str, Any] | None = None,
storage_options_output: dict[str, Any] | None = None,
overwrite: bool = False,
grid_overrides: dict | None = None,
) -> None:
Expand Down Expand Up @@ -164,7 +166,9 @@ def segy_to_mdio( # noqa: C901
accuracy mode in ZFP guarantees there won't be any errors larger
than this value. The default is 0.01, which gives about 70%
reduction in size. Will be ignored if `lossless=True`.
storage_options: Storage options for the cloud storage backend.
storage_options_input: Storage options for SEG-Y input file.
Default is `None` (will assume anonymous)
storage_options_output: Storage options for the MDIO output file.
Default is `None` (will assume anonymous)
overwrite: Toggle for overwriting existing store
grid_overrides: Option to add grid overrides. See examples.
Expand Down Expand Up @@ -355,20 +359,25 @@ def segy_to_mdio( # noqa: C901
)
raise ValueError(message)

if storage_options is None:
storage_options = {}
# Handle storage options and check permissions etc
if storage_options_input is None:
storage_options_input = {}

if storage_options_output is None:
storage_options_output = {}

store = process_url(
url=mdio_path_or_buffer,
mode="w",
storage_options=storage_options,
storage_options=storage_options_output,
memory_cache_size=0, # Making sure disk caching is disabled,
disk_cache=False, # Making sure disk caching is disabled
)

# Open SEG-Y with MDIO's SegySpec. Endianness will be inferred.
mdio_spec = mdio_segy_spec()
segy = SegyFile(url=segy_path, spec=mdio_spec)
segy_settings = SegySettings(storage_options=storage_options_input)
segy = SegyFile(url=segy_path, spec=mdio_spec, settings=segy_settings)

text_header = segy.text_header
binary_header = segy.binary_header
Expand All @@ -380,7 +389,7 @@ def segy_to_mdio( # noqa: C901
for name, byte, format_ in zip(index_names, index_bytes, index_types): # noqa: B905
index_fields.append(HeaderField(name=name, byte=byte, format=format_))
mdio_spec_grid = mdio_spec.customize(trace_header_fields=index_fields)
segy_grid = SegyFile(url=segy_path, spec=mdio_spec_grid)
segy_grid = SegyFile(url=segy_path, spec=mdio_spec_grid, settings=segy_settings)

dimensions, chunksize, index_headers = get_grid_plan(
segy_file=segy_grid,
Expand Down Expand Up @@ -482,7 +491,7 @@ def segy_to_mdio( # noqa: C901
store_nocache = process_url(
url=mdio_path_or_buffer,
mode="r+",
storage_options=storage_options,
storage_options=storage_options_output,
memory_cache_size=0, # Making sure disk caching is disabled,
disk_cache=False, # Making sure disk caching is disabled
)
Expand Down
Loading