Skip to content

Commit

Permalink
Parsing local and cloud SEG-Y files with new I/O library (#381)
Browse files Browse the repository at this point in the history
* Refactor text_header setter type check

Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process.

* Refactor text_header setter type check

Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process.

* Refactor text_header setter type check

Removed the list type check from the text_header setter in accessor.py. The application now expects a string input instead of a list, simplifying the validation process.

* Refactor workers for SEG-Y parsing

Simplify `header_scan_worker` and `trace_worker` in SEG-Y module by removing unused imports and streamlining parameter list. Update functions to work directly with `SegyFile` instances and clean up data handling logic for efficiency.

* Refactor SEG-Y parser and streamline imports

Refactor the parsing functions in `src/mdio/segy/parsers.py` to simplify the codebase and improve maintainability. Redundant functions such as `parse_binary_header`, `parse_text_header`, and `get_trace_count` have been removed, while imports have been condensed to only essential modules. The `NUM_CORES` logic is updated to count logical cores instead of just physical ones.

* Refactor SEG-Y converter and simplify imports

Removed unused imports and functions in the SEG-Y converter module to enhance code maintainability. Simplified the arguments for the `segy_to_mdio` function to increase ease of use and readability. Reduced complexity by utilizing `SegyFile` class for SEG-Y file operations.

* Refactor get_grid_plan and remove unused imports

The get_grid_plan function in utilities.py has been refactored to accept a SegyFile instance instead of individual parameters for the file path. Unused imports were eliminated, and type checking imports are now conditional, improving readability and modularity.

* use NDArray typing since we now return struct

* Refactor to use 'segy' instead of 'segyio'.

The changes involve major refactoring of the code base to use the 'segy' library instead of 'segyio'. Most notably, this included updating the handling of SEG-Y dtypes, byte order, and trace headers. Unused imports have been removed to clean up the code. A new multiprocessing chunk size has been introduced and set attributes to SegyFile instance instead of passing them as function arguments.

* refactor override tests to use ndarray headers instead of a dictionary to make it work with 'segy'.

* Remove unit tests for IBM/IEEE conversions and text headers

* Refactor and simplify 6D tests related to SEG-Y

* Refactor and simplify 6D tests related to SEG-Y

* Upgrade segy package version

The segy package version has been updated from 0.0.13 to 0.0.14 in the pyproject.toml file. This upgrade was performed to update software dependencies and to integrate the latest bug fixes and features delivered with the new version.

* Refactored segy factory creation in mdio_to_segy function

A new helper function, 'make_segy_factory', has been created to handle the generation of SegyFactory. This function accepts more parameters to provide better control over the creation of the SEG-Y based on the MDIO metadata. Changes also include updates in import declarations and reorganization of some code blocks in the 'mdio_spec_to_segy' function.

* Update segy library version

* Multiply sample_interval by 1000 in SegyFactory

In the SegyFactory initialization within creation.py, the sample_interval parameter has been modified to be multiplied by 1000. This change ensures that the value is correctly represented in microseconds, aligning with the expected data format.

* fix docstring errors

* Update dependency package versions

* update field name for segy data

* import Endianness from new location

* use bleeding edge segy during dev

* allow configuring endianness on export

* update binary header

* Update the 'segy' git repository link

* Update virtualenv version in constraints.txt

* Update poetry version in workflow constraints

* update RtD dependencies

* switch myst-nb to stable

* fix broken tests

* fix broken tests

* simplify factory usage and fix tests

* add original segyio fields as spec

* Add pytest-dependency to project dev dependencies

* fix: headers were missed due to early return

* streamline mdio segy spec

* simplify mock 4d generation

* enforce mdio segy spec

* update type hints to the correct segy type.

* revert api

* update type hints

* remove endian from segy import because its inferred

* remove output format from seg-y export. we only export as its set in "binary header"

* update endian kwarg name

* revert to old api

* enable all tests

* Update get_grid_plan

* Remove unused byte swapping function from segy creation module.

* Remove now unused byte utils module

* Add temporary safety check ignore for specific CVE

The safety check in noxfile.py has been updated to temporarily ignore a specific Common Vulnerabilities and Exposures (CVE) number because it's not deemed critical. A TODO note is added to remind removal of this exception once the issue is resolved.

* fix safety ignore syntax

* make temp zarr files module scoped

* revert to_segy endian api

* simplify changes

* Correct variable in default chunk selection

* Update segy package version in pyproject.toml

* use correct spec for factory

* use new endian inference from `segy`

* get header dtype from spec instead of reading a header

* remove unnecessary cast

* remove commented line

* Implement dynamic CPU count for header parsing

* backward_compat: revert text header to write as list[str] instead of str with newline

* generate spec as needed and avoid singleton bugs

* bump version

* add missing return doc

* Add future annotations import for type hints
  • Loading branch information
tasansal authored Jun 25, 2024
1 parent 6bb1956 commit f4a5ad4
Show file tree
Hide file tree
Showing 29 changed files with 1,577 additions and 2,306 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/constraints-poetry.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
poetry==1.8.2
poetry==1.8.3
2 changes: 1 addition & 1 deletion .github/workflows/constraints.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
pip==24.0
nox==2024.4.15
nox-poetry==1.0.3
virtualenv==20.26.1
virtualenv==20.26.2
5 changes: 2 additions & 3 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
furo==2024.5.6
sphinx==7.3.7
sphinx-click==5.1.0
sphinx-click==6.0.0
sphinx-copybutton==0.5.2
# myst-nb==0.17.2
myst-nb @ git+https://github.com/executablebooks/MyST-NB@35ebd54
myst-nb==1.1.0
linkify-it-py==2.0.3
18 changes: 11 additions & 7 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,15 @@ def safety(session: Session) -> None:
"""Scan dependencies for insecure packages."""
requirements = session.poetry.export_requirements()
session.install("safety")
session.run("safety", "check", "--full-report", f"--file={requirements}")
# TODO(Altay): Remove the CVE ignore once its resolved. Its not critical, so ignoring now.
ignore = ["70612"]
session.run(
"safety",
"check",
"--full-report",
f"--file={requirements}",
f"--ignore={','.join(ignore)}",
)


@session(python=python_versions)
Expand Down Expand Up @@ -219,9 +227,7 @@ def docs_build(session: Session) -> None:
"sphinx-click",
"sphinx-copybutton",
"furo",
# TODO(Altay): Update this to v1.0.0 when its out. Right now we
# use this because myst-nb stable doesn't work with Sphinx 7.
"myst-nb@git+https://github.com/executablebooks/MyST-NB@35ebd54",
"myst-nb",
"linkify-it-py",
)

Expand All @@ -243,9 +249,7 @@ def docs(session: Session) -> None:
"sphinx-click",
"sphinx-copybutton",
"furo",
# TODO(Altay): Update this to v1.0.0 when its out. Right now we
# use this because myst-nb stable doesn't work with Sphinx 7.
"myst-nb@git+https://github.com/executablebooks/MyST-NB@35ebd54",
"myst-nb",
"linkify-it-py",
)

Expand Down
1,551 changes: 1,059 additions & 492 deletions poetry.lock

Large diffs are not rendered by default.

62 changes: 31 additions & 31 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "multidimio"
version = "0.7.4"
version = "0.8.0"
description = "Cloud-native, scalable, and user-friendly multi dimensional energy data!"
authors = ["TGS <sys-opensource@tgs.com>"]
maintainers = [
Expand All @@ -26,22 +26,21 @@ Changelog = "https://github.com/TGSAI/mdio-python/releases"
python = ">=3.9,<3.13"
click = "^8.1.7"
click-params = "^0.5.0"
zarr = "^2.16.1"
dask = ">=2023.10.0"
tqdm = "^4.66.1"
segyio = "^1.9.3"
numba = "^0.59.1"
psutil = "^5.9.5"
fsspec = ">=2023.9.1"
zarr = "^2.18.2"
dask = ">=2024.6.1"
tqdm = "^4.66.4"
psutil = "^6.0.0"
fsspec = ">=2024.6.0"
segy = "^0.1.4"
rich = "^13.7.1"
urllib3 = "^1.26.18" # Workaround for poetry-plugin-export/issues/183

# Extras
distributed = {version = ">=2023.10.0", optional = true}
bokeh = {version = "^3.2.2", optional = true}
s3fs = {version = ">=2023.5.0", optional = true}
gcsfs = {version = ">=2023.5.0", optional = true}
adlfs = {version = ">=2023.4.0", optional = true}
distributed = {version = ">=2024.6.1", optional = true}
bokeh = {version = "^3.4.1", optional = true}
s3fs = {version = ">=2024.6.0", optional = true}
gcsfs = {version = ">=2024.6.0", optional = true}
adlfs = {version = ">=2024.4.1", optional = true}
zfpy = {version = "^0.5.5", optional = true}

[tool.poetry.extras]
Expand All @@ -51,30 +50,31 @@ lossy = ["zfpy"]

[tool.poetry.group.dev.dependencies]
black = "^24.4.2"
coverage = {version = "^7.4.0", extras = ["toml"]}
coverage = {version = "^7.5.3", extras = ["toml"]}
darglint = "^1.8.1"
flake8 = "^7.0.0"
flake8 = "^7.1.0"
flake8-bandit = "^4.1.1"
flake8-bugbear = "^23.12.2"
flake8-bugbear = "^24.4.26"
flake8-docstrings = "^1.7.0"
flake8-rst-docstrings = "^0.3.0"
furo = ">=2023.9.10"
furo = ">=2024.5.6"
isort = "^5.13.2"
mypy = "^1.8.0"
pep8-naming = "^0.13.3"
pre-commit = "^3.6.0"
pre-commit-hooks = "^4.5.0"
pytest = "^7.4.4"
pyupgrade = "^3.15.0"
safety = "^2.3.5"
sphinx-autobuild = "^2021.3.14"
sphinx-click = "^5.1.0"
mypy = "^1.10.0"
pep8-naming = "^0.14.1"
pre-commit = "^3.7.1"
pre-commit-hooks = "^4.6.0"
pytest = "^8.2.2"
pytest-dependency = "^0.6.0"
pyupgrade = "^3.16.0"
safety = "^3.2.3"
sphinx-autobuild = ">=2024.4.16"
sphinx-click = "^6.0.0"
sphinx-copybutton = "^0.5.2"
typeguard = "^4.1.5"
xdoctest = {version = "^1.1.2", extras = ["colors"]}
myst-parser = "^2.0.0"
Pygments = "^2.17.2"
Sphinx = "^7.2.6"
typeguard = "^4.3.0"
xdoctest = {version = "^1.1.5", extras = ["colors"]}
myst-parser = "^3.0.1"
Pygments = "^2.18.0"
Sphinx = "^7.3.7"

[tool.poetry.scripts]
mdio = "mdio.__main__:main"
Expand Down
24 changes: 0 additions & 24 deletions src/mdio/commands/segy.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,16 +96,6 @@
help="Custom chunk size for bricked storage",
type=IntListParamType(),
)
@option(
"-endian",
"--endian",
required=False,
default="big",
help="Endianness of the SEG-Y file",
type=Choice(["little", "big"]),
show_default=True,
show_choices=True,
)
@option(
"-lossless",
"--lossless",
Expand Down Expand Up @@ -152,7 +142,6 @@ def segy_import(
header_types: list[str],
header_names: list[str],
chunk_size: list[int],
endian: str,
lossless: bool,
compression_tolerance: float,
storage_options: dict[str, Any],
Expand Down Expand Up @@ -356,7 +345,6 @@ def segy_import(
index_types=header_types,
index_names=header_names,
chunksize=chunk_size,
endian=endian,
lossless=lossless,
compression_tolerance=compression_tolerance,
storage_options=storage_options,
Expand All @@ -377,16 +365,6 @@ def segy_import(
type=STRING,
show_default=True,
)
@option(
"-format",
"--segy-format",
required=False,
default="ibm32",
help="SEG-Y sample format",
type=Choice(["ibm32", "ieee32"]),
show_default=True,
show_choices=True,
)
@option(
"-storage",
"--storage-options",
Expand All @@ -408,7 +386,6 @@ def segy_export(
mdio_file: str,
segy_path: str,
access_pattern: str,
segy_format: str,
storage_options: dict[str, Any],
endian: str,
):
Expand Down Expand Up @@ -438,7 +415,6 @@ def segy_export(
mdio_path_or_buffer=mdio_file,
output_segy_path=segy_path,
access_pattern=access_pattern,
out_sample_format=segy_format,
storage_options=storage_options,
endian=endian,
)
21 changes: 4 additions & 17 deletions src/mdio/converters/mdio.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@

from mdio import MDIOReader
from mdio.segy.blocked_io import to_segy
from mdio.segy.byte_utils import ByteOrder
from mdio.segy.byte_utils import Dtype
from mdio.segy.creation import concat_files
from mdio.segy.creation import mdio_spec_to_segy
from mdio.segy.utilities import segy_export_rechunker
Expand All @@ -34,7 +32,6 @@ def mdio_to_segy( # noqa: C901
output_segy_path: str,
endian: str = "big",
access_pattern: str = "012",
out_sample_format: str = "ibm32",
storage_options: dict = None,
new_chunks: tuple[int, ...] = None,
selection_mask: np.ndarray = None,
Expand Down Expand Up @@ -65,8 +62,6 @@ def mdio_to_segy( # noqa: C901
endian. Default is 'big'.
access_pattern: This specificies the chunk access pattern. Underlying
zarr.Array must exist. Examples: '012', '01'
out_sample_format: Output sample format.
Currently support: {'ibm32', 'float32'}. Default is 'ibm32'.
storage_options: Storage options for the cloud storage backend.
Default: None (will assume anonymous access)
new_chunks: Set manual chunksize. For development purposes only.
Expand Down Expand Up @@ -99,7 +94,6 @@ def mdio_to_segy( # noqa: C901
... mdio_path_or_buffer="prefix2/file.mdio",
... output_segy_path="prefix/file.segy",
... selection_mask=boolean_mask,
... out_sample_format="float32",
... )
"""
Expand All @@ -117,25 +111,23 @@ def mdio_to_segy( # noqa: C901
creation_args = [
mdio_path_or_buffer,
output_segy_path,
endian,
access_pattern,
out_sample_format,
endian,
storage_options,
new_chunks,
selection_mask,
backend,
]

if client is not None:
if distributed is not None:
# This is in case we work with big data
feature = client.submit(mdio_spec_to_segy, *creation_args)
mdio, sample_format = feature.result()
mdio, segy_factory = feature.result()
else:
msg = "Distributed client was provided, but `distributed` is not installed"
raise ImportError(msg)
else:
mdio, sample_format = mdio_spec_to_segy(*creation_args)
mdio, segy_factory = mdio_spec_to_segy(*creation_args)

live_mask = mdio.live_mask.compute()

Expand Down Expand Up @@ -163,10 +155,6 @@ def mdio_to_segy( # noqa: C901
selection_mask = selection_mask[dim_slices]
live_mask = live_mask & selection_mask

# Parse output type and byte order
out_dtype = Dtype[out_sample_format.upper()]
out_byteorder = ByteOrder[endian.upper()]

# tmp file root
out_dir = path.dirname(output_segy_path)
tmp_dir = TemporaryDirectory(dir=out_dir)
Expand All @@ -177,8 +165,7 @@ def mdio_to_segy( # noqa: C901
samples=samples,
headers=headers,
live_mask=live_mask,
out_dtype=out_dtype,
out_byteorder=out_byteorder,
segy_factory=segy_factory,
file_root=tmp_dir.name,
axis=tuple(range(1, samples.ndim)),
)
Expand Down
Loading

0 comments on commit f4a5ad4

Please sign in to comment.