Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DATAUP-682: Add import spec writers for xsv files #155

Merged
merged 5 commits into from
Jan 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,4 +103,5 @@ ENV/
data/
.DS_Store
.virtualenvs/
test.env
test.env
run_tests_single.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I put it in the gitignore so it didn't get checked in accidentally. It's just a hack to run_tests.sh to allow running a single file of tests

188 changes: 188 additions & 0 deletions staging_service/import_specifications/file_writers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
"""
Write an import specification to one or more files.

The names of the files will be the datatype suffixed by the file extension unless the writer
handles Excel or similar files that can contain multiple datatypes, in which case the
file name will be import_specification suffixed by the extension.

All the write_* functions in this module have the same function signature:

:param folder: Where the files should be written. The folder must exist.
:param types: The import specifications to write. This is a dictionary of data types as strings
to the specifications for the data type. Each specification has two required keys:
* `order_and_display`: this is a list of lists. Each inner list has two elements:
* The parameter ID of a parameter. This is typically the `id` field from the
KBase app `spec.json` file.
* The display name of the parameter. This is typically the `ui-name` field from the
KBase app `display.yaml` file.
The order of the inner lists in the outer list defines the order of the columns
in the resulting import specification files.
* `data`: this is a list of str->str or number dicts. The keys of the dicts are the
parameter IDs as described above, while the values are the values of the parameters.
Each dict must have exactly the same keys as the `order_and_display` structure. Each
entry in the list corresponds to a row in the resulting import specification,
and the order of the list defines the order of the rows.
Leave the `data` list empty to write an empty template.
:returns: A mapping of the data types to the files to which they were written.
"""
# note that we can't use an f string here to interpolate the variables below, e.g.
# order_and_display, etc.

import collections
import csv
import numbers

from typing import Any
from pathlib import Path

# this version is synonymous to the versions in individual_parsers.py. However, this module
# should only ever write the most recent format for import specifictions, while the parsers
# may need to also be able to parse earlier versions.
_VERSION = 1

# these are the same as in individual_parsers.py. They might change from version to version so
# have a separate copy here.
_DATA_TYPE = "Data type:"
_VERSION_STR = "Version:"
_COLUMN_STR = "Columns:"
_HEADER_SEP = ";"

_IMPORT_SPEC_FILE_NAME = "import_specification"

_ORDER_AND_DISPLAY = "order_and_display"
_DATA = "data"
_EXT_CSV = "csv"
_EXT_TSV = "tsv"
_EXT_EXCEL = "xlsx"
_SEP_CSV = ","
_SEP_TSV = "\t"

def _check_import_specification(types: dict[str, dict[str, list[Any]]]):
f"""
Check the structure of an import specification data structure. If the input is empty the
result is a noop.

:param types: The import specifications to check. This is a dictionary of data types as strings
to the specifications for the data type. Each specification has two required keys:
* {_ORDER_AND_DISPLAY}: this is a list of lists. Each inner list has two elements:
* The parameter ID of a parameter. This is typically the `id` field from the
KBase app `spec.json` file.
* The display name of the parameter. This is typically the `ui-name` field from the
KBase app `display.yaml` file.
The order of the inner lists in the outer list defines the order of the columns
in the resulting import specification files.
* {_DATA}: this is a list of str->str or number dicts. The keys of the dicts are the
parameter IDs as described above, while the values are the values of the parameters.
Each dict must have exactly the same keys as the {_ORDER_AND_DISPLAY} structure.
Each entry in the list corresponds to a row in the resulting import specification,
and the order of the list defines the order of the rows.
Leave the {_DATA} list empty to write an empty template.
"""
if not types:
return
for datatype in types:
# replace this with jsonschema? don't worry about it for now
_check_string(datatype, "A data type")
spec = types[datatype]
if type(spec) != dict:
raise ImportSpecWriteException(f"The value for data type {datatype} must be a mapping")
if _ORDER_AND_DISPLAY not in spec:
raise ImportSpecWriteException(
f"Data type {datatype} missing {_ORDER_AND_DISPLAY} key")
_check_is_sequence(
spec[_ORDER_AND_DISPLAY], f"Data type {datatype} {_ORDER_AND_DISPLAY} value")
if not len(spec[_ORDER_AND_DISPLAY]):
raise ImportSpecWriteException(
f"At least one entry is required for {_ORDER_AND_DISPLAY} for type {datatype}")
if _DATA not in spec:
raise ImportSpecWriteException(f"Data type {datatype} missing {_DATA} key")
_check_is_sequence(spec[_DATA], f"Data type {datatype} {_DATA} value")

param_ids = set()
for i, id_display in enumerate(spec[_ORDER_AND_DISPLAY]):
err = (f"Invalid {_ORDER_AND_DISPLAY} entry for datatype {datatype} "
+ f"at index {i} ")
_check_is_sequence(id_display, err + "- the entry")
if len(id_display) != 2:
raise ImportSpecWriteException(err + "- expected 2 item list")
pid = id_display[0]
_check_string(pid, err + "- parameter ID")
_check_string(id_display[1], err + "- parameter display name")
param_ids.add(pid)
for i, datarow in enumerate(spec[_DATA]):
err = f"Data type {datatype} {_DATA} row {i}"
if type(datarow) != dict:
raise ImportSpecWriteException(err + " is not a mapping")
if datarow.keys() != param_ids:
raise ImportSpecWriteException(
err + f" does not have the same keys as {_ORDER_AND_DISPLAY}")
for pid, v in datarow.items():
if v is not None and not isinstance(v, numbers.Number) and not isinstance(v, str):
raise ImportSpecWriteException(
err + f"'s value for parameter {pid} is not a number or a string")


def _check_string(tocheck: Any, errprefix: str):
if not isinstance(tocheck, str) or not tocheck.strip():
raise ImportSpecWriteException(
errprefix + " cannot be a non-string or a whitespace only string")


def _check_is_sequence(tocheck: Any, errprefix: str):
if not (isinstance(tocheck, collections.abc.Sequence) and not isinstance(tocheck, str)):
raise ImportSpecWriteException(errprefix + " is not a list")


# TODO WRITE_XSV look into server OOM protection if the user sends a huge JSON packet

def write_csv(folder: Path, types: dict[str, dict[str, list[Any]]]) -> dict[str, Path]:
"""
Writes import specifications to 1 or more csv files. All the writers in this module
have the same function signatures; see the module level documentation.
"""
return _write_xsv(folder, types, _EXT_CSV, _SEP_CSV)


def write_tsv(folder: Path, types: dict[str, dict[str, list[Any]]]) -> dict[str, Path]:
"""
Writes import specifications to 1 or more tsv files. All the writers in this module
have the same function signatures; see the module level documentation.
"""
return _write_xsv(folder, types, _EXT_TSV, _SEP_TSV)


def _write_xsv(folder: Path, types: dict[str, dict[str, list[Any]]], ext: str, sep: str):
_check_write_args(folder, types)
res = {}
for datatype in types:
out = folder / (datatype + "." + ext)
dt = types[datatype]
cols = len(dt[_ORDER_AND_DISPLAY])
with open(out, "w", newline='') as f:
csvw = csv.writer(f, delimiter=sep) # handle sep escaping
csvw.writerow([f"{_DATA_TYPE} {datatype}{_HEADER_SEP} "
+ f"{_COLUMN_STR} {cols}{_HEADER_SEP} {_VERSION_STR} {_VERSION}"])
pids = [i[0] for i in dt[_ORDER_AND_DISPLAY]]
Comment on lines +164 to +165
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does pid stand for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameter id

csvw.writerow(pids)
csvw.writerow([i[1] for i in dt[_ORDER_AND_DISPLAY]])
for row in dt[_DATA]:
csvw.writerow([row[pid] for pid in pids])
res[datatype] = out
return res


def _check_write_args(folder: Path, types: dict[str, dict[str, list[Any]]]):
if not folder:
# this is a programming error, not a user input error, so not using the custom
# exception here
raise ValueError("The folder cannot be null")
if type(types) != dict:
raise ImportSpecWriteException("The types value must be a mapping")
_check_import_specification(types)


class ImportSpecWriteException(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably YAGNI, but we may want to consider creating an StagingServiceException(Exception) and then deriving new exceptions from StagingServiceException instead of Exception, to allow for catching Staging Service Exceptions and Exceptions

try:

except StagingServiceException

except Execption

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, made an issue. Next time I touch the code base I'll do it

"""
An exception thrown when writing an import specification fails.
"""
pass
Loading