-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DATAUP-682: Add import spec writers for xsv files #155
Changes from all commits
b22e5c3
f52b390
4d4f878
e5d619e
d3a52f2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -103,4 +103,5 @@ ENV/ | |
data/ | ||
.DS_Store | ||
.virtualenvs/ | ||
test.env | ||
test.env | ||
run_tests_single.sh | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,188 @@ | ||
""" | ||
Write an import specification to one or more files. | ||
|
||
The names of the files will be the datatype suffixed by the file extension unless the writer | ||
handles Excel or similar files that can contain multiple datatypes, in which case the | ||
file name will be import_specification suffixed by the extension. | ||
|
||
All the write_* functions in this module have the same function signature: | ||
|
||
:param folder: Where the files should be written. The folder must exist. | ||
:param types: The import specifications to write. This is a dictionary of data types as strings | ||
to the specifications for the data type. Each specification has two required keys: | ||
* `order_and_display`: this is a list of lists. Each inner list has two elements: | ||
* The parameter ID of a parameter. This is typically the `id` field from the | ||
KBase app `spec.json` file. | ||
* The display name of the parameter. This is typically the `ui-name` field from the | ||
KBase app `display.yaml` file. | ||
The order of the inner lists in the outer list defines the order of the columns | ||
in the resulting import specification files. | ||
* `data`: this is a list of str->str or number dicts. The keys of the dicts are the | ||
parameter IDs as described above, while the values are the values of the parameters. | ||
Each dict must have exactly the same keys as the `order_and_display` structure. Each | ||
entry in the list corresponds to a row in the resulting import specification, | ||
and the order of the list defines the order of the rows. | ||
Leave the `data` list empty to write an empty template. | ||
:returns: A mapping of the data types to the files to which they were written. | ||
""" | ||
# note that we can't use an f string here to interpolate the variables below, e.g. | ||
# order_and_display, etc. | ||
|
||
import collections | ||
import csv | ||
import numbers | ||
|
||
from typing import Any | ||
from pathlib import Path | ||
|
||
# this version is synonymous to the versions in individual_parsers.py. However, this module | ||
# should only ever write the most recent format for import specifictions, while the parsers | ||
# may need to also be able to parse earlier versions. | ||
_VERSION = 1 | ||
|
||
# these are the same as in individual_parsers.py. They might change from version to version so | ||
# have a separate copy here. | ||
_DATA_TYPE = "Data type:" | ||
_VERSION_STR = "Version:" | ||
_COLUMN_STR = "Columns:" | ||
_HEADER_SEP = ";" | ||
|
||
_IMPORT_SPEC_FILE_NAME = "import_specification" | ||
|
||
_ORDER_AND_DISPLAY = "order_and_display" | ||
_DATA = "data" | ||
_EXT_CSV = "csv" | ||
_EXT_TSV = "tsv" | ||
_EXT_EXCEL = "xlsx" | ||
_SEP_CSV = "," | ||
_SEP_TSV = "\t" | ||
|
||
def _check_import_specification(types: dict[str, dict[str, list[Any]]]): | ||
f""" | ||
Check the structure of an import specification data structure. If the input is empty the | ||
result is a noop. | ||
|
||
:param types: The import specifications to check. This is a dictionary of data types as strings | ||
to the specifications for the data type. Each specification has two required keys: | ||
* {_ORDER_AND_DISPLAY}: this is a list of lists. Each inner list has two elements: | ||
* The parameter ID of a parameter. This is typically the `id` field from the | ||
KBase app `spec.json` file. | ||
* The display name of the parameter. This is typically the `ui-name` field from the | ||
KBase app `display.yaml` file. | ||
The order of the inner lists in the outer list defines the order of the columns | ||
in the resulting import specification files. | ||
* {_DATA}: this is a list of str->str or number dicts. The keys of the dicts are the | ||
parameter IDs as described above, while the values are the values of the parameters. | ||
Each dict must have exactly the same keys as the {_ORDER_AND_DISPLAY} structure. | ||
Each entry in the list corresponds to a row in the resulting import specification, | ||
and the order of the list defines the order of the rows. | ||
Leave the {_DATA} list empty to write an empty template. | ||
""" | ||
if not types: | ||
return | ||
for datatype in types: | ||
# replace this with jsonschema? don't worry about it for now | ||
_check_string(datatype, "A data type") | ||
spec = types[datatype] | ||
if type(spec) != dict: | ||
raise ImportSpecWriteException(f"The value for data type {datatype} must be a mapping") | ||
if _ORDER_AND_DISPLAY not in spec: | ||
raise ImportSpecWriteException( | ||
f"Data type {datatype} missing {_ORDER_AND_DISPLAY} key") | ||
_check_is_sequence( | ||
spec[_ORDER_AND_DISPLAY], f"Data type {datatype} {_ORDER_AND_DISPLAY} value") | ||
if not len(spec[_ORDER_AND_DISPLAY]): | ||
raise ImportSpecWriteException( | ||
f"At least one entry is required for {_ORDER_AND_DISPLAY} for type {datatype}") | ||
if _DATA not in spec: | ||
raise ImportSpecWriteException(f"Data type {datatype} missing {_DATA} key") | ||
_check_is_sequence(spec[_DATA], f"Data type {datatype} {_DATA} value") | ||
|
||
param_ids = set() | ||
for i, id_display in enumerate(spec[_ORDER_AND_DISPLAY]): | ||
err = (f"Invalid {_ORDER_AND_DISPLAY} entry for datatype {datatype} " | ||
+ f"at index {i} ") | ||
_check_is_sequence(id_display, err + "- the entry") | ||
if len(id_display) != 2: | ||
raise ImportSpecWriteException(err + "- expected 2 item list") | ||
pid = id_display[0] | ||
_check_string(pid, err + "- parameter ID") | ||
_check_string(id_display[1], err + "- parameter display name") | ||
param_ids.add(pid) | ||
for i, datarow in enumerate(spec[_DATA]): | ||
err = f"Data type {datatype} {_DATA} row {i}" | ||
if type(datarow) != dict: | ||
raise ImportSpecWriteException(err + " is not a mapping") | ||
if datarow.keys() != param_ids: | ||
raise ImportSpecWriteException( | ||
err + f" does not have the same keys as {_ORDER_AND_DISPLAY}") | ||
for pid, v in datarow.items(): | ||
if v is not None and not isinstance(v, numbers.Number) and not isinstance(v, str): | ||
raise ImportSpecWriteException( | ||
err + f"'s value for parameter {pid} is not a number or a string") | ||
|
||
|
||
def _check_string(tocheck: Any, errprefix: str): | ||
if not isinstance(tocheck, str) or not tocheck.strip(): | ||
raise ImportSpecWriteException( | ||
errprefix + " cannot be a non-string or a whitespace only string") | ||
|
||
|
||
def _check_is_sequence(tocheck: Any, errprefix: str): | ||
if not (isinstance(tocheck, collections.abc.Sequence) and not isinstance(tocheck, str)): | ||
raise ImportSpecWriteException(errprefix + " is not a list") | ||
|
||
|
||
# TODO WRITE_XSV look into server OOM protection if the user sends a huge JSON packet | ||
|
||
def write_csv(folder: Path, types: dict[str, dict[str, list[Any]]]) -> dict[str, Path]: | ||
""" | ||
Writes import specifications to 1 or more csv files. All the writers in this module | ||
have the same function signatures; see the module level documentation. | ||
""" | ||
return _write_xsv(folder, types, _EXT_CSV, _SEP_CSV) | ||
|
||
|
||
def write_tsv(folder: Path, types: dict[str, dict[str, list[Any]]]) -> dict[str, Path]: | ||
""" | ||
Writes import specifications to 1 or more tsv files. All the writers in this module | ||
have the same function signatures; see the module level documentation. | ||
""" | ||
return _write_xsv(folder, types, _EXT_TSV, _SEP_TSV) | ||
|
||
|
||
def _write_xsv(folder: Path, types: dict[str, dict[str, list[Any]]], ext: str, sep: str): | ||
_check_write_args(folder, types) | ||
res = {} | ||
for datatype in types: | ||
out = folder / (datatype + "." + ext) | ||
dt = types[datatype] | ||
cols = len(dt[_ORDER_AND_DISPLAY]) | ||
with open(out, "w", newline='') as f: | ||
csvw = csv.writer(f, delimiter=sep) # handle sep escaping | ||
csvw.writerow([f"{_DATA_TYPE} {datatype}{_HEADER_SEP} " | ||
+ f"{_COLUMN_STR} {cols}{_HEADER_SEP} {_VERSION_STR} {_VERSION}"]) | ||
pids = [i[0] for i in dt[_ORDER_AND_DISPLAY]] | ||
Comment on lines
+164
to
+165
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does pid stand for? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. parameter id |
||
csvw.writerow(pids) | ||
csvw.writerow([i[1] for i in dt[_ORDER_AND_DISPLAY]]) | ||
for row in dt[_DATA]: | ||
csvw.writerow([row[pid] for pid in pids]) | ||
res[datatype] = out | ||
return res | ||
|
||
|
||
def _check_write_args(folder: Path, types: dict[str, dict[str, list[Any]]]): | ||
if not folder: | ||
# this is a programming error, not a user input error, so not using the custom | ||
# exception here | ||
raise ValueError("The folder cannot be null") | ||
if type(types) != dict: | ||
raise ImportSpecWriteException("The types value must be a mapping") | ||
_check_import_specification(types) | ||
|
||
|
||
class ImportSpecWriteException(Exception): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably YAGNI, but we may want to consider creating an StagingServiceException(Exception) and then deriving new exceptions from StagingServiceException instead of Exception, to allow for catching Staging Service Exceptions and Exceptions
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, made an issue. Next time I touch the code base I'll do it |
||
""" | ||
An exception thrown when writing an import specification fails. | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I put it in the gitignore so it didn't get checked in accidentally. It's just a hack to
run_tests.sh
to allow running a single file of tests