New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix JSON bug with data reading #691

Merged

taylorfturner merged 5 commits into capitalone:main from JGSweets:fix-bug-json

Oct 17, 2022

Contributor

JGSweets commented Oct 14, 2022 •

edited

Loading

Previously: we could not read line separated JSON arrays.
Now we can read:

[1, 2]
[2, 3]
[3, 3]

becomes a JSONData which is wrapping pandas as:

JGSweets added 2 commits

October 14, 2022 17:37


          fix: mypy issue

2a14fe8


          fix: json bug w/ test

0bb6cda

JGSweets added Bug High Priority labels

JGSweets requested review from ksneab7, taylorfturner, micdavis and tyfarnan as code owners

October 14, 2022 23:17

JGSweets assigned taylorfturner

JGSweets commented

View reviewed changes

dataprofiler/data_readers/avro_data.py

@@ @@ -1,6 +1,6 @@ @@
               """Contains class for saving and loading spreadsheet data."""
               from io import BytesIO, StringIO
-              from typing import Any, Dict, List, Optional, Union, cast
+              from typing import Any, Dict, List, Optional, Union

Contributor Author

JGSweets Oct 14, 2022

the proper way in the other func for a boolean which checks type is to return a TypeGuard

Contributor Author

JGSweets Oct 14, 2022

no need to cast now

JGSweets commented

View reviewed changes

dataprofiler/data_readers/avro_data.py

@@ @@ -92,18 +92,12 @@ def is_match( @@
                       # get current position of stream
                       if data_utils.is_stream_buffer(file_path):
-                          file_path = cast(

Contributor Author

JGSweets Oct 14, 2022

static typing, not functional change, removed bc of TypeGuard

JGSweets commented

View reviewed changes

dataprofiler/data_readers/avro_data.py

                           starting_location = file_path.tell()
                       is_valid_avro = fastavro.is_avro(file_path)
                       # return to original position in stream
                       if data_utils.is_stream_buffer(file_path):
-                          file_path = cast(

Contributor Author

JGSweets Oct 14, 2022

static typing, not functional change, removed bc of TypeGuard

JGSweets commented

View reviewed changes

dataprofiler/data_readers/filepath_or_buffer.py


		def is_stream_buffer(filepath_or_buffer: Any) -> bool:

		def is_stream_buffer(filepath_or_buffer: Any) -> TypeGuard[Union[StringIO, BytesIO]]:

Contributor Author

JGSweets Oct 14, 2022

fix to use TypeGuard

JGSweets commented

View reviewed changes

dataprofiler/data_readers/filepath_or_buffer.py

Comment on lines -30 to +35

    
                      open_method: str ="r",

                      encoding: Optional[str]=None,

                      seek_offset: Optional[int]=None,

                      seek_whence: int=0,

                      open_method: str = "r",

                      encoding: Optional[str] = None,

                      seek_offset: Optional[int] = None,

                      seek_whence: int = 0,

Contributor Author

JGSweets Oct 14, 2022

formatting

JGSweets commented

View reviewed changes

dataprofiler/data_readers/filepath_or_buffer.py

Comment on lines -54 to +58

-                      self.original_type: Union[Type[str], Type[StringIO], Type[BytesIO], Type[IO]] = type(filepath_or_buffer)
+                      self.original_type: Union[
+                          Type[str], Type[StringIO], Type[BytesIO], Type[IO]
+                      ] = type(filepath_or_buffer)

Contributor Author

JGSweets Oct 14, 2022

format

JGSweets commented

View reviewed changes

dataprofiler/data_readers/filepath_or_buffer.py

Comment on lines -87 to +93

-                          self._filepath_or_buffer = cast(TextIOWrapper, self._filepath_or_buffer) # guaranteed by self._is_wrapped
+                          self._filepath_or_buffer = cast(
+                              TextIOWrapper, self._filepath_or_buffer
+                          )  # guaranteed by self._is_wrapped

Contributor Author

JGSweets Oct 14, 2022

format

JGSweets commented

View reviewed changes

dataprofiler/data_readers/filepath_or_buffer.py

Comment on lines -95 to +103

-                          self._filepath_or_buffer = cast(IO, self._filepath_or_buffer) # can't be str due to conversion in __enter__
+                          self._filepath_or_buffer = cast(
+                              IO, self._filepath_or_buffer
+                          )  # can't be str due to conversion in __enter__

Contributor Author

JGSweets Oct 14, 2022

format

JGSweets commented

View reviewed changes

dataprofiler/data_readers/json_data.py

-                      _data = _data.to_dict(orient="records", into=OrderedDict)
-                      for i, sample in enumerate(_data):
-                          _data[i] = json.dumps(
+                      data = self._get_data_as_df(data)

Contributor Author

JGSweets Oct 14, 2022

no longer having two variables of the data with _data
fixes type acceptance at the top

JGSweets commented

View reviewed changes

dataprofiler/data_readers/json_data.py

                       """
                       Extract the data as a json format.
                       :param data: raw data
                       :type data: list
                       :return: dataframe in json format
                       """
-                      _data: Union[pd.DataFrame, List]

Contributor Author

JGSweets Oct 14, 2022

no longer having two variables of the data with _data
fixes type acceptance at the top


          fix: docstring

JGSweets commented

View reviewed changes

dataprofiler/data_readers/json_data.py

                       :return:
                       """
                       for key in list(dic.keys()):
+                          if not isinstance(key, str):

Contributor Author

JGSweets Oct 14, 2022

fix for allowing the [1] format in json reading

JGSweets commented

View reviewed changes

dataprofiler/data_readers/json_data.py

@@ @@ -392,14 +394,16 @@ def is_match( @@
                               return True
                           except (json.JSONDecodeError, UnicodeDecodeError):
                               data_file.seek(0)
+                          json_identifier_re = re.compile(r"(:|\[)")

Contributor Author

JGSweets Oct 14, 2022

allows both : and [ as a JSON identifier differentiating from just a string.

JGSweets commented

View reviewed changes

dataprofiler/tests/data_readers/test_json_data.py

@@ @@ -12,11 +12,12 @@ @@
               class TestNestedJSON(unittest.TestCase):
                   def test_flat_to_nested_json(self):
-                      dic = {"a.b": "ab", "a.c": "ac", "a.d.f": "adf", "b": "b"}
+                      dic = {"a.b": "ab", "a.c": "ac", "a.d.f": "adf", "b": "b", 1: 3}

Contributor Author

JGSweets Oct 14, 2022

updates test to check for keys that aren't strings


          fix: eof line

458ef6f

JGSweets commented

View reviewed changes

dataprofiler/tests/data_readers/test_json_data.py

@@ @@ -56,6 +57,11 @@ def setUpClass(cls): @@
                               encoding="utf-8",
                               count=14,
                           ),
+                          dict(

Contributor Author

JGSweets Oct 14, 2022

adds test which includes the new json format

JGSweets commented

View reviewed changes

dataprofiler/tests/data/json/simple-list.json

		@@ -0,0 +1,3 @@
		[1]

Contributor Author

JGSweets Oct 14, 2022

new data for testing json reading


          Merge branch 'main' into fix-bug-json

49c1324

ksneab7 approved these changes

View reviewed changes

taylorfturner enabled auto-merge (squash)

October 17, 2022 12:39

taylorfturner approved these changes

View reviewed changes

taylorfturner merged commit d16b5c8 into capitalone:main

taylorfturner mentioned this pull request

added static typing to data_utils.py #662

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug High Priority