Implement serialization and deserialization for file dict #2

asikowitz · 2023-03-01T17:52:35Z

Doing the large-pr in acryldata flow, but in your personal branch.

I opted to make serializer and deserializer required arguments, because adding defaults as json.dumps and json.loads respectively would add extra overhead for basic FileBackedDict[str] or FileBackedDict[int] objects, when the identify function would be sufficient. I don't think it's too onerous to have to specify your serializer and deserializer, even if it's just the identity function or json functions, and in general something the user should be thinking about when they're using FileBackedDict

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz · 2023-03-01T17:54:04Z

metadata-ingestion/src/datahub/utilities/file_backed_collections.py

+        n_deleted = self._conn.execute(
+            "DELETE FROM data WHERE key = ?", (key,)
+        ).rowcount
+        if not in_cache and not n_deleted:
+            raise KeyError(key)


Doing self[key] will cause a deserialization call, so I reimplemented using the rowcount returned by the delete call

asikowitz · 2023-03-01T17:55:36Z

metadata-ingestion/src/datahub/utilities/file_backed_collections.py

+    def __del__(self) -> None:
+        self.close()


Just in case we forget to close, do it on garbage collection. Have confirmed close() can be called multiple times

cool thanks

asikowitz added 2 commits March 1, 2023 12:44

Implement serialization and deserialization for file dict

ec54325

ci

a30ff4e

asikowitz requested a review from hsheth2 March 1, 2023 17:52

github-actions bot added the ingestion label Mar 1, 2023

asikowitz commented Mar 1, 2023

View reviewed changes

asikowitz merged commit a0c6d85 into hsheth2:file-dict Mar 1, 2023

asikowitz deleted the file-dict-serialization branch March 1, 2023 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement serialization and deserialization for file dict #2

Implement serialization and deserialization for file dict #2

asikowitz commented Mar 1, 2023

asikowitz Mar 1, 2023

hsheth2 Mar 1, 2023

asikowitz Mar 1, 2023

hsheth2 Mar 1, 2023

Implement serialization and deserialization for file dict #2

Implement serialization and deserialization for file dict #2

Conversation

asikowitz commented Mar 1, 2023

Checklist

asikowitz Mar 1, 2023

Choose a reason for hiding this comment

hsheth2 Mar 1, 2023

Choose a reason for hiding this comment

asikowitz Mar 1, 2023

Choose a reason for hiding this comment

hsheth2 Mar 1, 2023

Choose a reason for hiding this comment