-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Neptune Data builder Integration #438
feat: Neptune Data builder Integration #438
Conversation
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
@@ -0,0 +1,141 @@ | |||
# Copyright Contributors to the Amundsen project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you all want this to be placed in: https://github.com/amundsen-io/amundsengremlin
@@ -10,9 +10,9 @@ | |||
from databuilder.extractor.generic_extractor import GenericExtractor | |||
|
|||
|
|||
class Neo4jEsLastUpdatedExtractor(GenericExtractor): | |||
class EsLastUpdatedExtractor(GenericExtractor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please Note: I changed this to be more generic so all data stores can use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, cc @allisonsuarez from Lyft side
@@ -37,7 +37,7 @@ def __init__(self, | |||
email: str, | |||
first_name: str = '', | |||
last_name: str = '', | |||
name: str = '', | |||
full_name: str = '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please Note: One of the sample files refers this as full_name and I thought that name made more sense. Up to you all if it should be name.
Please note that this PR brings support for multiple models that the gremlin metadata proxy currently does not support. So some models will not show up on the frontend even though they are ingested. |
will take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. Left some code-level comments and questions. This is super neat!
I'm not familiar enough with the testing or and some other aspects to confidently approve, I will let @feng-tao give word on that.
neptune_uri = "wss://{host}/gremlin".format( | ||
host=self._neptune_host | ||
) | ||
self.source_factory = neptune_bulk_loader_api.get_neptune_graph_traversal_source_factory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why save this factory?
super(NeptuneCSVPublisher, self).__init__() | ||
|
||
def init(self, conf: ConfigTree) -> None: | ||
self._boto_session = Session( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can/should this share code with NeptuneSessionClient
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think they should. The api used by the NeptuneCSVPublisher are used in a different context than the api's used by the NeptuneSessionClient. I feel like it could be combined if you feel strongly. But I feel like it might break the abstraction of the client if the two should be blended together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong feelings, totally fine to leave separate if we feel they may diverge more in the future -- just noticed that a lot of the code reading from config and forming connection string was similar and wondered.
errors=True | ||
) | ||
load_status_payload = load_status_response.get('payload', {}) | ||
if 'status' not in load_status_payload.get('overallStatus', {}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, and ok to ignore, but I think more pythonic and DRY would be to just attempt the load_status_payload['overallStatus']['status']
and catch the KeyError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this applies in a few other places, but won't comment repeatedly)
file_paths = self._get_file_paths() | ||
for file_location in file_paths: | ||
with open(file_location, 'rb') as file_csv: | ||
file_csv_bytes = BytesIO(file_csv.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wrap the file_csv
in a BytesIO
? can we not pass the file_csv
handle straight to neptune_api_client.upload
, allowing it to stream from disk rather than buffer? If there's a reason why that's not ok, a comment would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that is not very efficient. Not sure what I was thinking when I did that. Good catch.
def validate(self) -> None: | ||
""" | ||
Validation method. Focused on limit the risk on deleting nodes and relations. | ||
- Check if deleted nodes will be within 10% of total nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is 5% by default?
self.target_nodes = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_NODES)) | ||
self.target_relations = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_RELATIONS)) | ||
self.batch_size = conf.get_int(NeptuneStalenessRemovalTask.BATCH_SIZE) | ||
self.dry_run = conf.get_bool(NeptuneStalenessRemovalTask.DRY_RUN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see where this is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call out I must had removed it at some point during debugging. Added back in.
.with_fallback(NeptuneStalenessRemovalTask.DEFAULT_CONFIG) | ||
self.target_nodes = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_NODES)) | ||
self.target_relations = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_RELATIONS)) | ||
self.batch_size = conf.get_int(NeptuneStalenessRemovalTask.BATCH_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are deletions being batched? i don't see this used atm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great callout I removed the flag. I could not find a great way to batch delete without it being ridiculously slow. So I found what I think is a okay way to delete the items from the db. It has not caused us any problems so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the batching stuff is unique to neo4j performance characteristics, so nice that we don't need to do batching here, much simpler
self, | ||
total_records: Iterable[Dict[str, Any]], | ||
stale_records: Iterable[Dict[str, Any]], | ||
types: Iterable[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be similar to existing code, but i believe this typing won't guarantee that the iterator is replayable -- so doing if type_str not in types
multiple times may fail?
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, some minor naming nits , thanks for the great contribution!
edge_traversal.next() | ||
|
||
@staticmethod | ||
def _update_entity_properties_on_traversal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason some static method starts with underscore while some don't?
@@ -10,9 +10,9 @@ | |||
from databuilder.extractor.generic_extractor import GenericExtractor | |||
|
|||
|
|||
class Neo4jEsLastUpdatedExtractor(GenericExtractor): | |||
class EsLastUpdatedExtractor(GenericExtractor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, cc @allisonsuarez from Lyft side
yield result | ||
|
||
def get_scope(self) -> str: | ||
return 'extractor.search_data' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractor.neptune_search_data ? the scope should be different between extractors
self._closer.close() | ||
|
||
def get_scope(self) -> str: | ||
return "loader.filesystem_csv_neptune" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loader.neptune_filesystem_csv ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be consistent with https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/loader/file_system_neo4j_csv_loader.py#L187
but I think loader.neptune_filesystem_csv sounds better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a short tutorial on how to use Neptune e2e in https://github.com/amundsen-io/amundsen/tree/master/docs/tutorials ? That would be super helpful for others who want to use neptune , thanks!
Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>
Summary of Changes
Implements the Neptune as a datastore in the Databuilder library.
RFC: https://github.com/amundsen-io/rfcs/blob/master/rfcs/013-neptune-support.md
Tests
I created tests for each model to test the serialization compatibility for Neptune. Also added a test the Neptune data loader.
Documentation
I created a sample script that shows how to use the
FSNeptuneCSVLoader
andNeptuneCSVPublisher
.CheckList
Make sure you have checked all steps below to ensure a timely review.
make test