Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PUT vs. PATCH in the Ingestion Framework #14040

Closed
5 of 6 tasks
pmbrull opened this issue Nov 20, 2023 · 6 comments
Closed
5 of 6 tasks

Refactor PUT vs. PATCH in the Ingestion Framework #14040

pmbrull opened this issue Nov 20, 2023 · 6 comments
Assignees

Comments

@pmbrull
Copy link
Collaborator

pmbrull commented Nov 20, 2023

We have 2 issues with the ingestion

  • unify patch calls: we don't want multiple backend calls to PATCH different elements. A single trip should be enough.
  • when we do a PUT we can override existing values
    • we have hacks for some of the properties to avoid overwriting them from the ingestion: ownership, description, domain,...
    • every new field becomes a new workaround we need to add on the backend: protecting fields and checking who the caller is (ingestion-bot).

Try:

  • Check how many backend calls we are doing
  • We can remove this call https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/ingestion/api/topology_runner.py#L216 -> entityRef now works with FQN not the ID -> we can update how we keep the context information. We should double-check how we are using the context data to ensure we'll have enough ingredients just keeping processed FQNs.
    • Since we don't need to wait for a response, check if we can use async requests to the server to not wait for certain results and move on. Since we'd still have some dependencies between nodes / stages, we could have a flag in each of them to specify if it can be treated async
  • Move the list_all_entities to the beginning.
    entity_state = metadata.list_all_entities(entity=entity_type, params=params)
    -> we are already doing a "list all" to mark entities as deleted. Move it at the beginning and keep that as a cache/state in the workflow.
    • As a followup task, add a generated column in each *_entity table to have a serviceName , so that we can index it and have better performance when doing the list_all_entities
  • see if we can replace some ES calls with the cache/state
@pmbrull pmbrull moved this to Ingestion - Bugs & Minor Features in Release 1.3.0 Nov 20, 2023
@harshach
Copy link
Collaborator

harshach commented Nov 21, 2023

  • Add times to fetch source data vs time get/write to OM apis
  • Add Updated as another record in the summary, to showcase how many entities are getting updated during an ingestion vs created
  • We should only fetch objects from lowest container level, for example in databsaes, we should fetch per schema level to avoid fetching all of them at once at the beginning of the ingestion process
  • Introduce a pluggable cache layer
    • If there is a persistent local disk avaialble we can use file system to cache the objects that are ingested. We can take a hash of the object and store it
    • Make an option for S3 bucket as storage based cache layer
    • If none of them are available we will use in-memory cache which essentially makes a call to OpenMetadata server
  • Instead of caching what data we have in OpenMetadata, we should cache what we are fetching from the source . That way we can only process this object if it changed from the source

@pmbrull
Copy link
Collaborator Author

pmbrull commented Nov 24, 2023

  • Based on this thread, the PUT vs. PATCH also impacts lineage. The example shared is for View Lineage, where we are rerunning the lineage PUT overwriting any lineage added by hand. Here we could see if PATCHing lineage is an option, or if we want to flag the query as processed and avoid rerunning this

  • Refactor PUT vs. PATCH in the Ingestion Framework #14040

@pmbrull
Copy link
Collaborator Author

pmbrull commented Nov 24, 2023

@OnkarVO7 I believe these changes will automatically fix:

Let's double-check as part of acceptance criteria here

@OnkarVO7
Copy link
Contributor

Part 1 of the fixes: Removing the get_by_name calls from the topology
#14098

@OnkarVO7
Copy link
Contributor

OnkarVO7 commented Nov 29, 2023

  • Feature Task

Ingestion Update for Patching Entities

Schema Changes:

  • Add a new field named sourceHash to all entities.

Ingestion Workflow Changes:

  1. Initial Entity Creation:

    • During the first run using create_entity_request:
      • Convert the create_entity_request model to JSON and store the hash of the create_entity_request json in the sourceHash field of the entity.
  2. Entity Retrieval and Caching:

    • On subsequent runs:
      • Retrieve a list of all entities from the server with the sourceHash field.
      • Store only the Fully Qualified Name (fqn) and the sourceHash field in the cache/memory.
  3. Request Comparison:

    • When building the create_entity_request again, compare the current JSON hash of the create_entity_request with the stored hash in the cache/memory.
  4. Update Determination and Action:

    • If the hashes match:
      • No changes have occurred on the source; no server request needed.
    • If the hashes don't match:
      • Call the get_by_name API to retrieve the entity.
      • Make a patch request with the updated field(s).

@harshach harshach moved this to Ingestion - Bugs & Minor Features in Release 1.2.3 Nov 30, 2023
@harshach harshach removed this from Release 1.2.3 Dec 6, 2023
@harshach harshach closed this as completed Dec 6, 2023
@harshach harshach moved this to Done in Release 1.2.3 Dec 6, 2023
@pmbrull pmbrull reopened this Dec 11, 2023
@OnkarVO7 OnkarVO7 removed this from Release 1.2.3 Jan 17, 2024
@OnkarVO7 OnkarVO7 removed this from Release 1.3.0 Jan 17, 2024
@harshach harshach moved this to Ingestion - Bugs & Minor Features in Release 1.4.0 Jan 26, 2024
@OnkarVO7
Copy link
Contributor

  • Feature Task

Patch Only Changed/Updated fields from ingestion topology

Our current patch logic updates the entire entity from server with the fields of the create request. Then makes a jsonpatch request by comparing original and new entity.

Instead Find a way to implement below logic:

  1. Find out any existing specific fields that are changed.
  2. Find any new updates to specific fields.
  3. Walk through original server payload and update in-place where only the specific fields are changed.

@OnkarVO7 OnkarVO7 removed this from Release 1.4.0 Apr 4, 2024
@pmbrull pmbrull closed this as completed Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants