Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atum redesign #28

Open
6 tasks
yruslan opened this issue Apr 17, 2020 · 6 comments
Open
6 tasks

Atum redesign #28

yruslan opened this issue Apr 17, 2020 · 6 comments

Comments

@yruslan
Copy link
Collaborator

yruslan commented Apr 17, 2020

Background

Currently, Atum relies on the global state of a Spark Application. This complicates the usage of Atum for jobs that are slightly more complicated than just a pipeline of a single dataframe. If there are several dataframes and several reads/writes and not every read and write is associated with control measurements, Atum will try to process all dataframes as if all require measurements.

The current workaround for such use cases is disableControlMeasuresTracking() method that is invoked before writing a dataframe that does not require control measurements.

Feature

  • Control measurements should be attached to a dataframe, not to the Spark session. E.g., to turn on control measurements users should do df.enableControlMeasuresTracking() instead of spark. enableControlMeasuresTracking(). Same for switching off control measurements.
  • The measurements should happen to the dataframe it was initialized and the derived ones. Other dataframes shouldn't be affected.
  • Checkpoints and other housekeeping information should not be kept in the global state.
  • Adding metadata should be done as dataframe implicits (e.g. df.setAdditionalInfo(...)).
  • Atum should keep checkpoints for each registered dataframe separately.
  • Atum plugins should have an event that guaranteed to be sent last. Atum should guarantee that no more events are sent after that.

Additonal context

After the new design is confirmed this issue can be converted to epic and all subitems to tasks.

@lokm01
Copy link
Collaborator

lokm01 commented Apr 17, 2020

Makes sense.

@AdrianOlosutean
Copy link
Contributor

I would also proporse to redesign some parts so that they are immutable and functional style. What do you think?

@lokm01
Copy link
Collaborator

lokm01 commented Apr 18, 2020

Absolutely.

@benedeki
Copy link
Collaborator

Not sure about the last one like its described, particularly in regard to the changes above.
If the ATUM would be "attached" to a dataset, it would make sense to send a "last message" on that dataframe. But I am not sure there would be something to hook such an event reliably to. 🤔

@yruslan
Copy link
Collaborator Author

yruslan commented Apr 20, 2020

Yeah, it would probably be hard to implement an event that is sent last per dataset. But an event that is sent last during the lifetime of the application could be useful.

@AdrianOlosutean
Copy link
Contributor

Fields such as Country and others should be made optional and only functional ones should be mandatory to include

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants