Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing Non spark Automic/Airflow Job information in spline #1150

Open
thambi1981 opened this issue Jan 11, 2023 · 2 comments
Open

Publishing Non spark Automic/Airflow Job information in spline #1150

thambi1981 opened this issue Jan 11, 2023 · 2 comments

Comments

@thambi1981
Copy link

thambi1981 commented Jan 11, 2023

Background [Optional]

We have requirement to capture the job name ( Airflow job or Automic job or any scheduler ) which is submitted . Right now we capture the spark application name which shows in the progress collection against applicationName key.
Every spark job would have submitted by scheduler ( airflow or Automic). We would like to captures this information as the top level source and the show the spark applicationName. I do see extra key in the progress document. but I don't see any reserved field to be displayed in the UI. It's not only Automic/airflow job name . In future We would like to also add some additional information when we progress further.
Please give some insights what's the right way showing this non spark information. Also it using the same job name we would like to search in UI.
Reason for this requirement to identify the spark job is submitted from which scheduler job. By adding this feature more users can adopt

Question

Please give some insights what's the right way showing non spark job level information and how to make that job is searchable

@wajda
Copy link
Contributor

wajda commented Jan 23, 2023

I do see extra key in the progress document. but I don't see any reserved field to be displayed in the UI. It's not only Automic/airflow job name

Not sure I understand your question correctly, but the execution plan name (aka application name, or job name) in provided by an agent in the optional string property name, and then stored under the same property in the executionPlan collection in the ArangoDB.

image

On the UI, when talking about the execution event list which is formed from the items stored in the progress collection, the application name is displayed in the leftmost column, and is taken from the respective JSON property returned by the Consumer REST API.

image

In the database, the application name is additionally stored under the execPlanDetails property of the 'progress' items. That is pure optimization to avoid extra traversals over the progressOf edge.

image

Everything that is under extra property is a black box for Spline. The extra property is there to store any additional metadata that might be used for some custom user queries. Spline itself doesn't know the meaning of what is stored under extra and doesn't touch it at all. The most we could do is just to display it somewhere on the UI, and probably search in it. But it certainly isn't needed for displaying the application name, as for that there is a reserved property, as explained above.

Not sure if I answered your question.

@erdembanak
Copy link

erdembanak commented Jun 30, 2023

I'm not sure about the question either but we are planning to use the Airflow task name as the spark app name to achieve this. Basically we will be calling spark like this:

spark = SparkSession.builder.appName(airflow-task-name).getOrCreate().

You should be able to pass the name of Airflow task to your module though.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New
Development

No branches or pull requests

3 participants