Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/operations/{operationId} output doesn't contain textual transformation expressions #801

Open
metametametameta opened this issue Nov 13, 2020 · 6 comments
Labels

Comments

@metametametameta
Copy link

metametametameta commented Nov 13, 2020

Background

When a field in a project list is created or transformed in some fashion, there tends to be some sort of an original "expression" string that represents the transformation. For example, here is a simple field that gets created from two existing fields via a concat expression (Java sample)

df = df.withColumn(
    "name",
    concat(df.col("lname"), lit(", "), df.col("fname")));

So basically "name" is derived from "lname" and "fname" via some transformation. The attribute lineage itself is easy enough to figure out by analyzing the result of the /operations/{operationId} REST call.

However, it's not so easy to get back the human readable "text" of the transformation.

Example

In the JSON returned by the /operations/{operationId}, I see something like

  ...
  "projectList": [
    {
      "_typeHint": "expr.AttrRef",
      "refId": "b9f83e1b-abe9-4555-b6a0-1ca0f6f23d50",
      "expression" = "lname" //very useful, but not available
    },
    {
      "_typeHint": "expr.AttrRef",
      "refId": "a32219e0-eab0-4f0e-95de-d91ce2f43744",
      "expression" = "fname" //very useful, but not available
    },
    {
      "_typeHint": "expr.Alias",
      "alias": "name",
      "expression" : "concat(df.col("lname"), lit(", "), df.col("fname"))", //very useful, but not available
      "child": {
        "children": [
          {
            "_typeHint": "expr.AttrRef",
            "refId": "b9f83e1b-abe9-4555-b6a0-1ca0f6f23d50"
          },
          {
            "_typeHint": "expr.Literal",
            "value": ", ",
            "dataTypeId": "92c3dfe1-e004-4ca7-976c-a15a05a7da59"
          },
          {
            "_typeHint": "expr.AttrRef",
            "refId": "a32219e0-eab0-4f0e-95de-d91ce2f43744"
          }
        ],
        "name": "concat",
        "_typeHint": "expr.Generic",
        "dataTypeId": "28bdd23a-5004-421a-9b36-14458dec80a0",
        "exprType": "Concat"
      }
    }
  ],
  ...

Proposed Solution

Solution Ideas

As you can see, the original expression concat(df.col("lname"), lit(", "), df.col("fname")) is only represented in the REST JSON by its complex, structured version, which is not suitable for display to the user viewing the lineage. There is nothing else in the JSON I could find that has a textual version of the field-level transformation.

P.S. Note that in the Web UI for spline - you already seem to do something along these lines by showing an expression.

(Shown in Spline UI - but doesn't appear to be there in REST API output)

Transformations
λ = lname
λ = fname
λ = concatlname, , , fname AS name
`
So why not make the textual expressions above available via the REST API itself as that makes it useful for building a custom UI based purely on the REST API. I imagine adding an "expression" attribute (see my JSON snippet above) or similar would do the trick. "concatlname, , , fname AS name" above is obviously useful, but ideally having the original "concat(df.col("lname"), lit(", "), df.col("fname"))" is ideal if available (or even both versions)

@wajda
Copy link
Contributor

wajda commented Nov 18, 2020

Hi @metametametameta,
Thank you for suggestion. I understand your intention and actually, your use-case - namely attribute level lineage - is one of the primary use-cases that we will address in Spline 0.6
Basically, as you mentioned, current UI is able to show you the readable textual representation of expressions. But it's not taken from any particular attribute in JSON, instead it's derived from that structural attribute/expression model, that in turn closely represents an original expression graph.
There are several reasons why Spline doesn't carry the string representation verbatim, but rather rely on the ASG model:

  • Versatility. The structured format is more detailed and can be used for creating different and more user-friendly expression representations than just a string. Imagine for instance that in the UI lname and rname are rendered as links, so you can click on those and do some actions (e.g. view the attribute type, navigate to the attribute definition, or explore it's lineage beyond the current job boundary).
  • De-duplication. As a string representation can be derived from an expression graph, there is little to no benefit in carrying both at the same time.
  • Representation consistency. The last but not least reason is that the agent just might not have/provide a consistent representation of expressions. In your Spark example df is a name of the reference to a data frame, which might not exist in a compiled code. Spark uses it's own expression model which doesn't tell you if it was originally written in Scala, Java, Python or SQL DSL. The best we can get is a Spark own text representation that might not be consistent from a version to a version, and not always easily readable. Also, Spline can work with other data processing frameworks, not only Spark. And that requirement makes things even less defined.

So instead of capturing expression text we capture its semantics to be able to render the most suitable human friendly representation on the UI depending on the use-case.

@metametametameta
Copy link
Author

metametametameta commented Nov 18, 2020

Hi @wajda Thanks for explaining the rationale behind the structured representation vs. string representation. I can possibly do some tree -> text translation on my end, but just wanted to know if there's a metamodel I can work with? You mention structural attribute/expression model and ASG model which I'm not familiar with - is there a spec of some sort available for that?

P.S. I looked around in the spline-ui code, and I'm guessing this is your metamodel for OpExpression?

https://github.com/AbsaOSS/spline-ui/blob/develop/ui/projects/spline-api/src/lib/execution-event/models/entities/operation-property/expression/op-expression.models.ts

@wajda
Copy link
Contributor

wajda commented Nov 19, 2020

Yes, pretty much.
To be honest in the version 0.5 there's no universal model for expressions at the server-side level. The expression info is stored as a payload in whatever format it is received from a producer. The UI then is supposed to look at the execution plan systemInfo and agentInfo properties to select a proper method of parsing and representing that payload (storing in the extra property). But that mechanism weren't actually implemented properly, the whole system is still in work-in-progress state, so at the moment UI just assumes that the data comes from the Spark agent and expect the model to be this - https://github.com/AbsaOSS/spline/blob/release/0.3/model/src/main/scala/za/co/absa/spline/model/expr/Expression.scala

But please keep in mind that this will change as of Spline 0.6. We're going to introduce an Expression entity in the persistent model, and it will be propagated to the other models as well. So if you plan develop something on top of this, I'd recommend to take a look at the first draft at the model here - https://github.com/AbsaOSS/spline/tree/feature/spline-677-persistene/producer-model/src/main/scala/za/co/absa/spline/producer/model/v1_1

@metametametameta
Copy link
Author

I don't expect to be dealing with anything other that Spark lineage right now and I think I can do something simple with the existing JSON model in 0.5.5 for the operation expressions.

I interpreted your response to mean that new fields (or totally new REST endpoints) will be added in 0.6.0 which is great. But the REST output itself in 0.6.0 (for a Spark use case) will be backward compatible with 0.5.5 though, right? I aim to use any new REST API expression level output when they appear, but am also assuming that simply upgrading to 0.6.0 won't break my REST calls and JSON analysis. Is that a safe assumption?

@wajda
Copy link
Contributor

wajda commented Nov 19, 2020

We always strive for maintaining the backward compatibility although it's not always easy, especially when introducing breaking changes to the domain model.

Server-wise, it will be able to communicate to any agent starting from ver 0.4+. For the database, there is a Spline admin tool that will take care of the database migration, so your existing lineage data should be safe. Although minor discrepancies could occur in some places, but at least it should be safe on the datasets, jobs, and operations level. We'll do our best to preserve as many details as possible, but if some part of migration is too difficult or long to implement, we might decide to cut it there. After all we start our version numbers with 0 for a reason :)

Agent-wise, 0.6 will be able to talk to 0.5 servers, but expression details will probably be sacrificed (we won't implement model downgrade).

Producer REST model should be (fully or almost) identical up to the operation level.
Consumer REST model will most likely be significantly different.

@metametametameta
Copy link
Author

@wajda Thanks. I'll keep the expected changes in mind when upgrading to 0.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New
Development

No branches or pull requests

2 participants