Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Attribute Type Checking for Sentiment, Aggregation, Hash Join Operators in Front-End #1924

Merged
merged 33 commits into from
Jun 17, 2023

Conversation

aahei
Copy link
Contributor

@aahei aahei commented May 23, 2023

Check attribute type and give suggestions to users' selection of attributes.

Included 3 representative operators that require attribute type checking: Sentiment Analysis, Aggregation, and Hash Join.

Attribute Type Rule Schema

The attribute type rule schema format is a custom schema modified from JSON Schema. The schema is injected into the operator schema, to an attribute called attributeTypeRules. The type of the injected attribute is an object that describes the expected attribute type for the operator property whose value is the auto-filled attribute name.

The schema may contain three constraints: enum that defines the possible values of the type, const (const.$data) that is now used for cross-property equality checks, and allOf (for if-then sets) for more complicated cases. The schema may have multiple constraints but must satisfy all of them.

In general, our schema has the following format.

{
  "attributeTypeRules": {
    "<attribute name>": {
      <constraints>
    },
    ...
  }
}

Here are the attribute type schemas for the three operators. Those schemas cover the most common situations.

Check if the attribute type is in a set

The attribute type of "xAttr" and "yAttr" must be numeric (integer, long, or double). This example is for the Linear Regression operator.

{
  "attributeTypeRules": {
    "xAttr":{
      "enum": ["integer", "long", "double"]
    },
    "yAttr":{
      "enum": ["integer", "long", "double"]
    }
  }
}

Check the equality of two attribute types

The attribute type of "buildAttributeName" must equal to that of "probeAttributeName". For Hash Join operator.

{
  "attributeTypeRules": {
    "buildAttributeName": {
      "const": {
        "$data": "probeAttributeName"
      }
    }
}

Note that the $data keyword is used to implement value comparison. Currently, it is the only way to check equality between values in JSON Schema. However, the $data keyword proposal is not in the formal JSON Schema standard. See json-schema-org/json-schema-spec#51, and https://ajv.js.org/guide/combining-schemas.html#data-reference.

Attribute type dependent on other property's value

All if-then in allOf array must be satisfied. For example, if aggFunction's value is "sum", "average", "min", or "max", then the attribute type of "attribute" can only be numeric (integer, long, or double). Similarly, if aggFunction's value is "concat", then the attribute type of "attribute must be "string". If aggFunction has other values, like "count", then it has no constraint. This schema is for the Aggregation operator.

{
  "attributeTypeRules": {
    "attribute": {
      "allOf": [
        {
          "if": {
            "aggFunction": {
              "valEnum": ["sum", "average", "min", "max"]
            }
          },
          "then": {
            "enum": ["integer", "long", "double"]
          }
        },
        {
          "if": {
            "aggFunction": {
              "valEnum": ["concat"]
            }
          },
          "then": {
            "enum": ["string"]
          }
        }
      ]
    }
  }
}

Note: the schema inside then is not a complete attribute type rule schema because it does not refer to the current property name again, but uses enum directly. That is, instead of "then": { "0:attribute": { "enum": ["integer", "long", "double"] } }, it uses "then": { "enum": ["integer", "long", "double"] }.

Moreover, the valEnum in the if clause is comparing the actual literal value of the property, not the attribute type, as opposed to elsewhere in the attribute type schema, so we used the new custom keyword valEnum instead of enum here.

Other differences between the custom attribute type schema and JSON schema

properties is ignored. E.g. an ordinary JSON Schema looks like this:
{ "properties": { "country": { "enum": ["US", "Canada"] } } }
But in our schema, it would look like this:
{ "country": { "enum": ["US", "Canada"] } }

JSON Schema Reference: https://json-schema.org/understanding-json-schema/

UI Change

A suggestion box will appear when the user's input does not match the attribute type schema.

  • sentiment analysis error message
  • hash join error message
  • aggregation error message

Unit Test

The unit test "should change the content of property editor from an empty panel correctly" fails in core\new-gui\src\app\workspace\component\property-editor\operator-property-edit-frame\operator-property-edit-frame.component.spec.ts despite the code works as expected in manual testing. It will be temporarily disabled to merge the PR and be further investigated in the future.

This code is related to

, and .

The main cause of the unit test failure in is that we changed this.formlyFields = field.fieldGroup; to this.formlyFields = [field]; because else the validator in field will not be triggered. However, after changing this, there will be error like this.

Error: NG0100: ExpressionChangedAfterItHasBeenCheckedError: Expression has changed after it was checked. Previous value for 'ng-valid': 'true'. Current value: 'false'.. Find more at https://angular.io/errors/NG0100
        error properties: Object({ code: -100 })

It is then later fixed by 35cc4c1 commit. However, then "should change the content of property editor from an empty panel correctly" fails, which is the unit test disabled.

@aahei aahei linked an issue May 23, 2023 that may be closed by this pull request
Copy link
Collaborator

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on the code.

For a better review, please improve the PR description to

  1. include the two abstracts used to inject the schema check information into the operator metadata.
  2. describe clearly the custom modification of the JSON schema interface, and explain why it is necessary.

@aahei
Copy link
Contributor Author

aahei commented May 31, 2023

The feature was demonstrated during the meeting on 2023-05-22.
TODO:

  • Attribute type checking warning box should change color to yellow or green, not red.
  • Message should be clear.

@aahei
Copy link
Contributor Author

aahei commented Jun 4, 2023

proposed new schema for attribute type:

"attributeType": {
  "<port>:<property name>":{
    "<constraints>"
  }
}

where "<constraints>" are one or more from below:

"enum": ["integer", "double", "<other possible type>"]
"const": {
  "$data": "<port>:<property name>"
}
"allOf": [
  {
    "if": {
      "<property name>": {
        "valEnum": ["<value of property>"]
      }
    },
    "then": {
      "enum": ["integer", "long", "double", "<other possible type>"]
    }
  },
  "<more if-then>"
]

Changes:

  • hardcode port into the schema.
    I used : here in the property name string to inject the port because it's simple. JSON Schema also has special characters (e.g., #, /, ~, etc.) in the property name string too, like the schema pointer, which I avoided using.
  • use valEnum in the if clause to distinguish it from enum, because we actually check the value of the property there, instead of attribute type.
    For example, for the aggregation operator case, we check the value of the aggFunction property.
    [{
     "if": {
       "aggFunction": {
         "valEnum": ["sum", "average", "min", "max"]
       }
     },
     "then": {
       "enum": ["integer", "long", "double"]
     }
    },
    {
     "if": {
       "aggFunction": {
         "valEnum": ["concat"]
       }
     },
     "then": {
       "enum": ["string"]
     }
    }]

@aahei
Copy link
Contributor Author

aahei commented Jun 4, 2023

Examples of the friendly error messages:

  • Sentiment analysis
    The type (double) of 'Column B' is not valid for property 'attribute': must be string
  • Hash Join
    The type (integer) of 'Column A' is not valid for property 'buildAttributeName': must be the same type (string) as property 'probeAttributeName'
  • Aggregation
    The type (string) of 'Group' is not valid for property 'attribute': must be integer or long or double, given that property 'aggFunction' is average

@aahei
Copy link
Contributor Author

aahei commented Jun 5, 2023

The error message box now looks like the first image and was like the second before. However, the current implementation overwrites the style of the Formly error box, not just for this attribute type validation, but also for all others. Yet, currently, other validations are all done on FormField scope not FormGroup scope like this, which means that the style change does not affect other current validations.
After
Before

@aahei
Copy link
Contributor Author

aahei commented Jun 5, 2023

Change JSON object name to AttributeTypeRules

@aahei
Copy link
Contributor Author

aahei commented Jun 5, 2023

Examples of the friendly error messages:

  • Sentiment analysis
    The type (double) of 'Column B' is not valid for property 'attribute': must be string
  • Hash Join
    The type (integer) of 'Column A' is not valid for property 'buildAttributeName': must be the same type (string) as property 'probeAttributeName'
  • Aggregation
    The type (string) of 'Group' is not valid for property 'attribute': must be integer or long or double, given that property 'aggFunction' is average

Since it's suggestion, should change the tone, like "expect" instead of invalid.

@aahei
Copy link
Contributor Author

aahei commented Jun 5, 2023

The error message box now looks like the first image and was like the second before. However, the current implementation overwrites the style of the Formly error box, not just for this attribute type validation, but also for all others. Yet, currently, other validations are all done on FormField scope not FormGroup scope like this, which means that the style change does not affect other current validations. After Before

Change to yellow

@aahei aahei requested a review from Yicong-Huang June 5, 2023 06:36
@aahei
Copy link
Contributor Author

aahei commented Jun 8, 2023

proposed new schema for attribute type:

"attributeType": {
  "<port>:<property name>":{
    "<constraints>"
  }
}

where "" are one or more from below:

"enum": ["integer", "double", "<other possible type>"]
"const": {
  "$data": "<port>:<property name>"
}
"allOf": [
  {
    "if": {
      "<property name>": {
        "valEnum": ["<value of property>"]
      }
    },
    "then": {
      "enum": ["integer", "long", "double", "<other possible type>"]
    }
  },
  "<more if-then>"
]

Changes:

  • hardcode port into the schema.
    I used : here in the property name string to inject the port because it's simple. JSON Schema also has special characters (e.g., #, /, ~, etc.) in the property name string too, like the schema pointer, which I avoided using.
  • use valEnum in the if clause to distinguish it from enum, because we actually check the value of the property there, instead of attribute type.
    For example, for the aggregation operator case, we check the value of the aggFunction property.
    [{
     "if": {
       "aggFunction": {
         "valEnum": ["sum", "average", "min", "max"]
       }
     },
     "then": {
       "enum": ["integer", "long", "double"]
     }
    },
    {
     "if": {
       "aggFunction": {
         "valEnum": ["concat"]
       }
     },
     "then": {
       "enum": ["string"]
     }
    }]

Update: decided not to hardcode the port in the property name but instead use the autofillAttributeOnPort value in the schema.

# Conflicts:
#	core/new-gui/src/app/workspace/service/dynamic-schema/schema-propagation/schema-propagation.service.ts
Copy link
Collaborator

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aahei aahei merged commit 7f29d4f into master Jun 17, 2023
@aahei aahei deleted the cli-attr-type-checking branch June 17, 2023 01:45
aahei added a commit that referenced this pull request Jun 24, 2023
…titions, Type Casting (#2005)

This is the second PR for the attribute type checking feature. The first
one is #1924.

## Description of Attribute Type Rules
### Interval Join
`leftAttributeName` (and `rightAttributeName`) must be `integer`,
`long`, `double`, or `timestamp`.

And, `leftAttributeName` attribute must have the same type as the
`rightAttributeName`.

```JSON
{
  "attributeTypeRules": {
    "leftAttributeName": {
      "enum": ["integer", "long", "double", "timestamp"]
    },
    "rightAttributeName": {
      "const": {
        "$data": "leftAttributeName"
      }
    }
  }
}
```

Note: We intentionally put `enum` test in front of `const` test, because
we want to test whether they have the correct type. Or, if we put the
`const` test first, i.e `rightAttributeName` rule first, and if
`leftAttributeName`'s attribute type is an invalid type like `string`,
then it will prompt the user that `rightAttributeName` should have the
same attribute type as `leftAttributeName` -- `string` -- which is
incorrect since both should not be a `string` type.

### Scatter Plot

`xColumn` and `yColumn` attributes must be of `integer` or `double`
type.

```JSON
{
  "attributeTypeRules": {
    "xColumn":{
      "enum": ["integer", "double"]
    },
    "yColumn":{
      "enum": ["integer", "double"]
    }
  }
}
```

Note: it may support `long` in the future. See
#1954.

### Sort Partitions

`sortAttributeName` attribute type must be `integer`, `long`, or
`double`.

```JSON
{
  "attributeTypeRules": {
    "sortAttributeName":{
      "enum": ["integer", "long", "double"]
    }
  }
}
```

Note: May support `timestamp` in the future. See
#1954.

### Type Casting

For example, if we want to convert an attribute to `integer`, it must
have attribute type of `string`, `long`, `double`, or `boolean`. A type
should not convert to the type itself. See the schema for detail.

```JSON
{
	"attributeTypeRules": {
		"attribute": {
			"allOf": [{
					"if": {
						"resultType": {
							"valEnum": ["integer"]
						}
					},
					"then": {
						"enum": ["string", "long", "double", "boolean"]
					}
				},
				{
					"if": {
						"resultType": {
							"valEnum": ["double"]
						}
					},
					"then": {
						"enum": ["string", "integer", "long", "boolean"]
					}
				},
				{
					"if": {
						"resultType": {
							"valEnum": ["boolean"]
						}
					},
					"then": {
						"enum": ["string", "integer", "long", "double"]
					}
				},
				{
					"if": {
						"resultType": {
							"valEnum": ["long"]
						}
					},
					"then": {
						"enum": ["string", "integer", "double", "boolean", "timestamp"]
					}
				},
				{
					"if": {
						"resultType": {
							"valEnum": ["timestamp"]
						}
					},
					"then": {
						"enum": ["string", "long"]
					}
				}
			]
		}
	}
}
```

Note: The type constraint is enforced in
`core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/tuple/schema/AttributeTypeUtils.scala`.

---------

Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attribute Type Validation at the Front End
2 participants