Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement]: Build a rule of relationship between table and optimizer/resource group #1865

Open
3 tasks done
majin1102 opened this issue Aug 21, 2023 · 7 comments
Open
3 tasks done
Labels

Comments

@majin1102
Copy link
Contributor

majin1102 commented Aug 21, 2023

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Right now, when we want to declare a table optimized by some optimizer group, we have two clear ways:

  1. set default optimizer group of a catalog, and don't declare optimizer group in table properties:
    image

  2. declare a property of 'self-optimizing.group' in table properties (in create table or alter table statement):
    image

In practice, using default optimizer group has better experience while not flexible in case that multiple groups are necessary in one catalog. Using table property provides more flexibility but sacrifice user experience and security, imagine that every table(user) needs to know the resources behind AMS and has the authority to allocate resources, this could be a disaster.

It does't seem a big deal of this because in many cases there's only one external/default optimizer group without considerations for security and isolation. But it would be never late to have a better way to provide user experience, isolation and security for self-optimizing

How should we improve?

Better user experience
users only declare relationships in one place and use them everywhere. It's a bad idea to define a property in table which means table owner must know the concepts and instances.

It's a good idea of declaring properties in optimizer group and use an extendable rule like regex

Better security
Relationships of table and resource should be certain and can not be modified without the permission of the owner of resources. It is clear that declaring properties in optimizer group fulfills this criterion

Better isolation
when we declare relationships of table and resources or modify them, the rules must be mutually exclusive

In conclusion, I proposed that declaring regex rules in optimizer group defines relationships of table and resources. For example:

catalog1.db1.*
catalog2..

leads to a clear definition that this optimizer groups could be used in these tables and only used by them.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Aug 21, 2024
Copy link

github-actions bot commented Sep 4, 2024

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 4, 2024
@Aireed
Copy link
Contributor

Aireed commented Sep 6, 2024

I believe the intended effect of this feature should be as follows:
Priority of group configuration: Table-level configuration > Regex rule configuration > Catalog default configuration.

If a rule is manually configured on the table, it should take precedence.
Rule persistence

  1. Regex rule configurations should not be written to the underlying table's properties.
  2. the rule stored in catalog properties

Rule change

  1. Changes to regex rules will result in group changes for all affected tables.
  2. After a regex rule is deleted, the catalog's default configuration should take effect.

It can take effect through TableRuntimeRefresh.

Rule queries:

  1. When displaying the Optimize group list, show the rules affecting the tables which is collected from catalog prooperties.
    image

2.Also display these configuration rules in the catalog's properties.
image

IMO,Based on the issue description, there was an initial intention to configure this rule at the group level. I agree with this, but from an implementation standpoint, this will involve extensive changes in every properties call.

If we configure the regex rules in the catalog's properties, the effect of this property can be consolidated in the BasicUnkeyedTable::properties call within the MixedCatalogUtil::mergeCatalogPropertiesToTable method, making it convenient to implement.
image

@XBaith @majin1102 @zhoujinsong @nicochen WDYT.

@Aireed Aireed reopened this Sep 6, 2024
@majin1102
Copy link
Contributor Author

majin1102 commented Sep 6, 2024

I don't think rules on catalog properties are necessary if we could use optimizer group

@github-actions github-actions bot removed the stale label Sep 7, 2024
@klion26
Copy link
Member

klion26 commented Sep 7, 2024

Will it be possible for multiple types of optimizers to exist in one OptimizerGroup in the future? For example, the same OptimizerGroup may contain both Flink and Spark optimizers. The OptimizerGroup is similar to a logical resource pool, and different types of optimizers will occupy some resources.

@majin1102
Copy link
Contributor Author

Will it be possible for multiple types of optimizers to exist in one OptimizerGroup in the future? For example, the same OptimizerGroup may contain both Flink and Spark optimizers. The OptimizerGroup is similar to a logical resource pool, and different types of optimizers will occupy some resources.

What scenarios would this hybrid resource model be helpful for?
I believe this will introduce considerable complexity.

@klion26
Copy link
Member

klion26 commented Sep 20, 2024

@majin1102 Thanks for the reply, I'm asking this because of the following scenarios: when using Flink optimizer for merging, the optimizer may stop/or need to chase data, or there may be sudden needs for merging. However, Flink optimizer is not particularly good at automatic scaling (at least on Yarn).

In addition, if we consider resources, that is, OptimizerGroup is just a resource pool, and the optimizer is an application running in OptimzierGroup(similar to OptimizerGroup is a queue of Yarn, the optimizer is an application), will this not add too much complexity, or is there something I missed here? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants