Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check interval range to avoid cases where year is inappropriately entered #16945

Open
asdf2014 opened this issue Aug 22, 2024 · 3 comments · May be fixed by #16951
Open

Check interval range to avoid cases where year is inappropriately entered #16945

asdf2014 opened this issue Aug 22, 2024 · 3 comments · May be fixed by #16951

Comments

@asdf2014
Copy link
Member

Description

In Apache Druid, we need to support a new feature that can check the interval range to avoid cases where the year is inappropriately entered.

Specifically, when dealing with time data, there are instances where incorrect years are entered due to typos or other reasons. For example, entering the year as 20240 instead of 2024. These incorrect years can lead to significant deviations in data processing and analysis results, affecting the accuracy and reliability of the data.

To avoid such situations, we plan to add an interval range check feature in Apache Druid. This feature will allow users to set a reasonable range for years, such as from the year 2000 to 2100. During data input and processing, the system will automatically check whether the year falls within this range. If a year outside this range is detected, the system will issue a warning or error message, prompting the user to make corrections.

The implementation of this new feature will include the following steps:

  1. Define a reasonable year range: Users can set a reasonable year range through configuration files or the interface.
  2. Data input check: During the data input phase, the system will check whether the year of each data entry falls within the set range.
  3. Data processing check: During the data processing phase, the system will also perform year checks to ensure that all processing data years are within the reasonable range.
  4. Error handling and notification: If a year outside the range is detected, the system will log the error and issue a warning or error message to the user.

By introducing this interval range check feature, we can effectively avoid data issues caused by incorrect year entries, enhancing the accuracy and reliability of data processing. This will provide users with higher quality data analysis services, ensuring that their decisions are based on accurate and error-free data.

@kfaraz
Copy link
Contributor

kfaraz commented Sep 3, 2024

@asdf2014 , we already support validation of intervals:

Do you want to just filter out such records (which is already supported as listed above) or also raise an alert when an out-of-range record is encountered?

@asdf2014
Copy link
Member Author

asdf2014 commented Sep 9, 2024

Hi @kfaraz , Apache Druid certainly supports checking data dates. This proposal is about checking at the Task's Payload level because we have encountered errors in filling out intervals on business side, which led to reading a large amount of data from HDFS. It is not the same level of checking as what you mentioned 😅

@kfaraz
Copy link
Contributor

kfaraz commented Sep 23, 2024

I see, thanks for the clarification, @asdf2014 .

So you want to add a validation on the input time interval while persisting the task payload itself.
I am not entirely sure if we can always safeguard against users making such mistakes. The surface area is too large.
It is always possible to validate things which are semantically incorrect for Druid.
But something which is inadvisable only for a certain use case should ideally be validated on the application side itself.

That said, it does make sense for an admin to allow users to perform only valid actions.

To that effect, the admin could specify a property called say validYearRange or validIntervalRange in common.runtime.properties, which would then be used to validate all task payloads.
But then again, I am averse to adding a new config for every new validation that we have to perform.

cc: @abhishekagarwal87 , what are your thoughts on such validations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants