-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Interval arithmetic to support String type. #7098
Comments
@berkaysynnada @ozankabak |
#[test]
fn test_normalize_string_interval() -> Result<()> {
// Define a mapping of characters to numerical values
let mut char_map: HashMap<char, u8> = HashMap::new();
for i in 0..128 {
let c = (i as u8) as char;
char_map.insert(c, i as u8);
}
let mut string_intervals = vec!["00", "aaaaa", "AMddist", "!_xyz", "zzz", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzx", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"];
// sort the string_intervals by their ascii value
string_intervals.sort_by(|a, b| a.cmp(b));
let first_interval = string_intervals.first().unwrap();
let last_interval = string_intervals.last().unwrap();
let float_vec = string_intervals.iter().map(|interval| {
calcuate_normalize_value(&mut char_map, interval).unwrap_or(0f64)
}).collect::<Vec<_>>();
let min_val = float_vec.iter().fold(std::f64::INFINITY, |acc, x| acc.min(*x));
let max_val = float_vec.iter().fold(std::f64::NEG_INFINITY, |acc, x| acc.max(*x));
let normalized_vector: Vec<f64> = float_vec.iter().map(|x| (x - min_val) / (max_val - min_val)).collect();
println!("normalized_vector {:?}", normalized_vector);
Ok(())
}
fn calcuate_normalize_value(char_map: &HashMap<char, u8>, str_value: &str) -> Option<f64> {
let base = char_map.len() as f64;
if str_value.is_ascii() {
let mut sum = 0.0;
for (i, c) in str_value.chars().enumerate() {
let char_value = *char_map.get(&c).unwrap() as f64;
let weight = base.powf((0-i as i32) as f64);
sum += char_value * weight;
}
let normalized_value = sum / char_map.len() as f64;
Some(normalized_value)
} else {
None
}
} |
Sounds good. However, I'm not entirely sure about the correctness of the weight calculation. Shouldn't the weight of the leading characters be higher (assuming the order of the strings is as in the dictionary)? Once the strings are accurately mapped to floats, I believe there wouldn't be any problem with calculating selectivity as you've proposed. |
This will output: normalized_vector [0.0, 0.16394189376465151, 0.35710980460519687, 0.7175591189901976, 0.9999993427698617, 1.0, 1.0] |
Yes, you are right. The |
This version seems correct to calculate weights. I don't think there will be any problems and it would improve the interval arithmetic and statistics calculations. Thanks |
Started #7882 to collect improvement to interval arithmetic |
Is your feature request related to a problem or challenge?
Currently DataFusion does not support Interval arithmetic on string type, and the stats
analyze
process will be failed during CBO stats estimation process.I think maybe we can normalize the string intervals to numeric intervals first and leverage the numeric intervals to estimate the
selectivity
.Init input Bound:
["aaa", "xyz123"] => [0, 1]
Estimated Bound (example)
["ccc", "ddd"] => [0.2, 0.3]
estimated selectivity:
0.1
The Normalization of the string intervals can be based on ascil code ordering, for any non-ascil char, we can just return early.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: