Enhance Interval arithmetic to support String type. #7098

mingmwang · 2023-07-26T08:32:45Z

Is your feature request related to a problem or challenge?

Currently DataFusion does not support Interval arithmetic on string type, and the stats analyze process will be failed during CBO stats estimation process.

I think maybe we can normalize the string intervals to numeric intervals first and leverage the numeric intervals to estimate the selectivity.

Init input Bound:
["aaa", "xyz123"] => [0, 1]

Estimated Bound (example)
["ccc", "ddd"] => [0.2, 0.3]

estimated selectivity:
0.1

The Normalization of the string intervals can be based on ascil code ordering, for any non-ascil char, we can just return early.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

mingmwang · 2023-07-26T08:35:43Z

@berkaysynnada @ozankabak
Please share your thoughts on this.

mingmwang · 2023-07-26T08:42:24Z

    #[test]
    fn test_normalize_string_interval() -> Result<()> {
        // Define a mapping of characters to numerical values
        let mut char_map: HashMap<char, u8> = HashMap::new();

        for i in 0..128 {
            let c = (i as u8) as char;
            char_map.insert(c, i as u8);
        }

        let mut string_intervals = vec!["00", "aaaaa", "AMddist", "!_xyz", "zzz", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzx", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"];
        // sort the string_intervals by their ascii value
        string_intervals.sort_by(|a, b| a.cmp(b));

        let first_interval = string_intervals.first().unwrap();
        let last_interval = string_intervals.last().unwrap();

        let float_vec = string_intervals.iter().map(|interval| {
            calcuate_normalize_value(&mut char_map, interval).unwrap_or(0f64)
        }).collect::<Vec<_>>();

        let min_val = float_vec.iter().fold(std::f64::INFINITY, |acc, x| acc.min(*x));
        let max_val = float_vec.iter().fold(std::f64::NEG_INFINITY, |acc, x| acc.max(*x));
        let normalized_vector: Vec<f64> = float_vec.iter().map(|x| (x - min_val) / (max_val - min_val)).collect();

        println!("normalized_vector {:?}", normalized_vector);
        Ok(())
    }

    fn calcuate_normalize_value(char_map: &HashMap<char, u8>, str_value: &str) -> Option<f64> {
        let base = char_map.len() as f64;
        if str_value.is_ascii() {
            let mut sum = 0.0;
            for (i, c) in str_value.chars().enumerate() {
                let char_value = *char_map.get(&c).unwrap() as f64;
                let weight = base.powf((0-i as i32) as f64);
                sum += char_value * weight;
            }
            let normalized_value = sum / char_map.len() as f64;
            Some(normalized_value)
        } else {
            None
        }
    }

berkaysynnada · 2023-07-26T11:21:33Z

Sounds good. However, I'm not entirely sure about the correctness of the weight calculation. Shouldn't the weight of the leading characters be higher (assuming the order of the strings is as in the dictionary)? Once the strings are accurately mapped to floats, I believe there wouldn't be any problem with calculating selectivity as you've proposed.

mingmwang · 2023-07-26T11:51:04Z

This will output:

normalized_vector [0.0, 0.16394189376465151, 0.35710980460519687, 0.7175591189901976, 0.9999993427698617, 1.0, 1.0]

mingmwang · 2023-07-26T11:53:19Z

Sounds good. However, I'm not entirely sure about the correctness of the weight calculation. Shouldn't the weight of the leading characters be higher (assuming the order of the strings is as in the dictionary)? Once the strings are accurately mapped to floats, I believe there wouldn't be any problem with calculating selectivity as you've proposed.

Yes, you are right. The weigh is calculated based on the position of the char.

berkaysynnada · 2023-07-26T12:39:11Z

This version seems correct to calculate weights. I don't think there will be any problems and it would improve the interval arithmetic and statistics calculations. Thanks

alamb · 2023-10-20T13:33:38Z

Started #7882 to collect improvement to interval arithmetic

mingmwang added the enhancement New feature or request label Jul 26, 2023

alamb mentioned this issue Oct 20, 2023

[EPIC] A collection of Interval arithmetic (not Intervals of time) improvements #7882

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Interval arithmetic to support String type. #7098

Enhance Interval arithmetic to support String type. #7098

mingmwang commented Jul 26, 2023

mingmwang commented Jul 26, 2023

mingmwang commented Jul 26, 2023 •

edited

Loading

berkaysynnada commented Jul 26, 2023

mingmwang commented Jul 26, 2023

mingmwang commented Jul 26, 2023

berkaysynnada commented Jul 26, 2023

alamb commented Oct 20, 2023

Enhance Interval arithmetic to support String type. #7098

Enhance Interval arithmetic to support String type. #7098

Comments

mingmwang commented Jul 26, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

mingmwang commented Jul 26, 2023

mingmwang commented Jul 26, 2023 • edited Loading

berkaysynnada commented Jul 26, 2023

mingmwang commented Jul 26, 2023

mingmwang commented Jul 26, 2023

berkaysynnada commented Jul 26, 2023

alamb commented Oct 20, 2023

mingmwang commented Jul 26, 2023 •

edited

Loading