Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Interval arithmetic to support String type. #7098

Open
Tracked by #7882
mingmwang opened this issue Jul 26, 2023 · 7 comments
Open
Tracked by #7882

Enhance Interval arithmetic to support String type. #7098

mingmwang opened this issue Jul 26, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@mingmwang
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently DataFusion does not support Interval arithmetic on string type, and the stats analyze process will be failed during CBO stats estimation process.

I think maybe we can normalize the string intervals to numeric intervals first and leverage the numeric intervals to estimate the selectivity.

Init input Bound:
["aaa", "xyz123"] => [0, 1]

Estimated Bound (example)
["ccc", "ddd"] => [0.2, 0.3]

estimated selectivity:
0.1

The Normalization of the string intervals can be based on ascil code ordering, for any non-ascil char, we can just return early.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@mingmwang mingmwang added the enhancement New feature or request label Jul 26, 2023
@mingmwang
Copy link
Contributor Author

@berkaysynnada @ozankabak
Please share your thoughts on this.

@mingmwang
Copy link
Contributor Author

mingmwang commented Jul 26, 2023

    #[test]
    fn test_normalize_string_interval() -> Result<()> {
        // Define a mapping of characters to numerical values
        let mut char_map: HashMap<char, u8> = HashMap::new();

        for i in 0..128 {
            let c = (i as u8) as char;
            char_map.insert(c, i as u8);
        }

        let mut string_intervals = vec!["00", "aaaaa", "AMddist", "!_xyz", "zzz", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzx", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"];
        // sort the string_intervals by their ascii value
        string_intervals.sort_by(|a, b| a.cmp(b));

        let first_interval = string_intervals.first().unwrap();
        let last_interval = string_intervals.last().unwrap();

        let float_vec = string_intervals.iter().map(|interval| {
            calcuate_normalize_value(&mut char_map, interval).unwrap_or(0f64)
        }).collect::<Vec<_>>();

        let min_val = float_vec.iter().fold(std::f64::INFINITY, |acc, x| acc.min(*x));
        let max_val = float_vec.iter().fold(std::f64::NEG_INFINITY, |acc, x| acc.max(*x));
        let normalized_vector: Vec<f64> = float_vec.iter().map(|x| (x - min_val) / (max_val - min_val)).collect();

        println!("normalized_vector {:?}", normalized_vector);
        Ok(())
    }

    fn calcuate_normalize_value(char_map: &HashMap<char, u8>, str_value: &str) -> Option<f64> {
        let base = char_map.len() as f64;
        if str_value.is_ascii() {
            let mut sum = 0.0;
            for (i, c) in str_value.chars().enumerate() {
                let char_value = *char_map.get(&c).unwrap() as f64;
                let weight = base.powf((0-i as i32) as f64);
                sum += char_value * weight;
            }
            let normalized_value = sum / char_map.len() as f64;
            Some(normalized_value)
        } else {
            None
        }
    }

@berkaysynnada
Copy link
Contributor

Sounds good. However, I'm not entirely sure about the correctness of the weight calculation. Shouldn't the weight of the leading characters be higher (assuming the order of the strings is as in the dictionary)? Once the strings are accurately mapped to floats, I believe there wouldn't be any problem with calculating selectivity as you've proposed.

@mingmwang
Copy link
Contributor Author

This will output:

normalized_vector [0.0, 0.16394189376465151, 0.35710980460519687, 0.7175591189901976, 0.9999993427698617, 1.0, 1.0]

@mingmwang
Copy link
Contributor Author

Sounds good. However, I'm not entirely sure about the correctness of the weight calculation. Shouldn't the weight of the leading characters be higher (assuming the order of the strings is as in the dictionary)? Once the strings are accurately mapped to floats, I believe there wouldn't be any problem with calculating selectivity as you've proposed.

Yes, you are right. The weigh is calculated based on the position of the char.

@berkaysynnada
Copy link
Contributor

This version seems correct to calculate weights. I don't think there will be any problems and it would improve the interval arithmetic and statistics calculations. Thanks

@alamb
Copy link
Contributor

alamb commented Oct 20, 2023

Started #7882 to collect improvement to interval arithmetic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants