Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the parsing result for the special double number #1621

Merged
merged 14 commits into from
Dec 28, 2024
4 changes: 4 additions & 0 deletions src/dialect/bigquery.rs
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,8 @@ impl Dialect for BigQueryDialect {
fn supports_struct_literal(&self) -> bool {
true
}

fn support_unquoted_hyphenated_identifiers(&self) -> bool {
true
}
}
9 changes: 9 additions & 0 deletions src/dialect/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -762,6 +762,15 @@ pub trait Dialect: Debug + Any {
fn supports_table_sample_before_alias(&self) -> bool {
false
}

/// Returns true if the dialect supports unquoted hyphenated identifiers
///
/// For example, BigQuery supports the following identifier:
/// `SELECT * FROM my-project.my_dataset.my_table`
/// The project name `my-project` is a hyphenated identifier.
fn support_unquoted_hyphenated_identifiers(&self) -> bool {
false
}
}

/// This represents the operators for which precedence must be defined
Expand Down
30 changes: 21 additions & 9 deletions src/parser/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -978,13 +978,6 @@ impl<'a> Parser<'a> {
let _guard = self.recursion_counter.try_decrease()?;
debug!("parsing expr");
let mut expr = self.parse_prefix()?;
// Attempt to parse composite access. Example `SELECT f(x).a`
while self.consume_token(&Token::Period) {
expr = Expr::CompositeAccess {
expr: Box::new(expr),
key: self.parse_identifier(false)?,
}
}

debug!("prefix: {:?}", expr);
loop {
Expand Down Expand Up @@ -1150,7 +1143,8 @@ impl<'a> Parser<'a> {
Ok(Some(self.parse_match_against()?))
}
Keyword::STRUCT if self.dialect.supports_struct_literal() => {
Ok(Some(self.parse_struct_literal()?))
let struct_expr = self.parse_struct_literal()?;
Ok(Some(self.parse_compound_field_access(struct_expr, vec![])?))
}
Keyword::PRIOR if matches!(self.state, ParserState::ConnectBy) => {
let expr = self.parse_subexpr(self.dialect.prec_value(Precedence::PlusMinus))?;
Expand Down Expand Up @@ -1397,7 +1391,25 @@ impl<'a> Parser<'a> {
}
};
self.expect_token(&Token::RParen)?;
self.try_parse_method(expr)
let expr = self.try_parse_method(expr)?;
if !self.consume_token(&Token::Period) {
Ok(expr)
} else {
let tok = self.next_token();
let key = match tok.token {
Token::Word(word) => word.to_ident(tok.span),
_ => {
return parser_err!(
format!("Expected identifier, found: {tok}"),
tok.span.start
)
}
};
Ok(Expr::CompositeAccess {
expr: Box::new(expr),
key,
})
}
}
Token::Placeholder(_) | Token::Colon | Token::AtSign => {
self.prev_token();
Expand Down
80 changes: 47 additions & 33 deletions src/tokenizer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1141,33 +1141,44 @@ impl<'a> Tokenizer<'a> {
let s2 = peeking_take_while(chars, |ch| ch.is_ascii_hexdigit());
return Ok(Some(Token::HexStringLiteral(s2)));
}

// match one period
if let Some('.') = chars.peek() {
// Check if this actually is a float point number
let mut char_clone = chars.peekable.clone();
char_clone.next();
// Next char should be a digit, otherwise, it is not a float point number
if char_clone
.peek()
.map(|c| c.is_ascii_digit())
.unwrap_or(false)
{
if self.dialect.support_unquoted_hyphenated_identifiers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goldmedal @ayman-sigma
I'm wondering if it would make sense to move this special case handling from the tokenizer and into the parser. Since this is a BigQuery specific feature and only applies in a specific context, I feel like it would be more complicated to pull off within the tokenizer without affecting other use cases/dialects. Thinking in the parser we have more context on when this syntax is valid and a lot of that work was done in #1109
(I'm imagining since the tokenizer would be left to parse decimals as usual, we won't have the consideration in #1622 ?)

Essentially I'm wondering if we can extend the work to cover this scenario doublle number scenario. Using the example from #1598 it would essentially mean that e.g. the following token stream

[Word("foo"), Minus, Number("123."), Word("bar")]

gets combined into

ObjectName(vec![Word("foo-123"), Word("bar")])

would something like that be feasible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially I'm wondering if we can extend the work to cover this scenario doublle number scenario. Using the example from #1598 it would essentially mean that e.g. the following token stream

[Word("foo"), Minus, Number("123."), Word("bar")]

gets combined into

ObjectName(vec![Word("foo-123"), Word("bar")])

would something like that be feasible?

It's a good point. "Keeping tokenizer parse decimals as usual" makes sense to me.
I'll try it. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed it in b1d02f2
The double-value tests are enabled for BigQuery.

if let Some('.') = chars.peek() {
// Check if this actually is a float point number
let mut char_clone = chars.peekable.clone();
char_clone.next();
// Next char should be a digit, otherwise, it is not a float point number
if char_clone
.peek()
.map(|c| c.is_ascii_digit())
.unwrap_or(false)
{
s.push('.');
chars.next();
} else if !s.is_empty() {
// Number might be part of period separated construct. Keep the period for next token
// e.g. a-12.b
return Ok(Some(Token::Number(s, false)));
} else {
// No number -> Token::Period
chars.next();
return Ok(Some(Token::Period));
}
}

s += &peeking_take_while(chars, |ch| ch.is_ascii_digit());
} else {
if let Some('.') = chars.peek() {
s.push('.');
chars.next();
} else if !s.is_empty() {
// Number might be part of period separated construct. Keep the period for next token
// e.g. a-12.b
return Ok(Some(Token::Number(s, false)));
} else {
// No number -> Token::Period
chars.next();
}
s += &peeking_take_while(chars, |ch| ch.is_ascii_digit());

// No number -> Token::Period
if s == "." {
return Ok(Some(Token::Period));
}
}

s += &peeking_take_while(chars, |ch| ch.is_ascii_digit());

let mut exponent_part = String::new();
// Parse exponent as number
if chars.peek() == Some(&'e') || chars.peek() == Some(&'E') {
Expand Down Expand Up @@ -2154,6 +2165,7 @@ mod tests {
BigQueryDialect, ClickHouseDialect, HiveDialect, MsSqlDialect, MySqlDialect,
};
use core::fmt::Debug;
use sqlparser::test_utils::all_dialects_where;

#[test]
fn tokenizer_error_impl() {
Expand Down Expand Up @@ -2202,18 +2214,20 @@ mod tests {
#[test]
fn tokenize_select_float_hyphenated_identifier() {
let sql = String::from("SELECT a-12.b");
let dialect = GenericDialect {};
let tokens = Tokenizer::new(&dialect, &sql).tokenize().unwrap();
let expected = vec![
Token::make_keyword("SELECT"),
Token::Whitespace(Whitespace::Space),
Token::make_word("a", None),
Token::Minus,
Token::Number(String::from("12"), false),
Token::Period,
Token::make_word("b", None),
];
compare(expected, tokens);
let dialects = all_dialects_where(|d| d.support_unquoted_hyphenated_identifiers());
for dialect in dialects.dialects {
let tokens = Tokenizer::new(dialect.as_ref(), &sql).tokenize().unwrap();
let expected = vec![
Token::make_keyword("SELECT"),
Token::Whitespace(Whitespace::Space),
Token::make_word("a", None),
Token::Minus,
Token::Number(String::from("12"), false),
Token::Period,
Token::make_word("b", None),
];
compare(expected, tokens);
}
}

#[test]
Expand Down
144 changes: 144 additions & 0 deletions tests/sqlparser_common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2964,6 +2964,115 @@ fn test_compound_expr() {
}
}

#[test]
fn test_double_value() {
// TODO: support double value for dialect that supports unquoted hyphenated identifiers
// see issue: https://github.com/apache/datafusion-sqlparser-rs/issues/1622
let dialects = all_dialects_where(|dialect| !dialect.support_unquoted_hyphenated_identifiers());
let test_cases = vec![
gen_number_case_with_sign("0."),
gen_number_case_with_sign("0.0"),
gen_number_case_with_sign("0000."),
gen_number_case_with_sign("0000.00"),
gen_number_case_with_sign(".0"),
gen_number_case_with_sign(".00"),
gen_number_case_with_sign("0e0"),
gen_number_case_with_sign("0e+0"),
gen_number_case_with_sign("0e-0"),
gen_number_case_with_sign("0.e-0"),
gen_number_case_with_sign("0.e+0"),
gen_number_case_with_sign(".0e-0"),
gen_number_case_with_sign(".0e+0"),
gen_number_case_with_sign("00.0e+0"),
gen_number_case_with_sign("00.0e-0"),
];

for (input, expected) in test_cases {
for (i, expr) in input.iter().enumerate() {
if let Statement::Query(query) =
dialects.one_statement_parses_to(&format!("SELECT {}", expr), "")
{
if let SetExpr::Select(select) = *query.body {
assert_eq!(expected[i], select.projection[0]);
} else {
panic!("Expected a SELECT statement");
}
} else {
panic!("Expected a SELECT statement");
}
}
}
}

fn gen_number_case(value: &str) -> (Vec<String>, Vec<SelectItem>) {
let input = vec![
value.to_string(),
format!("{} col_alias", value),
format!("{} AS col_alias", value),
];
let expected = vec![
SelectItem::UnnamedExpr(Expr::Value(number(value))),
SelectItem::ExprWithAlias {
expr: Expr::Value(number(value)),
alias: Ident::new("col_alias"),
},
SelectItem::ExprWithAlias {
expr: Expr::Value(number(value)),
alias: Ident::new("col_alias"),
},
];
(input, expected)
}

fn gen_sign_number_case(value: &str, op: UnaryOperator) -> (Vec<String>, Vec<SelectItem>) {
match op {
UnaryOperator::Plus | UnaryOperator::Minus => {}
_ => panic!("Invalid sign"),
}

let input = vec![
format!("{}{}", op, value),
format!("{}{} col_alias", op, value),
format!("{}{} AS col_alias", op, value),
];
let expected = vec![
SelectItem::UnnamedExpr(Expr::UnaryOp {
op,
expr: Box::new(Expr::Value(number(value))),
}),
SelectItem::ExprWithAlias {
expr: Expr::UnaryOp {
op,
expr: Box::new(Expr::Value(number(value))),
},
alias: Ident::new("col_alias"),
},
SelectItem::ExprWithAlias {
expr: Expr::UnaryOp {
op,
expr: Box::new(Expr::Value(number(value))),
},
alias: Ident::new("col_alias"),
},
];
(input, expected)
}

/// generate the test cases for signed and unsigned numbers
/// For example, given "0.0", the test cases will be:
/// - "0.0"
/// - "+0.0"
/// - "-0.0"
fn gen_number_case_with_sign(number: &str) -> (Vec<String>, Vec<SelectItem>) {
let (mut input, mut expected) = gen_number_case(number);
for op in [UnaryOperator::Plus, UnaryOperator::Minus] {
let (input_sign, expected_sign) = gen_sign_number_case(number, op);
input.extend(input_sign);
expected.extend(expected_sign);
}
(input, expected)
}

#[test]
fn parse_negative_value() {
let sql1 = "SELECT -1";
Expand Down Expand Up @@ -12470,6 +12579,41 @@ fn parse_composite_access_expr() {
all_dialects_where(|d| d.supports_struct_literal()).verified_stmt(
"SELECT * FROM t WHERE STRUCT(STRUCT(1 AS a, NULL AS b) AS c, NULL AS d).c.a IS NOT NULL",
);
let support_struct = all_dialects_where(|d| d.supports_struct_literal());
let stmt = support_struct
.verified_only_select("SELECT STRUCT(STRUCT(1 AS a, NULL AS b) AS c, NULL AS d).c.a");
let expected = SelectItem::UnnamedExpr(Expr::CompoundFieldAccess {
Comment on lines +12580 to +12583
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The access chain for the struct literal is represented by CompoundFieldAccess currently. See #1551 for the details.
cc @ayman-sigma

root: Box::new(Expr::Struct {
values: vec![
Expr::Named {
name: Ident::new("c"),
expr: Box::new(Expr::Struct {
values: vec![
Expr::Named {
name: Ident::new("a"),
expr: Box::new(Expr::Value(Number("1".parse().unwrap(), false))),
},
Expr::Named {
name: Ident::new("b"),
expr: Box::new(Expr::Value(Value::Null)),
},
],
fields: vec![],
}),
},
Expr::Named {
name: Ident::new("d"),
expr: Box::new(Expr::Value(Value::Null)),
},
],
fields: vec![],
}),
access_chain: vec![
AccessExpr::Dot(Expr::Identifier(Ident::new("c"))),
AccessExpr::Dot(Expr::Identifier(Ident::new("a"))),
],
});
assert_eq!(stmt.projection[0], expected);
}

#[test]
Expand Down
Loading