diff --git a/README.md b/README.md index 9a67abcf8..fd676d115 100644 --- a/README.md +++ b/README.md @@ -100,13 +100,18 @@ similar semantics are represented with the same AST. We welcome PRs to fix such issues and distinguish different syntaxes in the AST. -## WIP: Extracting source locations from AST nodes +## Source Locations (Work in Progress) -This crate allows recovering source locations from AST nodes via the [Spanned](https://docs.rs/sqlparser/latest/sqlparser/ast/trait.Spanned.html) trait, which can be used for advanced diagnostics tooling. Note that this feature is a work in progress and many nodes report missing or inaccurate spans. Please see [this document](./docs/source_spans.md#source-span-contributing-guidelines) for information on how to contribute missing improvements. +This crate allows recovering source locations from AST nodes via the [Spanned] +trait, which can be used for advanced diagnostics tooling. Note that this +feature is a work in progress and many nodes report missing or inaccurate spans. +Please see [this ticket] for information on how to contribute missing +improvements. -```rust -use sqlparser::ast::Spanned; +[Spanned]: https://docs.rs/sqlparser/latest/sqlparser/ast/trait.Spanned.html +[this ticket]: https://github.com/apache/datafusion-sqlparser-rs/issues/1548 +```rust // Parse SQL let ast = Parser::parse_sql(&GenericDialect, "SELECT A FROM B").unwrap(); @@ -123,9 +128,9 @@ SQL was first standardized in 1987, and revisions of the standard have been published regularly since. Most revisions have added significant new features to the language, and as a result no database claims to support the full breadth of features. This parser currently supports most of the SQL-92 syntax, plus some -syntax from newer versions that have been explicitly requested, plus some MSSQL, -PostgreSQL, and other dialect-specific syntax. Whenever possible, the [online -SQL:2016 grammar][sql-2016-grammar] is used to guide what syntax to accept. +syntax from newer versions that have been explicitly requested, plus various +other dialect-specific syntax. Whenever possible, the [online SQL:2016 +grammar][sql-2016-grammar] is used to guide what syntax to accept. Unfortunately, stating anything more specific about compliance is difficult. There is no publicly available test suite that can assess compliance diff --git a/docs/source_spans.md b/docs/source_spans.md deleted file mode 100644 index 136a4ced2..000000000 --- a/docs/source_spans.md +++ /dev/null @@ -1,52 +0,0 @@ - -## Breaking Changes - -These are the current breaking changes introduced by the source spans feature: - -#### Added fields for spans (must be added to any existing pattern matches) -- `Ident` now stores a `Span` -- `Select`, `With`, `Cte`, `WildcardAdditionalOptions` now store a `TokenWithLocation` - -#### Misc. -- `TokenWithLocation` stores a full `Span`, rather than just a source location. Users relying on `token.location` should use `token.location.start` instead. -## Source Span Contributing Guidelines - -For contributing source spans improvement in addition to the general [contribution guidelines](../README.md#contributing), please make sure to pay attention to the following: - - -### Source Span Design Considerations - -- `Ident` always have correct source spans -- Downstream breaking change impact is to be as minimal as possible -- To this end, use recursive merging of spans in favor of storing spans on all nodes -- Any metadata added to compute spans must not change semantics (Eq, Ord, Hash, etc.) - -The primary reason for missing and inaccurate source spans at this time is missing spans of keyword tokens and values in many structures, either due to lack of time or because adding them would break downstream significantly. - -When considering adding support for source spans on a type, consider the impact to consumers of that type and whether your change would require a consumer to do non-trivial changes to their code. - -Example of a trivial change -```rust -match node { - ast::Query { - field1, - field2, - location: _, // add a new line to ignored location -} -``` - -If adding source spans to a type would require a significant change like wrapping that type or similar, please open an issue to discuss. - -### AST Node Equality and Hashes - -When adding tokens to AST nodes, make sure to store them using the [AttachedToken](https://docs.rs/sqlparser/latest/sqlparser/ast/helpers/struct.AttachedToken.html) helper to ensure that semantically equivalent AST nodes always compare as equal and hash to the same value. F.e. `select 5` and `SELECT 5` would compare as different `Select` nodes, if the select token was stored directly. f.e. - -```rust -struct Select { - select_token: AttachedToken, // only used for spans - /// remaining fields - field1, - field2, - ... -} -``` \ No newline at end of file diff --git a/src/ast/helpers/attached_token.rs b/src/ast/helpers/attached_token.rs index ed340359d..6b930b513 100644 --- a/src/ast/helpers/attached_token.rs +++ b/src/ast/helpers/attached_token.rs @@ -19,7 +19,7 @@ use core::cmp::{Eq, Ord, Ordering, PartialEq, PartialOrd}; use core::fmt::{self, Debug, Formatter}; use core::hash::{Hash, Hasher}; -use crate::tokenizer::{Token, TokenWithSpan}; +use crate::tokenizer::TokenWithSpan; #[cfg(feature = "serde")] use serde::{Deserialize, Serialize}; @@ -27,17 +27,65 @@ use serde::{Deserialize, Serialize}; #[cfg(feature = "visitor")] use sqlparser_derive::{Visit, VisitMut}; -/// A wrapper type for attaching tokens to AST nodes that should be ignored in comparisons and hashing. -/// This should be used when a token is not relevant for semantics, but is still needed for -/// accurate source location tracking. +/// A wrapper over [`TokenWithSpan`]s that ignores the token and source +/// location in comparisons and hashing. +/// +/// This type is used when the token and location is not relevant for semantics, +/// but is still needed for accurate source location tracking, for example, in +/// the nodes in the [ast](crate::ast) module. +/// +/// Note: **All** `AttachedTokens` are equal. +/// +/// # Examples +/// +/// Same token, different location are equal +/// ``` +/// # use sqlparser::ast::helpers::attached_token::AttachedToken; +/// # use sqlparser::tokenizer::{Location, Span, Token, TokenWithLocation}; +/// // commas @ line 1, column 10 +/// let tok1 = TokenWithLocation::new( +/// Token::Comma, +/// Span::new(Location::new(1, 10), Location::new(1, 11)), +/// ); +/// // commas @ line 2, column 20 +/// let tok2 = TokenWithLocation::new( +/// Token::Comma, +/// Span::new(Location::new(2, 20), Location::new(2, 21)), +/// ); +/// +/// assert_ne!(tok1, tok2); // token with locations are *not* equal +/// assert_eq!(AttachedToken(tok1), AttachedToken(tok2)); // attached tokens are +/// ``` +/// +/// Different token, different location are equal 🤯 +/// +/// ``` +/// # use sqlparser::ast::helpers::attached_token::AttachedToken; +/// # use sqlparser::tokenizer::{Location, Span, Token, TokenWithLocation}; +/// // commas @ line 1, column 10 +/// let tok1 = TokenWithLocation::new( +/// Token::Comma, +/// Span::new(Location::new(1, 10), Location::new(1, 11)), +/// ); +/// // period @ line 2, column 20 +/// let tok2 = TokenWithLocation::new( +/// Token::Period, +/// Span::new(Location::new(2, 10), Location::new(2, 21)), +/// ); +/// +/// assert_ne!(tok1, tok2); // token with locations are *not* equal +/// assert_eq!(AttachedToken(tok1), AttachedToken(tok2)); // attached tokens are +/// ``` +/// // period @ line 2, column 20 #[derive(Clone)] #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] pub struct AttachedToken(pub TokenWithSpan); impl AttachedToken { + /// Return a new Empty AttachedToken pub fn empty() -> Self { - AttachedToken(TokenWithSpan::wrap(Token::EOF)) + AttachedToken(TokenWithSpan::new_eof()) } } @@ -80,3 +128,9 @@ impl From for AttachedToken { AttachedToken(value) } } + +impl From for TokenWithSpan { + fn from(value: AttachedToken) -> Self { + value.0 + } +} diff --git a/src/ast/mod.rs b/src/ast/mod.rs index 6d35badf9..e52251d52 100644 --- a/src/ast/mod.rs +++ b/src/ast/mod.rs @@ -596,9 +596,21 @@ pub enum CeilFloorKind { /// An SQL expression of any type. /// +/// # Semantics / Type Checking +/// /// The parser does not distinguish between expressions of different types -/// (e.g. boolean vs string), so the caller must handle expressions of -/// inappropriate type, like `WHERE 1` or `SELECT 1=1`, as necessary. +/// (e.g. boolean vs string). The caller is responsible for detecting and +/// validating types as necessary (for example `WHERE 1` vs `SELECT 1=1`) +/// See the [README.md] for more details. +/// +/// [README.md]: https://github.com/apache/datafusion-sqlparser-rs/blob/main/README.md#syntax-vs-semantics +/// +/// # Equality and Hashing Does not Include Source Locations +/// +/// The `Expr` type implements `PartialEq` and `Eq` based on the semantic value +/// of the expression (not bitwise comparison). This means that `Expr` instances +/// that are semantically equivalent but have different spans (locations in the +/// source tree) will compare as equal. #[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)] #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr( diff --git a/src/ast/query.rs b/src/ast/query.rs index f3a76d893..ad7fd261e 100644 --- a/src/ast/query.rs +++ b/src/ast/query.rs @@ -282,6 +282,7 @@ impl fmt::Display for Table { pub struct Select { /// Token for the `SELECT` keyword pub select_token: AttachedToken, + /// `SELECT [DISTINCT] ...` pub distinct: Option, /// MSSQL syntax: `TOP () [ PERCENT ] [ WITH TIES ]` pub top: Option, @@ -511,7 +512,7 @@ impl fmt::Display for NamedWindowDefinition { #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] pub struct With { - // Token for the "WITH" keyword + /// Token for the "WITH" keyword pub with_token: AttachedToken, pub recursive: bool, pub cte_tables: Vec, @@ -564,7 +565,7 @@ pub struct Cte { pub query: Box, pub from: Option, pub materialized: Option, - // Token for the closing parenthesis + /// Token for the closing parenthesis pub closing_paren_token: AttachedToken, } diff --git a/src/ast/spans.rs b/src/ast/spans.rs index 8e8c7b14a..1e0f1bf09 100644 --- a/src/ast/spans.rs +++ b/src/ast/spans.rs @@ -21,21 +21,51 @@ use super::{ /// Given an iterator of spans, return the [Span::union] of all spans. fn union_spans>(iter: I) -> Span { - iter.reduce(|acc, item| acc.union(&item)) - .unwrap_or(Span::empty()) + Span::union_iter(iter) } -/// A trait for AST nodes that have a source span for use in diagnostics. +/// Trait for AST nodes that have a source location information. /// -/// Source spans are not guaranteed to be entirely accurate. They may -/// be missing keywords or other tokens. Some nodes may not have a computable -/// span at all, in which case they return [`Span::empty()`]. +/// # Notes: +/// +/// Source [`Span`] are not yet complete. They may be missing: +/// +/// 1. keywords or other tokens +/// 2. span information entirely, in which case they return [`Span::empty()`]. +/// +/// Note Some impl blocks (rendered below) are annotated with which nodes are +/// missing spans. See [this ticket] for additional information and status. +/// +/// [this ticket]: https://github.com/apache/datafusion-sqlparser-rs/issues/1548 +/// +/// # Example +/// ``` +/// # use sqlparser::parser::{Parser, ParserError}; +/// # use sqlparser::ast::Spanned; +/// # use sqlparser::dialect::GenericDialect; +/// # use sqlparser::tokenizer::Location; +/// # fn main() -> Result<(), ParserError> { +/// let dialect = GenericDialect {}; +/// let sql = r#"SELECT * +/// FROM table_1"#; +/// let statements = Parser::new(&dialect) +/// .try_with_sql(sql)? +/// .parse_statements()?; +/// // Get the span of the first statement (SELECT) +/// let span = statements[0].span(); +/// // statement starts at line 1, column 1 (1 based, not 0 based) +/// assert_eq!(span.start, Location::new(1, 1)); +/// // statement ends on line 2, column 15 +/// assert_eq!(span.end, Location::new(2, 15)); +/// # Ok(()) +/// # } +/// ``` /// -/// Some impl blocks may contain doc comments with information -/// on which nodes are missing spans. pub trait Spanned { - /// Compute the source span for this AST node, by recursively - /// combining the spans of its children. + /// Return the [`Span`] (the minimum and maximum [`Location`]) for this AST + /// node, by recursively combining the spans of its children. + /// + /// [`Location`]: crate::tokenizer::Location fn span(&self) -> Span; } diff --git a/src/lib.rs b/src/lib.rs index 6c8987b63..5d72f9f0e 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -25,6 +25,9 @@ //! 1. [`Parser::parse_sql`] and [`Parser::new`] for the Parsing API //! 2. [`ast`] for the AST structure //! 3. [`Dialect`] for supported SQL dialects +//! 4. [`Spanned`] for source text locations (see "Source Spans" below for details) +//! +//! [`Spanned`]: ast::Spanned //! //! # Example parsing SQL text //! @@ -61,13 +64,67 @@ //! // The original SQL text can be generated from the AST //! assert_eq!(ast[0].to_string(), sql); //! ``` -//! //! [sqlparser crates.io page]: https://crates.io/crates/sqlparser //! [`Parser::parse_sql`]: crate::parser::Parser::parse_sql //! [`Parser::new`]: crate::parser::Parser::new //! [`AST`]: crate::ast //! [`ast`]: crate::ast //! [`Dialect`]: crate::dialect::Dialect +//! +//! # Source Spans +//! +//! Starting with version `0.53.0` sqlparser introduced source spans to the +//! AST. This feature provides source information for syntax errors, enabling +//! better error messages. See [issue #1548] for more information and the +//! [`Spanned`] trait to access the spans. +//! +//! [issue #1548]: https://github.com/apache/datafusion-sqlparser-rs/issues/1548 +//! [`Spanned`]: ast::Spanned +//! +//! ## Migration Guide +//! +//! For the next few releases, we will be incrementally adding source spans to the +//! AST nodes, trying to minimize the impact on existing users. Some breaking +//! changes are inevitable, and the following is a summary of the changes: +//! +//! #### New fields for spans (must be added to any existing pattern matches) +//! +//! The primary change is that new fields will be added to AST nodes to store the source `Span` or `TokenWithLocation`. +//! +//! This will require +//! 1. Adding new fields to existing pattern matches. +//! 2. Filling in the proper span information when constructing AST nodes. +//! +//! For example, since `Ident` now stores a `Span`, to construct an `Ident` you +//! must provide now provide one: +//! +//! Previously: +//! ```text +//! # use sqlparser::ast::Ident; +//! Ident { +//! value: "name".into(), +//! quote_style: None, +//! } +//! ``` +//! Now +//! ```rust +//! # use sqlparser::ast::Ident; +//! # use sqlparser::tokenizer::Span; +//! Ident { +//! value: "name".into(), +//! quote_style: None, +//! span: Span::empty(), +//! }; +//! ``` +//! +//! Similarly, when pattern matching on `Ident`, you must now account for the +//! `span` field. +//! +//! #### Misc. +//! - [`TokenWithLocation`] stores a full `Span`, rather than just a source location. +//! Users relying on `token.location` should use `token.location.start` instead. +//! +//![`TokenWithLocation`]: tokenizer::TokenWithLocation #![cfg_attr(not(feature = "std"), no_std)] #![allow(clippy::upper_case_acronyms)] diff --git a/src/tokenizer.rs b/src/tokenizer.rs index 7a79445e0..aacfc16fa 100644 --- a/src/tokenizer.rs +++ b/src/tokenizer.rs @@ -422,13 +422,35 @@ impl fmt::Display for Whitespace { } /// Location in input string +/// +/// # Create an "empty" (unknown) `Location` +/// ``` +/// # use sqlparser::tokenizer::Location; +/// let location = Location::empty(); +/// ``` +/// +/// # Create a `Location` from a line and column +/// ``` +/// # use sqlparser::tokenizer::Location; +/// let location = Location::new(1, 1); +/// ``` +/// +/// # Create a `Location` from a pair +/// ``` +/// # use sqlparser::tokenizer::Location; +/// let location = Location::from((1, 1)); +/// ``` #[derive(Eq, PartialEq, Hash, Clone, Copy, Ord, PartialOrd)] #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] pub struct Location { - /// Line number, starting from 1 + /// Line number, starting from 1. + /// + /// Note: Line 0 is used for empty spans pub line: u64, - /// Line column, starting from 1 + /// Line column, starting from 1. + /// + /// Note: Column 0 is used for empty spans pub column: u64, } @@ -448,10 +470,25 @@ impl fmt::Debug for Location { } impl Location { - pub fn of(line: u64, column: u64) -> Self { + /// Return an "empty" / unknown location + pub fn empty() -> Self { + Self { line: 0, column: 0 } + } + + /// Create a new `Location` for a given line and column + pub fn new(line: u64, column: u64) -> Self { Self { line, column } } + /// Create a new location for a given line and column + /// + /// Alias for [`Self::new`] + // TODO: remove / deprecate in favor of` `new` for consistency? + pub fn of(line: u64, column: u64) -> Self { + Self::new(line, column) + } + + /// Combine self and `end` into a new `Span` pub fn span_to(self, end: Self) -> Span { Span { start: self, end } } @@ -463,7 +500,9 @@ impl From<(u64, u64)> for Location { } } -/// A span of source code locations (start, end) +/// A span represents a linear portion of the input string (start, end) +/// +/// See [Spanned](crate::ast::Spanned) for more information. #[derive(Eq, PartialEq, Hash, Clone, PartialOrd, Ord, Copy)] #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] @@ -483,12 +522,15 @@ impl Span { // We need a const instance for pattern matching const EMPTY: Span = Self::empty(); + /// Create a new span from a start and end [`Location`] pub fn new(start: Location, end: Location) -> Span { Span { start, end } } - /// Returns an empty span (0, 0) -> (0, 0) + /// Returns an empty span `(0, 0) -> (0, 0)` + /// /// Empty spans represent no knowledge of source location + /// See [Spanned](crate::ast::Spanned) for more information. pub const fn empty() -> Span { Span { start: Location { line: 0, column: 0 }, @@ -498,6 +540,19 @@ impl Span { /// Returns the smallest Span that contains both `self` and `other` /// If either span is [Span::empty], the other span is returned + /// + /// # Examples + /// ``` + /// # use sqlparser::tokenizer::{Span, Location}; + /// // line 1, column1 -> line 2, column 5 + /// let span1 = Span::new(Location::new(1, 1), Location::new(2, 5)); + /// // line 2, column 3 -> line 3, column 7 + /// let span2 = Span::new(Location::new(2, 3), Location::new(3, 7)); + /// // Union of the two is the min/max of the two spans + /// // line 1, column 1 -> line 3, column 7 + /// let union = span1.union(&span2); + /// assert_eq!(union, Span::new(Location::new(1, 1), Location::new(3, 7))); + /// ``` pub fn union(&self, other: &Span) -> Span { // If either span is empty, return the other // this prevents propagating (0, 0) through the tree @@ -512,6 +567,7 @@ impl Span { } /// Same as [Span::union] for `Option` + /// /// If `other` is `None`, `self` is returned pub fn union_opt(&self, other: &Option) -> Span { match other { @@ -519,13 +575,57 @@ impl Span { None => *self, } } + + /// Return the [Span::union] of all spans in the iterator + /// + /// If the iterator is empty, an empty span is returned + /// + /// # Example + /// ``` + /// # use sqlparser::tokenizer::{Span, Location}; + /// let spans = vec![ + /// Span::new(Location::new(1, 1), Location::new(2, 5)), + /// Span::new(Location::new(2, 3), Location::new(3, 7)), + /// Span::new(Location::new(3, 1), Location::new(4, 2)), + /// ]; + /// // line 1, column 1 -> line 4, column 2 + /// assert_eq!( + /// Span::union_iter(spans), + /// Span::new(Location::new(1, 1), Location::new(4, 2)) + /// ); + pub fn union_iter>(iter: I) -> Span { + iter.into_iter() + .reduce(|acc, item| acc.union(&item)) + .unwrap_or(Span::empty()) + } } /// Backwards compatibility struct for [`TokenWithSpan`] #[deprecated(since = "0.53.0", note = "please use `TokenWithSpan` instead")] pub type TokenWithLocation = TokenWithSpan; -/// A [Token] with [Location] attached to it +/// A [Token] with [Span] attached to it +/// +/// This is used to track the location of a token in the input string +/// +/// # Examples +/// ``` +/// # use sqlparser::tokenizer::{Location, Span, Token, TokenWithSpan}; +/// // commas @ line 1, column 10 +/// let tok1 = TokenWithSpan::new( +/// Token::Comma, +/// Span::new(Location::new(1, 10), Location::new(1, 11)), +/// ); +/// assert_eq!(tok1, Token::Comma); // can compare the token +/// +/// // commas @ line 2, column 20 +/// let tok2 = TokenWithSpan::new( +/// Token::Comma, +/// Span::new(Location::new(2, 20), Location::new(2, 21)), +/// ); +/// // same token but different locations are not equal +/// assert_ne!(tok1, tok2); +/// ``` #[derive(Debug, Clone, Hash, Ord, PartialOrd, Eq, PartialEq)] #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] @@ -535,16 +635,24 @@ pub struct TokenWithSpan { } impl TokenWithSpan { - pub fn new(token: Token, span: Span) -> TokenWithSpan { - TokenWithSpan { token, span } + /// Create a new [`TokenWithSpan`] from a [`Token`] and a [`Span`] + pub fn new(token: Token, span: Span) -> Self { + Self { token, span } + } + + /// Wrap a token with an empty span + pub fn wrap(token: Token) -> Self { + Self::new(token, Span::empty()) } - pub fn wrap(token: Token) -> TokenWithSpan { - TokenWithSpan::new(token, Span::empty()) + /// Wrap a token with a location from `start` to `end` + pub fn at(token: Token, start: Location, end: Location) -> Self { + Self::new(token, Span::new(start, end)) } - pub fn at(token: Token, start: Location, end: Location) -> TokenWithSpan { - TokenWithSpan::new(token, Span::new(start, end)) + /// Return an EOF token with no location + pub fn new_eof() -> Self { + Self::wrap(Token::EOF) } }