-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial lexing support for real literals following #143. #273
Conversation
Found by fuzz testing.
lexer/tokenized_buffer.cpp
Outdated
for (std::size_t n = source_text.size(); i != n; ++i) { | ||
char c = source_text[i]; | ||
if (llvm::isAlnum(c) || c == '_') { | ||
if (c >= 'a' && c <= 'z' && result.radix_point != llvm::StringRef::npos && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to point comparisons in the same direction
'a' <= c && c <= 'z'
it makes them read more easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I find that harder to read, because I interpret the left-hand side of the comparison as being the variable and the right hand side as being the bound (the subject versus the object in the comparison), and it's harder for me to think about 'a'
being the subject in the comparison. So for me, reading this requires learning an extra rule, that I need to look ahead and pattern match against a <= b && b <= c
before deciding how to interpret the a <=
.
I can get used to this style if that's what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factored out to a separate function; that at least helped with my reading of the rewritten version.
lexer/tokenized_buffer.cpp
Outdated
if (int_text.getAsInteger(/*Radix=*/0, int_value)) { | ||
|
||
auto check_digit_separator_placement = [&](unsigned | ||
remaining_digit_separators) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: weird spacing, maybe force a newline before the capture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #269.
lexer/tokenized_buffer.cpp
Outdated
}; | ||
|
||
// For decimal and hexadecimal digit sequences, digit separators must form | ||
// groups of 3 or 4 digits (4 or 5 characters), respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment would be more helpful on line 411
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #269.
assert(token_info.kind == TokenKind::RealLiteral() && | ||
"The token must be a real literal!"); | ||
|
||
// Note that every real literal is at least three characters long, so we can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 0.
not a valid real literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Per #143, a real literal requires at least one digit on each side of the .
.
lexer/tokenized_buffer.h
Outdated
// fraction (mantissa * 10^exponent). | ||
class RealLiteralValue { | ||
const llvm::APInt *mantissa; | ||
const llvm::APInt *exponent; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is worth noting that these are only valid until an additional token is lexed (at which point the storage vectors could resize).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment to explain the lifetime situation here.
pointers to TokenizedBuffer's vector elements.
lexer/tokenized_buffer.cpp
Outdated
}); | ||
// TODO(zygoloid): Update lexical rules to specify that a numeric literal | ||
// cannot be immediately followed by an alphanumeric character. | ||
std::size_t i = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we switch to using a signed integer for all of this? I am made very nervous working with unsigned integers like this. They both have an edge case in the common value and we can't sanitize them effectively.
I would be fine with using int
, but also totally fine with ssize_t
(however we want to acquire that type) which is probably slightly more efficient (sadly).
I'd do this pretty pervasively as I think that'll make the code cleanest. Especially when capturing the size into a local immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. This value is used to initialize radix_point
and exponent
, which can contain the value npos
(which doesn't fit into an int
or ssize_t
). And I make use of the fact that we get npos
here to simplify calls to substr
elsewhere. Switching away from npos
seems feasible, but doesn't seem like idiomatic use of StringRef
/ string_view
to me -- this seems like fighting against the given API -- but I'm OK with that if we want to systematically avoid unsigned types.
Should we include something in the C++ style guide about avoiding unsigned types (even in cases such as this)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed throughout. I'm not sure I like the static_cast<int>
s, nor the introduced possibility of misbehavior for large inputs, but I think this is on balance an improvement. Should we limit input files to 2GB up-front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think we should push back against the use of npos
. I find it ... very difficult to reason about. I would find -1
almost easier, but I agree with the direction you're heading by using size as a somewhat easier to reason about sentinel.
On the other topic, we should maybe chat about this in Discord, but I'd be reasonably supportive of moving to ssize_t and getting such a type that is easily accessed here.
I keep finding frustrating code generation quality issues with int
because of compiler limitations anyways.
That said, I'm perfectly happy with the int
code you have now and considering ssize_t
in a follow-up. And even if we use ssize_t
, I'm happy to insist on code not pushing 32-bit offsets so that it is always valid on a 32-bit system to use int
or ssize_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about consistently using int64_t
rather than ssize_t
for file offsets and buffer positions? We're already doing that in some places (eg, the offset in LineInfo
. I suspect we don't care too much about the extra memory usage for 32-bit compiles of the toolchain, given our priorities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM (but I'd suggest a follow-up patch)
By implication this also means stopping using npos to represent positions not found within the string. In order to avoid adding special cases in substr calls, use text.size() instead.
lexer/tokenized_buffer.cpp
Outdated
} | ||
|
||
// Parse a string that is known to be a valid base-radix integer into an APInt. | ||
// If needs_cleaning is true, the string may additionally contain _ and . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stripping _
makes sense, but stripping .
feels weird. Can you expand on the comment a bit to say why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
purpose. Change GetIntegerValue to return a const& for consistency and to avoid forcing a copy on large (>64 bit) integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One meta comment is that I think clang-format needs to get run...
Some of the comments below are really fine to defer -- I've tried to be explicit in the comment, but if not clear don't hesitate to ask for deferring to a follow-up.
lexer/tokenized_buffer.cpp
Outdated
}); | ||
// TODO(zygoloid): Update lexical rules to specify that a numeric literal | ||
// cannot be immediately followed by an alphanumeric character. | ||
std::size_t i = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think we should push back against the use of npos
. I find it ... very difficult to reason about. I would find -1
almost easier, but I agree with the direction you're heading by using size as a somewhat easier to reason about sentinel.
On the other topic, we should maybe chat about this in Discord, but I'd be reasonably supportive of moving to ssize_t and getting such a type that is easily accessed here.
I keep finding frustrating code generation quality issues with int
because of compiler limitations anyways.
That said, I'm perfectly happy with the int
code you have now and considering ssize_t
in a follow-up. And even if we use ssize_t
, I'm happy to insist on code not pushing 32-bit offsets so that it is always valid on a 32-bit system to use int
or ssize_t
.
we don't need to worry about iterator invalidation.
NumericLiteralParser to better reflect that it's assigning semantics to literals rather than merely checking their morphology.
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with potentially some minor formatting fixes below. One looks like an odd clang-format thing, the other is just member-order matching the style guide.
I'd double check w/ @fowles to make sure he's fine before landing, but if so feel free to submit with the fixes.
lexer/tokenized_buffer.cpp
Outdated
}); | ||
// TODO(zygoloid): Update lexical rules to specify that a numeric literal | ||
// cannot be immediately followed by an alphanumeric character. | ||
std::size_t i = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM (but I'd suggest a follow-up patch)
lexer/tokenized_buffer.cpp
Outdated
#include <algorithm> | ||
#include <bitset> | ||
#include <cmath> | ||
#include <iterator> | ||
#include <string> | ||
|
||
#include "lexer/tokenized_buffer.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did clang-format
do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Looks like my editor integration predates the addition of the -assume-filename
flag so it had no idea this was the corresponding header. :-/ Should be fixed now.
lexer/tokenized_buffer.cpp
Outdated
DiagnosticEmitter& emitter; | ||
NumericLiteral literal; | ||
|
||
// The radix of the literal: 2, 10, or 16, for a prefix of '0b', no prefix, | ||
// or '0x', respectively. | ||
int radix = 10; | ||
|
||
// The various components of a numeric literal: | ||
// | ||
// [radix] int_part [. fract_part [[ep] [+-] exponent_part]] | ||
llvm::StringRef int_part; | ||
llvm::StringRef fract_part; | ||
llvm::StringRef exponent_part; | ||
|
||
// Do we need to remove any special characters (digit separator or radix | ||
// point) before interpreting the mantissa or exponent as an integer? | ||
bool mantissa_needs_cleaning = false; | ||
bool exponent_needs_cleaning = false; | ||
|
||
// True if we found a `-` before `exponent_part`. | ||
bool exponent_is_negative = false; | ||
|
||
// True if we produced an error but recovered. | ||
bool recovered_from_error = false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google's style guide would put these below the public
section I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Initial lexing support for real literals. Currently also includes #269. Let me know if you'd like me to follow the stacked PR process here, but I'm assuming #269 will land soon enough that it's not worthwhile.