Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial lexing support for real literals following #143. #273

Merged
merged 28 commits into from
Feb 27, 2021
Merged

Initial lexing support for real literals following #143. #273

merged 28 commits into from
Feb 27, 2021

Conversation

zygoloid
Copy link
Contributor

Initial lexing support for real literals. Currently also includes #269. Let me know if you'd like me to follow the stacked PR process here, but I'm assuming #269 will land soon enough that it's not worthwhile.

@zygoloid zygoloid requested review from fowles and chandlerc February 18, 2021 00:49
@google-cla google-cla bot added the cla: yes PR meets CLA requirements according to bot. label Feb 18, 2021
for (std::size_t n = source_text.size(); i != n; ++i) {
char c = source_text[i];
if (llvm::isAlnum(c) || c == '_') {
if (c >= 'a' && c <= 'z' && result.radix_point != llvm::StringRef::npos &&
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to point comparisons in the same direction

'a' <= c && c <= 'z'

it makes them read more easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I find that harder to read, because I interpret the left-hand side of the comparison as being the variable and the right hand side as being the bound (the subject versus the object in the comparison), and it's harder for me to think about 'a' being the subject in the comparison. So for me, reading this requires learning an extra rule, that I need to look ahead and pattern match against a <= b && b <= c before deciding how to interpret the a <=.

I can get used to this style if that's what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored out to a separate function; that at least helped with my reading of the rewritten version.

if (int_text.getAsInteger(/*Radix=*/0, int_value)) {

auto check_digit_separator_placement = [&](unsigned
remaining_digit_separators) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: weird spacing, maybe force a newline before the capture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #269.

};

// For decimal and hexadecimal digit sequences, digit separators must form
// groups of 3 or 4 digits (4 or 5 characters), respectively.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment would be more helpful on line 411

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #269.

assert(token_info.kind == TokenKind::RealLiteral() &&
"The token must be a real literal!");

// Note that every real literal is at least three characters long, so we can
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 0. not a valid real literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Per #143, a real literal requires at least one digit on each side of the ..

// fraction (mantissa * 10^exponent).
class RealLiteralValue {
const llvm::APInt *mantissa;
const llvm::APInt *exponent;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is worth noting that these are only valid until an additional token is lexed (at which point the storage vectors could resize).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to explain the lifetime situation here.

});
// TODO(zygoloid): Update lexical rules to specify that a numeric literal
// cannot be immediately followed by an alphanumeric character.
std::size_t i = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we switch to using a signed integer for all of this? I am made very nervous working with unsigned integers like this. They both have an edge case in the common value and we can't sanitize them effectively.

I would be fine with using int, but also totally fine with ssize_t (however we want to acquire that type) which is probably slightly more efficient (sadly).

I'd do this pretty pervasively as I think that'll make the code cleanest. Especially when capturing the size into a local immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. This value is used to initialize radix_point and exponent, which can contain the value npos (which doesn't fit into an int or ssize_t). And I make use of the fact that we get npos here to simplify calls to substr elsewhere. Switching away from npos seems feasible, but doesn't seem like idiomatic use of StringRef / string_view to me -- this seems like fighting against the given API -- but I'm OK with that if we want to systematically avoid unsigned types.

Should we include something in the C++ style guide about avoiding unsigned types (even in cases such as this)?

Copy link
Contributor Author

@zygoloid zygoloid Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed throughout. I'm not sure I like the static_cast<int>s, nor the introduced possibility of misbehavior for large inputs, but I think this is on balance an improvement. Should we limit input files to 2GB up-front?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think we should push back against the use of npos. I find it ... very difficult to reason about. I would find -1 almost easier, but I agree with the direction you're heading by using size as a somewhat easier to reason about sentinel.

On the other topic, we should maybe chat about this in Discord, but I'd be reasonably supportive of moving to ssize_t and getting such a type that is easily accessed here.

I keep finding frustrating code generation quality issues with int because of compiler limitations anyways.

That said, I'm perfectly happy with the int code you have now and considering ssize_t in a follow-up. And even if we use ssize_t, I'm happy to insist on code not pushing 32-bit offsets so that it is always valid on a 32-bit system to use int or ssize_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about consistently using int64_t rather than ssize_t for file offsets and buffer positions? We're already doing that in some places (eg, the offset in LineInfo. I suspect we don't care too much about the extra memory usage for 32-bit compiles of the toolchain, given our priorities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM (but I'd suggest a follow-up patch)

lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.h Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
By implication this also means stopping using npos to represent
positions not found within the string. In order to avoid adding special
cases in substr calls, use text.size() instead.
}

// Parse a string that is known to be a valid base-radix integer into an APInt.
// If needs_cleaning is true, the string may additionally contain _ and .
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stripping _ makes sense, but stripping . feels weird. Can you expand on the comment a bit to say why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
purpose.

Change GetIntegerValue to return a const& for consistency and to avoid
forcing a copy on large (>64 bit) integers.
Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One meta comment is that I think clang-format needs to get run...

Some of the comments below are really fine to defer -- I've tried to be explicit in the comment, but if not clear don't hesitate to ask for deferring to a follow-up.

});
// TODO(zygoloid): Update lexical rules to specify that a numeric literal
// cannot be immediately followed by an alphanumeric character.
std::size_t i = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think we should push back against the use of npos. I find it ... very difficult to reason about. I would find -1 almost easier, but I agree with the direction you're heading by using size as a somewhat easier to reason about sentinel.

On the other topic, we should maybe chat about this in Discord, but I'd be reasonably supportive of moving to ssize_t and getting such a type that is easily accessed here.

I keep finding frustrating code generation quality issues with int because of compiler limitations anyways.

That said, I'm perfectly happy with the int code you have now and considering ssize_t in a follow-up. And even if we use ssize_t, I'm happy to insist on code not pushing 32-bit offsets so that it is always valid on a 32-bit system to use int or ssize_t.

lexer/tokenized_buffer.h Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer_test.cpp Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
@zygoloid
Copy link
Contributor Author

One meta comment is that I think clang-format needs to get run...

Done.

Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with potentially some minor formatting fixes below. One looks like an odd clang-format thing, the other is just member-order matching the style guide.

I'd double check w/ @fowles to make sure he's fine before landing, but if so feel free to submit with the fixes.

});
// TODO(zygoloid): Update lexical rules to specify that a numeric literal
// cannot be immediately followed by an alphanumeric character.
std::size_t i = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM (but I'd suggest a follow-up patch)

#include <algorithm>
#include <bitset>
#include <cmath>
#include <iterator>
#include <string>

#include "lexer/tokenized_buffer.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did clang-format do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Looks like my editor integration predates the addition of the -assume-filename flag so it had no idea this was the corresponding header. :-/ Should be fixed now.

Comment on lines 238 to 262
DiagnosticEmitter& emitter;
NumericLiteral literal;

// The radix of the literal: 2, 10, or 16, for a prefix of '0b', no prefix,
// or '0x', respectively.
int radix = 10;

// The various components of a numeric literal:
//
// [radix] int_part [. fract_part [[ep] [+-] exponent_part]]
llvm::StringRef int_part;
llvm::StringRef fract_part;
llvm::StringRef exponent_part;

// Do we need to remove any special characters (digit separator or radix
// point) before interpreting the mantissa or exponent as an integer?
bool mantissa_needs_cleaning = false;
bool exponent_needs_cleaning = false;

// True if we found a `-` before `exponent_part`.
bool exponent_is_negative = false;

// True if we produced an error but recovered.
bool recovered_from_error = false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google's style guide would put these below the public section I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zygoloid zygoloid merged commit bbcc31a into carbon-language:trunk Feb 27, 2021
@zygoloid zygoloid deleted the real-literals branch February 27, 2021 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes PR meets CLA requirements according to bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants