Initial lexing support for real literals following #143. #273

zygoloid · 2021-02-18T00:49:27Z

Initial lexing support for real literals. Currently also includes #269. Let me know if you'd like me to follow the stacked PR process here, but I'm assuming #269 will land soon enough that it's not worthwhile.

Found by fuzz testing.

fowles · 2021-02-18T03:30:02Z

lexer/tokenized_buffer.cpp

+  for (std::size_t n = source_text.size(); i != n; ++i) {
+    char c = source_text[i];
+    if (llvm::isAlnum(c) || c == '_') {
+      if (c >= 'a' && c <= 'z' && result.radix_point != llvm::StringRef::npos &&


prefer to point comparisons in the same direction

'a' <= c && c <= 'z'

it makes them read more easily

Interesting. I find that harder to read, because I interpret the left-hand side of the comparison as being the variable and the right hand side as being the bound (the subject versus the object in the comparison), and it's harder for me to think about 'a' being the subject in the comparison. So for me, reading this requires learning an extra rule, that I need to look ahead and pattern match against a <= b && b <= c before deciding how to interpret the a <=.

I can get used to this style if that's what we want.

Factored out to a separate function; that at least helped with my reading of the rewritten version.

fowles · 2021-02-18T03:39:08Z

lexer/tokenized_buffer.cpp

-    if (int_text.getAsInteger(/*Radix=*/0, int_value)) {
+
+    auto check_digit_separator_placement = [&](unsigned
+                                                   remaining_digit_separators) {


nit: weird spacing, maybe force a newline before the capture?

Fixed in #269.

fowles · 2021-02-18T03:40:10Z

lexer/tokenized_buffer.cpp

+    };
+
+    // For decimal and hexadecimal digit sequences, digit separators must form
+    // groups of 3 or 4 digits (4 or 5 characters), respectively.


I think this comment would be more helpful on line 411

Fixed in #269.

fowles · 2021-02-18T03:57:20Z

lexer/tokenized_buffer.cpp

+  assert(token_info.kind == TokenKind::RealLiteral() &&
+         "The token must be a real literal!");
+
+  // Note that every real literal is at least three characters long, so we can


is 0. not a valid real literal?

No. Per #143, a real literal requires at least one digit on each side of the ..

fowles · 2021-02-18T04:00:32Z

lexer/tokenized_buffer.h

+  // fraction (mantissa * 10^exponent).
+  class RealLiteralValue {
+    const llvm::APInt *mantissa;
+    const llvm::APInt *exponent;


it is worth noting that these are only valid until an additional token is lexed (at which point the storage vectors could resize).

I've added a comment to explain the lifetime situation here.

relevant detail.

pointers to TokenizedBuffer's vector elements.

chandlerc · 2021-02-20T03:43:16Z

lexer/tokenized_buffer.cpp

-  });
+  // TODO(zygoloid): Update lexical rules to specify that a numeric literal
+  // cannot be immediately followed by an alphanumeric character.
+  std::size_t i = 1;


Could we switch to using a signed integer for all of this? I am made very nervous working with unsigned integers like this. They both have an edge case in the common value and we can't sanitize them effectively.

I would be fine with using int, but also totally fine with ssize_t (however we want to acquire that type) which is probably slightly more efficient (sadly).

I'd do this pretty pervasively as I think that'll make the code cleanest. Especially when capturing the size into a local immediately.

Hm. This value is used to initialize radix_point and exponent, which can contain the value npos (which doesn't fit into an int or ssize_t). And I make use of the fact that we get npos here to simplify calls to substr elsewhere. Switching away from npos seems feasible, but doesn't seem like idiomatic use of StringRef / string_view to me -- this seems like fighting against the given API -- but I'm OK with that if we want to systematically avoid unsigned types.

Should we include something in the C++ style guide about avoiding unsigned types (even in cases such as this)?

Addressed throughout. I'm not sure I like the static_cast<int>s, nor the introduced possibility of misbehavior for large inputs, but I think this is on balance an improvement. Should we limit input files to 2GB up-front?

FWIW, I think we should push back against the use of npos. I find it ... very difficult to reason about. I would find -1 almost easier, but I agree with the direction you're heading by using size as a somewhat easier to reason about sentinel.

On the other topic, we should maybe chat about this in Discord, but I'd be reasonably supportive of moving to ssize_t and getting such a type that is easily accessed here.

I keep finding frustrating code generation quality issues with int because of compiler limitations anyways.

That said, I'm perfectly happy with the int code you have now and considering ssize_t in a follow-up. And even if we use ssize_t, I'm happy to insist on code not pushing 32-bit offsets so that it is always valid on a 32-bit system to use int or ssize_t.

What do you think about consistently using int64_t rather than ssize_t for file offsets and buffer positions? We're already doing that in some places (eg, the offset in LineInfo. I suspect we don't care too much about the extra memory usage for 32-bit compiles of the toolchain, given our priorities.

SGTM (but I'd suggest a follow-up patch)

lexer/tokenized_buffer.cpp

lexer/tokenized_buffer.h

lexer/tokenized_buffer.cpp

By implication this also means stopping using npos to represent positions not found within the string. In order to avoid adding special cases in substr calls, use text.size() instead.

fowles · 2021-02-23T04:45:15Z

lexer/tokenized_buffer.cpp

+}
+
+// Parse a string that is known to be a valid base-radix integer into an APInt.
+// If needs_cleaning is true, the string may additionally contain _ and .


stripping _ makes sense, but stripping . feels weird. Can you expand on the comment a bit to say why?

lexer/tokenized_buffer.cpp

purpose. Change GetIntegerValue to return a const& for consistency and to avoid forcing a copy on large (>64 bit) integers.

chandlerc

One meta comment is that I think clang-format needs to get run...

Some of the comments below are really fine to defer -- I've tried to be explicit in the comment, but if not clear don't hesitate to ask for deferring to a follow-up.

chandlerc · 2021-02-23T08:21:08Z

lexer/tokenized_buffer.cpp

-  });
+  // TODO(zygoloid): Update lexical rules to specify that a numeric literal
+  // cannot be immediately followed by an alphanumeric character.
+  std::size_t i = 1;


FWIW, I think we should push back against the use of npos. I find it ... very difficult to reason about. I would find -1 almost easier, but I agree with the direction you're heading by using size as a somewhat easier to reason about sentinel.

On the other topic, we should maybe chat about this in Discord, but I'd be reasonably supportive of moving to ssize_t and getting such a type that is easily accessed here.

I keep finding frustrating code generation quality issues with int because of compiler limitations anyways.

That said, I'm perfectly happy with the int code you have now and considering ssize_t in a follow-up. And even if we use ssize_t, I'm happy to insist on code not pushing 32-bit offsets so that it is always valid on a 32-bit system to use int or ssize_t.

lexer/tokenized_buffer.h

lexer/tokenized_buffer.cpp

lexer/tokenized_buffer_test.cpp

lexer/tokenized_buffer.cpp

we don't need to worry about iterator invalidation.

NumericLiteralParser to better reflect that it's assigning semantics to literals rather than merely checking their morphology.

zygoloid · 2021-02-25T20:21:20Z

One meta comment is that I think clang-format needs to get run...

Done.

chandlerc

LGTM with potentially some minor formatting fixes below. One looks like an odd clang-format thing, the other is just member-order matching the style guide.

I'd double check w/ @fowles to make sure he's fine before landing, but if so feel free to submit with the fixes.

chandlerc · 2021-02-26T06:31:52Z

lexer/tokenized_buffer.cpp

-  });
+  // TODO(zygoloid): Update lexical rules to specify that a numeric literal
+  // cannot be immediately followed by an alphanumeric character.
+  std::size_t i = 1;


SGTM (but I'd suggest a follow-up patch)

chandlerc · 2021-02-26T06:33:08Z

lexer/tokenized_buffer.cpp

 #include <algorithm>
 #include <bitset>
 #include <cmath>
 #include <iterator>
 #include <string>

+#include "lexer/tokenized_buffer.h"


Did clang-format do this?

Yes. Looks like my editor integration predates the addition of the -assume-filename flag so it had no idea this was the corresponding header. :-/ Should be fixed now.

chandlerc · 2021-02-26T06:56:47Z

lexer/tokenized_buffer.cpp

+  DiagnosticEmitter& emitter;
+  NumericLiteral literal;
+
+  // The radix of the literal: 2, 10, or 16, for a prefix of '0b', no prefix,
+  // or '0x', respectively.
+  int radix = 10;
+
+  // The various components of a numeric literal:
+  //
+  //     [radix] int_part [. fract_part [[ep] [+-] exponent_part]]
+  llvm::StringRef int_part;
+  llvm::StringRef fract_part;
+  llvm::StringRef exponent_part;
+
+  // Do we need to remove any special characters (digit separator or radix
+  // point) before interpreting the mantissa or exponent as an integer?
+  bool mantissa_needs_cleaning = false;
+  bool exponent_needs_cleaning = false;
+
+  // True if we found a `-` before `exponent_part`.
+  bool exponent_is_negative = false;
+
+  // True if we produced an error but recovered.
+  bool recovered_from_error = false;
+


Google's style guide would put these below the public section I think.

zygoloid added 5 commits February 12, 2021 15:51

Initial lexing support for integer literals following #143.

87506d4

Address review feedback.

fcd30d6

Switch to using a bitset to determine digit validity.

94b3e72

Initial lexing support for real literals following #143.

696ac54

Fix crash on digit sequences containing only digit separators.

68f1949

Found by fuzz testing.

zygoloid requested review from fowles and chandlerc February 18, 2021 00:49

google-cla bot added the cla: yes PR meets CLA requirements according to bot. label Feb 18, 2021

fowles reviewed Feb 18, 2021

View reviewed changes

zygoloid added 7 commits February 19, 2021 13:26

Address review comments.

57a7406

Remove redundant assert.

9ff89cc

Merge updates to PR #269.

c5f9a3b

Switch over two comments so the more detailed comment is closer to the

396f487

relevant detail.

Merge updates to #269.

952d88a

Merge branch 'trunk' into real-literals

2bed845

Add a comment explaining why it's safe for RealLiteralValue to hold

67f5aa4

pointers to TokenizedBuffer's vector elements.

chandlerc requested changes Feb 20, 2021

View reviewed changes

zygoloid added 2 commits February 22, 2021 16:25

Avoid using unsigned integer types for positions within a string.

f89ddca

By implication this also means stopping using npos to represent positions not found within the string. In order to avoid adding special cases in substr calls, use text.size() instead.

Split numeric literal checking into a bunch of smaller functions.

bb506db

fowles reviewed Feb 23, 2021

View reviewed changes

zygoloid added 2 commits February 22, 2021 21:37

Rename int_literals to literal_int_storage to better reflect its

1b74b49

purpose. Change GetIntegerValue to return a const& for consistency and to avoid forcing a copy on large (>64 bit) integers.

Address review feedback and a TODO.

4f3819c

chandlerc requested changes Feb 23, 2021

View reviewed changes

zygoloid added 7 commits February 23, 2021 13:43

Switch RealLiteralValue to not hold pointers to container elements, so

b8b63d9

we don't need to worry about iterator invalidation.

Move CheckDigitSequence into NumericLiteralLexer.

f3706ab

Decouple NumericLiteralLexer from Lexer.

c9faeb3

Move numeric literal lexer out of the lexer proper.

7e9963a

Clean up, add comments, clang-format. Rename NumericLiteralLexer to

975cac7

NumericLiteralParser to better reflect that it's assigning semantics to literals rather than merely checking their morphology.

Improve readability (code review suggestion from @chandlerc).

1c9546c

Add tests for lack of digits in the fractional part of a number.

cc183d1

zygoloid added 2 commits February 23, 2021 15:51

Merge branch 'trunk' into real-literals

f574fa5

Additional clang-format fixes.

7f4d9d9

This was referenced Feb 24, 2021

Simplify error reporting interface #291

Merged

use a proof-of-work return type for Lex* functions #290

Merged

chandlerc approved these changes Feb 26, 2021

View reviewed changes

fowles approved these changes Feb 26, 2021

View reviewed changes

zygoloid added 3 commits February 26, 2021 17:27

clang-format with fixed configuration.

01bc2df

Style: reorder private members after public.

56f3a1b

Merge branch 'trunk' into real-literals

628f2a0

zygoloid merged commit bbcc31a into carbon-language:trunk Feb 27, 2021

zygoloid deleted the real-literals branch February 27, 2021 01:34

chandlerc pushed a commit that referenced this pull request Jun 28, 2022

Initial lexing support for real literals following #143. (#273)

a8e4a69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial lexing support for real literals following #143. #273

Initial lexing support for real literals following #143. #273

zygoloid commented Feb 18, 2021

fowles Feb 18, 2021

zygoloid Feb 19, 2021

zygoloid Feb 23, 2021

fowles Feb 18, 2021

zygoloid Feb 19, 2021

fowles Feb 18, 2021

zygoloid Feb 19, 2021

fowles Feb 18, 2021

zygoloid Feb 19, 2021

fowles Feb 18, 2021

zygoloid Feb 19, 2021

chandlerc Feb 20, 2021

zygoloid Feb 22, 2021

zygoloid Feb 23, 2021 •

edited

Loading

chandlerc Feb 23, 2021

zygoloid Feb 25, 2021

chandlerc Feb 26, 2021

fowles Feb 23, 2021

zygoloid Feb 23, 2021

chandlerc left a comment

chandlerc Feb 23, 2021

zygoloid commented Feb 25, 2021

chandlerc left a comment

chandlerc Feb 26, 2021

chandlerc Feb 26, 2021

zygoloid Feb 27, 2021

chandlerc Feb 26, 2021

zygoloid Feb 27, 2021

Initial lexing support for real literals following #143. #273

Initial lexing support for real literals following #143. #273

Conversation

zygoloid commented Feb 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zygoloid Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zygoloid commented Feb 25, 2021

chandlerc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zygoloid Feb 23, 2021 •

edited

Loading