Character "#" is defined as "pound" #1

jaeggerr · 2024-04-27T19:35:52Z

Hello!

First, thank you for that useful library.

I think there is an error in the next method of BasicTextCharacterSequence.

case "#": return BasicTextCharacter(kind: .pound, range: range)

It should be hash (which does not exist yet).

Let me know if you want me to write a PR, or if you prefer to fix that yourself.

The text was updated successfully, but these errors were encountered:

mattmassicotte · 2024-04-28T11:14:10Z

Well, I'm so happy you find it useful! And thank you for filing this issue.

You made me realize that I did not know what the "correct" name for this symbol is, and it turns out to be regional! There is no standarized name, althought unicode has chosen to name this "number sign".

I'm happy to fix this, or accept a PR. But it isn't clear to me that "hash" is unambiguously the right choice. I'm tempted to follow unicode here. What do you think?

Also, any change here will be source-breaking, so I'm not even sure what the right approach to making the change would be.

jaeggerr · 2024-04-28T17:00:20Z

OK, after asking ChatGPT what would be the right name for the symbol, I understand better the problem.

The "#" character is commonly referred to as the "hash" symbol. It's also known by several other names, including "number sign," "pound sign" (especially in the United States when used in contexts like telephone systems), and "sharp" (in musical notation). On social media and in programming, it's often called a "hashtag" when used to tag keywords or topics.

I thought that you confused # with £. This is indeed a more regional issue...
You are right that changing the code now and referring case .pound to the # character will silently break existing code.

The only solution I have in mind would be to name # hashtag because even though it looks a bit social media oriented, everyone knows what it is and it is not confusing.
As the code won't compile after updating the library, it would be easy to fix and probably the least dangerous solution.

If you decide to add some day the £ symbol, it could be named something like poundCurrency to avoid confusion.

Btw, I was wondering how you decided to populate your BasicTextCharacterKind enum.
Why do you have specific symbols (like pound), and groups of different possible characters (like upperCaseLetter)?
And why did you not associate the actual symbol for those groups cases.
For example

case upperCaseLetter(String)
case digit(Int)
case otherCharacter(String)
...

It could have been a good shortcut to check tokens and when using next and peek.

jaeggerr · 2024-04-28T17:15:53Z

I also realize that indeed, making any change in the BasicTextCharacterKind enum will be a breaking change.
Even adding new symbols will be breaking, as they won't be referred as otherCharacter anymore.

This is getting more complicated than expected.

mattmassicotte · 2024-04-30T10:17:56Z

Ok, so first for the naming. I completely agree that "pound" is a poor choice. I decided to go with the unicode naming "numberSign", which is not regional and seems quite stable. But I also introduced this:

@available(*, deprecated, message: "Please use numberSign")
public static let pound = BasicTextCharacterKind.numberSign

There are situations where this could still be breaking in subtle ways, but I have a feeling it won't be a problem. What do you think?

mattmassicotte · 2024-04-30T10:26:12Z

Why do you have specific symbols (like pound), and groups of different possible characters (like upperCaseLetter)?

The whole idea with BasicTextCharacterSequence was to make it easier to tokenize the kinds of languages I normally encounter. It is possible to have grammars where it will not be appropriate. In particular, upper/lower case is significnat for some languages, but runs of upper-lower are not.

You can always define your own custom sequence if that isn't an appropriate building block.

And why did you not associate the actual symbol for those groups cases?

I honestly don't remember! It does seem like this could be a good idea, though I have not looked closely at it. It would definitely be possible to make a new sequence that does it.

jaeggerr · 2024-05-01T15:30:50Z

There are situations where this could still be breaking in subtle ways, but I have a feeling it won't be a problem. What do you think?

Your solution looks like a good approach to me. The warning could be enough to let people that this is not the £ symbol, and gently invite those how already used the # symbol as pound to update their code.
As long as the pound case keeps on referring to #, and you're not adding new cases for new tokens, I don't think it can break anything.

You can always define your own custom sequence if that isn't an appropriate building block.

Yes, I will create my own sequence based on what you started.
And if I do something that would be worth being included it the lib, I'll let you know.

Thanks for your help, and feel free to close this issue!

mattmassicotte · 2024-05-12T11:47:08Z

Sounds great. And thanks again for bringing this up!

mattmassicotte added a commit that referenced this issue Apr 30, 2024

Introduce numberSign, deprecate pound #1

cc183d1

mattmassicotte closed this as completed May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character "#" is defined as "pound" #1

Character "#" is defined as "pound" #1

jaeggerr commented Apr 27, 2024

mattmassicotte commented Apr 28, 2024

jaeggerr commented Apr 28, 2024

jaeggerr commented Apr 28, 2024 •

edited

Loading

mattmassicotte commented Apr 30, 2024

mattmassicotte commented Apr 30, 2024

jaeggerr commented May 1, 2024

mattmassicotte commented May 12, 2024

Character "#" is defined as "pound" #1

Character "#" is defined as "pound" #1

Comments

jaeggerr commented Apr 27, 2024

mattmassicotte commented Apr 28, 2024

jaeggerr commented Apr 28, 2024

jaeggerr commented Apr 28, 2024 • edited Loading

mattmassicotte commented Apr 30, 2024

mattmassicotte commented Apr 30, 2024

jaeggerr commented May 1, 2024

mattmassicotte commented May 12, 2024

jaeggerr commented Apr 28, 2024 •

edited

Loading