Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character "#" is defined as "pound" #1

Closed
jaeggerr opened this issue Apr 27, 2024 · 7 comments
Closed

Character "#" is defined as "pound" #1

jaeggerr opened this issue Apr 27, 2024 · 7 comments

Comments

@jaeggerr
Copy link

Hello!

First, thank you for that useful library.

I think there is an error in the next method of BasicTextCharacterSequence.

case "#": return BasicTextCharacter(kind: .pound, range: range)

It should be hash (which does not exist yet).

Let me know if you want me to write a PR, or if you prefer to fix that yourself.

@mattmassicotte
Copy link
Contributor

Well, I'm so happy you find it useful! And thank you for filing this issue.

You made me realize that I did not know what the "correct" name for this symbol is, and it turns out to be regional! There is no standarized name, althought unicode has chosen to name this "number sign".

I'm happy to fix this, or accept a PR. But it isn't clear to me that "hash" is unambiguously the right choice. I'm tempted to follow unicode here. What do you think?

Also, any change here will be source-breaking, so I'm not even sure what the right approach to making the change would be.

@jaeggerr
Copy link
Author

OK, after asking ChatGPT what would be the right name for the symbol, I understand better the problem.

The "#" character is commonly referred to as the "hash" symbol. It's also known by several other names, including "number sign," "pound sign" (especially in the United States when used in contexts like telephone systems), and "sharp" (in musical notation). On social media and in programming, it's often called a "hashtag" when used to tag keywords or topics.

I thought that you confused # with £. This is indeed a more regional issue...
You are right that changing the code now and referring case .pound to the # character will silently break existing code.

The only solution I have in mind would be to name # hashtag because even though it looks a bit social media oriented, everyone knows what it is and it is not confusing.
As the code won't compile after updating the library, it would be easy to fix and probably the least dangerous solution.

If you decide to add some day the £ symbol, it could be named something like poundCurrency to avoid confusion.

Btw, I was wondering how you decided to populate your BasicTextCharacterKind enum.
Why do you have specific symbols (like pound), and groups of different possible characters (like upperCaseLetter)?
And why did you not associate the actual symbol for those groups cases.
For example

case upperCaseLetter(String)
case digit(Int)
case otherCharacter(String)
...

It could have been a good shortcut to check tokens and when using next and peek.

@jaeggerr
Copy link
Author

jaeggerr commented Apr 28, 2024

I also realize that indeed, making any change in the BasicTextCharacterKind enum will be a breaking change.
Even adding new symbols will be breaking, as they won't be referred as otherCharacter anymore.

This is getting more complicated than expected.

@mattmassicotte
Copy link
Contributor

Ok, so first for the naming. I completely agree that "pound" is a poor choice. I decided to go with the unicode naming "numberSign", which is not regional and seems quite stable. But I also introduced this:

@available(*, deprecated, message: "Please use numberSign")
public static let pound = BasicTextCharacterKind.numberSign

There are situations where this could still be breaking in subtle ways, but I have a feeling it won't be a problem. What do you think?

@mattmassicotte
Copy link
Contributor

Why do you have specific symbols (like pound), and groups of different possible characters (like upperCaseLetter)?

The whole idea with BasicTextCharacterSequence was to make it easier to tokenize the kinds of languages I normally encounter. It is possible to have grammars where it will not be appropriate. In particular, upper/lower case is significnat for some languages, but runs of upper-lower are not.

You can always define your own custom sequence if that isn't an appropriate building block.

And why did you not associate the actual symbol for those groups cases?

I honestly don't remember! It does seem like this could be a good idea, though I have not looked closely at it. It would definitely be possible to make a new sequence that does it.

@jaeggerr
Copy link
Author

jaeggerr commented May 1, 2024

There are situations where this could still be breaking in subtle ways, but I have a feeling it won't be a problem. What do you think?

Your solution looks like a good approach to me. The warning could be enough to let people that this is not the £ symbol, and gently invite those how already used the # symbol as pound to update their code.
As long as the pound case keeps on referring to #, and you're not adding new cases for new tokens, I don't think it can break anything.

You can always define your own custom sequence if that isn't an appropriate building block.

Yes, I will create my own sequence based on what you started.
And if I do something that would be worth being included it the lib, I'll let you know.

Thanks for your help, and feel free to close this issue!

@mattmassicotte
Copy link
Contributor

Sounds great. And thanks again for bringing this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants