Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support case insensitive string contains #269

Merged
merged 3 commits into from
May 11, 2023

Conversation

henricook
Copy link
Contributor

I can convert some of my existing projects to use CIString instead of my home rolled CaseInsensitiveString class if we can support contains.

I don't contribute as often as I'd like so please feel free to comment on anything

@henricook henricook force-pushed the string-contains branch 2 times, most recently from 057a3ad to f73fbc8 Compare September 6, 2022 07:52
@armanbilge
Copy link
Member

Thanks for the contribution!

@isomarcte is this something we'll continue to be able to support in your 2.x rewrite?

@isomarcte
Copy link
Member

That should be quite doable.

@henricook
Copy link
Contributor Author

Great! Let me know if you'd like me to port this anywhere

@isomarcte
Copy link
Member

Great! Let me know if you'd like me to port this anywhere

I'll ping this ticket when the core changes in 2.x.x land.

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I've pushed a unit test where this breaks down. I would expect contains to be consistent with equals. I think we could make my tests pass with a Locale.ROOT argument to toLowerCase. (I wish Java would deprecate the versions without a Locale.)

I still fear that's not the entire solution, because equalsIgnoreCase folds in both directions. I just don't have time to publish another failing test. I think we'll need a port of commons-lang3's method to make it bulletproof.

@isomarcte
Copy link
Member

@rossabaker I'm pretty sure this will hold if we define contains to be "Contained in the case folded string" by some specific case folding scheme. For example from my 2.x.x branch.

scala> FullCaseFoldedString("i").toString.contains(FullCaseFoldedString("I").toString)
val res3: Boolean = true

scala> TurkicFullCaseFoldedString("i").toString.contains(TurkicFullCaseFoldedString("I").toString)
val res4: Boolean = false

@isomarcte
Copy link
Member

Oh sorry @armanbilge and @henricook I didn't realize this was a PR, I thought it was an issue only when I first read it.

Yeah, as @rossabaker points out, we can't really define this safely on 1.x.x because we don't actually have a full representation of the case folded value.

@rossabaker
Copy link
Member

rossabaker commented Oct 8, 2022

I think it can be supported in 1.x. The definition of equality compares corresponding characters. contains could work just like it does on String, but using the same character predicate as equalsIgnoreCase. That's exactly what commons-lang does, if I'm not mistaken.

@henricook
Copy link
Contributor Author

henricook commented Oct 8, 2022

Thanks for these thoughtful comments all, i'm processing them. Why's the Turkey Test called the Turkey Test?

Edit: Ah! Found it! https://stackoverflow.com/questions/796986/what-is-the-turkey-test

@henricook
Copy link
Contributor Author

henricook commented Oct 8, 2022

Do we know why I might prefer commons-lang3's regionMatches over the standard java String regionMatches? I've got the turkey test passing by using that but would be interesting to know if CL3's has an advantage or is just an implementation leftover from before it was in Java standard

@henricook henricook force-pushed the string-contains branch 4 times, most recently from 7a1c2c3 to 5f2a620 Compare October 8, 2022 12:57
@henricook
Copy link
Contributor Author

What do we think of c4a0a8f ?

Please note: I've force pushed this branch with a rewritten commit history as my author field was a work address accidentally

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks correct to me. I think all the tests would pass if the haystackIndex maxes out at 0. I wonder if we could have a property like:

forAll { (prefix: String, s: String, suffix: String) => 
  val a = CIString(prefix + s + suffix)
  val b = CIString(s.map(_.toUpperCase)) // something that makes mixed case deterministically would be better still
  a.contains(b)
}

@isomarcte
Copy link
Member

@rossabaker @henricook I'm good with this for 1.x.x, but note that we aren't doing a Unicode character contains, we are doing a UTF-16 char contains. So things like this won't work, but it should for most Unicode definitions of case folding.

scala> CIString("\u0041\u030a")
val res0: org.typelevel.ci.CIString = Å
scala> CIString("\u00c5")
val res1: org.typelevel.ci.CIString = Å
scala> res0.contains(res1)
val res2: Boolean = false

Though I suppose that is consistent with the equality behavior for 1.x.x.

Just please be aware of this caveat until 2.x.x is out.

@henricook
Copy link
Contributor Author

henricook commented Apr 12, 2023

Thanks both, and apologies this fell off my radar. Is this good to go @rossabaker ?

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, mine too. Yes, looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants