-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce PositionalGenerator #2710
Conversation
One of the more common complaints among programmers is memorizing and understanding the regular expression syntax. It is concise but dense. The `regexify` method uses a _subset_ of regex language to generate values. Since it is not exactly the Ruby regex language, this means that it is an additional mini-language to learn, with its own quirks. The PositionalGenerator attempts to solve the same problem but more clearly. It is used for generating fixed-length strings where each byte has a specific value, such as postal codes, VINs, business IDs, and so on. Three examples to show the tradeoffs: `gb_licence_checksum` --------------------- With `regexify`: ```ruby regexify(/[0-9][A-Z][A-Z]/) ``` With `PositionalGenerator`: ```ruby generate(:string) do |g| g.int g.letter(ranges: ['A'..'Z'], length: 2) end ``` `ssn_valid` ----------- With `regexify`: ```ruby ssn = regexify(/[0-8]\d{2}-\d{2}-\d{4}/) INVALID_SSN.any? { |regex| regex =~ ssn } ? ssn_valid : ssn ``` With `PositionalGenerator`: ```ruby generate(:string) do |g| g.int(ranges: [100..665, 667..899]) g.lit('-') g.int(ranges: [10..99]) g.lit('-') g.int(ranges: [1000..9999]) end ``` `vin` ----- With `regexify`: ```ruby front = 8.times.map { VIN_KEYSPACE.sample(random: Faker::Config.random) }.join back = 8.times.map { VIN_KEYSPACE.sample(random: Faker::Config.random) }.join checksum = "#{front}A#{back}".chars.each_with_index.map do |char, i| value = (char =~ /\A\d\z/ ? char.to_i : VIN_TRANSLITERATION[char.to_sym]) value * VIN_WEIGHT[i] end.inject(:+) % 11 checksum = 'X' if checksum == 10 "#{front}#{checksum}#{back}" ``` With `PositionalGenerator`: ```ruby generate(:string) do |g| g.letter(name: :wmi, ranges: ['100'..'199', '400'..'499', '500'..'599', '700'..'799', '7A0'..'7F9']) g.letter(name: :vds, length: 5, ranges: [VIN_KEYSPACE]) g.computed(name: :checksum, deps: %i[wmi vds model_year plant_code vis]) do |wmi, vds, model_year, plant_code, vis| checksum = "#{wmi}#{vds}0#{model_year}#{plant_code}#{vis}".chars.each_with_index.map do |char, i| value = (char =~ /\A\d\z/ ? char.to_i : VIN_TRANSLITERATION[char.to_sym]) value * VIN_WEIGHT[i] end.inject(:+) % 11 if checksum == 10 'X' else checksum end end g.letter(name: :model_year, length: 1, ranges: [VIN_KEYSPACE - %w[U Z 0]]) g.letter(name: :plant_code, length: 1, ranges: [VIN_KEYSPACE]) g.int(name: :vis, length: 6) end ``` Summary ------- As you can see, the `PositionalGenerator` is much more verbose than `regexify`. The tradeoff is understanding and readability.
@mike-burns thank you for the detailed PR and docs in the proposed changes! Another generator that could use benefit from this approach is the password one. The only concern I have is making the app slower. Do you have any idea on how introducing this would impact loading time? |
As a proxy for real-world benchmarks, I ran There were three numbers to pull: the test suite's clock, the computed tests per second and assertions per second, and the amount of time the OS was in user space while running the program.
|
Thanks for working on this, @mike-burns ! I think this would be a solid addition to faker! In my view, the biggest value here would be on rewriting the more complicated generators who use long regex patterns. Those are the ones that are hard to understand and maintain. Using the approach you're providing here could make them a bit more declarative and easier to work with. The vin generator is a great example of this. I'll take a look at the implementation and give you a review in the next few weeks. I'm also adding @faker-ruby/core-contributors as reviewers so they can chime in. Thanks! |
Co-authored-by: Mike Burns <mburns@thoughtbot.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice approach and well documented, thanks for working on this!
I left a few questions and a couple of suggestions.
if @ranges | ||
case s = @ranges.sample(random: Faker::Config.random) | ||
when Range | ||
s.to_a.sample(random: Faker::Config.random) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but it would be nice if we could get away with sampling a character without having to convert the whole range to an array every time. For a range of integers, there's also no bound on the size of it.
There was some work done recently to skip generating these ranges of characters for passwords here (#2725).
Unfortunately, rand
only accepts ranges of numbers. 😞
Maybe convert the range to array once, and pick randomly from it afterwards?
# | ||
# Later we can look up the appropriate component by indexing into the | ||
# +@components+ array. | ||
def build_graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting approach!
Is there a need to also detect any cycles in the dependency graph? I guess if you're keeping track of seen
nodes, you would be able to detect it, and maybe raise an error?
Not sure it would be possible to generate all values for these component if there's a dependency cycle, so maybe raising "cycle detected etc." when the cycle is detected could be useful to the user.
By the way, have you considered generating a topological sort of the graph and calculating/generating values based on that? I know there's a gem called tsort
, but I've never used it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to also detect any cycles in the dependency graph?
Good question, and I'm not sure!
This is an internal class that ideally the user will not control, so hopefully faker devs and people extending faker will be able to wield it responsibly. Currently there is no concept of generating these graph nodes dynamically; they are defined once, and that's the result every time.
That said, it will be a nice debugging tool to catch those cycles. What do you think, should we build that in?
generating a topological sort of the graph
Oh hey, those are the words I should have searched for! I guess I did that manually. Would you rather I bring in the tsort gem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to use tsort, your implementation is simple enough to follow. It was just a thought I had when looking at the code. And maybe we don't need to worry about cycle detection right now.
I'm happy with the current state. Let me know if you want to add anything else, or else I'll just merge it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good from my end. Thank you so much for taking the time to read it and accept it into your repo!
Co-authored-by: Thiago Araujo <thd.araujo@gmail.com>
Co-authored-by: Mike Burns <mburns@thoughtbot.com>
add missing space
Thank you for finishing this off and merging it, @thdaraujo ! |
thanks for building it, @mike-burns ! 😄 |
One of the more common complaints among programmers is memorizing and understanding the regular expression syntax. It is concise but dense.
The
regexify
method uses a subset of regex language to generate values. Since it is not exactly the Ruby regex language, this means that it is an additional mini-language to learn, with its own quirks.The PositionalGenerator attempts to solve the same problem but more clearly. It is used for generating fixed-length strings where each byte has a specific value, such as postal codes, VINs, business IDs, and so on.
Three examples to show the tradeoffs:
gb_licence_checksum
With
regexify
:With
PositionalGenerator
:ssn_valid
With
regexify
:With
PositionalGenerator
:vin
With
regexify
:With
PositionalGenerator
:Summary
As you can see, the
PositionalGenerator
is much more verbose thanregexify
. The tradeoff is understanding and readability.In this commit I replaced all simple uses of
regexify
. The method is still used as part of the i18n framework, though.