Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce PositionalGenerator #2710

Merged
merged 7 commits into from
Jul 11, 2023
Merged

Conversation

mike-burns
Copy link
Contributor

One of the more common complaints among programmers is memorizing and understanding the regular expression syntax. It is concise but dense.

The regexify method uses a subset of regex language to generate values. Since it is not exactly the Ruby regex language, this means that it is an additional mini-language to learn, with its own quirks.

The PositionalGenerator attempts to solve the same problem but more clearly. It is used for generating fixed-length strings where each byte has a specific value, such as postal codes, VINs, business IDs, and so on.

Three examples to show the tradeoffs:

gb_licence_checksum

With regexify:

regexify(/[0-9][A-Z][A-Z]/)

With PositionalGenerator:

generate(:string) do |g|
  g.int
  g.letter(ranges: ['A'..'Z'], length: 2)
end

ssn_valid

With regexify:

ssn = regexify(/[0-8]\d{2}-\d{2}-\d{4}/)
INVALID_SSN.any? { |regex| regex =~ ssn } ? ssn_valid : ssn

With PositionalGenerator:

generate(:string) do |g|
  g.int(ranges: [100..665, 667..899])
  g.lit('-')
  g.int(ranges: [10..99])
  g.lit('-')
  g.int(ranges: [1000..9999])
end

vin

With regexify:

front = 8.times.map { VIN_KEYSPACE.sample(random: Faker::Config.random) }.join
back = 8.times.map { VIN_KEYSPACE.sample(random: Faker::Config.random) }.join
checksum = "#{front}A#{back}".chars.each_with_index.map do |char, i|
  value = (char =~ /\A\d\z/ ? char.to_i : VIN_TRANSLITERATION[char.to_sym])
  value * VIN_WEIGHT[i]
end.inject(:+) % 11
checksum = 'X' if checksum == 10
"#{front}#{checksum}#{back}"

With PositionalGenerator:

generate(:string) do |g|
  g.letter(name: :wmi, ranges: ['100'..'199', '400'..'499', '500'..'599', '700'..'799', '7A0'..'7F9'])
  g.letter(name: :vds, length: 5, ranges: [VIN_KEYSPACE])
  g.computed(name: :checksum, deps: %i[wmi vds model_year plant_code vis]) do |wmi, vds, model_year, plant_code, vis|
    checksum = "#{wmi}#{vds}0#{model_year}#{plant_code}#{vis}".chars.each_with_index.map do |char, i|
      value = (char =~ /\A\d\z/ ? char.to_i : VIN_TRANSLITERATION[char.to_sym])
      value * VIN_WEIGHT[i]
    end.inject(:+) % 11

    if checksum == 10
      'X'
    else
      checksum
    end
  end
  g.letter(name: :model_year, length: 1, ranges: [VIN_KEYSPACE - %w[U Z 0]])
  g.letter(name: :plant_code, length: 1, ranges: [VIN_KEYSPACE])
  g.int(name: :vis, length: 6)
end

Summary

As you can see, the PositionalGenerator is much more verbose than regexify. The tradeoff is understanding and readability.

In this commit I replaced all simple uses of regexify. The method is still used as part of the i18n framework, though.

One of the more common complaints among programmers is memorizing and
understanding the regular expression syntax. It is concise but dense.

The `regexify` method uses a _subset_ of regex language to generate
values. Since it is not exactly the Ruby regex language, this means that
it is an additional mini-language to learn, with its own quirks.

The PositionalGenerator attempts to solve the same problem but more
clearly. It is used for generating fixed-length strings where each byte
has a specific value, such as postal codes, VINs, business IDs, and
so on.

Three examples to show the tradeoffs:

`gb_licence_checksum`
---------------------

With `regexify`:

```ruby
regexify(/[0-9][A-Z][A-Z]/)
```

With `PositionalGenerator`:

```ruby
generate(:string) do |g|
  g.int
  g.letter(ranges: ['A'..'Z'], length: 2)
end
```

`ssn_valid`
-----------

With `regexify`:

```ruby
ssn = regexify(/[0-8]\d{2}-\d{2}-\d{4}/)
INVALID_SSN.any? { |regex| regex =~ ssn } ? ssn_valid : ssn
```

With `PositionalGenerator`:

```ruby
generate(:string) do |g|
  g.int(ranges: [100..665, 667..899])
  g.lit('-')
  g.int(ranges: [10..99])
  g.lit('-')
  g.int(ranges: [1000..9999])
end
```

`vin`
-----

With `regexify`:

```ruby
front = 8.times.map { VIN_KEYSPACE.sample(random: Faker::Config.random) }.join
back = 8.times.map { VIN_KEYSPACE.sample(random: Faker::Config.random) }.join
checksum = "#{front}A#{back}".chars.each_with_index.map do |char, i|
  value = (char =~ /\A\d\z/ ? char.to_i : VIN_TRANSLITERATION[char.to_sym])
  value * VIN_WEIGHT[i]
end.inject(:+) % 11
checksum = 'X' if checksum == 10
"#{front}#{checksum}#{back}"
```

With `PositionalGenerator`:

```ruby
generate(:string) do |g|
  g.letter(name: :wmi, ranges: ['100'..'199', '400'..'499', '500'..'599', '700'..'799', '7A0'..'7F9'])
  g.letter(name: :vds, length: 5, ranges: [VIN_KEYSPACE])
  g.computed(name: :checksum, deps: %i[wmi vds model_year plant_code vis]) do |wmi, vds, model_year, plant_code, vis|
    checksum = "#{wmi}#{vds}0#{model_year}#{plant_code}#{vis}".chars.each_with_index.map do |char, i|
      value = (char =~ /\A\d\z/ ? char.to_i : VIN_TRANSLITERATION[char.to_sym])
      value * VIN_WEIGHT[i]
    end.inject(:+) % 11

    if checksum == 10
      'X'
    else
      checksum
    end
  end
  g.letter(name: :model_year, length: 1, ranges: [VIN_KEYSPACE - %w[U Z 0]])
  g.letter(name: :plant_code, length: 1, ranges: [VIN_KEYSPACE])
  g.int(name: :vis, length: 6)
end
```

Summary
-------

As you can see, the `PositionalGenerator` is much more verbose than
`regexify`.  The tradeoff is understanding and readability.
@stefannibrasil
Copy link
Contributor

stefannibrasil commented Apr 14, 2023

@mike-burns thank you for the detailed PR and docs in the proposed changes!

Another generator that could use benefit from this approach is the password one.

The only concern I have is making the app slower. Do you have any idea on how introducing this would impact loading time?

@mike-burns
Copy link
Contributor Author

As a proxy for real-world benchmarks, I ran time rake test three times on both this branch and main.

There were three numbers to pull: the test suite's clock, the computed tests per second and assertions per second, and the amount of time the OS was in user space while running the program.

a-pattern-language

Test suite clock:

  • 12.367095259s
  • 12.47272084s
  • 12.645565266s

tests/s:

  • 167.14 tests/s
  • 163.46 tests/s
  • 165.72 tests/s

assertions/s:

  • 24518.77 assertions/s
  • 23160.70 assertions/s
  • 23806.84 assertions/s

OS user time:

  • 14.75s
  • 15.06s
  • 15.25s

main

Test suite clock:

  • 12.153501719
  • 12.212515919
  • 12.317378931

tests/s:

  • 169.25 tests/s
  • 170.07 tests/s
  • 167.81 tests/s

assertions/s:

  • 22925.66 assertions/s
  • 22969.06 assertions/s
  • 24784.05 assertions/s

OS user time:

  • 14.58s
  • 14.80s
  • 15.02s

Summary

This branch might be a tad slower. If we compare average wall times, we see a slowdown of 0.2s. However, that's within a margin of error for normal program run fluctuations so it's hard to say anything concrete.

If that's too hand-wave-y please let me know and I can set up a more rigorous benchmark. And if that's too much of a slowdown, also let me know -- I think one potential improvement is to create the PositionalGenerator objects on library loads, making generation faster (at the expense of memory usage for unused generators).

@thdaraujo thdaraujo requested a review from a team May 21, 2023 01:24
@thdaraujo
Copy link
Contributor

Thanks for working on this, @mike-burns ! I think this would be a solid addition to faker!

In my view, the biggest value here would be on rewriting the more complicated generators who use long regex patterns. Those are the ones that are hard to understand and maintain. Using the approach you're providing here could make them a bit more declarative and easier to work with. The vin generator is a great example of this.

I'll take a look at the implementation and give you a review in the next few weeks. I'm also adding @faker-ruby/core-contributors as reviewers so they can chime in.

Thanks!

Co-authored-by: Mike Burns <mburns@thoughtbot.com>
Copy link
Contributor

@thdaraujo thdaraujo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice approach and well documented, thanks for working on this!

I left a few questions and a couple of suggestions.

lib/helpers/positional_generator.rb Outdated Show resolved Hide resolved
lib/helpers/positional_generator.rb Show resolved Hide resolved
lib/helpers/positional_generator.rb Outdated Show resolved Hide resolved
lib/helpers/positional_generator.rb Outdated Show resolved Hide resolved
lib/helpers/positional_generator.rb Outdated Show resolved Hide resolved
lib/helpers/positional_generator.rb Outdated Show resolved Hide resolved
if @ranges
case s = @ranges.sample(random: Faker::Config.random)
when Range
s.to_a.sample(random: Faker::Config.random)
Copy link
Contributor

@thdaraujo thdaraujo May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but it would be nice if we could get away with sampling a character without having to convert the whole range to an array every time. For a range of integers, there's also no bound on the size of it.

There was some work done recently to skip generating these ranges of characters for passwords here (#2725).

Unfortunately, rand only accepts ranges of numbers. 😞

Maybe convert the range to array once, and pick randomly from it afterwards?

#
# Later we can look up the appropriate component by indexing into the
# +@components+ array.
def build_graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach!

Is there a need to also detect any cycles in the dependency graph? I guess if you're keeping track of seen nodes, you would be able to detect it, and maybe raise an error?

Not sure it would be possible to generate all values for these component if there's a dependency cycle, so maybe raising "cycle detected etc." when the cycle is detected could be useful to the user.

By the way, have you considered generating a topological sort of the graph and calculating/generating values based on that? I know there's a gem called tsort, but I've never used it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to also detect any cycles in the dependency graph?

Good question, and I'm not sure!

This is an internal class that ideally the user will not control, so hopefully faker devs and people extending faker will be able to wield it responsibly. Currently there is no concept of generating these graph nodes dynamically; they are defined once, and that's the result every time.

That said, it will be a nice debugging tool to catch those cycles. What do you think, should we build that in?

generating a topological sort of the graph

Oh hey, those are the words I should have searched for! I guess I did that manually. Would you rather I bring in the tsort gem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to use tsort, your implementation is simple enough to follow. It was just a thought I had when looking at the code. And maybe we don't need to worry about cycle detection right now.

I'm happy with the current state. Let me know if you want to add anything else, or else I'll just merge it. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good from my end. Thank you so much for taking the time to read it and accept it into your repo!

mike-burns and others added 2 commits June 9, 2023 18:20
Co-authored-by: Thiago Araujo <thd.araujo@gmail.com>
Co-authored-by: Mike Burns <mburns@thoughtbot.com>
@thdaraujo thdaraujo merged commit 1bc565f into faker-ruby:main Jul 11, 2023
@mike-burns
Copy link
Contributor Author

Thank you for finishing this off and merging it, @thdaraujo !

@thdaraujo
Copy link
Contributor

thdaraujo commented Jul 17, 2023

Thank you for finishing this off and merging it, @thdaraujo !

thanks for building it, @mike-burns ! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants