Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charlatan #94

Closed
10 of 13 tasks
sckott opened this issue Jan 28, 2017 · 30 comments
Closed
10 of 13 tasks

charlatan #94

sckott opened this issue Jan 28, 2017 · 30 comments

Comments

@sckott
Copy link
Contributor

sckott commented Jan 28, 2017

Summary

  • What does this package do? (explain in 50 words or less):

charlatan makes fake data. It borrows data and modifies code from Python's faker library. charlatan adds more fake data types, in particular to target core audience of scientists.

locale support - done for a few data types; more work to be done on others. Most complete locale support in JobProvider/ch_job and PersonProvider/name.

  • Paste the full DESCRIPTION file inside a code block below:
Package: charlatan
Type: Package
Title: Make Fake Data
Description: Make fake data, supporting addresses, person names, dates, 
    times, colors, coordinates, currencies, digital object identifiers
    (DOIs), jobs, phone numbers, DNA sequences, doubles and integers
    from distributions and within a range.
Version: 0.0.6.9560
Authors@R: c(person("Scott", "Chamberlain", role = c("aut", "cre"),
    email = "myrmecocystus@gmail.com"))
License: MIT + file LICENSE
URL: https://github.com/ropenscilabs/charlatan
BugReports: https://github.com/ropenscilabs/charlatan/issues
LazyData: true
Encoding: UTF-8
VignetteBuilder: knitr
Imports:
    R6 (>= 2.2.0),
    tibble (>= 1.2),
    whisker,
    Rmpfr
Suggests:
    testthat,
    covr,
    knitr,
    rmarkdown
RoxygenNote: 5.0.1
Collate:
    'base-provider.R'
    'datetime-provider.R'
    'address-provider.R'
    ...

bottom of Collate cut off for brevity.

  • URL for the package (the development repository, not a stylized html page):

https://github.com/ropenscilabs/charlatan

  • Who is the target audience?

Students in a classroom learning R, or e.g, in a software carpentry type bootcamp - since it's liteweight wrt dependencies, should be easy install everywhere. Also, if you're doing modeling/simulations, perhaps you could use this to generate some fake data to run models on.

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?

not that i know of

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:


Note: This package isn't quite complete in the sense that there's more types of data to add to the package, but the overall package API is set (but can be changed if needed), and tests are there (though more needed) and vignette exists - Anyway, submitting earlyish to get reviewer feedback

@noamross
Copy link
Contributor

noamross commented Feb 9, 2017

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

  • I note that ch_names() can sometimes generate just last names, with blank spaces in front of them. Is this expected behavior? You may want to test for expected name patterns.
  • Here is goodpractice::gp() output. Nothing major here, but a few small clean-ups:

── GP charlatan ────────────────────────────────

It is good practice to

✖ write unit tests for all functions, and all package code in general. 37% of code
lines are covered by test cases.

R/address-provider.R:89:NA
R/address-provider.R:93:NA
R/address-provider.R:97:NA
R/address-provider.R:102:NA
R/address-provider.R:103:NA
... and 267 more lines

✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and
developers are used it and it is easier to read your code for them if you use
'<-'.

R/color-provider.R:1:13
R/credit_card-provider.R:14:17
R/credit_card-provider.R:15:21
R/credit_card-provider.R:16:19
R/credit_card-provider.R:17:26
... and 103 more lines

✖ avoid long code lines, it is bad for readability. Also, many people prefer
editor windows that are about 80 characters wide. Try make your lines shorter
than 80 characters

R/address-provider.R:122:1
R/color-provider-uk_UA.R:6:1
R/color-provider-uk_UA.R:7:1
R/color-provider-uk_UA.R:9:1
R/color-provider-uk_UA.R:11:1
... and 646 more lines

✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...)
expressions. They are error prone and result 1:0 if the expression on the right
hand side is zero. Use seq_len() or seq_along() instead.

R/missing-data-provider.R:26:19

✖ fix this R CMD check NOTE: Namespaces in Imports field not imported from: ‘R6’
‘Rmpfr’ ‘whisker’ All declared Imports should be used.
─────────────────────────────────────────────

devtools::spell_check() also found "inifinity" (infinity) and "dats" (data).


Reviewers: @geanders @tjmahr
Due date: 2017-03-06

@noamross
Copy link
Contributor

Thanks @ganderson and @tjmahr for agreeing to review.

@noamross
Copy link
Contributor

Whoops, that's @geanders

@noamross
Copy link
Contributor

Hi @geanders and @tjmahr. Just a friendly reminder that your reviews are due on Monday, March 6.

@tjmahr
Copy link

tjmahr commented Mar 2, 2017

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 8


Review Comments

This package is a tool for generating fake data. Users can create fake data of a given type by using functions that start with ch_. For example, ch_credit_card_number() creates a fake credit card number or ch_name(n = 4, locale = "fr_FR") creates four fake French-sounding names. Applications for the package include education (creating fake datasets for students), statistical simulation, software testing, and perhaps anonymization.

Charlatan is modeled after Python's Faker library which in turn draws inspiration from PHP Faker, Ruby Faker and Perl Faker. Unfortunately, the name Faker has been taken on CRAN. (It is also worth noting that wakefield is another R package for fake-data. Wakefield provides data-frames of fake data, whereas this package mainly provides vectors of fake data.)

Overall, this package is well designed and well structured. The code is cleanly written and formatted, and it takes advantage of R idioms. This clear, expressive style made reading the code a breeze. That said, I found few bugs and inconsistencies while reviewing it.

This package uses the R6 object system to create classes of data-generators. (Presumably, using objects and methods made porting the code from Python rather straightforward.) The base class is the BaseGenerator class. All other generators inherit from this class. Therefore, I'll review the code by visiting each of the generators in turn.

BaseProvider

BaseProvider contains methods for generating random digits, random letters, and populating templates with random strings.

library(charlatan)
bp <- BaseProvider$new()
bp
#> <BaseProvider>
#>   Public:
#>     bothify: function (text = "## ??") 
#>     check_locale: function (x) 
#>     clone: function (deep = FALSE) 
#>     lexify: function (text = "????") 
#>     numerify: function (text = "###") 
#>     random_digit: function () 
#>     random_digit_not_null: function () 
#>     random_digit_not_null_or_empty: function () 
#>     random_digit_or_empty: function () 
#>     random_element: function (x) 
#>     random_int: function (min = 0, max = 9999) 
#>     random_letter: function ()

bp$random_digit()
#> [1] 0
bp$numerify("I have ## friends")
#> [1] "I have 75 friends"

This package makes extensive use of the sample() function. This function has a flawed default behavior: It expands integers into ranges. Thus, sample(10, 1) is the same as sample(1:10, 1). This package's BaseProvider$random_element() method inherits this flaw.

set.seed(22)
bp$random_element(10)
#> [1] 4

Perhaps, x[sample(seq_along(x), 1)], which samples from a sequence of item positions, would be a better implementation for this method.

The method random_int() has the signature:

str(args(bp$random_int))
#> function (min = 0, max = 9999)

It can never generate the maximum value, and this fact isn't noted anywhere.

set.seed(22)
range(replicate(bp$random_int(min = 0, max = 99), n = 100000))
#> [1]  0 98

It's not clear whether this behavior is intentional or an oversight, but it's the kind of weird detail that should be noted. On further consideration, this behavior is probably a bug because the DOI-generators uses the function with min = 0 and max = 9999.

I'm curious why the digit-or-blank generators don't just sample from a larger set. The code for random_digit_not_null_or_empty() means that half of the generated items should be "".

bp$random_digit_not_null_or_empty
#> function() {
#>       if (sample(0:1, size = 1) == 1) {
#>         sample(1:9, size = 1)
#>       } else {
#>         ''
#>       }
#>     }
#> <environment: 0x0000000008c7dd58>

# Why not sample with this instead?
bp$random_element(c(1:9, ""))
#> [1] "9"

NumericsProvider

This class generates random numbers. It demonstrates nice features of the R6-based design:

  1. grouping similar functionality into a cohesive object
  2. providing a logical extension point for other kinds of number generators
np <- NumericsProvider$new()
np
#> <NumericsProvider>
#>   Inherits from: <BaseProvider>
#>   Public:
#>     beta: function (n = 1, shape1, shape2, ncp = 0) 
#>     bothify: function (text = "## ??") 
#>     check_locale: function (x) 
#>     clone: function (deep = FALSE) 
#>     double: function (n = 1, mean = 0, sd = 1) 
#>     integer: function (n = 1, min = 1, max = 1000) 
#>     lexify: function (text = "????") 
#>     lnorm: function (n = 1, mean = 0, sd = 1) 
#>     norm: function (n = 1, mean = 0, sd = 1) 
#>     numerify: function (text = "###") 
#>     random_digit: function () 
#>     random_digit_not_null: function () 
#>     random_digit_not_null_or_empty: function () 
#>     random_digit_or_empty: function () 
#>     random_element: function (x) 
#>     random_int: function (min = 0, max = 9999) 
#>     random_letter: function () 
#>     unif: function (n = 1, min = 0, max = 9999)

This class powers the various random number generators in the package. The functions ch_integer(), ch_double(), ch_unif(), etc. actually just generate one-off NumericsProvider objects and call a method for the object.

ch_norm
#> function(n = 1, mean = 0, sd = 1) {
#>   assert(n, c('integer', 'numeric'))
#>   NumericsProvider$new()$norm(n, mean, sd)
#> }
#> <environment: namespace:charlatan>

The package rightly calls upon R's built-in number generators (rnorm, runif, rbeta, etc.) for almost all of the number generators. It uses different defaults for the uniform distribution than R, but does use the same defaults as R for the others.

args(runif)
#> function (n, min = 0, max = 1) 
#> NULL
args(NumericsProvider$new()$unif)
#> function (n = 1, min = 0, max = 9999) 
#> NULL

Here I think the package should follow R's defaults and take advantage of what the user may already know about the r* functions.

The integer() method doesn't defer to R, but it instead uses sample(). By default sample() does not sample with replacement, so NumericsProvider$new()$integer() and ch_integer() can behave in unexpected ways.

ch_integer(n = 10000, min = 1, max = 1000)
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

It is also odd that the random_int() and integer() use different techniques for generating random integers.

PersonProvider

This class implements the package's impressive (and fun) random name generator. It can generator locale-specific names. It has one method: render(). This class powers the ch_name() function.

set.seed(100)
person <- PersonProvider$new()
person$render()
#> [1] "Devan Lubowitz"
person$render()
#> [1] " Konopelski"
person$render()
#> [1] "Dr. Gladis Little MD"

ch_name(4)
#> [1] "Stanley Gottlieb"  "Ollie Ortiz MD"    "Ms. Meaghan Lesch"
#> [4] "Jordy Rohan"

As the example above shows, the names can vary in format. (Sometimes there is a blank first name --- this might be a bug.)

This function works by randomly selecting a name format and populating the format with random names/affixes.

# Formats
person$formats
#>  [1] "{{first_name_female}} {{last_names}}"                                         
#>  [2] "{{first_names_female}} {{last_names}}"                                        
#>  [3] "{{first_names_female}} {{last_names}}"                                        
#>  [4] "{{first_names_female}} {{last_names}}"                                        
#>  [5] "{{first_names_female}} {{last_names}}"                                        
#>  [6] "{{prefixes_female}} {{first_names_female}} {{last_names}}"                    
#>  [7] "{{first_names_female}} {{last_names}} {{suffixes_female}}"                    
#>  [8] "{{prefixes_female}} {{first_names_female}} {{last_names}} {{suffixes_female}}"
#>  [9] "{{first_names_male}} {{last_names}}"                                          
#> [10] "{{first_names_male}} {{last_names}}"                                          
#> [11] "{{first_names_male}} {{last_names}}"                                          
#> [12] "{{first_names_male}} {{last_names}}"                                          
#> [13] "{{first_names_male}} {{last_names}}"                                          
#> [14] "{{prefixes_male}} {{first_names_male}} {{last_names}}"                        
#> [15] "{{first_names_male}} {{last_names}} {{suffixes_male}}"                        
#> [16] "{{prefixes_male}} {{first_names_male}} {{last_names}} {{suffixes_male}}"

# Possible slot fillers
str(person$person)
#> List of 8
#>  $ first_names       : chr [1:7203] "Aaden" "Aarav" "Aaron" "Ab" ...
#>  $ first_names_male  : chr [1:3271] "Aaden" "Aarav" "Aaron" "Ab" ...
#>  $ first_names_female: chr [1:3932] "Aaliyah" "Abagail" "Abbey" "Abbie" ...
#>  $ last_names        : chr [1:473] "Abbott" "Abernathy" "Abshire" "Adams" ...
#>  $ prefixes_female   : chr [1:4] "Mrs." "Ms." "Miss" "Dr."
#>  $ prefixes_male     : chr [1:2] "Mr." "Dr."
#>  $ suffixes_female   : chr [1:4] "MD" "DDS" "PhD" "DVM"
#>  $ suffixes_male     : chr [1:11] "Jr." "Sr." "I" "II" ...

The package's locale-specificity is just a matter of selecting the appropriate formats and slot fillers for a locale. This implementation is a clean design win.

The locale-specific names are stored as variables in the package's namespace, and they are stored in R scripts. Thus, there are .R files that contain vectors with thousands of names.

r_dir <- rprojroot::find_rstudio_root_file("R/")
list.files(r_dir, "person-provider-")
#>  [1] "person-provider-bg_BG.R" "person-provider-cs_CZ.R"
#>  [3] "person-provider-de_AT.R" "person-provider-de_DE.R"
#>  [5] "person-provider-dk_DK.R" "person-provider-en_US.R"
#>  [7] "person-provider-es_ES.R" "person-provider-fa_IR.R"
#>  [9] "person-provider-fi_FI.R" "person-provider-fr_CH.R"
#> [11] "person-provider-fr_FR.R" "person-provider-hr_HR.R"
#> [13] "person-provider-it_IT.R"

I wonder if using a data/ folder would be a cleaner way to separate code and package data.

The documentation for ch_name() under-reports the supported locales. For example, Spanish is supported but this Spanish support is not documented.

There is a bug in how double last names (as in Spanish) are generated.

set.seed(103)
spanish <- PersonProvider$new(locale = "es_ES")

# Double last names are common in Spanish
spanish$formats
#>  [1] "{{first_name_male}} {{last_name}} {{last_name}}"                        
#>  [2] "{{first_name_male}} {{last_name}} {{last_name}}"                        
#>  [3] "{{first_name_male}} {{last_name}} {{last_name}}"                        
#>  [4] "{{first_name_male}} {{last_name}} {{last_name}}"                        
#>  [5] "{{first_name_male}} {{last_name}} {{last_name}}"                        
#>  [6] "{{first_name_male}} {{last_name}} {{last_name}}"                        
#>  [7] "{{first_name_male}} {{last_name}}"                                      
#>  [8] "{{first_name_male}} {{prefix}} {{last_name}}"                           
#>  [9] "{{first_name_male}} {{last_name}}-{{last_name}}"                        
#> [10] "{{first_name_male}} {{first_name_male}} {{last_name}} {{last_name}}"    
#> [11] "{{first_name_female}} {{last_name}} {{last_name}}"                      
#> [12] "{{first_name_female}} {{last_name}} {{last_name}}"                      
#> [13] "{{first_name_female}} {{last_name}} {{last_name}}"                      
#> [14] "{{first_name_female}} {{last_name}} {{last_name}}"                      
#> [15] "{{first_name_female}} {{last_name}} {{last_name}}"                      
#> [16] "{{first_name_female}} {{last_name}} {{last_name}}"                      
#> [17] "{{first_name_female}} {{last_name}}"                                    
#> [18] "{{first_name_female}} {{prefix}} {{last_name}}"                         
#> [19] "{{first_name_female}} {{last_name}}-{{last_name}}"                      
#> [20] "{{first_name_female}} {{first_name_female}} {{last_name}} {{last_name}}"

# The two last names should be different
spanish$render()
#> [1] "Jose Antonio Lloret Lloret"
spanish$render()
#> [1] "Javier Gras Gras"

This behavior is a problem with whisker::render(). The following code tweaks the package internals for debugging. It generates four different last names but whisker recycles the first.

pluck_names <- charlatan:::pluck_names

fmt <- "{{last_name}} {{last_name}} {{last_name}} {{last_name}}"
dat <- lapply(
  spanish$person[pluck_names(fmt, spanish$person)], 
  sample, 
  size = 1)

str(dat)
#> List of 4
#>  $ last_name: chr "Alegre"
#>  $ last_name: chr "Delgado"
#>  $ last_name: chr "Heras"
#>  $ last_name: chr "Cordero"
whisker::whisker.render(fmt, data = dat)
#> [1] "Alegre Alegre Alegre Alegre"

ColorProvider

This class generates colors.

Its hex-color generator users the random_int() method to choose a number between 1 and 16777215. The above noted behavior means it will never generate #ffffff, and the lower bound of 1 prevents #000000 from being generated too.

cp <- ColorProvider$new()
cp$color_name()
#> [1] "PowderBlue"
cp$safe_color_name()
#> [1] "fuchsia"

# Some locale sensitivity for color names, but not safe ones
cp2 <- ColorProvider$new(locale = "uk_UA")
cp2$color_name()
#> [1] "<U+0424><U+0456><U+043E><U+043B><U+0435><U+0442><U+043E><U+0432><U+0438><U+0439>"
cp2$safe_color_name()
#> [1] "maroon"

set.seed(26)
cp$hex_color()
#> [1] "#43f66"

It pads zeros onto strings less than 6 characters to create the familiar six- digit format. But padding is done on the right side, so it will never generate "#0000ff". It also counts character length after appending the the pound sign, so the above example shows an incorrect hex color.

I think using sprintf("#%06x", ...) would fix these problems. Actually, I discovered grDevices::rgb() while reading about the colors() for a later comment. That function should power the implementation.

rgb(0, 0, 0, maxColorValue = 255)
#> [1] "#000000"
rgb(255, 255, 255, maxColorValue = 255)
#> [1] "#FFFFFF"

sample_col <- function() sample(0:255, 1)
rgb(sample_col(), sample_col(), sample_col(), maxColorValue = 255)
#> [1] "#4ADFCC"

A similar strategy underlies safe_hex_color(), so the method has the same flaws. Plus, something mysterious happens sometimes:

set.seed(26)
cp$safe_hex_color()
#> [1] "#4400#4400"

For this method I think generating three separate hex digits and duplicating them and concatenating them would be a cleaner implementation. I also find conflicting information about which colors are "safe" --- is this method's definition of safe colors standard?

rbg_color() calls the hex_color() method, so it will inherit that function's bugs. Plus, sometimes I get an error.

set.seed(133)
colors <- replicate(n = 1000, cp$rgb_color())
#> Error: precBits >= 2 are not all TRUE

R has its own family of accepted color names, so those might a natural extension point.

sample(colors(), 3)
#> [1] "springgreen" "gray40"      "gray32"

CoordinateProvider

CoordinateProvider does what it says. It powers ch_lat(), ch_lon(), ch_position(). Notably, it does not inherit from the BaseProvider class.

cp <- CoordinateProvider$new()
CoordinateProvider
#> <CoordinateProvider> object generator
#>   Public:
#>     lon: function () 
#>     lat: function () 
#>     position: function (bbox = NULL) 
#>     clone: function (deep = FALSE) 
#>   Private:
#>     rnd: function () 
#>     coord_in_bbbox: function (bbox) 
#>   Parent env: <environment: namespace:charlatan>
#>   Locked objects: TRUE
#>   Locked class: FALSE
#>   Portable: TRUE

cp$lat()
#> [1] -21.11343
cp$lon()
#> [1] 123.7104
cp$position()
#> [1] 72.99215 12.02990

It can generate coordinates within a boundary box. The box is not checked, so users can get invalid coordinates as a result.

# Specify a bad box
cp$position(c(-12000, 0, 0, 30000))
#> [1] -9588.538 16289.230

CreditCardProvider

This class looks good but there is some commented out Python/R code in place. It looks like the class produces legitimate credit card numbers (i.e., they pass an appropriate checksum that real credit card numbers have). This detail highlights an important use case for this package: Generating fake data for testing code, as one could test some credit-card-validating function with this package.

DateTimeProvider

This class powers ch_timezone(), ch_unix_time(), and ch_date_time(). It is not documented that the date-times are integers converted to POSIX times, so they are all after 1970. The file also features some commented out Python code.

DateTimeProvider$new()$unix_time()
#> [1] 530920243
DateTimeProvider$new()$date_time()
#> [1] "2016-07-26 14:40:45 CDT"

There is a typo in century() method, so it always errors.

body(DateTimeProvider$new()$century)
#> {
#>     super$random_element(self$cenuries)
#> }

TaxonomyProvider

This class randomly samples from prepackaged genus/species names. The basis for the random genus/species names is well detailed in the hidden ?TaxonomyProvider page but not included in the user-facing ?taxonomy page. Now that Roxygen2 supports the new @inherit and @inheritSection fun title directives, I suggest that the user-facing page include this documentation.

set.seed(22)
TaxonomyProvider$new()$genus()
#> [1] "Dialium"
TaxonomyProvider$new()$epithet()
#> [1] "kovacevii"

# A species just genus() + epithet()
set.seed(22)
TaxonomyProvider$new()$species()
#> [1] "Dialium kovacevii"

Do-one-thing generators

CurrencyProvider

This class randomly samples a vector of currency abbreviations.

DOIProvider

This class's main job is to randomly select a DOI format and populate the format with characters/integers. This class does not use its inherited random_int() method to generate the integers, but it should.

JobProvider

The class produces occupation titles using the same locale-specificity covered earlier. It just randomly samples occupations from a vector with each locale's occupation names.

set.seed(36)
JobProvider$new(locale = "fr_FR")$render()
#> [1] "Intégrateur web"
ch_job(n = 3)
#> [1] "Pharmacist, community"         "Restaurant manager, fast food"
#> [3] "Designer, textile"
PhoneNumberProvider

This class randomly selects a format and then populates that format with digits.

head(PhoneNumberProvider$new()$formats, 3)
#> [1] "+##(#)##########" "+##(#)##########" "0##########"
PhoneNumberProvider$new()$render()
#> [1] "1-094-032-6500x4515"
ch_phone_number(4)
#> [1] "333-862-9754"         "(681)356-2399x715"    "1-612-259-7099x07938"
#> [4] "720.912.3811"
SequenceProvider

This class generates gene sequences by sampling letters and concatenating them. I feel the user-facing function should be named ch_gene_sequence(), not ch_sequence()

FraudsterClient

This class wraps all the ch_ functions into a single object for general-purpose fake data generation in a given locale. Because locales are not uniformly supported, this can lead to errors. Also, the name() method ignores the locale even though it can support different locales.

y <- fraudster(locale = "fr_FR")
y
#> <fraudster>
#>   locale: fr_FR

y$job()
#> [1] "Aquaculteur"
y$color_name()
#> Error in sample.int(length(x), size, replace, prob): invalid first argument
y$name()
#> [1] "Teddie McCullough III"

Unfinished providers

I didn't closely review the code for these classes, as they appear to be unfinished and are not user-facing like the other ones. I am documenting them here for completeness.

The package contains exported, in-progress code for generating addresses with AddressProvider. There is no ch_address().

It also contains the company_provider class which is exported but not documented and not linked to any ch_ functions. It has a different R6 design and naming scheme than other classes. This one looks like a lot of fun.

# I think the same whisker::render() bug is happening here
set.seed(27)
company_provider()$company()
#> [1] "Boehm, Boehm and Boehm"
company_provider()$company()
#> [1] "Hansen, Hansen and Hansen"
company_provider()$company()
#> [1] "Jacobi Inc,and Sons,LLC,Group,PLC,Ltd"

company_provider()$bs()
#> [1] "optimize clicks-and-mortar platforms"
company_provider()$bs()
#> [1] "reinvent extensible applications"

company_provider()$catch_phrase()
#> [1] "Object-based optimal structure"
company_provider()$catch_phrase()
#> [1] "Seamless zero-defect archive"

MissingProvider provides a method to inject NA values into a vector. There is no user-facing function (something like ch_missing()) for this class). Nevertheless, it's odd that the n missing values it generates for a vector overwrite the first n values.

set.seed(10)
MissingDataProvider$new()$make_missing(letters)
#>  [1] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  "o" "p" "q"
#> [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

I feel like it should randomly determine n, sample() n position indices, and make the elements in those positions NA so that the NAs are scattered throughout the vector.

ch_generate()

This function creates a data-frame of (en_US) fake data.

ch_generate("name", "job", n = 4)
#> # A tibble: 4 × 2
#>             name                               job
#>            <chr>                             <chr>
#> 1 Jaelynn Pouros                     Meteorologist
#> 2  Dalia Goldner Environmental health practitioner
#> 3  Loyce Keebler               Geologist, wellsite
#> 4   Kellen Bruen            Amenity horticulturist

The standard error for a bad field name is:

ch_generate("badname", n = 2)
#> Error: column name must be selected from allowed options, see docs

A better error might say, see ?ch_generate or print out the all_choices vector.

ch_generate("hex_color") does not work, although the documentations claims to support it.

The documentation for the n argument says "any (integer) number of things to get, 1 to inifinity". I would recommend just saying any non-negative integer. This applies to every documentation page as well.

The documentation for ... should also say that name, job, and phone_number columns are created by default if nothing is specified.

Other comments

parse_eval() should move to zzz.R, where the rest of the utility functions live. Personally, I would avoid using this function by structuring all the data into nested lists so that items can be retrieved with data[[locale_name]] instead of using parse_eval() to create-then-evaluate a variable name.

Some functions have return(); some do not.

The random_digit_not_null() method in the BaseGenerator is better named random_digit_not_zero().

The unit test for colors and safe colors should check that the generated colors meets the color and safe-color formats.

There are only unit tests for the colors, coordinates, credit cards, currency and jobs. I would suggest adding tests for the functions I've described bugs in.

@noamross
Copy link
Contributor

noamross commented Mar 2, 2017

Thanks for the excellent and thorough review, @tjmahr!

@geanders
Copy link

geanders commented Mar 7, 2017

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • [x ] As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • [x ] Installation instructions: for the development version of package and any non-standard dependencies in README
  • [ x] Vignette(s) demonstrating major functionality that runs successfully locally
  • [x ] Function Documentation: for all exported functions in R help
  • [x ] Examples for all exported functions in R Help that run successfully locally
  • [x ] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • [x ] Installation: Installation succeeds as documented.
  • [ x] Functionality: Any functional claims of the software been confirmed.
  • [ x] Performance: Any performance claims of the software been confirmed.
  • [ x] Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • [ x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4


Review Comments

This package allows a user to create synthetic data of names, addresses, jobs, phone numbers, credit card numbers, taxonomies, etc., in several different languages. The package offers two interfaces, either through fraudster or through the ch_* family of functions. I can imagine this package being very useful for R package maintainers who are trying to develop R packages that wrap open data APIs, as it would allow the package maintainer to test package functions against data that might be input from the API.

There are cases where unreasonable values will be generated by the functions in this package (e.g., phone numbers with illegal area codes, weird names for people, etc.), which is to some extent unavoidable when generating synthetic data. Depending on the intended applications of this package, this may or may not be important. It would be useful to get an idea of more of the intended uses of this package to help assess if any of these unreasonable values (some of which I point out in comments below) would be likely to introduce problems when using this package.

The package name is great (and memorable). Coding style is excellent throughout. The vignette rendered without error locally and all examples also ran without error.

Functions

  • The use of a common prefix for all the ch_ functions is very useful and will make it easier for users to use command completion to search for a function that want to use from the package but can't remember the name of.
  • The ch_* functions are very clearly named.
  • The ability to specify a locale when generating some types of data like names is particularly appealing. However, this will be much more helpful is there is a list provided somewhere (vignette maybe?) linking the possible choices for the locale code to use in this option to a location name. Also, I think that there is a typo in one of the options listed for local: I tried to google "hr_FR" (listed as an option for ch_name and ch_job) and couldn't come up with anything useful. Should this have been "fr_FR", which is already listed?
  • The ch_generate function, which allows a user to create a dataframe with columns generated using other ch_* functions, is a very useful addition. However, the line # ch_generate('job', 'color_name', 'stuff') in the examples of the helpfile is confusing. Could this be removed?
  • The package functions generally seem to have good checks of user inputs and helpful error messages if a user inputs something incorrect.
  • Many of the ch_* functions used for columns allowed in ch_generate can be generated for different locales, but ch_generate itself does not allow a locale argument. Could it? (I would not mark this as a major concern, since a user could create a dataframe using several calls to separate ch_* functions.)
  • Under the es_ES locale setting, it looks like many names get the same last name twice. Should this instead be sampling two different names from the last names vector and then putting them together? For example, the function currently gives values like this:
ch_name(n = 3, locale = "es_ES")
[1] "Nieves Nieves Diaz Diaz" "Sara Girón Girón"        "Clara Villena Villena"  
  • This may not be terribly important for the applications this package would be used for, but the method for names in "fr_FR" of adding random prefixes to some names is going to result in some funky last names in that language ("de la Dufour", "du Olivier", "Le Lemonnier"). Also, why do none of the French last names have accents in them, while the first names do? Will this matter in anticipated applications?
  • For some of these locales (for example, "fr_FR"), many job titles would be different depending on whether the worker is male or female. In the applications you envision for this package, will it be a problem that all the jobs generated with ch_job are for male workers?
  • Could the phone number generator include locale as an option?
  • Are there any advantages to using the functions enabled by numerics-provider.R (e.g. ch_norm, ch_unif, ch_lnorm, ch_beta) beyond using the base R analogs (rnorm, etc.)? If there are important advantages, it would be helpful if they were explained in the helpfile for these packages. If not, are they necessary to include in this package?
  • It may be useful to allow the user to set ranges for ch_date_time, similar to how a bounding box can be set for ch_position.
  • There are a number of files in the R subdirectory of the package for creating company names, but there doesn't seem to be a ch_company function that uses that. Is the company provider meant to be used anywhere by a package user? If not, could this code be removed from the package, or moved to a development branch?

Vignette

  • The formatting of the vignette, with a linked table of contents to the left, is very attractive. However, the text of the vignette is clearly still unfinished. Given that the intended use of the package is pretty straightforward, the vignette doesn't need to be very detailed, but it would be helpful if it included a short paragraph at the beginning introducing the package, for example, as well as a paragraph explaining the differences between using fraudster versus the ch_* family of functions (even just adding the information currently in the helpfile for the package would be useful).
  • It would be very useful to have more details, either in the vignette or in the helpfiles, about how the sythetic data is created and how much variety is available for each category. In particular, how were the values picked that were used in generating vectors for names, jobs, etc., for each locale (e.g., in person-provider-en_US.R)? Are these representative of common names in each locale? The details provided in the helpfile for TaxonomyProvider are very useful and the perfect level of detail-- something similar for other providers, and for the ch_* functions that use them, would be great.
  • For ch_job, it looks like the "en_US" locale values are more consistent with a "en_GB" locale. British spellings are used in these job titles ("maths", "centre"). Does this matter?
  • "available" is misspelled in the error message "not in set of avaiable locales".
  • For the "en_US" last names, at least, it doesn't look like there are any two-word last names (e.g., "von Trapp"). If this package is used to create test data, I think some cases of two-word last names could be useful.
  • Again, if this package is meant, at least as one use, to create "messy" data to use for function tests, it could be useful to introduce some variation in honorifics and degree listings (e.g., both "Mr" and "Mr.", both "PhD" and "Ph.D.") that might show up in real data.

Helpfiles

  • Check for typos throughout.
  • Based on the helpfiles for the ch_* functions, the n parameter can be "to infinity". I found this wording a bit awkard, as the the functions (expectedly) return an error if run with n set to Inf: Error in integer(n) : vector size cannot be infinite. Perhaps reword this in the helpfile to something like "any positive integer"?
  • In the helpfile for ch_generate, it would be useful to include a note on the columns that are included by default (name, job, and phone number). Also, there's currently text that suggests to "See Details", but no Details section.

@sckott
Copy link
Contributor Author

sckott commented Mar 7, 2017

@geanders thanks so much for the review! will respond soon

@sckott
Copy link
Contributor Author

sckott commented Mar 9, 2017

response to @tjmahr - thx much for your review!


BaseProvider

This package makes extensive use of the sample() function. This function has a flawed default behavior: It expands integers into ranges. Thus, sample(10, 1) is the same as sample(1:10, 1). This package's BaseProvider$random_element() method inherits this flaw. Perhaps, x[sample(seq_along(x), 1)], which samples from a sequence of item positions, would be a better implementation for this method.

see ropensci/charlatan#12 - looking into it

The method random_int() has the signature ... It can never generate the maximum value, and this fact isn't noted anywhere.

see ropensci/charlatan#13 - good catch

I'm curious why the digit-or-blank generators don't just sample from a larger set. The code for random_digit_not_null_or_empty() means that half of the generated items should be "". Why not sample with this instead? ...

see ropensci/charlatan#14 - good catch, will fix

NumericsProvider

The integer() method doesn't defer to R, but it instead uses sample(). By default sample() does not sample with replacement, so NumericsProvider$new()$integer() and ch_integer() can behave in unexpected ways.

This is b/c i was following what faker lib was doing - it's based on picking a random integer from a range given bay a user - not from some distribution

It is also odd that the random_int() and integer() use different techniques for generating random integers.

see above comment

PersonProvider

As the example above shows, the names can vary in format. (Sometimes there is a blank first name --- this might be a bug.)

see ropensci/charlatan#15 looking into it

The locale-specific names are stored as variables in the package's namespace, and they are stored in R scripts. Thus, there are .R files that contain vectors with thousands of names. I wonder if using a data/ folder would be a cleaner way to separate code and package data.

I've thought about storing as data - as binary I don't like as the data may need to be updated, changed - so possibly data-raw instead - i can look into it

The documentation for ch_name() under-reports the supported locales. For example, Spanish is supported but this Spanish support is not documented.

right, a symptom of submitting this for review early in development :) will fix

There is a bug in how double last names (as in Spanish) are generated.

see ropensci/charlatan#16 thanks, fixing

ColorProvider

Its hex-color generator users the random_int() method to choose a number between 1 and 16777215. The above noted behavior means it will never generate #ffffff, and the lower bound of 1 prevents #000000 from being generated too.

will fix see ropensci/charlatan#18

Some locale sensitivity for color names, but not safe ones

thanks, will look into it, see ropensci/charlatan#17

It pads zeros onto strings less than 6 characters to create the familiar six- digit format. But padding is done on the right side, so it will never generate "#0000ff". It also counts character length after appending the the pound sign, so the above example shows an incorrect hex color. I think using sprintf("#%06x", ...) would fix these problems. Actually, I discovered grDevices::rgb() while reading about the colors() for a later comment. That function should power the implementation. A similar strategy underlies safe_hex_color(), so the method has the same flaws. Plus, something mysterious happens sometimes: ...

thanks, for all those see ropensci/charlatan#19

CoordinateProvider

It can generate coordinates within a boundary box. The box is not checked, so users can get invalid coordinates as a result.

see ropensci/charlatan#20

CreditCardProvider

... there is some commented out Python/R code in place.

Right, just not done yet

DateTimeProvider

This class powers ch_timezone(), ch_unix_time(), and ch_date_time(). It is not documented that the date-times are integers converted to POSIX times, so they are all after 1970.

will fix that, see ropensci/charlatan#21

The file also features some commented out Python code.

Right, just not done yet

There is a typo in century() method, so it always errors.

will fix, see ropensci/charlatan#22

TaxonomyProvider

The basis for the random genus/species names is well detailed in the hidden ?TaxonomyProvider page but not included in the user-facing ?taxonomy page. Now that Roxygen2 supports the new @inherit and @inheritSection fun title directives, I suggest that the user-facing page include this documentation.

will do, see ropensci/charlatan#23

Do-one-thing generators

DOIProvider

This class does not use its inherited random_int() method to generate the integers, but it should.

good idea, see ropensci/charlatan#24

SequenceProvider

I feel the user-facing function should be named ch_gene_sequence(), not ch_sequence()

probably, see ropensci/charlatan#25

FraudsterClient

Because locales are not uniformly supported, this can lead to errors.

right, again, a symptom of submitting early in dev for review, will try to fail better on that ropensci/charlatan#26

Also, the name() method ignores the locale even though it can support different locales.

good catch, see ropensci/charlatan#26

Unfinished providers

I didn't closely review the code for these classes, as they appear to be unfinished and are not user-facing like the other ones. I am documenting them here for completeness.

right, again symptom of submitting for review earlyish

company_provider ... has a different R6 design and naming scheme than other classes

will fix that to make sure same format as others, see ropensci/charlatan#27

MissingProvider provides a method to inject NA values into a vector. There is no user-facing function (something like ch_missing()) for this class).

will add that, see ropensci/charlatan#28

it's odd that the n missing values it generates for a vector overwrite the first n values. I feel like it should randomly determine n, sample() n position indices, and make the elements in those positions NA so that the NAs are scattered throughout the vector.

see ropensci/charlatan#29

ch_generate()

The standard error for a bad field name is ... A better error might say, see ?ch_generate or print out the all_choices vector.

will point user to ?ch_generate - the list of options is short right now, but i think it will get quite long - in which case cumbersome to print all out, but maybe I will anyway ropensci/charlatan#30

ch_generate("hex_color") does not work, although the documentations claims to support it. The documentation for the n argument says "any (integer) number of things to get, 1 to inifinity". I would recommend just saying any non-negative integer. This applies to every documentation page as well. The documentation for ... should also say that name, job, and phone_number columns are created by default if nothing is specified.

will fix, ropensci/charlatan#31

Other comments

parse_eval() should move to zzz.R, where the rest of the utility functions live.

done

Personally, I would avoid using this function by structuring all the data into nested lists so that items can be retrieved with data[[locale_name]] instead of using parse_eval() to create-then-evaluate a variable name.

why?

The random_digit_not_null() method in the BaseGenerator is better named random_digit_not_zero().

good idea, done

The unit test for colors and safe colors should check that the generated colors meets the color and safe-color formats.

thanks, done

There are only unit tests for the colors, coordinates, credit cards, currency and jobs. I would suggest adding tests for the functions I've described bugs in.

definitely! meaning to fill out tests more, will work on that, ropensci/charlatan#2

@sckott
Copy link
Contributor Author

sckott commented Mar 9, 2017

response to @geanders - thx much for your review!


There are cases where unreasonable values will be generated by the functions in this package (e.g., phone numbers with illegal area codes, weird names for people, etc.), which is to some extent unavoidable when generating synthetic data.

@geanders What do you mean by illegal? e.g., with area codes, do you mean it generates an area code that doesn't exist? the rules in the pkg for PhoneNumberProvider are only for the formats the string takes, not what combinations are in the real set. I don't think our goal here is to create real phone numbers or real anything else. I think that would in fact not be a good idea. However, cases where "illegal" values are created with respect to not following a specific format then we'd want to fix that.

Depending on the intended applications of this package, this may or may not be important. It would be useful to get an idea of more of the intended uses of this package to help assess if any of these unreasonable values (some of which I point out in comments below) would be likely to introduce problems when using this package.

I don't have any specific applications in mind. I think the pkg could be used in A LOT of different use cases: generating datasets in another pkg, datasets for teaching, pkg test suites, generating datasets with a certain structure for a very specific use case, etc. I'll add more to the docs about what possible use cases are ropensci/charlatan#32

Functions

The ability to specify a locale when generating some types of data like names is particularly appealing. However, this will be much more helpful is there is a list provided somewhere (vignette maybe?) linking the possible choices for the locale code to use in this option to a location name.

Also, I think that there is a typo in one of the options listed for local: I tried to google "hr_FR" (listed as an option for ch_name and ch_job) and couldn't come up with anything useful. Should this have been "fr_FR", which is already listed?

sorry, that should be hr_HR (croation, http://www.localeplanet.com/icu/hr-HR/index.html), fixed in both

The ch_generate function, which allows a user to create a dataframe with columns generated using other ch_* functions, is a very useful addition. However, the line # ch_generate('job', 'color_name', 'stuff') in the examples of the helpfile is confusing. Could this be removed?

sure, removed (commented out eg, to show what happens when there's an invalid variable input

Many of the ch_* functions used for columns allowed in ch_generate can be generated for different locales, but ch_generate itself does not allow a locale argument. Could it? (I would not mark this as a major concern, since a user could create a dataframe using several calls to separate ch_* functions.)

sounds good to add it in there, see ropensci/charlatan#33

Under the es_ES locale setting, it looks like many names get the same last name twice. Should this instead be sampling two different names from the last names vector and then putting them together?

Right, will fix, see ropensci/charlatan#34

This may not be terribly important for the applications this package would be used for, but the method for names in "fr_FR" of adding random prefixes to some names is going to result in some funky last names in that language ("de la Dufour", "du Olivier", "Le Lemonnier"). Also, why do none of the French last names have accents in them, while the first names do? Will this matter in anticipated applications?

I'll have a look at this, but i'm only following what the Python library did, see https://github.com/joke2k/faker/blob/master/faker/providers/person/fr_FR/__init__.py I'll see why they don't have an accents in last names - see ropensci/charlatan#35

For some of these locales (for example, "fr_FR"), many job titles would be different depending on whether the worker is male or female. In the applications you envision for this package, will it be a problem that all the jobs generated with ch_job are for male workers?

@geanders What are examples of different titles for female vs. male?

Could the phone number generator include locale as an option?

yep, just haven't gotten to it yet, as you can see at https://github.com/joke2k/faker/tree/master/faker/providers there's a lot in there I haven't gotten to yet in this pkg, see ropensci/charlatan#36

Are there any advantages to using the functions enabled by numerics-provider.R (e.g. ch_norm, ch_unif, ch_lnorm, ch_beta) beyond using the base R analogs (rnorm, etc.)? If there are important advantages, it would be helpful if they were explained in the helpfile for these packages. If not, are they necessary to include in this package?

these were started from this discussion ropensci/charlatan#11 - yes, they aren't adding much value, but i think there's more to do with them

It may be useful to allow the user to set ranges for ch_date_time, similar to how a bounding box can be set for ch_position.

right, datetime provider is very incomplete - there's many methods not added yet - including date times between a range of dates, see https://github.com/joke2k/faker/blob/master/faker/providers/date_time/__init__.py#L362 - so it will be added in time

There are a number of files in the R subdirectory of the package for creating company names, but there doesn't seem to be a ch_company function that uses that. Is the company provider meant to be used anywhere by a package user? If not, could this code be removed from the package, or moved to a development branch?

i will add ch_company - that bit wasn't quite done yet - also related issue above in that I need to rework CompanyProvider to same R6 format as others

Vignette

However, the text of the vignette is clearly still unfinished. Given that the intended use of the package is pretty straightforward, the vignette doesn't need to be very detailed, but it would be helpful if it included a short paragraph at the beginning introducing the package, for example, as well as a paragraph explaining the differences between using fraudster versus the ch_* family of functions (even just adding the information currently in the helpfile for the package would be useful).

Thanks for the feedback on this, will improve it and finish off ropensci/charlatan#37

It would be very useful to have more details, either in the vignette or in the helpfiles, about how the sythetic data is created and how much variety is available for each category. In particular, how were the values picked that were used in generating vectors for names, jobs, etc., for each locale (e.g., in person-provider-en_US.R)? Are these representative of common names in each locale? The details provided in the helpfile for TaxonomyProvider are very useful and the perfect level of detail-- something similar for other providers, and for the ch_* functions that use them, would be great.

Good points. Will add to docs an overview on this. Much of the decisions were made in the python client, but I can see if they have any info on that.

For ch_job, it looks like the "en_US" locale values are more consistent with a "en_GB" locale. British spellings are used in these job titles ("maths", "centre"). Does this matter?

I imagine b/c the maintainer is from europe https://github.com/joke2k (though not GB)

"available" is misspelled in the error message "not in set of avaiable locales".

thx, fixed.

For the "en_US" last names, at least, it doesn't look like there are any two-word last names (e.g., "von Trapp"). If this package is used to create test data, I think some cases of two-word last names could be useful.

issued opened, ropensci/charlatan#38

Again, if this package is meant, at least as one use, to create "messy" data to use for function tests, it could be useful to introduce some variation in honorifics and degree listings (e.g., both "Mr" and "Mr.", both "PhD" and "Ph.D.") that might show up in real data.

Good idea. ropensci/charlatan#39 maybe a global toggle (for certain variables where appropriate) would be good b/c don't want messy on by default

Helpfiles

Check for typos throughout.

will do, i'm a vary baad spellller

Based on the helpfiles for the ch_* functions, the n parameter can be "to infinity". I found this wording a bit awkard, as the the functions (expectedly) return an error if run with n set to Inf: Error in integer(n) : vector size cannot be infinite. Perhaps reword this in the helpfile to something like "any positive integer"?

right, fixed.

In the helpfile for ch_generate, it would be useful to include a note on the columns that are included by default (name, job, and phone number). Also, there's currently text that suggests to "See Details", but no Details section.

fixed, thanks

sckott added a commit to ropensci/charlatan that referenced this issue Mar 9, 2017
update man files for new roxygen2 version, require roxygen2 (>= 6.0.1)
bump dev version
@tjmahr
Copy link

tjmahr commented Mar 10, 2017

NumericsProvider

The integer() method doesn't defer to R, but it instead uses sample(). By default sample() does not sample with replacement, so NumericsProvider$new()$integer() and ch_integer() can behave in unexpected ways.

This is b/c i was following what faker lib was doing - it's based on picking a random integer from a range given bay a user - not from some distribution

Am I allowed to elaborate comments in a review? Anyway, this implementation of ch_integer() behaves unlike ch_double() or ch_unif() because there's a limit to the number of integers that can be generated when sampling without replacement.

charlatan::ch_integer(100, min = 1, max = 10)
#> Error in sample.int(length(x), size, replace, prob) : 
#>   cannot take a sample larger than the population when 'replace = FALSE' 

This makes it behave in unexpected ways compared to its sibling functions. In this way, it's not really a random integer generator, although it's lumped in with those other number generators, but a function that shuffles a pile of numbers and draws from that shuffled pile.

@noamross
Copy link
Contributor

@tjmahr Certainly! We encourage ongoing conversations throughout the review process.

@geanders
Copy link

@geanders What do you mean by illegal? e.g., with area codes, do you mean it generates an area code that doesn't exist? the rules in the pkg for PhoneNumberProvider are only for the formats the string takes, not what combinations are in the real set. I don't think our goal here is to create real phone numbers or real anything else. I think that would in fact not be a good idea. However, cases where "illegal" values are created with respect to not following a specific format then we'd want to fix that.

For this, I mean certain number combinations that you'd never see (like an area code of 000). I agree that a package shouldn't create "real" phone numbers, but clearly the package is trying to create things that look realistic-- otherwise the package could just generate random character strings, with locale-specific encoding, for the names and job titles, right? Do you have links to any examples of the Python version in use? (or I could not be lazy and just google that myself.) I think my main question is how realistic everything generated needs to be to be useful in anticipated applications. If it doesn't need to be very realistic, just recognizable as a try at names, job titles, etc., I think these little things like occasionally weird names are fine. If they need to be plausibly realistic most of the time, it may be worth the effort to try to catch a few more of these issues.

@geanders What are examples of different titles for female vs. male?

accompagnateur / accompagnatrice, acheteur / acheteuse, administrateur / administratrice, etc. Well over half of the jobs listed have different titles for masculine / feminine. Again, this question would come down to how realistic output data needs to look, which will depend on anticipated uses of the package. As with the comment on phone numbers, perhaps this is not a problem if the data just needs to be recognizable as a dataset of names, jobs, etc., rather than one that could pass as a real dataset.

Otherwise, it sounds like you've got good plans that respond to the rest of my comments.

@sckott
Copy link
Contributor Author

sckott commented Mar 10, 2017

@tjmahr

thx for your comment. maybe this is a better fit for the numerics set of fxns https://github.com/ropenscilabs/charlatan/blob/master/R/base-provider.R#L74-L76

@sckott
Copy link
Contributor Author

sckott commented Mar 10, 2017

@geanders

thanks for your comments.

Opened an issue ropensci/charlatan#40 to keep this in mind and discuss about it with community, etc. The short answer is I don't know the answer right now. I sort of want to keep in line with what the Python library is doing so there's a large amount of similarity in what the two libs do. But if there's good reason to deviate and most people want that we can change - or if we can have a toggle to switch on or off more realistic values

thanks for the examples for female/male job titles. I'll have a look at that and fix - i don't want these to be just male, i just wasn't familiar enough with french to know the difference

@noamross
Copy link
Contributor

noamross commented May 9, 2017

Hi @sckott, just checking in on the status of this package. Are you waiting for any answers to your questions to reviewers?

@sckott
Copy link
Contributor Author

sckott commented May 9, 2017

no, not waiting on answers. have addressed many issues so far, but a few more to go (label: review) https://github.com/ropenscilabs/charlatan/issues?q=is%3Aissue+label%3Areview+is%3Aopen

@sckott
Copy link
Contributor Author

sckott commented May 31, 2017

@tjmahr @geanders Okay, sorry for the long delay.

I've now dealt with nearly all of your points - see issues with review label https://github.com/ropenscilabs/charlatan/milestone/1?closed=1 - associated with the v0.1 milestone - for my first cran push of the pkg

I've moved a few of your issues/or parts of your issues to the next milestone https://github.com/ropenscilabs/charlatan/milestone/2 - let me know if you think they're important enough we need to deal with them before going to CRAN.

and @noamross let me know if you see anything I should fix/make better

@noamross
Copy link
Contributor

noamross commented Jun 4, 2017

I am OK with the items moved to the 0.2 milestone if the reviewers are.

I note that test coverage is still modest, including in some areas (colors), where your reviewers identified bugs or inconsistencies. Can you bring that up?

@tjmahr and @geanders, please take a look at the updated version and let us know if @sckott has addressed your comments to your satisfaction.

@geanders
Copy link

geanders commented Jun 5, 2017

This has satisfactorily addressed my comments, and I'm ok with the items mover to the 0.2 milestones, as well.

@sckott
Copy link
Contributor Author

sckott commented Jun 5, 2017

Will roll out more tests

@sckott
Copy link
Contributor Author

sckott commented Jun 9, 2017

Okay, I've added more tests - now up to 91% coverage (yes, caveat that that doesn't necessarily mean every import. thing is tested) https://github.com/ropenscilabs/charlatan#charlatan

anything else? @noamross

@tjmahr
Copy link

tjmahr commented Jun 10, 2017

I read over the closed issues and the code changes. I did not give the package the intense kind of code review and interactive testing as I did for the first review, but I made two pull requests (ropensci/charlatan#43 ropensci/charlatan#47) where the code didn't look quite right. I think the package is ready for users, but the maintainers should be ready to make quick fixes and supporting unit tests if a generator demonstrates some systematic error.

@sckott
Copy link
Contributor Author

sckott commented Jun 10, 2017

thanks for the PR @tjmahr

ready to make quick fixes indeed!

@noamross
Copy link
Contributor

Thank you for your responses, @geanders and @tjmahr!

@sckott: This looks good as soon as you finish addressing ropensci/charlatan#47 .

@sckott
Copy link
Contributor Author

sckott commented Jun 12, 2017

@noamross okay, we're all set on that PR

@sckott
Copy link
Contributor Author

sckott commented Jun 15, 2017

all good @noamross ?

@noamross
Copy link
Contributor

Approved! @geanders and @geanders for your thorough work throughout this review. Good work @sckott, feel free to transfer to the ropensci repository, and consider making a post for the blog.

@sckott
Copy link
Contributor Author

sckott commented Jun 15, 2017

thanks everyone for your work that helped to improve the package!

@sckott
Copy link
Contributor Author

sckott commented Jun 16, 2017

@noamross draft post ropensci/roweb#310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants