Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added foundation for DE_DE data #75

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

xoryouyou
Copy link

Hi @cksac,
I've been using your library thanks to LukeMathWalker/zero-to-production

Since I've had the need to generate a bit of german fake data I took a closer look into the project.

I added a de_de.rs module to provide a few basic attributes like State,Company Type,Names and so on from the official government ID test card data

https://persosim.secunet.com/fileadmin/user_upload/PersoSim_Personalisierungsdaten_eID_UB.pdf

It's working with the check_determinism test as far as I can tell.

To verify the data that gets generated I added a de_manual_test.rs just so see some generic company struct on the terminal.

Run via cargo test de_manual_test -- --nocapture it produces some results like

Company { name: "Schuster e.V.", street: "Hillebrandtgang", zipcode: "29346", city: "Musterfrau land", state: "Berlin" }

So far so good. 😄

But that is the reason I opened the PR.

"Musterfrau land" is not a valid city name in german.

Most city names would consist of two nouns or names and a suffix.

Germany and to further extend the "DACH" Region (Germany, Austria, Switzerland) all have a pretty different layout for thinks like street names and addresses.

And I would like to discuss how I could implement it in a way it suites the library and their users.

For example a street name might be separated by dashes if it contains one or multiple names e.g. Karl-Marx-Allee https://en.wikipedia.org/wiki/Karl-Marx-Allee or can contain an apostrophe as seen here in https://de.wikipedia.org/wiki/Stra%C3%9Fenname in Laehr'scher Jagdweg

There are already some impl with random branching for data generation

let name = if Faker.fake_with_rng::<bool, _>(rng) {

And I don't know if I should start adding impls special for the DE_DE locale or how it would fit the overall project structure.

Any feedback or guidance is appreciated and I would gladly implement more of the german locale.

Cheers xoryouyou

@llogiq
Copy link
Collaborator

llogiq commented Dec 11, 2021

Hi there! Thank you for opening this PR! As another German (who didn't have enough time to stem this), I'm particularly happy this finally gets addressed.

With last names, I'd allow for double names by rule instead of putting them into the data. The rule should have special handling for equal last names to just return that name so we can avoid "Müller-Müller". I would avoid "Musterfrau".

Street names can be hyphenated names, last names, city names or a variety of words + any of "Straße", "Weg", "Gasse", "Allee", "Chaussee" selected with decreasing probability.

City names can be built from a variety of suffixes like. "stadt", "berg", "dorf", "kirchen", "burg", "feld", "au", "ing" or "heim" and optional prefixes "Groß ", "Klein ", "Ober", "Unter". That plus a list of name parts will give us a good selection.

@xoryouyou
Copy link
Author

Thanks for the feedback and that is exactly why I opened the PR.

The whole cityname with prefixes, suffixes and some static strings in between opened the question in my head how to integrate it into the library.

Would you enhance the impl here

impl<L: Data + Copy> Dummy<CityName<L>> for String {
and add all locale logic in there to branch and randomly generate a string from fiben parts or would you create a custom impl per locale which could then use its custom TPL with {prefix}{base}{suffix} ?

These are just ideas and I'm a bit lost how it could work best for the library :)

@llogiq
Copy link
Collaborator

llogiq commented Dec 11, 2021

Good question. Perhaps having a kind of templating functionality that we can reuse might be useful, perhaps along with a few rules (e.g. no uppercase letters after lowercase letters).

@cksac
Copy link
Owner

cksac commented Dec 11, 2021

I would prefer templates for the default implementation, in case it does not fit for specific locale, can use impl specialisation. But not sure is it in stable or not

@xoryouyou
Copy link
Author

Small update time.

Data

I've added the 50 most common firstnames for female and male
also the 100 most common lastnames.

Source:

Regarding the City names I've integrated the suggestions from @llogiq

A sample company currently looks like this:

Company {
    name: "Hahn and Beck GmbH",
    address_line_1: "Weißallee 44",
    address_line_2: "Aufgang 79",
    zipcode: "38015",
    city: "Peterheim",
    state: "Nordrhein-Westfalen",
}

Issues

During testing I found that only using templates will not suffice for all of the german specifics and probably for other locales too.

Some edge cases I encountered where:

Proper lowercasing in city names

Example: city: "UnterHeinrich Bauerfurt"
Should be: city:" Unterheinrich Bauerfurt"

Due to the .replace in


the function simply takes a LastName and plugs it into the Template string.
which could be prevented by lowercasing all templated strings except the first one.

Citystates

Example:

    city: "Juttafeld",
    state: "Berlin",

Should be:

    city: "Berlin",
    state: "Berlin",

In germany there are a few Citystates e.g. Berlin, Bremen, Hamburg. These share the same name
for City and State and therefore should not differ.

My understanding from the examples is that the library only provides parts of bigger structures like CompanyName() , CompanySuffix() ,... and the responsibility lies with the user to generate proper aggregates like struct GermanCompany to handle all the specific details in their implementation using the library.

My feeling is that some kind of customization logic might be needed in the future internally in the impls.

Static content

Due to a lot of edge cases which might not be easily generated but provide a benefit for testing purposes it
would be great to have something like ADDRESS_CITY_NAME_STATIC and ADDRESS_STREET_NAME_STATIC which could hold names like Berlin, Frankfurt, etc.
and be randomly chosen instead of a FirstName or Lastname as it is currently done.

Feedback

I would really like to get your feedback on this since I have the feeling a lot of other locales will face similar problems
like ireland with the EIRCODE
and a bunch of other specifications from the INSPIRE project

Therefore we might need a bit more logic in each respective impl<L:Data + Copy> Dummy<Type<L>> for each locale which feels wrong due to the fact that it would force specific locale code in the core of the library.

Or the other way round it would make sense to have impls based in each locale so EN_IE could implement its own version for e.g. PostalCode.

And since I'm only contributing I would like to get your input on this and ask for a bit of help or examples how I could implement either way you want it to be integrated in the future 😄

@llogiq
Copy link
Collaborator

llogiq commented Dec 19, 2021

Regarding the uppercase problem: I would suggest adding a function that lowercases all characters unless a) the first and b) following any whitespace. This would leave e.g. "Gross Annaheim" in uppercase but lowercase "Unterdirkfurt"

@MLNW
Copy link

MLNW commented Oct 16, 2024

Sad to see this is dead in the water. Would have been nice to fake some German data next to the other available languages.

@xoryouyou
Copy link
Author

Yeah unfortunately my work focus changed and I didn't have any time to maintain the PR or even keep it up to date.
But if anyone is willing feel free to take it over.
I can't foresee when I'll have time to work on it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants