Added foundation for DE_DE data #75

xoryouyou · 2021-12-11T13:43:32Z

Hi @cksac,
I've been using your library thanks to LukeMathWalker/zero-to-production

Since I've had the need to generate a bit of german fake data I took a closer look into the project.

I added a de_de.rs module to provide a few basic attributes like State,Company Type,Names and so on from the official government ID test card data

https://persosim.secunet.com/fileadmin/user_upload/PersoSim_Personalisierungsdaten_eID_UB.pdf

It's working with the check_determinism test as far as I can tell.

To verify the data that gets generated I added a de_manual_test.rs just so see some generic company struct on the terminal.

Run via cargo test de_manual_test -- --nocapture it produces some results like

Company { name: "Schuster e.V.", street: "Hillebrandtgang", zipcode: "29346", city: "Musterfrau land", state: "Berlin" }

So far so good. 😄

But that is the reason I opened the PR.

"Musterfrau land" is not a valid city name in german.

Most city names would consist of two nouns or names and a suffix.

Germany and to further extend the "DACH" Region (Germany, Austria, Switzerland) all have a pretty different layout for thinks like street names and addresses.

And I would like to discuss how I could implement it in a way it suites the library and their users.

For example a street name might be separated by dashes if it contains one or multiple names e.g. Karl-Marx-Allee https://en.wikipedia.org/wiki/Karl-Marx-Allee or can contain an apostrophe as seen here in https://de.wikipedia.org/wiki/Stra%C3%9Fenname in Laehr'scher Jagdweg

There are already some impl with random branching for data generation

fake-rs/fake/src/faker/impls/address.rs

Line 108 in 3264649

let name = if Faker.fake_with_rng::<bool, _>(rng) {

And I don't know if I should start adding impls special for the DE_DE locale or how it would fit the overall project structure.

Any feedback or guidance is appreciated and I would gladly implement more of the german locale.

Cheers xoryouyou

llogiq · 2021-12-11T14:26:11Z

Hi there! Thank you for opening this PR! As another German (who didn't have enough time to stem this), I'm particularly happy this finally gets addressed.

With last names, I'd allow for double names by rule instead of putting them into the data. The rule should have special handling for equal last names to just return that name so we can avoid "Müller-Müller". I would avoid "Musterfrau".

Street names can be hyphenated names, last names, city names or a variety of words + any of "Straße", "Weg", "Gasse", "Allee", "Chaussee" selected with decreasing probability.

City names can be built from a variety of suffixes like. "stadt", "berg", "dorf", "kirchen", "burg", "feld", "au", "ing" or "heim" and optional prefixes "Groß ", "Klein ", "Ober", "Unter". That plus a list of name parts will give us a good selection.

xoryouyou · 2021-12-11T14:39:11Z

Thanks for the feedback and that is exactly why I opened the PR.

The whole cityname with prefixes, suffixes and some static strings in between opened the question in my head how to integrate it into the library.

Would you enhance the impl here

fake-rs/fake/src/faker/impls/address.rs

Line 35 in 3264649

impl<L: Data + Copy> Dummy<CityName<L>> for String {

and add all locale logic in there to branch and randomly generate a string from fiben parts or would you create a custom impl per locale which could then use its custom TPL with {prefix}{base}{suffix} ?

These are just ideas and I'm a bit lost how it could work best for the library :)

llogiq · 2021-12-11T14:46:50Z

Good question. Perhaps having a kind of templating functionality that we can reuse might be useful, perhaps along with a few rules (e.g. no uppercase letters after lowercase letters).

cksac · 2021-12-11T15:37:45Z

I would prefer templates for the default implementation, in case it does not fit for specific locale, can use impl specialisation. But not sure is it in stable or not

Added german CityPrefix and Suffixes

xoryouyou · 2021-12-18T16:36:58Z

Small update time.

Data

I've added the 50 most common firstnames for female and male
also the 100 most common lastnames.

Source:

Regarding the City names I've integrated the suggestions from @llogiq

A sample company currently looks like this:

Company {
    name: "Hahn and Beck GmbH",
    address_line_1: "Weißallee 44",
    address_line_2: "Aufgang 79",
    zipcode: "38015",
    city: "Peterheim",
    state: "Nordrhein-Westfalen",
}

Issues

During testing I found that only using templates will not suffice for all of the german specifics and probably for other locales too.

Some edge cases I encountered where:

Proper lowercasing in city names

Example: city: "UnterHeinrich Bauerfurt"
Should be: city:" Unterheinrich Bauerfurt"

Due to the .replace in

fake-rs/fake/src/faker/impls/address.rs

Line 43 in 3264649

.replace(

the function simply takes a LastName and plugs it into the Template string.
which could be prevented by lowercasing all templated strings except the first one.

Citystates

Example:

    city: "Juttafeld",
    state: "Berlin",

Should be:

    city: "Berlin",
    state: "Berlin",

In germany there are a few Citystates e.g. Berlin, Bremen, Hamburg. These share the same name
for City and State and therefore should not differ.

My understanding from the examples is that the library only provides parts of bigger structures like CompanyName() , CompanySuffix() ,... and the responsibility lies with the user to generate proper aggregates like struct GermanCompany to handle all the specific details in their implementation using the library.

My feeling is that some kind of customization logic might be needed in the future internally in the impls.

Static content

Due to a lot of edge cases which might not be easily generated but provide a benefit for testing purposes it
would be great to have something like ADDRESS_CITY_NAME_STATIC and ADDRESS_STREET_NAME_STATIC which could hold names like Berlin, Frankfurt, etc.
and be randomly chosen instead of a FirstName or Lastname as it is currently done.

Feedback

I would really like to get your feedback on this since I have the feeling a lot of other locales will face similar problems
like ireland with the EIRCODE
and a bunch of other specifications from the INSPIRE project

Therefore we might need a bit more logic in each respective impl<L:Data + Copy> Dummy<Type<L>> for each locale which feels wrong due to the fact that it would force specific locale code in the core of the library.

Or the other way round it would make sense to have impls based in each locale so EN_IE could implement its own version for e.g. PostalCode.

And since I'm only contributing I would like to get your input on this and ask for a bit of help or examples how I could implement either way you want it to be integrated in the future 😄

llogiq · 2021-12-19T11:08:51Z

Regarding the uppercase problem: I would suggest adding a function that lowercases all characters unless a) the first and b) following any whitespace. This would leave e.g. "Gross Annaheim" in uppercase but lowercase "Unterdirkfurt"

MLNW · 2024-10-16T13:46:51Z

Sad to see this is dead in the water. Would have been nice to fake some German data next to the other available languages.

xoryouyou · 2024-10-16T14:07:11Z

Yeah unfortunately my work focus changed and I didn't have any time to maintain the PR or even keep it up to date.
But if anyone is willing feel free to take it over.
I can't foresee when I'll have time to work on it again.

Added foundation for DE_DE data

a2a9f7b

xoryouyou added 2 commits December 18, 2021 16:45

Added common german first and lastnames

b9b45bd

Added german CityPrefix and Suffixes

Changed address format and println output

177373e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added foundation for DE_DE data #75

Added foundation for DE_DE data #75

xoryouyou commented Dec 11, 2021

llogiq commented Dec 11, 2021

xoryouyou commented Dec 11, 2021

llogiq commented Dec 11, 2021

cksac commented Dec 11, 2021

xoryouyou commented Dec 18, 2021

llogiq commented Dec 19, 2021

MLNW commented Oct 16, 2024

xoryouyou commented Oct 16, 2024

Added foundation for DE_DE data #75

Are you sure you want to change the base?

Added foundation for DE_DE data #75

Conversation

xoryouyou commented Dec 11, 2021

llogiq commented Dec 11, 2021

xoryouyou commented Dec 11, 2021

llogiq commented Dec 11, 2021

cksac commented Dec 11, 2021

xoryouyou commented Dec 18, 2021

Data

Issues

Proper lowercasing in city names

Citystates

Static content

Feedback

llogiq commented Dec 19, 2021

MLNW commented Oct 16, 2024

xoryouyou commented Oct 16, 2024