Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Argument Validation Fails for non-latin characters #99

Closed
robmeek opened this issue Mar 17, 2020 · 14 comments
Closed

Optional Argument Validation Fails for non-latin characters #99

robmeek opened this issue Mar 17, 2020 · 14 comments
Assignees
Labels
bug quality refactoring Improving existing code for the better

Comments

@robmeek
Copy link

robmeek commented Mar 17, 2020

The addOptionalArguments method throws an InvalidArgumentException if an argument contains exclusively non-ascii characters.

e.g. if the traderName is ΕΛΛΑΣ

– note that the E and the A are Greek, and not Latin, unicode characters here. For test purposes something like Äß also throws the exception.

The filterArgument strips away all greek letters (and accented characters such as ä and é). In the case of ΕΛΛΑΣ – this means an empty string is sent to validateArgument.

The regex in validateArgument is probably also problematic. Changing to this:

/^[a-zA-Z0-9α-ωΑ-Ω\s\.\-,]+$/

– Will allow the greek alphabet but I don’t know whether that’s sufficient.

@robmeek
Copy link
Author

robmeek commented Mar 19, 2020

I wonder whether this library should be validating or filtering input to the API at all.

DragonBe added a commit that referenced this issue Mar 21, 2020
In this commit I changed the way we're filtering and validating characters so that we allow non-Latin characters as we have in Germany, Poland, Greece and other countries.
@DragonBe
Copy link
Owner

Hello @robmeek,

I tested the hypothesis you provided here where I have the following data set for testing

'Greek Trader Name' => [
    [
        'countryCode'          => 'EL',
        'vatNumber'            => '999645865',
        'requesterCountryCode' => 'EL',
        'requesterVatNumber'   => '999645865',
        'traderName'           => 'ΤΡΑΙΝΟΣΕ',
        'traderCompanyType'    => 'AE',
        'traderStreet'         => 'ΚΑΡΟΛΟΥ 1-3',
        'traderPostcode'       => '10437',
        'traderCity'           => 'ΑΘΗΝΑ',
    ],
],

I've modified the filterArgument method in such a way it no longer strips out non-Latin characters by using FILTER_FLAG_STRIP_LOW.

private function filterArgument(string $argumentValue): string
{
    $argumentValue = str_replace(['"', '\''], '', $argumentValue);
    return filter_var($argumentValue, FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
}

And modified the validation regex for validateArgument as you suggested, but a bit different so I can support more unicode languages using \pL and /u as is described at https://www.php.net/manual/en/regexp.reference.unicode.php.

    private function validateArgument(string $argumentValue): bool
    {
        if (false === filter_var($argumentValue, FILTER_VALIDATE_REGEXP, [
            'options' => ['regexp' => '/^[a-zA-Z0-9\s\.\-,\pL]+$/u']
        ])) {
            return false;
        }
        return true;
    }

Now the tests are running OK and we didn't break anything in the process, so that's always good.

Please have a look at my PR #100 and see if it solves your issue.

@robmeek
Copy link
Author

robmeek commented Mar 22, 2020

@DragonBe Yes, that looks good to me.
Thanks for looking into this so promptly!

@DragonBe
Copy link
Owner

Awaiting a code review from @krzaczek to make sure… once approved, it will be merged in.

@DragonBe DragonBe added bug refactoring Improving existing code for the better quality labels Mar 22, 2020
@fidelo-developer
Copy link

fidelo-developer commented May 26, 2020

@DragonBe can you please add "+" to the white list too?

/^[a-zA-Z0-9\s\.\-,\pL\+]+$/U

Example of a valid company name in the VIES database: ACME GmbH + Co. KG

Thanks a lot!
Mark

@fidelo-developer
Copy link

I just discovered, that parentheses need to be allowed too. Maybe it is better to remove this filter as @robmeek already suggested. I can see no need for it.
Thanks!

@DragonBe
Copy link
Owner

DragonBe commented May 27, 2020

I just discovered, that parentheses need to be allowed too. Maybe it is better to remove this filter as @robmeek already suggested. I can see no need for it.
Thanks!

I'm not really a fan of allowing all sorts of entries to go in to prevent malicious code to be executed. Given the feedback up to this point, what I might do instead is allowing all characters except those I consider harmful by using htmlspecialchars.

@fidelo-developer
Copy link

Thanks for your feedback.
I'm afraid that is not enough either, because ampersand (definitely) and double quotation marks (maybe) are regular characters in company names.

@DragonBe
Copy link
Owner

I was just reviewing the code bit again and there are two steps in the validation process:

  1. filtering input
  2. validating input

I could add the htmlspecialchars bit to the filtering process, but validation remains a critical step to consider. I'm running a couple of test scenarios now to figure out a good whitelist/blacklist approach as I don't want to forfeit on validation of input.

@DragonBe
Copy link
Owner

Hmm, up to this point I have most of the issues cleared, except for the ampesant issue.

Test case: VAN AERDE & PARTNERS
Filter returns: VAN AERDE & PARTNERS
Input Validation succeeds
VIES Validation fails -> because & does not match & 😕

@DragonBe
Copy link
Owner

Double checked the SoapCall result:

(object) array(
   'countryCode' => 'BE',
   'vatNumber' => '0458591947',
   'requestDate' => '2020-05-27+02:00',
   'valid' => false,
   'traderName' => '---',
   'traderCompanyType' => '---',
   'traderAddress' => '---',
   'requestIdentifier' => '',
)

Looks to me like the validation is ok, but the filtering should not filter the special characters

DragonBe added a commit that referenced this issue May 27, 2020
After some discussion on issue #99 I improved the way we filter arguments, especially for trader company names and added a couple of additional test cases to make sure we allow special characters inside the trader names.
@DragonBe
Copy link
Owner

OK, in 7ef8962 I was able to improve the filtering with additonal test cases and was even able to remove a potential security risk.

@fidelo-software can you have a look at it to see if your use cases are now covered? If not, please add some examples here in the thread or immediately to the test data provider.

@fidelo-developer
Copy link

Thank you for responding so quickly!
I tested 7ef8962 and it works fine with our company names, but I have an address which is not working yet:

Speak & Fun España SL
C/ Asunción 80, 8ºB Valid
41011 SEVILLA
Espana
ESB91648113

@DragonBe
Copy link
Owner

Correct, abbreviations in Spanish are often marked with a / which cannot be filtered out if we want to validate against VIES because Spain requires full trader details for validation purposes.

I know it's a bit annoying, but until I find a permanent solution I think I need to take it case by case. I've adjusted the validation rule in b971690.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug quality refactoring Improving existing code for the better
Projects
None yet
Development

No branches or pull requests

4 participants