-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classify data as anomalous #88
Comments
It's a difficult task to precisely define what should be classed as anomalous by default, however one thing that might be worth considering is to include Not a huge problem, since like you suggest the user can simply add these chars to the vocabulary like so:
Just wondering if email addresses and urls should function straight out of the box (given their common occurrence within datasets), without any user adjustment required? |
Hi @BenjaminFraser, Thanks for the question! I assume that this issue occurs when a data column contains values that are not supported by the string type (e.g., email addresses). If so, there may be an alternative solution! You can let ptype treat such values as normal values by modifying the data types it considers. For example, the following initializataion lets ptype take into account the email address type (although I should say that we haven't made extensive experiments with it):
With this, you should be able to annotate data columns with the "EmailAddress" label when approriate, and treat such values as normal rather than anomalous. Does this sound helpful? Note that the "EmailAddress" type is already supported, but we'd need to create a new PFSM for the urls. |
Hi @tahaceritli , Thanks for the speedy response! The solution you provided is great and a neater one than manually adding I hadn't realised the EmailAddress field was already supported (apologies, I should have looked more diligently!). |
No worries. Hope it works (please let me know if you'd have a problem). Perhaps I should prepare another notebook to demonstrate that. |
See notebook. Summary:
Part 1 (False Negatives):
Part 2 (False Positives):
pandas.read_csv
To do:
The text was updated successfully, but these errors were encountered: