DataHarmonization Document improvements #185

ghost · 2015-05-28T10:58:32Z

Hi,

the DataHarmonization document needs some improvements:

When looking at it in a browser the table is too big to be actually displayed in a browser window.
The document itself isn't so clearly understandable. e.g When to prefix a key with reported? When not? How is someone supposed to add a new field? Like http method? Should that then be reported_http_method?

aaronkaplan · 2015-05-28T11:03:08Z

Thx nice catch. What you saw was an intermediate version/commit on github. Will update it.

Mobile

On 28.05.2015, at 12:58, "Josef P. Bernhart" notifications@github.com wrote:

Hi,

the DataHarmonization document needs some improvements:

When looking at it in a browser the table is too big to be actually displayed in a browser window.
The document itself isn't so clearly understandable. e.g When to prefix a key with reported? When not? How is someone supposed to add a new field? Like http method? Should that then be reported_http_method?
—
Reply to this email directly or view it on GitHub.

SYNchroACK · 2015-05-28T13:27:20Z

True. Regarding reported keys, IMHO, its something to remove because in the new version (https://github.com/certtools/intelmq/tree/v1.0-beta) we have raw field where we will specify the raw event.

I think its easy to agree with this one.

Feedback?

ghost · 2015-05-29T18:30:50Z

@SYNchroACK, still not read. Trying to focus on that one :(

sebix · 2015-08-20T08:49:14Z

I need your help @aaronkaplan and @SYNchroACK: When working with the postgres (see my branch) I saw that the Data-Harmonization document uses underscore to subclass values. But the code uses dots. Replacing all underscores with docs does not make sense in all senses, e.g. with source.reverse_domain_name.

Additionally, some fields have other names in code and documentation, as with source.time and time.source. This is kind of blocking any work that involves harmonization.

EDIT:

source.reverse_domain_name is not documented
time.source in docs as source_time
time.observation in docs as observation_time
raw not in docs
classification.type probably malware, type or taxonomy
description.text probably description
feed.name is maybe feed or feed_code

aaronkaplan · 2015-08-20T10:44:15Z

Mobile

On 20.08.2015, at 10:49, Sebastian notifications@github.com wrote:

I need your help @aaronkaplan and @SYNchroACK: When working with the postgres (see my branch) I saw that the Data-Harmonization document uses underscore to subclass values. But the code uses dots. Replacing all underscores with docs does not make sense in all senses, e.g. with source.reverse_domain_name.

Got it. Well the underscores come from abusehelper compatibility. I don't think we still need that anymore since we'll need a mapping anyway by now. So I would also be fine with camelCase or some other renaming. BUT! Let's be very careful about this! It means we will have to refactor everything and every bot

@SYNchroACK: what do you say?

Additionally, some fields have other names in code and documentation, as with source.time and time.source. This is kind of blocking any work that involves harmonization.

That seems like a bug then.

—
Reply to this email directly or view it on GitHub.

sebix · 2015-08-20T13:59:56Z

On 08/20/2015 12:44 PM, AaronK wrote:

On 20.08.2015, at 10:49, Sebastian notifications@github.com wrote:

I need your help @aaronkaplan and @SYNchroACK: When working with the
postgres (see my branch) I saw that the Data-Harmonization document
uses underscore to subclass values. But the code uses dots. Replacing
all underscores with docs does not make sense in all senses, e.g. with
source.reverse_domain_name.

Got it. Well the underscores come from abusehelper compatibility. I
don't think we still need that anymore since we'll need a mapping
anyway by now. So I would also be fine with camelCase or some other
renaming. BUT! Let's be very careful about this! It means we will have
to refactor everything and every bot
@SYNchroACK: what do you say?
If we don't want to break too much, we just redefine the names in the
docs to dot-style. I can do that if you are fine with it.

Additionally, some fields have other names in code and
documentation, as with source.time and time.source. This is kind of
blocking any work that involves harmonization.
That seems like a bug then.
I have a list of undocumented field names in my post above.

SYNchroACK · 2015-08-23T15:45:37Z

The current fields are:
https://raw.githubusercontent.com/certtools/intelmq/master/intelmq/conf/harmonization.conf

time.source is correct. The idea is to have levels:

time . [source | observation]
source . [ip | fqdn | etc...]
destination . [ip | fqdn | etc...]

..etc...

We should create a script to get the json from the file above and generate a harmonization.md document.

aaronkaplan · 2015-08-23T15:46:44Z

On Aug 23, 2015, at 5:45 PM, Tomás Lima notifications@github.com wrote:

The current fields are:
https://raw.githubusercontent.com/certtools/intelmq/master/intelmq/conf/harmonization.conf

time.source is correct. The idea is to have levels:

time . [source | observation]
source . [ip | fqdn | etc...]
destination . [ip | fqdn | etc...]

..etc...

Tomas, did you see my other mail concerning the document?

SYNchroACK · 2015-08-23T15:48:54Z

which email? subject and timestamp?

sebix · 2015-08-24T07:10:40Z

On 08/23/2015 05:45 PM, Tomás Lima wrote:

We should create a script to get the json from the file above and
generate a harmonization.md document.
Yes, that's my idea. @aaronkaplan and me discussed this issue on Friday
trying to harmonize and define the Harmonziation fields. I will upload
more fixes today to my postgres branch.

sebix · 2015-08-24T09:28:48Z

Another issue that just came up: In geolocation sections we have cymru_cc and geoip_cc. But how to deal with oher sources? Grepping for '.cc' revealts that AlienVault and CleanMX both have geolocation information. Currently it is written to cc. Should we have cc, cymru_cc and geoip_cc?

BTW: I would like to rename geopip_cc to maxmind_cc, as geoip is a very generic term

SYNchroACK · 2015-08-24T10:08:36Z

@sebix , cymru_cc or geoip_cc doesnt exist. All current fields are here:
https://github.com/certtools/intelmq/blob/master/intelmq/conf/harmonization.conf

I dont see the need to have multiple fields for that... in the pipeline we can put the bots cymru and maxmind in the way we want...

cymru -> maxmind = means, if maxmind has a value, will overwrite cymru
maxmind -> cymru = means, if cymru has a value, will overwrite maxmind

In my perspective, I think we should minimize the number of fields that does the same thing... but @aaronkaplan and @sebix , whats your feedback?

sebix · 2015-08-24T10:57:21Z

On 08/24/2015 12:08 PM, Tomás Lima wrote:

@sebix https://github.com/sebix , cymru_cc or geoip_cc doesnt exist.
All current fields are here:
https://github.com/certtools/intelmq/blob/master/intelmq/conf/harmonization.conf

These are the implemented fields, the theoretically standardized fields
are here: https://github.com/certtools/intelmq/wiki/Data-Harmonization
Yes, they are not congruent, therefore we have this discussion trying to
unify them and re-standardize the fields again.

I dont see the need to have multiple fields for that... in the
pipeline we can put the bots cymru and maxmind in the way we want...

AFAIK the idea was/is to use more than one data source for geolocation
as they do not give identical results and geolocation is no exact
science. Having two independent localizations probably results in more
accurate location information.

cymru -> maxmind = means, if maxmind has a value, will overwrite cymru
maxmind -> cymru = means, if cymru has a value, will overwrite maxmind

What do you want to say? Is this a proposal for configuration?

In my perspective, I think we should minimize the number of fields
that does the same thing... but @aaronkaplan
https://github.com/aaronkaplan and @sebix https://github.com/sebix
, whats your feedback?

Right. More empty fields are overhead. But they could probably be handy
to filter data more easily (instead of searching for them in
additional_data. I'm not directly involved in the standardization
process itself, just can give hints and raise helpful questions.

SYNchroACK · 2015-08-24T11:48:09Z

On Mon, Aug 24, 2015 at 11:57 AM, Sebastian notifications@github.com
wrote:

On 08/24/2015 12:08 PM, Tomás Lima wrote:

@sebix https://github.com/sebix , cymru_cc or geoip_cc doesnt exist.
All current fields are here:

https://github.com/certtools/intelmq/blob/master/intelmq/conf/harmonization.conf

These are the implemented fields, the theoretically standardized fields
are here: https://github.com/certtools/intelmq/wiki/Data-Harmonization
Yes, they are not congruent, therefore we have this discussion trying to
unify them and re-standardize the fields again.

Some fields are deprecated, other just were updated...well... like I
mentioned, I think we need to get the description of most of the fields and
put them in harmonization.conf. Then, we can create a quick script for
convert the harmonization.conf into a harmonization.md that we can put on
docs folder. Yes, discuss and propose something, then we will try to
improve together.

I dont see the need to have multiple fields for that... in the
pipeline we can put the bots cymru and maxmind in the way we want...

AFAIK the idea was/is to use more than one data source for geolocation
as they do not give identical results and geolocation is no exact
science. Having two independent localizations probably results in more
accurate location information.

True, but have multiple fields related to localization seems to be
overwhelmed.

cymru -> maxmind = means, if maxmind has a value, will overwrite cymru
maxmind -> cymru = means, if cymru has a value, will overwrite maxmind

What do you want to say? Is this a proposal for configuration?

No, Im just presenting the possibilities depending of your confidence in
each source. For example, If you "trust" more in cymru, put cymru bot after
maxmind and Cymru bot will overwrite the geolocation information...

In my perspective, I think we should minimize the number of fields
that does the same thing... but @aaronkaplan
https://github.com/aaronkaplan and @sebix https://github.com/sebix
, whats your feedback?

Right. More empty fields are overhead. But they could probably be handy
to filter data more easily (instead of searching for them in
additional_data. I'm not directly involved in the standardization
process itself, just can give hints and raise helpful questions.

No, Im not suggesting put them in 'description.' (old
'additional_data'). Im just trying to keep the fields that are important
and remove the unnecessary.

aaronkaplan · 2015-08-24T15:22:40Z

It's a tough decision.

I'd say we keep geoip/maxmind_cc and cymru_cc. Why ?

Because these have different values for many IPs. Cymru_cc essentially is the RIR data and ip to country code mapping. Maxmind is more geographic while the RIRs capture re organization owning the netblock an which country it is in.
So we need both - on different contexts.
Maxmind is better for geographic mapping (world map etc) while Cymru is IMHO sometimes better for getting the right country I the Organisation which needs to be contacted.

I am for leaving it and documenting when we recommend to use which one.

Mobile

On 24.08.2015, at 12:08, Tomás Lima notifications@github.com wrote:

@sebix , cymru_cc or geoip_cc doesnt exist. All current fields are here:
https://github.com/certtools/intelmq/blob/master/intelmq/conf/harmonization.conf

I dont see the need to have multiple fields for that... in the pipeline we can put the bots cymru and maxmind in the way we want...

cymru -> maxmind = means, if maxmind has a value, will overwrite cymru
maxmind -> cymru = means, if cymru has a value, will overwrite maxmind

In my perspective, I think we should minimize the number of fields that does the same thing... but @aaronkaplan and @sebix , whats your feedback?

—
Reply to this email directly or view it on GitHub.

aaronkaplan · 2015-08-24T15:23:55Z

Sebix and me did that on FR. Please contact him for details.

Mobile

On 24.08.2015, at 13:48, Tomás Lima notifications@github.com wrote:

On Mon, Aug 24, 2015 at 11:57 AM, Sebastian notifications@github.com
wrote:

On 08/24/2015 12:08 PM, Tomás Lima wrote:

@sebix https://github.com/sebix , cymru_cc or geoip_cc doesnt exist.
All current fields are here:

https://github.com/certtools/intelmq/blob/master/intelmq/conf/harmonization.conf

These are the implemented fields, the theoretically standardized fields
are here: https://github.com/certtools/intelmq/wiki/Data-Harmonization
Yes, they are not congruent, therefore we have this discussion trying to
unify them and re-standardize the fields again.

Some fields are deprecated, other just were updated...well... like I
mentioned, I think we need to get the description of most of the fields and
put them in harmonization.conf. Then, we can create a quick script for
convert the harmonization.conf into a harmonization.md that we can put on
docs folder. Yes, discuss and propose something, then we will try to
improve together.

I dont see the need to have multiple fields for that... in the
pipeline we can put the bots cymru and maxmind in the way we want...

AFAIK the idea was/is to use more than one data source for geolocation
as they do not give identical results and geolocation is no exact
science. Having two independent localizations probably results in more
accurate location information.

True, but have multiple fields related to localization seems to be
overwhelmed.

cymru -> maxmind = means, if maxmind has a value, will overwrite cymru
maxmind -> cymru = means, if cymru has a value, will overwrite maxmind

What do you want to say? Is this a proposal for configuration?

No, Im just presenting the possibilities depending of your confidence in
each source. For example, If you "trust" more in cymru, put cymru bot after
maxmind and Cymru bot will overwrite the geolocation information...

In my perspective, I think we should minimize the number of fields
that does the same thing... but @aaronkaplan
https://github.com/aaronkaplan and @sebix https://github.com/sebix
, whats your feedback?

Right. More empty fields are overhead. But they could probably be handy
to filter data more easily (instead of searching for them in
additional_data. I'm not directly involved in the standardization
process itself, just can give hints and raise helpful questions.

No, Im not suggesting put them in 'description.' (old
'additional_data'). Im just trying to keep the fields that are important
and remove the unnecessary.
—
Reply to this email directly or view it on GitHub.

sebix · 2015-08-25T12:07:14Z

Current proposal is to have only one abuse_contact. The abusix-expert adds contacts for both source and destination currently. So the abusix bot has to decide which side is to blame based on the classification. This has to be defined before making this proposal final.

Current proposal of fields is here: https://github.com/sebix/intelmq/blob/postgres/docs/Harmonization-fields.md
generated from conf-file

SYNchroACK · 2015-08-25T13:12:54Z

@sebix form my point of view, I think its not a Abusix bot resposability to decide which "side".... the procedure that will send the emails should have the intelligence to understand the event and choose... so... its not a problem... and source.abuse_contact and destination.abuse_contact should continue....

Do you agree @sebix ?

sebix · 2015-08-25T13:15:34Z

The idea by @aaronkaplan was to have only one abuse_contact, as never both are relevant. Based on the classification, source or destination contact can be chosen.

Aaron, your comment please.

aaronkaplan · 2015-08-25T13:40:25Z

On Aug 25, 2015, at 3:15 PM, Sebastian notifications@github.com wrote:

The idea by @aaronkaplan was to have only one abuse_contact, as never both are relevant.

ACK

Based on the classification, source or destination contact can be chosen.

Aaron, your comment please.

100% ACK.
you want to send mails to essentially one list of email addresses . And that should be in the abuse_contact.
Otherwise you might end up with complexity again and no decision what to use.

Best,
a.

—
Reply to this email directly or view it on GitHub.

SYNchroACK · 2015-08-25T14:40:09Z

again, that intelligence should not be implemented in abusix and Im 100% this is the correct approach. Why? Imagine, tomorrow you will have a AbuseContactDB bot and a RIPE AbuseContactDB Bot, etc... and you will repeat in every bot the same intelligence. The bots should fill that keys (source.abuse_contact and destination.abuse_contact) with the abuse_contacts associated to the correspondent IPs. In the end of the pipeline, you will have a script or a bot, or whatever, that will be responsible to send the events depending of the classification. In that procedure (script, bot, platform...etc..) you will build that intel that will choose if you will use the source.abuse_contact or destination.abuse_contact.

Advanges with this approach:

less complexity in the implementation of each abuse_something_contact bot
more generic approach
if you need both contacts from source and destination, they already there

Disadvange:

one more query for each event (if the event has source and destination, which is probably 30% of the cases)

If you still disagree, lets schedule a confcall to discuss. :)

sebix · 2015-08-26T10:38:43Z

To document what @SYNchroACK and me just discussed, this issue was brought up by #298.

additional_information should be used by parser if there's no existing field in harmonization. Instead of a silent discard, the parsers should save this data.
Thomas was not fine with the name. Our proposal is to rename it to additional or extra, as it means 'additional/extra fields'.

For the content my proposal is to use JSON. JSON is machine-readable, so the additional information can be easily extracted by existing parser. Even in postgres it is very easy to query that data, as postgres support JSON.
Also, without a proper definition of the format the meaning of saved data is not clear. Using JSON, proper keys are mandatory.

SYNchroACK · 2015-08-26T11:02:13Z

@sebix agreed. 'extra' field should be add to harmonization.conf

so, if some feed has the 'zzz' and 'yyy' fields, the intelmq should suppor the following harmonization:

    "extra.zzz": {
      "type": "String",
      "description": "<description>"
    }
    "extra.yyy": {
      "type": "String",
      "description": "<description>"
    }

sebix · 2015-08-26T11:27:36Z

I meant that extra should be of type String and holding a JSON object. This does not require any adaption of the message class and the output bots.

sebix · 2015-08-26T11:34:52Z

Another open question affects abuse_contact, irrespective of the source-destination discussion:

For one IP or Host there are always multiple abuse contacts possible: domain (probably more then one), whois, AS. But abuse_contact is currently only defined for one email address. Should it be instead a list of addresses? Should they be grouped by responsiblity (e.g. domain, AS)? I think of some scenarios where it is appropriate to contact the AS/ISP (malware), but in case of a defacement or a vulnerable service, the domain/host owner should be (also) contacted. As with the source-destination discussion, should this be decided at the end of the pipeline?

sebix · 2015-08-26T11:39:47Z

And next question: we have os.name, os.version, user_agent and account. I thought of grouping them together. What about victim?

SYNchroACK · 2015-08-26T13:43:28Z

Contacts for domains, IPs / ASNs:
Proposal:

source.contact.ip
source.contact.fqdn
source.contact.asn

destination.contact.ip
destination.contact.fqdn
destination.contact.asn

Type of values of the abuse_contact field - string vs string list
Well, at the moment harmonization doest not suport list values. Reason: CSV compatibility
@sebix , whats your perspective regarding the legacy formats compatibility issue?
'source-destination discussion'
As I already mentioned, source and destination should be automatically filled and the intelligence need to send emails should be into sender code (script, platform, mailer expert bot, mailer output bot, etc...).
Regarding 'extra' field, I would like to propose a new improvement that will be use in 5)

source.extra.<field>

destination.extra.<field>

'os.name, os.version, user_agent and account' fields:

Note: 'account' field will be use to store email accounts (ex: email compromised) or website username (ex: github username compromised). So, should keep as: 'source.account' and 'destination.account'.

Proposal:

source.account
destination.account

source.extra.os.name
source.extra.os.version
source.extra.user_agent

destination.extra.os.name
destination.extra.os.version
destination.extra.user_agent

Regarding your proposal, I think 'victim' will not fit the needs because you will have to put the intelligence on some 'middle bot' if the event is related to a victim or not... so, lets put in generic way and let that intel part to other bot...

aaronkaplan · 2015-08-26T19:24:17Z

On Aug 26, 2015, at 1:34 PM, Sebastian notifications@github.com wrote:

Another open question affects abuse_contact, irrespective of the source-destination discussion:

For one IP or Host there are always multiple abuse contacts possible: domain (probably more then one), whois, AS. But abuse_contact is currently only defined for one email address. Should it be instead a list of addresses?

Sure there is no reason why it can't be a comma separated list.
You can give a mail program a comma separated list of recipients, no problem!

Should they be grouped by responsiblity (e.g. domain, AS)?

Nah... mail gets delivered :) The mail server decides the order any way.

I think of some scenarios where it is appropriate to contact the AS/ISP (malware), but in case of a defacement or a vulnerable service, the domain/host owner should be (also) contacted. As with the source-destination discussion, should this be decided at the end of the pipeline?

This is the decision of the bots processing the data...
It depends on the case (free, type, ...) so the abuse contact lookup bot needs to give the right answer ("whom to contact") based on the data it has in the event.
That's ok.

Best,
a.

aaronkaplan · 2015-08-26T19:26:10Z

On Aug 25, 2015, at 4:40 PM, Tomás Lima notifications@github.com wrote:

again, that intelligence should not be implemented in abusix and Im 100% this is the correct approach. Why? Imagine, tomorrow you will have a AbuseContactDB bot and a RIPE AbuseContactDB Bot, etc... and you will repeat in every bot the same intelligence. The bots should fill that keys (source.abuse_contact and destination.abuse_contact)

Tomas I am not sure if I agree with source.abuse_contact and destination.abuse_contact.

I went through all the cases manually with @sebix and it only makes sense to have "abuse_contact".
Whatever is source and destination depends on the feed, type etc . It really depends and if you add source.abuse_contact as well as destination.abuse_contact, then things only uneccessarily get confusing.
Let's keep it simple and stick with our KISS principle.

My 2 cents,
a.

with the abuse_contacts associated to the correspondent IPs. In the end of the pipeline, you will have a script or a bot, or whatever, that will be responsible to send the events depending of the classification. In that procedure (script, bot, platform...etc..) you will build that intel that will choose if you will use the source.abuse_contact or destination.abuse_contact.

Advanges with this approach:

• less complexity in the implementation of each abuse_something_contact bot
• more generic approach
• if you need both contacts from source and destination, they already there
Disadvange:

• one more query for each event (if the event has source and destination, which is probably 30% of the cases)
If you still disagree, lets schedule a confcall to discuss. :)

—
Reply to this email directly or view it on GitHub.

SYNchroACK · 2015-08-26T21:07:04Z

well, I think that to follow your approach you need to give some solutions for the following problems:

Who will have the responsability to implement the algorithm in each AbuseContact(RIPE/ABusix/etc)Expert Bot to know which IP (source or destination) should be choose?
Who will maintain all code in all AbuseContact(RIPE/ABusix/etc)Expert Bots? If we want to change something, who will change in all bots? and How?
If some team wants to use intelmq with some harmonization and taxonomy modifications, make sense to force them to customize all AbuseContactExpert Bots? If we fix some bug in the repo in some AbuseContactExpert Bot, make sense to force them to manually merge the fix?

IMHO, the solutions for these problems will not follow the KISS principle...

aaronkaplan · 2015-08-26T21:28:38Z

On Aug 26, 2015, at 11:07 PM, Tomás Lima notifications@github.com wrote:

well, I think that to follow your approach you need to give some solutions for the following problems:

Who will have the responsability to implement the algorithm in each AbuseContactExpert Bot to know which IP (source or destination) should be choose?

It is up to the individual implementor to chose the abuse contact lookup strategy.
We can provide a sample.

Who will maintain all code in all AbuseContactExpert Bots? If we want to change something, who will change in all bots? and How?

It is up to the individual implementor to chose the abuse contact lookup strategy.
We can provide a sample.
Why "all AubseContactExpert bots"? As said - one sample. Should a local installation want a different solution, go for it.

Did you take a look at
https://github.com/certtools/contactdb/blob/rest/doc/datasets.mkd
and
https://github.com/certtools/contactdb/blob/rest/doc/abuse-lookups.mkd

?
A lot of ideas are already there for a generic way - which of course can be different in each installation.

I have the very strong feeling that some ideas are totally not aligned yet regarding the abuse contact lookups. Let's have a conf call please!
Writing this all down via github "email" sort of is unpractical.

okay?

Thx,
a.

If some team wants to use intelmq with some harmonization and taxonomy modifications, make sense to force them to customize all AbuseContactExpert Bots? If we fix some bug in the repo in some AbuseContactExpert Bot, make sense to force them to manually merge the fix?

I think we talk about different things...

IMHO, the solutions for these problems will not follow the KISS principle...

—
Reply to this email directly or view it on GitHub.

sebix · 2015-08-27T09:32:10Z

I try to summarize the current discussions on the harmonization by grouping, citations and explanations from my side. Citations are not literally, I adapted most of them.

source.abuse_contact and destination.abuse_contact, or one abuse_contact?
- One abuse_contact:
  Aarons proposal to have only one field is based on the observation that for every classification type, only one of source and destination has to be contacted. It does never make sense to contact both.
  - Aaron: you want to send mails to essentially one list of email addresses. And that should be in the abuse_contact. Otherwise you might end up with complexity again and no decision what to use.
  - I went through all the cases manually and it only makes sense to have one abuse_contact. Whatever is source and destination depends on the feed, type etc. It really depends and if you add source.abuse_contact as well as destination.abuse_contact, then things only unnecessarily get confusing.
  - It is up to the individual implementor to chose the abuse contact lookup strategy. We can provide a sample.
  - Should a local installation want a different solution, go for it.
- Two abuse_contacts:
  - Tomas: From my point of view, I think its not a Abusix bot responsibility to decide which "side" should be contacted. The procedure that will send the emails should have the intelligence to understand the event and choose.
  - Imagine, tomorrow you will have a AbuseContactDB bot and a RIPE AbuseContactDB Bot, etc. You will have to repeat the same intelligence in every bot. The bots should fill that keys (source.abuse_contact and destination.abuse_contact) with the abuse_contacts associated to the correspondent IPs. In the end of the pipeline, you will have a script or a bot, or whatever, that will be responsible to send the events depending of the classification. In that procedure (script, bot, platform...etc..) you will build that intel that will choose if you will use the source.abuse_contact or destination.abuse_contact. Advantages with this approach:
    - Less complexity in the implementation of each AbuseSomethingExpertBot.
  - More generic approach.
  - If you need both contacts from source and destination, they are already there.
  The intelligence choosing the recipient should be in the sender code (script, platform, mailer expert bot, mailer output bot, etc...).
  Aarons approach you need to give some solutions for the following problems:
  1. Who will have the responsibility to implement the algorithm in each AbuseContact(RIPE/ABusix/etc)Expert Bot to know which IP (source or destination) should be choose?
  2. Who will maintain all code in all AbuseContact(RIPE/ABusix/etc)Expert Bots? If we want to change something, who will change in all bots? and How?
  3. If some team wants to use intelmq with some harmonization and taxonomy modifications, make sense to force them to customize all AbuseContactExpert Bots? If we fix some bug in the repository in some AbuseContactExpert Bot, make sense to force them to manually merge the fix?
- Comment by Sebastian: This could be made configurable along with the definition which side of the event should be contacted. Helper functions can be created either in harmonization.py or in utils.py. So no bots have to be adapted, but results in bigger configuration and bigger library.
Should abuse_contact be a single value, python/JSON-list, comma separated or a key-value store (dictionary)? contact or abuse_contact?
For one IP or Host there are always multiple abuse contacts possible: domain (probably more then one), whois, AS. But abuse_contact is currently only defined for one email address. Should it be instead a list of addresses? Should they be grouped by responsibility (e.g. domain, AS)? Should always all found abuse addresses be contacted?
- Proposal by Tomas is to create sub-fields for IP, FQDN and AS:
  - source.contact.ip
  - source.contact.fqdn
  - source.contact.asn
  - destination.contact.ip
  - destination.contact.fqdn
  - destination.contact.asn
- Aaron: Comma-separated list. Mail programs can easily deal with it without adaption.
- Whom to contact? Aaron: This is the decision of the bots processing the data. It depends on the case (free, type, ...) so the abuse contact lookup bot needs to give the right answer ("whom to contact") based on the data it has in the event. That's ok.
additional_information or additional or extra?
additional_information should be used by parser if there's no existing field in harmonization. Instead of a silent discard, the parsers should save this data.
- Thomas is not fine with the name. Proposal is to rename it to additional or extra, as it means 'additional/extra fields'.
Should additional_information be freely formated text, an unstandardized key-value store (dictionary), or always a JSON object?
We are in line that the content of this field should not be unparsed data but rather be machine-readable. Thus this should be a key-value store as all the other fields.
- Proposal by Tomas: If some feed has the 'zzz' and 'yyy' fields, the intelmq should support the following harmonization:
```
"extra.zzz": {
    "type": "String",
    "description": "<description>"
}
"extra.yyy": {
    "type": "String",
    "description": "<description>"
}
```
- Comment by Sebastian completing the proposal: Names are free to chose (general rules: lower case, no whitespace) and type is always String. At the end of the pipeline, the output bots have to decided how to treat this data. For e.g. postgres it does make sense to write the data as JSON.
Should os.name, os.version and user_agent be moved to extra?
- Proposal by Tomas:
  - source.account
  - destination.account
  - source.extra.os.name
  - source.extra.os.version
  - source.extra.user_agent
  - destination.extra.os.name
  - destination.extra.os.version
  - destination.extra.user_agent
Should time.observation be added by collectors or by parsers?
Current definition is:

The time a source bot saw the event.

According to this definition, the collectors should add this field, but actually the parsers are doing it.
- Tomas and Sebastian: Should be done by collectors. While the time of data retrieval is relevant, the time of processing is not.
- Aaron: Should be done by parsers, adjust definition in docs to processed.
Geolocation: Save two country codes from Maxmind and Cymru, or just one? All other geolocation fields only exist once (lat, long, state, etc.)
- One field, Tomas:
  - Having multiple fields related to localization seems to be overwhelmed.
  - If you "trust" more in cymru, put cymru bot after maxmind and Cymru bot will overwrite the geolocation information.
- Two fields, Aaron:
  - The idea is to use more than one data source for geolocation, because these have different values for many IPs. Cymru essentially is the RIR data and ip to country code mapping. Maxmind is more geographic while the RIRs capture are organization owning the netblock an which country it is in. So we need both - on different contexts. Maxmind is better for geographic mapping (world map etc) while Cymru is IMHO sometimes better for getting the right country I the Organization which needs to be contacted. I am for leaving it and documenting when we recommend to use which one.
  - Comment by Sebastian: How to deal with other sources? Grepping for .cc reveals that AlienVault and CleanMX both have geolocation information. Currently it is written to cc. What should they do, if cc is dropped? Write this data to additional_fields?

sebix · 2015-09-02T08:32:20Z

Two discussions are resolved so far:

source.abuse_contact and destination.abuse_contact, or one abuse_contact?
resolved, two fields
Should time.observation be added by collectors or by parsers?
fixed, collectors

We have no further comments on:

Should abuse_contact be a single value, python/JSON-list, comma separated or a key-value store (dictionary)? contact or abuse_contact?
additional_information or additional or extra?
Should additional_information be freely formated text, an unstandardized key-value store (dictionary), or always a JSON object?
Should os.name, os.version and user_agent be moved to extra?
Geolocation: Save two country codes from Maxmind and Cymru, or just one? All other geolocation fields only exist once (lat, long, state, etc.)

SYNchroACK · 2015-09-02T10:39:37Z

My proposal:

comma separated ; contact.ip / contact.domain / contact.asn
extra (string field)
json object
yes
one

aaronkaplan · 2015-09-02T19:27:50Z

On Sep 2, 2015, at 12:39 PM, Tomás Lima notifications@github.com wrote:

My proposal:

comma separated ; contact.ip / contact.domain / contact.asn
why three? I don't like three fields here. The abuse contact expert should look at the whole event.
Based on an analysis of it (that's it's job!) it needs to find one contact where this info needs to be delivered to.
By "one contact" I mean that this can be also a CC: list (comma separated string). But it should be only one json field.

Why?
KISS and do not leave all options in there to be decided for later by other programmers later in the chain.
That will only create confusion. Which field should a later program chose then? It will also need logic to know which field to look at... not good.

If the abuse contact lookup bot does not do what you want it to do, copy & paste & modify it :)
That's what I think should be done.

So: IMHO one JSON field with a list (comma separated) of email addresses.

My 2 cents...

extra (string field)
okay with me

json object
okay with me

yes
ACK

one
But which one to take then? Sometimes cymru is better, sometimes maxmind...

—
Reply to this email directly or view it on GitHub.

SYNchroACK · 2015-09-03T09:37:17Z

My new proposal:

'contact' field ; json object
extra (string field)
json object
yes
two - keep the cymru and maxmind

SYNchroACK · 2015-10-22T12:47:33Z

Include also:

include a default intelmq Event message in JSON format

aaronkaplan · 2015-12-18T21:58:07Z

I believe the user_agent topic came up again at a recent discussion in our jour fixe meeting.
It seems to be important enough for the handlers (and common enough according to them) that it deserves to have its own field. Yes, we could name it extra.user_agent of course. However, turns out that there are no extra.* fields in the DHO. There is only an extra field.

So, the whole discussion about this particular point (user_agent in extra.) seems to miss the reality.

SYNchroACK added the feature Indicates new feature requests or new features label Jun 6, 2015

SYNchroACK mentioned this issue Jun 6, 2015

Rename 'source_time' and 'observation_time' #107

Closed

SYNchroACK added the documentation Indicates a need for improvements or additions to documentation label Aug 7, 2015

sebix mentioned this issue Aug 20, 2015

Doc - Improve DataHarmonization.md #189

Closed

sebix mentioned this issue Aug 21, 2015

abuse.ch collectors and parsers #293

Merged

sebix mentioned this issue Sep 10, 2015

Modify expert bot added #315

Merged

SYNchroACK added this to the Release 1 - v1.0 milestone Oct 9, 2015

SYNchroACK mentioned this issue Oct 22, 2015

Improve DataHarmonization document #250

Closed

sebix mentioned this issue Nov 3, 2015

BOT: DNS lookup #373

Closed

sebix added the data-format label Dec 15, 2015

sebix mentioned this issue Dec 18, 2015

harmonization: move the user_agent info out from extra and make it it's own proper field #419

Open

sebix closed this as completed Apr 5, 2016

DataHarmonization Document improvements #185

DataHarmonization Document improvements #185

Comments

ghost commented May 28, 2015

aaronkaplan commented May 28, 2015

SYNchroACK commented May 28, 2015

ghost commented May 29, 2015

sebix commented Aug 20, 2015

aaronkaplan commented Aug 20, 2015

sebix commented Aug 20, 2015

SYNchroACK commented Aug 23, 2015

aaronkaplan commented Aug 23, 2015

SYNchroACK commented Aug 23, 2015

sebix commented Aug 24, 2015

sebix commented Aug 24, 2015

SYNchroACK commented Aug 24, 2015

sebix commented Aug 24, 2015

SYNchroACK commented Aug 24, 2015

aaronkaplan commented Aug 24, 2015

aaronkaplan commented Aug 24, 2015

sebix commented Aug 25, 2015

SYNchroACK commented Aug 25, 2015

sebix commented Aug 25, 2015

aaronkaplan commented Aug 25, 2015

SYNchroACK commented Aug 25, 2015

sebix commented Aug 26, 2015

SYNchroACK commented Aug 26, 2015

sebix commented Aug 26, 2015

sebix commented Aug 26, 2015

sebix commented Aug 26, 2015

SYNchroACK commented Aug 26, 2015

aaronkaplan commented Aug 26, 2015

aaronkaplan commented Aug 26, 2015

SYNchroACK commented Aug 26, 2015

aaronkaplan commented Aug 26, 2015

sebix commented Aug 27, 2015

sebix commented Sep 2, 2015

SYNchroACK commented Sep 2, 2015

aaronkaplan commented Sep 2, 2015

SYNchroACK commented Sep 3, 2015

SYNchroACK commented Oct 22, 2015

aaronkaplan commented Dec 18, 2015