Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Types of user ids #5775

Closed
jdwieland8282 opened this issue Sep 21, 2020 · 27 comments · Fixed by #5767
Closed

Types of user ids #5775

jdwieland8282 opened this issue Sep 21, 2020 · 27 comments · Fixed by #5767
Labels
feature pinned won't be closed by stalebot

Comments

@jdwieland8282
Copy link
Member

jdwieland8282 commented Sep 21, 2020

Type of issue

Question

Description

Do consumers of user ids set within either a new or existing userid module need to know more about how the UUID was generated? Or is the id itself sufficient. Would more DSPs integrate against a particular user id if they knew more about how it was generated?

We should consider a new attribute called "stype" (source type). Type would be passed along side the UUID to SSPs & DSPs.

Steps to reproduce

NA

Test page

NA

Expected results

pbjs.setConfig({
    userSync: {
        userIds: [{
            name: "publisherProvided",
            params: {
                eids: [{
                    source: "example.com",
                    atype: 1,
                    uids:[{
                      id: "value read from cookie or local storage",
                      ext: {
                        stype: "sha256email"
                      }
                  }]
                },{
                    source: "id-partner.com",
                    atype: 1,
                    uids:[{
                      id: "value read from cookie or local storage",
                      ext: {
                        stype: "ppuid"
                      }
                  }]
                }]
            }
        }]
    }
});
@smenzer
Copy link
Collaborator

smenzer commented Sep 24, 2020

I would suggest NOT including any form of hashed email in the options for stype. a hashed email is like a fingerprint in that you can't "reset" it - once you have it, you will always have the same link to a set of user data, even if they've asked someone upstream to reset/clear their data. since there's no efficient way (today) to tell EVERY platform in the industry to wipe data for a user, the best way today is simply to generate a new user id. this is just like apple and android allowing you to reset your MAID. identity providers that base an ID off of a hashed email are fine since they can change the ID they generate if the user has asked to reset/opt out at some point.

@jdwieland8282
Copy link
Member Author

So far we have:

  • DMP - added by a 3rd party id provider like ID5, Liveramp, Lotame, etc..
  • PPUID - added by the publisher, the publisher can be identified in eids.source

@dmdabbs
Copy link

dmdabbs commented Oct 1, 2020

FWIW, in late July there was a discussion on https://openrtb-iabtechlab.slack.com/archives/C3Y6GHUTH/p1595433523254200 (TechLab Programmatic, general Slack). The use case similar to here, signaling a stable, publisher-generated UID:

As a DSP, I want a site-specific/publisher-provided ID, so to enable basic per-site frequency-capping at least, in the absence of cross-site identifier, though there are probably other uses. I want this to be an ID generated by the site, common to all traffic for that site, i.e. perhaps generated by the PubCommon prebid.js module or similar, but how it it gets made is outside of OpenRTB's scope I think.
...
FWIW, we accept eids with a "source" attribute of "pubcid.org" for that scenario.
...
re: eids, I think probably this should happen..
add an "agent type" for site-specific IDs. Somehow deal with that there won't be a "source", necessarily, if they self-generate them. "source" is defined currently as "Source or technology provider responsible for the set of included IDs. Expressed as a top-level domain."

The following was sketched but I don't recall seeing this discussion thread picked up again.

// Agent Type
// 0   A stable, publisher/site-provided identifier.
// ... etc from OpenRTB spec
//
"eids":[
{
   "source": "localhost",
   "uids": [
      { "id": "c4a4c843-2368-4b5e-b3b1-6ee4702b9ad6", "atype": 0 }
   ],
},   
...

I pinged the channel to see if there was more discussion on eids enhancements.

@smenzer
Copy link
Collaborator

smenzer commented Oct 1, 2020

to me, DMP is too generic, and also identifiable simply by looking in the source field. I'm not sure exactly what all the right values are, but I think it's important to get some ideas from the consumers of the IDs (i.e. DSPs) to make sure it's useful.

On the ID5 side for example, we provide a field we call linkType that we use to signal how we linked two 1p IDs together - through no link (i.e. it's a publisher-only ID), through our probabilistic algo, or via deterministic signals. This would let consumers of the ID know the strength of the cross-domain reconciliation and allow them to make decisions on it. Perhaps standardizing something along these lines would be useful for the DSPs?

@joshuakoran
Copy link

I agree that when describing IDs, it would be useful to distinguish among the various "dimensions" of IDs:

Describes Person or Device/App

  • Directly-Identifiable (e.g., email)
  • Pseudonymous (e.g., alphanumeric string)

Set/Link of IDs

  • Device graph
  • First-party sets?

Source type

  • Publisher or Brand (who from the consumer's point of view is also a publisher)
  • Vendor to publisher or brand (or their agents)

Actual source

  • Which domain of which organization generated/controls ID

Age of ID

  • Creation date
  • Last seen date

I would keep all the "dimensions" distinct from uses of ID

  • Preference management (e.g., opt-in/-out of personalization)
  • Engagement (or restrictions like frequency cap)
  • Measurement (distinct counting)

@jdcauley
Copy link

jdcauley commented Oct 2, 2020

Re @dmdabbs note, I think this is what you might be referring to?

This is the spec we're currently using with OpenRTB, https://github.com/Advertising-ID-Consortium/IdentityLink-in-RTB

@jdwieland8282
Copy link
Member Author

@joshuakoran Just working my way through your list. I think the adcom "atype" field handles

  1. Directly-Identifiable (e.g., email)
  2. Pseudonymous (e.g., alphanumeric string)

-Can you further clarify what you meant by "Set/Link of IDs"? Do you expect this to be an array of other ids?
-Same question for "source type", what values do you expect here?
-"Actual source" would be the user id module name value, or "source: if different from name.

  • age of ID, seems easy enough.
  • Re dimensions, are you advocating a second array where uses of ID would be enumerated? or just point out that we shouldn't add these at all yet?

@joshuakoran
Copy link

@jdwieland8282

The "set/link of IDs" concept relates to sharing a common ID that maps one ID to other IDs (e.g., x-device link, link across two 1P domains such as required by first-party sets).

The source (or perhaps better name is "controller") of ID is due to permissions/permitted uses tied to ID.

The dimensions defining the ID (e.g., source/controller, type, time) are orthogonal to information tied to the ID (audience attributes/cohorts, restrictions against use for personalization, event-aggregates such as frequency counter, etc.)

@jdwieland8282
Copy link
Member Author

Hey @joshuakoran , mind adding a few example values for each that you think maybe relevant? I want to be sure clear on what you are suggesting.

@joshuakoran
Copy link

Sorry for the delay, finally coming up for air. Agree that adcom is the right model to improve upon.

As we think about reducing discrepancies and adopting cross-publisher common ID schemes, such as being discussed here and in IAB TL, it seems we can improve how we annotate the interoperable IDs being used to improve engagement, measurement and optimization.

The original question as I understood it was to provide enhanced standard descriptors (metadata) around user IDs + information associated with them, rather than the attribute data (e.g., interest taxonomies, demographic taxonomies, geo taxonomies) or event data (activity_type, optional value of activity such as a purchase transaction).

I think the broad classification of ID metadata can be classed into two buckets of better describing the what-ness of the ID and “provenance” of the ID.

WHAT concepts
Some IDs describe people/households (such as home address), others are describe web clients (like the alphanumeric strings stored in cookies). While privacy language calls the former “directly-identifiable” (to replace the more ambiguous term “PII”), when web activity is not associated directly-identifiable IDs privacy language calls these IDs, “pseudonymous IDs.”

Thus example one might be to define whether the ID is pseudonymous or not.

The second type of ID is one that merely links other IDs, such as a “cluster ID.” This ID is generated server-side to associate various IDs together, either probabilistically OR deterministically. Marketers often use this for “x-device” or even same device “x-app” use cases. When publishers operating different domains link their IDs deterministically they may wish to create a shared ID for their use, which is analogous to the proposed “first-party” sets.

Thus example two might to define whether the ID is deterministically associated with other IDs or not. 

FROM WHERE “provenance” concepts
An orthogonal dimension to the ID we are discussing is its provenance. Which organization created it? Privacy regulations tend to call this the “data controller.”

When was it created? When was the last time it was verified as still active?

Syndicating “stale” IDs to be activated in a walled garden or across the Open Web is technically feasible, but not adding value to marketers. Yet most marketers do not have visibility on the age or last seen date of the data syndicated on their behalf to improve media buying.

Ensuring we know where IDs come from likely requires ensuring compact description and perhaps even signing the data.

USE concepts

I also recommend we keep the above annotations about IDs distinct from what processing operations are associated with them:

Preference management (e.g., opt-in/-out of personalization)
Engagement (or restrictions like frequency cap)
Measurement (distinct counting)
Audit (which ID was sent from which org to which other org, when, and what use restrictions were communicated)

Examples (purely for illustration and not in formal spec format or optimized for transport efficiency):
Zeta_Pseudonymous_ID=123, pseudonymous, created 20200915, last_event=20201025
Zeta_Pseudonymous_ID=234, pseudonymous, created 20201001, last_event=20201026
Zeta_Email_ID=pomacedon@gmail.com, directly-identifiable, created 20201001, last_event=20201027
Zeta_Household_ID=abc, pseudonymous, probabilistic_set {ZPID=123, ZPID=234), created=20201027

@jdwieland8282
Copy link
Member Author

Thanks @joshuakoran what you're describing is going to be tough to express in JSON in a way that makes sense to everyone, let me take a first stab and we can iterate. wrt providence, I feel like the source and stype values do a good job describing that, so I'm going to leave them out for now.

@jdwieland8282
Copy link
Member Author

jdwieland8282 commented Nov 3, 2020

how about something like this? Anything else to add?

   "ext":{
      "eids":[
         {
            "source":"sharedid.org",
            "uids":[
               {
                  "id":"d88c96-5cb6-410d-827d-b019e476",
                  "atype":1,
                  "ext":[
                     {
                        "stype":"ppuid", //ppuid,dmp,sha256email
                        "origin":"person", //person, household, browser, device, gaming console
                        "pseudonymous":TRUE, //boolean
                        "deterministic":FALSE, //boolean
                        "created":"1604429992", //UNIX timestamp
                        "lastseen":"1604430025", //UNIX timestamp
                        "signature":[
                           {
                              "signedby":"cryptoboi",
                              "signature":"cryptostring"
                           }
                        ]
                     }
                  ]
               }
            ]
         }
      ]
   }
}

@abhinavsinha001
Copy link

I am assuming all these params and values have to be well defined for any consumer to make sense out of it. Wouldn't it be better if we map combination of origin , pseudoanonymous and deterministic to custom atype values and publish it. Would reduce payload as well as easy to extend without adding extra parameters.

@jdwieland8282
Copy link
Member Author

Hi @abhinavsinha001, I think you've raised a very good point, to be clear, I don't have a strong opinion yet about what this should look like, I'm channeling the Identity PMC. But to your point about well defined values you are exactly right. We need a way to ensure that creators don't declare there ID deterministic when it isn't. Wrt pseudoanonymous, all ids except email address is pseudoanonymous, and even email can be pseudoanonymous. So in my mind pseudoanonymous should go entirely.

The consumer in this scenario is a DSP.

As far as mapping pseudoanonymous and deterministic to a custom atype, atype isn't well understood or used. In theory that sounds like a good idea to me but in practice I'm not sure it would work. Thanks for your comments, what would be really helpful is a modified example. I don't want to be the only one doing the data modeling.

@joshuakoran
Copy link

Hi Jeff -

"Wrt pseudoanonymous, all ids except email address is pseudoanonymous"

I think that while many IDs we rely on may begin as "pseudonymous," I believe the regulations require organizations to have appropriate technical and/or operational measures in place to keep people's activity distinct from their offline identity (directly-identifiable ID, fkna PII) to be classed as "pseudonymous."

@jdwieland8282
Copy link
Member Author

sure, no disagreement from me on that pt.

@smenzer
Copy link
Collaborator

smenzer commented Nov 5, 2020

since the primary consumer here are DSPs, can we get some of them to weigh in on what they'd want to see and whether they want the granularity of separate fields or a single field like atype?

@abhinavsinha001
Copy link

I agree we should get feedback from DSPs on this. I feel most of the parameters do not have any significance individually and can be represented broadly using atype values.

Sample request leveraging atype value

{
  "eids": [
    {
      "source": "sharedid.org",
      "uids": [
        {
          "id": "d88c96-5cb6-410d-827d-b019e476",
          "atype": 501,
          "ext": [
            {
              "created": "1604429992",
              "lastseen": "1604430025",
              "signature": [
                {
                  "signedby": "cryptoboi",
                  "signature": "cryptostring"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Here is how we can maintain metadata for atype and parameters that define a particular atype value.

Atype Metadata

Adtype Description
1 An ID which is tied to a specific web browser or device (cookie-based, probabilistic, or other).
2 In-app impressions, which will typically contain a type of device ID (or rather, the privacy-compliant versions of device IDs).
3 A person-based ID, i.e., that is the same across devices.
500x All the IDs gnerated by publishers (stype:ppuid)
501 stype:ppuid, origin:browser, deterministic:true, method:login , scope:individual, duration:short
501 stype:ppuid, origin:browser, deterministic:true, method:localstore,scope:individual , duration:medium
600x All the Ids aquired from some DMP (stype:dmp)
601 stype:dmp, origin:browser, deterministic:true, method:transaction ,scope:individual ,duration:long
601 stype:dmp, origin:browser, deterministic:true,method:transaction ,scope:household,duration:long
700x All idendifiers generated using some link like IP/Device (stype:probabilistic)
701 stype: probabilistic, origin:gaming-console, deterministic:false , method:algo ,scope:household ,duration:short

ID metadata Params

Parameter Description
stype Type of source which generated this ID
origin Where this Id was generated / stored
deterministic If the Id can be confidently tied to a browser/person
method How the ID was aquired , login, using some transaction like purchase, algorythm or traditional sync
scope Does this Id represent an individual / household
duration The time this ID can typically last : short < 7 days , medium <30 days , long >30 days

@abhinavsinha001
Copy link

Update: Just realized while on IAB-TL meeting - most of the fields and data are part of Data Transparency Standard 1.0 and there is an active discussion to map these fields to oRTB User object - we can use the same standards for eids type as well.

@jdwieland8282
Copy link
Member Author

ok, so sounds like we have something that describes the type of user id in the atype field and it's just a matter of defining how we want to support the atype designation:

  • created
  • last seen
  • signed

I'd like to pause here, now that we have some firmer requirements and wait for DSPs to weigh in. Any disagreement with that approach?

@jdwieland8282
Copy link
Member Author

jdwieland8282 commented Nov 11, 2020

@abhinavsinha001 I like your example. For anyone who missed the 11/11 Identity PMC meeting, we agreed to move forward with this feature. The group felt we should proactively provide some real time metadata about the id to buyers in preparation for a future state with diminished 3rd party cookie availability.

Each UserId module sub adapter will need to decide to support these fields. The PMC will define the standard. Are there any objections to @abhinavsinha001 data model? I'll cross post on our slack channel as well.

  "eids": [
    {
      "source": "sharedid.org",
      "uids": [
        {
          "id": "d88c96-5cb6-410d-827d-b019e476",
          "atype": 501,
          "ext": [
            {
              "created": "1604429992",
              "lastseen": "1604430025",
              "signature": [
                {
                  "signedby": "cryptoboi",
                  "signature": "cryptostring"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

@joshuakoran
Copy link

Just FYI PRAM is suggesting three types of Identifiers:

  1. system-generated pseudonymous ID (e.g., cookie or MAID),
  2. user-provided ID (e.g., hashed email) and
  3. directly-identifiable identity (publisher-agnostic offline identity)

We can augment this by creation/last seen as described above + source (e.g., publisher, vendor, marketer), such that vendor=apple provides IDFA, and vendor=sharedid.org provides cookie ID.

@smenzer
Copy link
Collaborator

smenzer commented Nov 20, 2020

@joshuakoran I don't really understand the difference between 1. and 2. ... could you please explain a bit?

@joshuakoran
Copy link

Even if the output is a pseudonymous ID, the input mechanism has different friction/control for users.

The user has binary control of generating / resetting ID in 1), but limited technical control over how the ID can be shared across domains.

The user has 100% technical control of providing (different/same) ID to be shared across domains for 2). Once the ID in 2) is generated it has the same limits as 1), but the generation using different IDs (work email, home email as one example) is different than using the same laptop with same browser cookies at home and work.

@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 25, 2020
@gglas gglas removed the stale label Jan 13, 2021
@gglas
Copy link

gglas commented May 10, 2021

@jdwieland8282 did we land on a solution here?

@gglas gglas added feature pinned won't be closed by stalebot labels May 10, 2021
@jdwieland8282
Copy link
Member Author

This hasn't come up lately, my recollection is that we would use the atype field and leave it at that. If anyone else has a different recollection feel free to reopen and propose a standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature pinned won't be closed by stalebot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants