Skip to content

Sensitive data handling

vjrj edited this page May 6, 2024 · 1 revision

The ALA handles potentially sensitive data by generalising record data. In general, this means rounding off latitude and longitude values so that the location is only accurate to a specified level (eg. rounded to 10km) and nulling information that could be used to reconstruct the locality (eg. locality, municipality, verbatim coordinates etc.). In theory, other generalisations could be applied, eg. reducing the date to year-month accuracy.

The ALA sensitive data system can also detect instances of pest species and modify records to allow pest occurrences to be identified. This is not normally applied, since it also requires a notification mechanism.

Design

The basic principles that underlie the sensitive data library are shown in the diagram below

There is a table of zone descriptions, each with a unique identifier. These can be either countries (AU, NZ, Other), states (VIC, NSW, etc.), territories (Australian Antarctic Territory), quarantine zones (Fire Ant Exclusion Zone) or Pest Notification Areas. Zones are not mutually exclusive.

There is a table of sensitivity categories, each with a unique identifier. These might be conservation/endangerment ratings or pest categories.

There is a schemas of rules. These take the form (taxon, zone) → (category, generalisations). A specific taxon for something in a particular zone can be assigned to zero or more categories and sets of generalisations. The rules are constucted by querying the list server for lists specifically to be included in the SDS. https://lists.ala.org.au/public/speciesLists?&max=25&sort=listName&order=asc&isSDS=eq:true Each list is associated with a zone and has defaults for the categories/generalisations to apply. Individual records can override these defaults.

An occurrence record arrives at the SDS as a list of properties, based on Darwin Core terms, eg decimalLatitude, year, etc. The SDS takes the following steps:

  • First, the taxon is matched. Either directly via taxonConceptID or by matching scientificName. If there are no rules that might apply to that taxon, the generalised record is the same as the supplied record.
  • If there are rules instances attached to the taxon, the zones for the record are computed. This is done by supplying the latitude and longitude to a set of layers, either directly or via the layers web service (I'm really not sure how this is chosen). The result is a list of zone identifiers that apply to the record. Setting the sampledProvided flag to true indicates that values for specific layers are provided as part of the properties, using the layer ID as a key.
  • The rules list is searched for each (taxon, zone) pair. If there is a rule instance, then the generalisations given by the rule are applied to properties in the occurrence record. The result is an occurrence record where certain interesting properties have been fuzzed or removed.

What does this look like API wise? You submit a dictionary of properties and get back a sensitivity report, saying whether and how the record is sensitive and a set of new properties to apply to the record.

Sensitive data handling in the biocache-store

The biocache-store has an instance of the SDS library linked into it. The full occurrence record is converted into properties and supplied to the library and any changes applied to the record.

The SensitivityProcessor takes the raw record and converts it into a key-value dictionary. It then adds decimalLatitude, decimalLongitude, coordinatePrecision, coordinateUncertaintyInMeters, year, month and day from the processed record, since they will have been converted to WGS84 and have had dates parsed. These coordinates are also supplied to the SpatialLayerDAO and any results from the (configured) sdsLayerList are added to the dictionary, using the layer ID as a key. A sampled values provided flag is then set, to stop the SDS from doing its own sampling. These are then supplied to the SDS library and the resulting values applied to the 

Sensitive data handling in the pipeline (v1)

The pipeline uses the ala-sensitive-data-service to provide data to a transform, shown in the diagram below.

  • Data is collected from the verbatim data and the various interpreted records.
  • The transform first makes a query to the service to see whether this taxon is sensitive at all. This query is coarse-grained and can be saved in a KV cache.
  • If the taxon is sensitive, the transform builds a properties list from the records supplied.
  • The properties list is supplied to the service, which performs the processing shown in the basic SDS diagram above. The result is a sensitivity report. **This is not really cacheable, since it relies on the SDS to look up provided lat/longs and these are likely to vary across records. **
  • The report is used to build a new sensitivity record, describing the sensitive status and the original/altered values for the occurrence.
  • Each record is then modified to contain any new values.
  • The resulting generalised records are then written to a new, generalised repository.

What's wrong with this?

The key problem with this is that the processing step has to invoke the web service for each potentially sensitive record record. This is a bit inefficient.

Sensitive data handling in the pipeline (v2)

The key aim here is to get the processing queries cacheable. It relies on the processing pipeline being able to quickly provide a list of zones that the record belongs to, by looking them up in a geographical bitmap. The web service then returns instructions on how to process the records, which get done in the pipeline. So

  • Data is collected from the verbatim data and the various interpreted records.
  • The transform first makes a query to the service to see whether this taxon is sensitive at all. This query is coarse-grained and can be saved in a KV cache.
  • If the record is potentially sensitive, a list of zones that the record is in is computed from a geo bitmap. Note that the record can have multiple zones.
  • The taxon and the zones list is supplied to the web service. Note that this is cacheable, since the zones are fairly large.
  • The web service returns a report with a list of processing instructions in it, which may come from the cache.
  • Each record is modified according to the processing instructions. (These are quite simple - generalise to 5km, or delete the locality field.)
  • The report is also used to build a new sensitivity record, describing the sensitive status and the original/altered values for the occurrence from the records.
  • The resulting generalised records are then written to a new, generalised repository.

Diagram: pipelines sensitive data processing

Clone this wiki locally