Skip to content

Commit

Permalink
#22 Split the "All About Analytics Data" guide in smaller guides and …
Browse files Browse the repository at this point in the history
…a category about "Data Model"
  • Loading branch information
mnapoli committed Jan 26, 2015
1 parent 61a381c commit d2eb4b2
Show file tree
Hide file tree
Showing 7 changed files with 418 additions and 312 deletions.
4 changes: 1 addition & 3 deletions app/helpers/Content/Category/DevelopCategory.php
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,7 @@ public function getItems()
new Guide('piwiks-reporting-api'),
]),
new Guide('piwik-on-the-command-line'),
new EmptySubCategory('Analytics Data', [
new Guide('all-about-analytics-data'),
]),
new Guide('all-about-analytics-data'),
new EmptySubCategory('Database', [
new Guide('persistence-and-the-mysql-backend'),
new Guide('extending-database'),
Expand Down
319 changes: 13 additions & 306 deletions docs/all-about-analytics-data.md

Large diffs are not rendered by default.

197 changes: 197 additions & 0 deletions docs/archive-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
category: Develop
previous: archiving
next: reports
---
# Archive data

**Archive data** is created during the [archiving process](/guides/archiving) by aggregating **log data**.

Piwik aggregates and persists two types of archive data:

- **metrics**, which are single numeric values
- **reports**, which are two-dimensional array of values

Reports will normally contain metric values, but they can also contain other data (either additionally or in lieu of metric values).

Reports and metrics are defined by plugins, letting any plugin extend the data analyzed by Piwik. However there are several metrics, called **core metrics**, that are defined by Piwik Core.

## Subset parameters

Reports and metrics provide analytics data about a set of things. This set is defined by three constraints:

- a website ID
- a period
- a segment

The **website ID** selects visits that were tracked for a specific website. This ID is specified in all HTTP requests with the `idSite` query parameter.

The **period** selects visits that were tracked within a specific date range. The period is specified in all HTTP requests with the `date` and `period` query parameters.

The **segment** selects visits based on a boolean expression that uses visit properties. It is specified in all HTTP requests by the `segment` query parameter and can be used to select almost any conceivable subset of visit.

Analytics parameters are stored in reports as metadata, that means they are stored as [DataTable](/api-reference/Piwik/DataTable) metadata.

## Metrics

### Core metrics

**Core metrics** are metrics that are not defined by plugins but by **Piwik Core**.

New reports that analyze visits, action types or conversions should contain these metrics.

#### Visit metrics

Core metrics for a set of visits:

Name | Metric ID | Description
-----------------|-----------------------|------------
Visits | `nb_visits` | Number of tracked visits. <br> A visit is series of events each of which happened no more than 30 minutes apart.
Unique visitors | `nb_uniq_visitors` | The number of unique sources of visits. <br> A visit source is an entity that causes a visit to be tracked.
Actions | `nb_actions` | The number of tracked actions. <br> An action is an event tracked by Piwik.
Max Actions | `max_actions` | The maximum number of actions that occurred in one visit.
Sum Visit Length | `sum_visit_length` | The sum of each visit's elapsed time.
Bounce Count | `bounce_count` | The number of visits that consisted of only one action.
Converted Visits | `nb_visits_converted` | The number of visits that caused at least one conversion. <br> Includes conversions for every goal of a site.
Conversions | `nb_conversions` | The number of conversions tracked for this set of visits. <br> Includes conversions for every goal of a site.
Revenue | `revenue` | The total revenue generated by these visits. <br> Includes revenue for every goal of a site plus its ecommerce revenue.

#### Action metrics

Core metrics for a single action type:

Name | Metric ID | Description
--------------------------|--------------------------------|------------
Hits | `nb_hits` | The number times this action was ever done.
Sum Time Spent | `sum_time_spent` | The total amount of time the user spent doing this action.
Sum Page Generation Time | `sum_time_generation` | The total amount of time a server spent serving this action.
Hits With Generation Time | `nb_hits_with_time_generation` | The number of hits that included generation time information.
Min Page Generation Time | `min_time_generation` | The minimum amount of time a server spent serving this action.
Max Page Generation Time | `max_time_generation` | The maximum amount of time a server spent serving this action.
Unique Exit Visitors | `exit_nb_uniq_visitors` | The number of unique visitors that ever exited a site after this action.
Exit Visits | `exit_nb_visits` | The total number of visits that ended with this action.
Unique Entry Visitors | `entry_nb_uniq_visitors` | The total number of unique visitors that started a visit with this action.
Entry Visits | `entry_nb_visits` | The total number of visits that started with this action.
Entry Actions | `entry_nb_actions` | <!-- TODO: isn't this the same as entry visits? -->
Entry Sum Visit Length | `entry_sum_visit_length` | The sum of each entry visit's elapsed time.
Entry Bounce Count | `entry_bounce_count` | The number of visits that consisted of this action and no other.
Hits From Search | `nb_hits_following_search` | The number of times this action was done after a site search.

#### E-commerce metrics

Core metrics for the set of ecommerce conversions (either all orders or all abandoned carts) recorded for a set of visits:

Name | Metric ID | Description
---------------------|--------------------|------------
Revenue Subtotal | `revenue_subtotal` | The total cost of every item that was a part of these orders or abandoned carts.
Revenue Tax | `revenue_tax` | The total tax amount applied to these orders/abandoned carts.
Revenue Shipping | `revenue_shipping` | The total amount of shipping applied to these orders/abandoned carts.
Revenue Discount | `revenue_discount` | The total amount of discounts applied to these orders/abandoned carts.
Ecommerce Item Count | `items` | The total number of items in these orders/abandoned carts.

#### Goal metrics

Core metrics for a set of visits and one goal of a site:

Name | Metric ID | Description
-----------------|--------------------------------|------------
Goal Conversions | `goal_<idGoal>_nb_conversions` | The conversions tracked for a specific goal and this set of visits.
Goal Revenue | `goal_<idGoal>_revenue` | The total revenue generated by the conversions for a specific goal.

_Note: `<idGoal>` should be replaced with the ID of a goal._

Goal specific metrics are stored in the database in the `goals` column of serialized reports. The column contains a PHP array mapping goal IDs with arrays of goal specific metric values. These values are set as normal column values with the metric names described above by the [AddColumnsProcessedMetricsGoal](/api-reference/Piwik/DataTable/Filter/AddColumnsProcessedMetricsGoal) DataTable filter.

### Processed metrics

In the interests of [archiving](/guides/all-about-analytics-data#the-archiving-process) and database size efficiency, some metrics are not stored in database. They are instead calculated when needed using other metrics. These metrics are called **processed metrics**.

Below is the list of processed metrics that are calculated using *core metrics*. New reports that analyze visits, action types or conversions should be have these metrics added when possible.

_Note: Some processed metrics will appear multiple times in the lists below. These metrics have different meanings based on the reports they are in._

Processed metrics for a set of visits:

Name | Metric ID | Description
---------------------|------------------------|------------
Conversion Rate | `conversion_rate` | The percent of visits that had at least one conversion.
Actions Per Visit | `nb_actions_per_visit` | The average number of actions for a single visit.
Average Time On Site | `avg_time_on_site` | The average number of time spent per visit in seconds.
Bounce Rate | `bounce_rate` | The percent of visits that resulted in a bounce.

Processed metrics for a single action type:

Name | Metric ID | Description
---------------------------------------------|-----------------------|------------
Average Generation Time | `avg_time_generation` | The average amount of time it took for a server to serve this action.
Average Number of Search Result Pages Viewed | `nb_pages_per_search` | The average number of search result pages viewed after a site search. <br> Only valid for site search keywords and site search categories.
Average Time On Page | `avg_time_on_page` | The average amount of time users spent doing this action.
Entry Bounce Rate | `bounce_rate` | The percent of all visits that consisted of this action and no other.
Exit Rate | `exit_rate` | The percent of all visits that ended with this action.

Processed metrics for the set of ecommerce orders recorded for a set of visits:

Name | Metric ID | Description
----------------------|---------------------|------------
Average Order Revenue | `avg_order_revenue` | The average revenue of each order.

Processed metrics for the set of ecommerce items in a set of orders or abandoned carts:

Name | Metric ID | Description
------------------------|-------------------|------------
Average Price | `avg_price` | The average price of each item.
Average Quantity | `avg_quantity` | The average number of each item in an order/abandoned cart.
Product Conversion Rate | `conversion_rate` | The percent of orders/abandoned carts that include this item.

The following is a list of processed metrics that are also specific to one goal of one site:

Name | Metric ID | Description
--------------------------|-----------------------------------|------------
Average Revenue per Visit | `goal_<idGoal>_revenue_per_visit` | The average amount of revenue generated per visit for this goal.

### Naming convention

Metrics calculated and persisted by plugins **must** be named with the following format: `PluginName_metricName`. For example: `MyPlugin_myFancyMetric`.

Core metrics have special names and do not follow this convention.

## Reports

Reports are stored in memory using the [`DataTable`](/api-reference/Piwik/DataTable) class. A `DataTable` is a 2-dimension array composed of rows and columns.

Each row contains metrics that relate to a set of visits, actions, conversions… That set is defined and described by a special **label** column. How the column describes the set depends entirely upon the specific report. For example in the `UserSettings.getBrowser` report, a row with the label *Firefox* would hold metrics for visits that used the Firefox browser.

Some reports like `VisitsSummary.get` will not have a label column: they have only one row that refers to the entire set of entities.

### Report metadata

In addition to metrics, each row can also contain **metadata**. This metadata will usually assist the label column in describing the set of things the row represents.

Some metadata have special meaning in Piwik, for example:

- `logo`: the value can be a path to an image that will be shown alongside each row in the UI
- `url`: the value can be a URL to which the row will link in the UI

### Subtables

Reports can be hierarchical: each row can be attached to another DataTable. Tables that are attached to rows are called **subtables**.

Subtables provide further analytics for the set of visits that a row represents. For example, the `Referrers.getSearchEngines` report has one row per search engine. Each row has a subtable that describes keywords used with that search engine. Here is a schematic representation:

```
Search Engine Keyword (subtable) Visitors
--------------|-------------------|----------
Google | 207
--------------|------------------------------
| piwik | 11
| libre analytics | 6
| ...
---------------------------------------------
Duck Duck Go | 121
--------------|------------------------------
| ...
```

### Naming convention

Reports **must** be named like metrics are: `PluginName_reportName`. For example: `MyPlugin_myFancyReport`.
134 changes: 134 additions & 0 deletions docs/archiving.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---
category: Develop
previous: log-data
next: archive-data
---
# The Archiving Process

**Log data** cannot be used directly for end-user reports because it would require to process an enormous amount of data every time the report is needed.

To solve that problem, the **archiving process** aggregates log data into **archive data**. Reports are then built using archive data.

## Example

Let's take as an example a website that received 1000 page views in one day. The **log data** would be the list of those 1000 events along with other information, for example:

```
URL Time ...
/homepage 17:00:19 ...
/about 17:01:10 ...
/homepage 17:05:30 ...
/categories 17:06:14 ...
/homepage 17:10:03 ...
...
```

The **archiving process** aggregates this raw data into archive data.

For example, to build the report of the number of views per page (to see the most popular pages), the archiving will list all pages and sum the number of views for each page:

```
URL Page views
/homepage 205
/categories 67
/about 5
...
```

That data is the **archive data**.

While pre-computing archive data seems of course superfluous for 1000 page views, it is not when dealing with higher amounts of data.

## When?

By default, archive data is calculated and cached **on-demand**. When a specific report is requested, Piwik will check if the required archive data exist and generate it if not.

### Pre-archiving

When tracking a website with a lot of traffic, the archiving on-demand might take too much time. In those situations, archiving on demand must be disabled and [pre-archiving needs to run in background at a scheduled time](http://piwik.org/docs/setup-auto-archiving/).

Pre-archiving can be run for every site and period (except custom date ranges) using the `core:archive` console command:

```
$ ./console core:archive
```

A usual setup is to run that command at fixed interval using `cron`.

The command will remember when it was last executed and will only archive a website if there have been new visits.

## How?

Log data is aggregated into archive data for each:

- site
- period: day, week, month, year or custom date range (custom date ranges cannot be pre-archived)
- [segment](http://piwik.org/docs/segmentation/)

Archiving logic (i.e. the way of aggregating log data) is defined by plugins. All reports defined by a plugin are archived together rather than individually.

If no segment is given in the query and data cannot be found, every report of every plugin will be generated and cached all at once. If a segment is supplied, then the reports that belong to the same plugins as the requested data will be generated and cached.

### Period aggregations

Archive data is calculated differently based on the period type:

- "day" periods are aggregation of log data
- "week", "month", "year" and custom date ranges are aggregation of "day" reports

For example archive data for a week is created by aggregating archive data of the 7 days of the week. This is much faster than aggregating log data.

### Plugin Archivers

Plugins that want to archive reports and metrics define a class called `Archiver` that extends [`Piwik\Plugin\Archiver`](/api-reference/Piwik/Plugin/Archiver). This class will be automatically detected and called during the archiving process.

Log data aggregation is handled by the [`LogAggregator`](/api-reference/Piwik/DataAccess/LogAggregator) class. Archive data aggregation is handled by the [`ArchiveProcessor::aggregateDataTableRecords()`](/api-reference/Piwik/ArchiveProcessor#aggregatedatatablerecords) and [`ArchiveProcessor::aggregateNumericMetrics()`](/api-reference/Piwik/ArchiveProcessor#aggregatenumericmetrics) methods.

Plugins can access a [`LogAggregator`](/api-reference/Piwik/DataAccess/LogAggregator) and [`ArchiveProcessor`](/api-reference/Piwik/ArchiveProcessor) instance with [`Piwik\Plugin\Archiver`](/api-reference/Piwik/Plugin/Archiver).

To learn more about how aggregation is accomplished with Piwik's MySQL backend, read about the [database schema](/guides/persistence-and-the-mysql-backend).

## Persisting archive data

Archive data is persisted using [`ArchiveProcessor`](/api-reference/Piwik/ArchiveProcessor).

Metrics are inserted using [`insertNumericRecord()`](/api-reference/Piwik/ArchiveProcessor#insertnumericrecords).

Reports are first serialized using [`DataTable::getSerialized()`](/api-reference/Piwik/DataTable#getserialized) and then inserted using [`ArchiveProcessor::insertBlobRecord()`](/api-reference/Piwik/ArchiveProcessor#insertblobrecord):

```php
// insert a numeric metric
$myFancyMetric = // ... calculate the metric value ...
$archiveProcessor->insertNumericRecord('MyPlugin_myFancyMetric', $myFancyMetric);

// insert a record (with all of its subtables)
$maxRowsInTable = Config::getInstance()->General['datatable_archiving_maximum_rows_standard'];j

$dataTable = // ... build by aggregating visits ...
$serializedData = $dataTable->getSerialized(
$maxRowsInTable,
$maxRowsInSubtable = $maxRowsInTable,
$columnToSortBy = Metrics::INDEX_NB_VISITS
);

$archiveProcessor->insertBlobRecords('MyPlugin_myFancyReport', $serializedData);
```

Persisted reports and metrics are indexed by the website ID, period and segment. The date and time of archiving is also attached to the data. To learn the specifics of how this is done with MySQL see the [database schema](/guides/persistence-and-the-mysql-backend).

### Reports vs Records

When a report is archived, it is called a **record** not a report. We make a distinction because multiple reports can sometimes be generated from one **record**.

For example, the *UserSettings* plugin uses one record to hold browser details of visitors. This record is used to generate both the `UserSettings.getBrowserVersion` and `UserSettings.getBrowser` reports. The second report simply processes the first to make a new report. The plugin could have archived both reports, but this would have been a **massive** waste of space, considering the new report would be cached for every website/period/segment combination.

<a name="record-storage-guidelines"></a>

<div markdown="1" class="alert alert-warning">
**Record storage guidelines**

Care must be taken to store as little as possible when persisting records. Make sure to follow the guidelines below before inserting records as archive data:

* **Records should not be stored with string column names.** Instead they should be replaced with integer column IDs (see [Metrics](/api-reference/Piwik/Metrics) for a list of existing ones).
* **Metadata that can be added using existing data should not be stored with reports.** Instead they should be added in API methods when turning records into reports.
</div>
Loading

0 comments on commit d2eb4b2

Please sign in to comment.