Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare with Wikibase tabular data model #1059

Merged
merged 5 commits into from
Dec 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions content/docs/guides/mediawiki-tabular-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: Comparison with MediaWiki Tabular Data
sidebar:
order: 3
---

<table>
<tr>
<th>Authors</th>
<td>Jakob Voß</td>
</tr>
</table>

[MediaWiki](https://www.mediawiki.org/) is the software used to run Wikipedia and related projects of the Wikimedia Foundation, including the media file repository [Wikimedia Commons](https://commons.wikimedia.org/). Commons hosts mostly images but also some records with tabular data. The [MediaWiki Tabular Data Model](https://www.mediawiki.org/wiki/Help:Tabular_data) was inspired by Data Package version 1 but it slightly differs from current Data Package specification, as described below.

## Property Comparison

A [MediaWiki tabular data page](https://www.mediawiki.org/wiki/Help:Tabular_data) describes and contains an individual table of data similar to a [Data Resource](/standard/data-resource/) with inline tabular data. Both are serialized as JSON objects, but the former comes as a page with unique name in a MediaWiki instance (such as Wikimedia Commons).

### Top-level Properties

MediaWiki Tabular Data has three required and two optional top-level properties. Most of these properties map to corresponding properties of a Data Resource:

| MediaWiki Tabular Data | Data Package Table Schema |
nichtich marked this conversation as resolved.
Show resolved Hide resolved
| ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------- |
| - (implied by page name) | [name](/standard/data-resource/#name) (required) is a string |
| [description](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (optional) is a localized string | [description](/standard/data-resource/#description) (optional) is a CommonMark string |
| [data](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) | [data](/standard/data-resource/#name) (optional) |
| [license](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) is the string `CC0-1.0` or another known identifier | [licenses](/standard/data-resource/#licenses) (optional) is an array |
| [schema](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) as [described below](#schema-properties) | [schema](/standard/data-resource/#schema) (optional) can have multiple forms |
nichtich marked this conversation as resolved.
Show resolved Hide resolved
| [sources](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (optional) is a string with Wiki markup | [sources](/standard/data-resource/#sources) (optional) is an array of objects |

The differences are:
nichtich marked this conversation as resolved.
Show resolved Hide resolved

- property `name` does not exist but can be implied from page name
- property `description` and `sources` have another format
- property `data` is always an array of arrays and [data types](#data-types) of individual values can differ
- property `schema` is required but it differs in definion of [schema properties](#schema-properties)
- there is no property `licenses` but `license` fixed to plain string value `CC0-1.0` (other license indicators may be possible)

### Data Types

Tabular Data supports four data types that overlap with [Table Schema data types](/standard/table-schema/#field-types):

- `number` subset of Table Schema [number](/standard/table-schema/#number) (no `NaN`, `INF`, or `-INF`)
- `boolean` same as Table Schema [boolean](/standard/table-schema/#boolean)
- `string` subset of Table Schema [string](/standard/table-schema/#string) (limited to 400 characters at most and must not include `\n` or `\t`)
- `localized ` refers to an object that maps language codes to strings with same limitations as `string` type.
This type is not supported in Table Schema.

Individual values in a MediaWiki Tabular Data table can always be `null`, while in Table Schema you need to explicitly list values that should be considered missing in [schema.missingValues](/standard/table-schema/#missingValues).

### Schema Properties

The `schema` property of MediaWiki tabular contains an object with property `fields` just like [Table Schema](/standard/table-schema/) but no other properties are allowed. Elements of this array are like Table Schema [field descriptors](/standard/table-schema/#field) limited to three properties and different value spaces:

| MediaWiki Tabular Data | Data Package Table Schema |
| ---------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| [name](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) must be a string matching `^[a-zA-Z_][a-zA-Z_0-9]*` | [name](/standard/table-schema/#name) (required) can be any string |
| [type](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) is one of the [Data Types above](#data-types) | [type](/standard/table-schema/#type) (optional) with [different data types](#data-types) |
| [title](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (optional) is a localized string | [title](/standard/table-schema/#title) (optional) is a plain string |
Loading