diff --git a/docs/api/competition.dataset.md b/docs/api/competition.dataset.md
new file mode 100644
index 00000000..8018af79
--- /dev/null
+++ b/docs/api/competition.dataset.md
@@ -0,0 +1,5 @@
+::: polaris.dataset.CompetitionDataset
+    options:
+        filters: ["!^_"]
+
+---
\ No newline at end of file
diff --git a/docs/api/competition.evaluation.md b/docs/api/competition.evaluation.md
new file mode 100644
index 00000000..2a66956f
--- /dev/null
+++ b/docs/api/competition.evaluation.md
@@ -0,0 +1,7 @@
+::: polaris.evaluate.CompetitionPredictions
+
+---
+
+::: polaris.evaluate.CompetitionResults
+
+---
\ No newline at end of file
diff --git a/docs/api/competition.md b/docs/api/competition.md
new file mode 100644
index 00000000..4729af63
--- /dev/null
+++ b/docs/api/competition.md
@@ -0,0 +1,3 @@
+::: polaris.competition.CompetitionSpecification
+
+---
\ No newline at end of file
diff --git a/docs/api/evaluation.md b/docs/api/evaluation.md
index 9187a763..7be2479a 100644
--- a/docs/api/evaluation.md
+++ b/docs/api/evaluation.md
@@ -1,3 +1,12 @@
+::: polaris.evaluate.ResultsMetadata
+    options:
+        filters: ["!^_"]
+
+---
+
+::: polaris.evaluate.EvaluationResult
+
+---
 
 ::: polaris.evaluate.BenchmarkResults
 
diff --git a/docs/tutorials/competition.participate.ipynb b/docs/tutorials/competition.participate.ipynb
new file mode 100644
index 00000000..0ee6f223
--- /dev/null
+++ b/docs/tutorials/competition.participate.ipynb
@@ -0,0 +1,253 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "40f99374-b47e-4f84-bdb9-148a11f9c07d",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "source": [
+    "# Participating in a Competition\n",
+    "\n",
+    "<div class=\"admonition abstract highlight\">\n",
+    "    <p class=\"admonition-title\">In short</p>\n",
+    "    <p>This tutorial walks you through how to fetch an active competition from Polaris, prepare your predictions and then submit them for secure evaluation by the Polaris Hub.</p>\n",
+    "</div>\n",
+    "\n",
+    "Participating in a competition on Polaris is very similar to participating in a standard benchmark. The main difference lies in how predictions are prepared and how they are evaluated. We'll touch on each of these topics later in the tutorial. \n",
+    "\n",
+    "Before continuing, please ensure you are logged into Polaris."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "3d66f466",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "remove_cell"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "# Note: Cell is tagged to not show up in the mkdocs build\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "9b465ea4-7c71-443b-9908-3f9e567ee4c4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[32m2024-08-09 18:05:23.205\u001b[0m | \u001b[32m\u001b[1mSUCCESS \u001b[0m | \u001b[36mpolaris.hub.client\u001b[0m:\u001b[36mlogin\u001b[0m:\u001b[36m267\u001b[0m - \u001b[32m\u001b[1mYou are successfully logged in to the Polaris Hub.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "import polaris as po\n",
+    "from polaris.hub.client import PolarisHubClient\n",
+    "\n",
+    "# Don't forget to add your Polaris Hub username below!\n",
+    "MY_POLARIS_USERNAME = \"\"\n",
+    "\n",
+    "client = PolarisHubClient()\n",
+    "client.login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5edee39f-ce29-4ae6-91ce-453d9190541b",
+   "metadata": {},
+   "source": [
+    "## Fetching a Competition\n",
+    "\n",
+    "As with standard benchmarks, Polaris provides simple APIs that allow you to quickly fetch a competition from the Polaris Hub. All you need is the unique identifier for the competition which follows the format of `competition_owner`/`competition_name`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "4e004589-6c48-4232-b353-b1700536dde6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "competition_id = \"polaris/hello-world-competition\"\n",
+    "competition = po.load_competition(competition_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36f3e829",
+   "metadata": {},
+   "source": [
+    "## Participate in the Competition\n",
+    "The Polaris library is designed to make it easy to participate in a competition. In just a few lines of code, we can get the train and test partition, access the associated data in various ways and evaluate our predictions. There's two main API endpoints. \n",
+    "\n",
+    "- `get_train_test_split()`: For creating objects through which we can access the different dataset partitions.\n",
+    "- `evaluate()`: For evaluating a set of predictions in accordance with the competition protocol."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8605928",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train, test = competition.get_train_test_split()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e78bf878",
+   "metadata": {},
+   "source": [
+    "The created test and train objects support various flavours to access the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7b17bb31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The objects are iterable\n",
+    "for x, y in train:\n",
+    "    pass\n",
+    "\n",
+    "# The objects can be indexed\n",
+    "for i in range(len(train)):\n",
+    "    x, y = train[i]\n",
+    "\n",
+    "# The objects have properties to access all data at once\n",
+    "x = train.inputs\n",
+    "y = train.targets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ec12825",
+   "metadata": {},
+   "source": [
+    "Now, let's create some predictions against the official Polaris `hello-world-competition`. We will train a simple random forest model on the ECFP representation through scikit-learn and datamol, and then we will submit our results for secure evaluation by the Polaris Hub."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "902353bc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import datamol as dm\n",
+    "from sklearn.ensemble import RandomForestRegressor\n",
+    "\n",
+    "# Load the competition (automatically loads the underlying dataset as well)\n",
+    "competition = po.load_competition(\"polaris/hello-world-benchmark\")\n",
+    "\n",
+    "# Get the split and convert SMILES to ECFP fingerprints by specifying an featurize function.\n",
+    "train, test = competition.get_train_test_split(featurization_fn=dm.to_fp)\n",
+    "\n",
+    "# Define a model and train\n",
+    "model = RandomForestRegressor(max_depth=2, random_state=0)\n",
+    "model.fit(train.X, train.y)\n",
+    "\n",
+    "predictions = model.predict(test.X)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a36e334",
+   "metadata": {},
+   "source": [
+    "Now that we have created some predictions, we can construct a `CompetitionPredictions` object that will prepare our predictions for evaluation by the Polaris Hub. Here, you can also add metadata to your predictions to better describe your results and how you achieved them. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2b36e09b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from polaris.evaluate import CompetitionPredictions\n",
+    "\n",
+    "competition_predictions = CompetitionPredictions(\n",
+    "    name=\"hello-world-result\",\n",
+    "    predictions=predictions,\n",
+    "    github_url=\"https://github.com/polaris-hub/polaris-hub\",\n",
+    "    paper_url=\"https://polarishub.io/\",\n",
+    "    description=\"Hello, World!\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ff06a9c",
+   "metadata": {},
+   "source": [
+    "Once your `CompetitionPredictions` object is created, you're ready to submit them for evaluation! This will automatically save your result to the Polaris Hub, but it will be private. You can choose to make it public through the Polaris web application. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e684c611",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results = competition.evaluate(competition_predictions)\n",
+    "\n",
+    "client.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44973556",
+   "metadata": {},
+   "source": [
+    "That's it! Just like that you have partaken in your first Polaris competition. Keep an eye on that leaderboard and best of luck in your future competitions!\n",
+    "\n",
+    "The End.\n",
+    "\n",
+    "---"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/tutorials/custom_dataset_benchmark.ipynb b/docs/tutorials/custom_dataset_benchmark.ipynb
index ef1ff5f0..ddbc003a 100644
--- a/docs/tutorials/custom_dataset_benchmark.ipynb
+++ b/docs/tutorials/custom_dataset_benchmark.ipynb
@@ -393,7 +393,7 @@
    },
    "outputs": [],
    "source": [
-    "from polaris.hub.client import PolarisHubClient\n",
+    "# from polaris.hub.client import PolarisHubClient\n",
     "\n",
     "# NOTE: Commented out to not flood the DB\n",
     "# with PolarisHubClient() as client:\n",
@@ -491,11 +491,11 @@
      "evalue": "1 validation error for MultiTaskBenchmarkSpecification\ntarget_cols\n  Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.4/v/value_error",
      "output_type": "error",
      "traceback": [
-      "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
-      "\u001B[0;31mValidationError\u001B[0m                           Traceback (most recent call last)",
-      "\u001B[1;32m/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb Cell 25\u001B[0m line \u001B[0;36m3\n\u001B[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a>\u001B[0m \u001B[39mfrom\u001B[39;00m \u001B[39mpolaris\u001B[39;00m\u001B[39m.\u001B[39;00m\u001B[39mbenchmark\u001B[39;00m \u001B[39mimport\u001B[39;00m MultiTaskBenchmarkSpecification\n\u001B[0;32m----> <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=2'>3</a>\u001B[0m benchmark \u001B[39m=\u001B[39m MultiTaskBenchmarkSpecification(\n\u001B[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=3'>4</a>\u001B[0m     dataset\u001B[39m=\u001B[39;49mdataset,\n\u001B[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=4'>5</a>\u001B[0m     target_cols\u001B[39m=\u001B[39;49m\u001B[39m\"\u001B[39;49m\u001B[39mLOG SOLUBILITY PH 6.8 (ug/mL)\u001B[39;49m\u001B[39m\"\u001B[39;49m,\n\u001B[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=5'>6</a>\u001B[0m     input_cols\u001B[39m=\u001B[39;49m\u001B[39m\"\u001B[39;49m\u001B[39mSMILES\u001B[39;49m\u001B[39m\"\u001B[39;49m,\n\u001B[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=6'>7</a>\u001B[0m     split\u001B[39m=\u001B[39;49msplit,\n\u001B[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=7'>8</a>\u001B[0m     metrics\u001B[39m=\u001B[39;49m\u001B[39m\"\u001B[39;49m\u001B[39mmean_absolute_error\u001B[39;49m\u001B[39m\"\u001B[39;49m,\n\u001B[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=8'>9</a>\u001B[0m )\n",
-      "File \u001B[0;32m~/micromamba/envs/polaris/lib/python3.11/site-packages/pydantic/main.py:164\u001B[0m, in \u001B[0;36mBaseModel.__init__\u001B[0;34m(__pydantic_self__, **data)\u001B[0m\n\u001B[1;32m    162\u001B[0m \u001B[39m# `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks\u001B[39;00m\n\u001B[1;32m    163\u001B[0m __tracebackhide__ \u001B[39m=\u001B[39m \u001B[39mTrue\u001B[39;00m\n\u001B[0;32m--> 164\u001B[0m __pydantic_self__\u001B[39m.\u001B[39;49m__pydantic_validator__\u001B[39m.\u001B[39;49mvalidate_python(data, self_instance\u001B[39m=\u001B[39;49m__pydantic_self__)\n",
-      "\u001B[0;31mValidationError\u001B[0m: 1 validation error for MultiTaskBenchmarkSpecification\ntarget_cols\n  Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.4/v/value_error"
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mValidationError\u001b[0m                           Traceback (most recent call last)",
+      "\u001b[1;32m/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb Cell 25\u001b[0m line \u001b[0;36m3\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mpolaris\u001b[39;00m\u001b[39m.\u001b[39;00m\u001b[39mbenchmark\u001b[39;00m \u001b[39mimport\u001b[39;00m MultiTaskBenchmarkSpecification\n\u001b[0;32m----> <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=2'>3</a>\u001b[0m benchmark \u001b[39m=\u001b[39m MultiTaskBenchmarkSpecification(\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=3'>4</a>\u001b[0m     dataset\u001b[39m=\u001b[39;49mdataset,\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=4'>5</a>\u001b[0m     target_cols\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mLOG SOLUBILITY PH 6.8 (ug/mL)\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=5'>6</a>\u001b[0m     input_cols\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mSMILES\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=6'>7</a>\u001b[0m     split\u001b[39m=\u001b[39;49msplit,\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=7'>8</a>\u001b[0m     metrics\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mmean_absolute_error\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/cas.wognum/Documents/repositories/polaris/docs/tutorials/custom_dataset_benchmark.ipynb#X33sZmlsZQ%3D%3D?line=8'>9</a>\u001b[0m )\n",
+      "File \u001b[0;32m~/micromamba/envs/polaris/lib/python3.11/site-packages/pydantic/main.py:164\u001b[0m, in \u001b[0;36mBaseModel.__init__\u001b[0;34m(__pydantic_self__, **data)\u001b[0m\n\u001b[1;32m    162\u001b[0m \u001b[39m# `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks\u001b[39;00m\n\u001b[1;32m    163\u001b[0m __tracebackhide__ \u001b[39m=\u001b[39m \u001b[39mTrue\u001b[39;00m\n\u001b[0;32m--> 164\u001b[0m __pydantic_self__\u001b[39m.\u001b[39;49m__pydantic_validator__\u001b[39m.\u001b[39;49mvalidate_python(data, self_instance\u001b[39m=\u001b[39;49m__pydantic_self__)\n",
+      "\u001b[0;31mValidationError\u001b[0m: 1 validation error for MultiTaskBenchmarkSpecification\ntarget_cols\n  Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.4/v/value_error"
      ]
     }
    ],
diff --git a/docs/tutorials/optimization.ipynb b/docs/tutorials/optimization.ipynb
index d479f596..d086821f 100644
--- a/docs/tutorials/optimization.ipynb
+++ b/docs/tutorials/optimization.ipynb
@@ -72,7 +72,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Let's create a dummy dataset with two columns \n",
+    "# Let's create a dummy dataset with two columns\n",
     "rng = np.random.default_rng(0)\n",
     "col_a = rng.choice(list(range(100)), 10000)\n",
     "col_b = rng.random(10000)\n",
@@ -195,10 +195,10 @@
     "import zarr\n",
     "from tempfile import mkdtemp\n",
     "\n",
-    "tmpdir =  mkdtemp()\n",
+    "tmpdir = mkdtemp()\n",
     "\n",
-    "# For the ones familiar with Zarr, this is not optimized at all. \n",
-    "# If you wouldn't want to convert to NumPy, you would want to \n",
+    "# For the ones familiar with Zarr, this is not optimized at all.\n",
+    "# If you wouldn't want to convert to NumPy, you would want to\n",
     "# optimize the chunking / compression.\n",
     "\n",
     "path = os.path.join(tmpdir, \"data.zarr\")\n",
@@ -276,7 +276,7 @@
    ],
    "source": [
     "%%timeit\n",
-    "for batch in dataloader: \n",
+    "for batch in dataloader:\n",
     "    pass"
    ]
   },
@@ -314,7 +314,7 @@
    ],
    "source": [
     "%%timeit\n",
-    "for batch in dataloader: \n",
+    "for batch in dataloader:\n",
     "    pass"
    ]
   },
@@ -336,6 +336,7 @@
    "outputs": [],
    "source": [
     "from shutil import rmtree\n",
+    "\n",
     "rmtree(tmpdir)"
    ]
   },
diff --git a/mkdocs.yml b/mkdocs.yml
index 128a204e..02b32073 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -23,6 +23,8 @@ nav:
           - Zarr Datasets: tutorials/dataset_zarr.ipynb
           - Dataset Factory: tutorials/dataset_factory.ipynb
           - Optimization: tutorials/optimization.ipynb
+      - Competitions:
+          - tutorials/competition.participate.ipynb
   - API Reference:
       - Load: api/load.md
       - Core:
@@ -30,6 +32,10 @@ nav:
           - Benchmark: api/benchmark.md
           - Subset: api/subset.md
           - Evaluation: api/evaluation.md
+      - Competitions:
+          - Competition Dataset: api/competition.dataset.md
+          - Competition: api/competition.md
+          - Competiton Evaluation: api/competition.evaluation.md
       - Hub:
           - Client: api/hub.client.md
           - PolarisFileSystem: api/hub.polarisfs.md
diff --git a/polaris/__init__.py b/polaris/__init__.py
index ddb0f44a..339df6ef 100644
--- a/polaris/__init__.py
+++ b/polaris/__init__.py
@@ -4,9 +4,9 @@
 from loguru import logger
 
 from ._version import __version__
-from .loader import load_benchmark, load_dataset
+from .loader import load_benchmark, load_dataset, load_competition
 
-__all__ = ["load_dataset", "load_benchmark", "__version__"]
+__all__ = ["load_dataset", "load_benchmark", "__version__", "load_competition"]
 
 # Configure the default logging level
 os.environ["LOGURU_LEVEL"] = os.environ.get("LOGURU_LEVEL", "INFO")
diff --git a/polaris/benchmark/_base.py b/polaris/benchmark/_base.py
index 3dd0c5b1..45f9cc15 100644
--- a/polaris/benchmark/_base.py
+++ b/polaris/benchmark/_base.py
@@ -5,7 +5,6 @@
 
 import fsspec
 import numpy as np
-import pandas as pd
 from datamol.utils import fs
 from loguru import logger
 from pydantic import (
@@ -20,10 +19,10 @@
 
 from polaris._artifact import BaseArtifactModel
 from polaris.mixins import ChecksumMixin
-from polaris.dataset import Dataset, Subset
-from polaris.evaluate import BenchmarkResults, Metric, ResultsType
+from polaris.dataset import Dataset, Subset, CompetitionDataset
+from polaris.evaluate import BenchmarkResults, Metric
+from polaris.evaluate.utils import evaluate_benchmark
 from polaris.hub.settings import PolarisHubSettings
-from polaris.utils.context import tmp_attribute_change
 from polaris.utils.dict2html import dict2html
 from polaris.utils.errors import InvalidBenchmarkError
 from polaris.utils.misc import listit
@@ -97,18 +96,16 @@ class BenchmarkSpecification(BaseArtifactModel, ChecksumMixin):
 
     # Public attributes
     # Data
-    dataset: Union[Dataset, str, dict[str, Any]]
+    dataset: Union[Dataset, CompetitionDataset, str, dict[str, Any]]
     target_cols: ColumnsType
     input_cols: ColumnsType
     split: SplitType
-    metrics: Union[str, Metric, list[Union[str, Metric]]]
-    main_metric: Optional[Union[str, Metric]] = None
+    metrics: Union[str, Metric, list[str | Metric]]
+    main_metric: str | Metric | None = None
 
     # Additional meta-data
     readme: str = ""
-    target_types: dict[str, Optional[Union[TargetType, str]]] = Field(
-        default_factory=dict, validate_default=True
-    )
+    target_types: dict[str, Union[TargetType, str, None]] = Field(default_factory=dict, validate_default=True)
 
     @field_validator("dataset")
     def _validate_dataset(cls, v):
@@ -164,25 +161,29 @@ def _validate_main_metric(cls, v):
             v = Metric[v]
         return v
 
-    @field_validator("split")
-    def _validate_split(cls, v, info: ValidationInfo):
+    @model_validator(mode="after")
+    def _validate_split(cls, m: "BenchmarkSpecification"):
         """
         Verifies that:
           1) There are no empty test partitions
           2) All indices are valid given the dataset
           3) There is no duplicate indices in any of the sets
-          3) There is no overlap between the train and test set
+          4) There is no overlap between the train and test set
+          5) No row exists in the test set where all labels are missing/empty
         """
+        split = m.split
 
         # Train partition can be empty (zero-shot)
         # Test partitions cannot be empty
-        if (isinstance(v[1], dict) and any(len(v) == 0 for v in v[1].values())) or (
-            not isinstance(v[1], dict) and len(v[1]) == 0
+        if (isinstance(split[1], dict) and any(len(v) == 0 for v in split[1].values())) or (
+            not isinstance(split[1], dict) and len(split[1]) == 0
         ):
             raise InvalidBenchmarkError("The predefined split contains empty test partitions")
 
-        train_idx_list = v[0]
-        full_test_idx_list = list(chain.from_iterable(v[1].values())) if isinstance(v[1], dict) else v[1]
+        train_idx_list = split[0]
+        full_test_idx_list = (
+            list(chain.from_iterable(split[1].values())) if isinstance(split[1], dict) else split[1]
+        )
 
         if len(train_idx_list) == 0:
             logger.info(
@@ -203,8 +204,8 @@ def _validate_split(cls, v, info: ValidationInfo):
         # Check for duplicate indices within a given test set. Because a user can specify
         # multiple test sets for a given benchmark and it is acceptable for indices to be shared
         # across test sets, we check for duplicates in each test set independently.
-        if isinstance(v[1], dict):
-            for test_set_name, test_set_idx_list in v[1].items():
+        if isinstance(split[1], dict):
+            for test_set_name, test_set_idx_list in split[1].items():
                 if len(test_set_idx_list) != len(set(test_set_idx_list)):
                     raise InvalidBenchmarkError(
                         f'Test set with name "{test_set_name}" contains duplicate indices'
@@ -213,12 +214,13 @@ def _validate_split(cls, v, info: ValidationInfo):
             raise InvalidBenchmarkError("The test set contains duplicate indices")
 
         # All indices are valid given the dataset
-        if info.data["dataset"] is not None:
-            max_i = len(info.data["dataset"])
+        dataset = m.dataset
+        if dataset is not None:
+            max_i = len(dataset)
             if any(i < 0 or i >= max_i for i in chain(train_idx_list, full_test_idx_set)):
                 raise InvalidBenchmarkError("The predefined split contains invalid indices")
 
-        return v
+        return m
 
     @field_validator("target_types")
     def _validate_target_types(cls, v, info: ValidationInfo):
@@ -353,6 +355,36 @@ def task_type(self) -> str:
         v = TaskType.MULTI_TASK if len(self.target_cols) > 1 else TaskType.SINGLE_TASK
         return v.value
 
+    def _get_subset(self, indices, hide_targets=True, featurization_fn=None):
+        """Returns a [`Subset`][polaris.dataset.Subset] using the given indices. Used
+        internally to construct the train and test sets."""
+        return Subset(
+            dataset=self.dataset,
+            indices=indices,
+            input_cols=self.input_cols,
+            target_cols=self.target_cols,
+            hide_targets=hide_targets,
+            featurization_fn=featurization_fn,
+        )
+
+    def _get_test_set(
+        self, hide_targets=True, featurization_fn: Optional[Callable] = None
+    ) -> Union["Subset", dict[str, Subset]]:
+        """Construct the test set(s), given the split in the benchmark specification. Used
+        internally to construct the test set for client use and evaluation.
+        """
+
+        def make_test_subset(vals):
+            return self._get_subset(vals, hide_targets=hide_targets, featurization_fn=featurization_fn)
+
+        test_split = self.split[1]
+        if isinstance(test_split, dict):
+            test = {k: make_test_subset(v) for k, v in test_split.items()}
+        else:
+            test = make_test_subset(test_split)
+
+        return test
+
     def get_train_test_split(
         self, featurization_fn: Optional[Callable] = None
     ) -> tuple[Subset, Union["Subset", dict[str, Subset]]]:
@@ -372,21 +404,8 @@ def get_train_test_split(
                 an associated name. The targets of the test set can not be accessed.
         """
 
-        def _get_subset(indices, hide_targets):
-            return Subset(
-                dataset=self.dataset,
-                indices=indices,
-                input_cols=self.input_cols,
-                target_cols=self.target_cols,
-                hide_targets=hide_targets,
-                featurization_fn=featurization_fn,
-            )
-
-        train = _get_subset(self.split[0], hide_targets=False)
-        if isinstance(self.split[1], dict):
-            test = {k: _get_subset(v, hide_targets=True) for k, v in self.split[1].items()}
-        else:
-            test = _get_subset(self.split[1], hide_targets=True)
+        train = self._get_subset(self.split[0], hide_targets=False, featurization_fn=featurization_fn)
+        test = self._get_test_set(hide_targets=True, featurization_fn=featurization_fn)
 
         return train, test
 
@@ -399,6 +418,16 @@ def evaluate(
             Contrary to other frameworks that you might be familiar with, we opted for a signature that includes just
             the predictions. This reduces the chance of accidentally using the test targets during training.
 
+        info: Expected structure for `y_pred` and `y_prob` arguments
+            The supplied `y_pred` and `y_prob` arguments must adhere to a certain structure depending on the number of
+            tasks and test sets included in the benchmark. Refer to the following for guidance on the correct structure when
+            creating your `y_pred` and `y_prod` objects:
+
+            - Single task, single set: `[values...]`
+            - Multi-task, single set: `{task_name_1: [values...], task_name_2: [values...]}`
+            - Single task, multi-set: `{test_set_1: {task_name: [values...]}, test_set_2: {task_name: [values...]}}`
+            - Multi-task, multi-set: `{test_set_1: {task_name_1: [values...], task_name_2: [values...]}, test_set_2: {task_name_1: [values...], task_name_2: [values...]}}`
+
         For this method, we make the following assumptions:
 
         1. There can be one or multiple test set(s);
@@ -417,7 +446,6 @@ def evaluate(
         Returns:
             A `BenchmarkResults` object. This object can be directly submitted to the Polaris Hub.
 
-
         Examples:
             1. For regression benchmarks:
                 pred_scores = your_model.predict_score(molecules) # predict continuous score values
@@ -432,72 +460,31 @@ def evaluate(
         """
 
         # Instead of having the user pass the ground truth, we extract it from the benchmark spec ourselves.
-        # This simplifies the API, but also was added to make accidental access to the test set targets less likely.
-        # See also the `hide_targets` parameter in the `Subset` class.
-        test = self.get_train_test_split()[1]
-
-        if not isinstance(test, dict):
-            test = {"test": test}
-
-        y_true = {}
-        for k, test_subset in test.items():
-            with tmp_attribute_change(test_subset, "_hide_targets", False):
-                y_true[k] = test_subset.targets
-
-        if not isinstance(y_pred, dict) or all(k in self.target_cols for k in y_pred):
-            y_pred = {"test": y_pred}
-
-        if not isinstance(y_prob, dict) or all(k in self.target_cols for k in y_prob):
-            y_prob = {"test": y_prob}
-
-        if any(k not in y_pred for k in test.keys()) and any(k not in y_prob for k in test.keys()):
-            raise KeyError(
-                f"Missing keys for at least one of the test sets. Expecting: {sorted(test.keys())}"
-            )
-
-        # Results are saved in a tabular format. For more info, see the BenchmarkResults docs.
-        scores: ResultsType = pd.DataFrame(columns=BenchmarkResults.RESULTS_COLUMNS)
-
-        # For every test set...
-        for test_label, y_true_subset in y_true.items():
-            # For every metric...
-            for metric in self.metrics:
-                if metric.is_multitask:
-                    # Multi-task but with a metric across targets
-                    score = metric(
-                        y_true=y_true_subset, y_pred=y_pred.get(test_label), y_prob=y_prob.get(test_label)
-                    )
-                    scores.loc[len(scores)] = (test_label, "aggregated", metric, score)
-                    continue
-
-                if not isinstance(y_true_subset, dict):
-                    # Single task
-                    score = metric(
-                        y_true=y_true_subset, y_pred=y_pred.get(test_label), y_prob=y_prob.get(test_label)
-                    )
-                    scores.loc[len(scores)] = (
-                        test_label,
-                        self.target_cols[0],
-                        metric,
-                        score,
-                    )
-                    continue
+        # The `evaluate_benchmark` function expects the benchmark labels to be of a certain structure which
+        # depends on the number of tasks and test sets defined for the benchmark. Below, we build the structure
+        # of the benchmark labels based on the aforementioned factors.
+        test = self._get_test_set(hide_targets=False)
+        if isinstance(test, dict):
+            #
+            # For multi-set benchmarks
+            y_true = {}
+            for test_set_name, values in test.items():
+                y_true[test_set_name] = {}
+                if isinstance(values.targets, dict):
+                    #
+                    # For multi-task, multi-set benchmarks
+                    for task_name, values in values.targets.items():
+                        y_true[test_set_name][task_name] = values
+                else:
+                    #
+                    # For single task, multi-set benchmarks
+                    y_true[test_set_name][self.target_cols[0]] = values.targets
+        else:
+            #
+            # For single set benchmarks (single and multiple task)
+            y_true = test.targets
 
-                # Otherwise, for every target...
-                for target_label, y_true_target in y_true_subset.items():
-                    # Single-task metrics for a multi-task benchmark
-                    # In such a setting, there can be NaN values, which we thus have to filter out.
-                    mask = ~np.isnan(y_true_target)
-                    score = metric(
-                        y_true=y_true_target[mask],
-                        y_pred=y_pred[test_label][target_label][mask]
-                        if y_pred[test_label] is not None
-                        else None,
-                        y_prob=y_prob[test_label][target_label][mask]
-                        if y_prob[test_label] is not None
-                        else None,
-                    )
-                    scores.loc[len(scores)] = (test_label, target_label, metric, score)
+        scores = evaluate_benchmark(self.target_cols, self.metrics, y_true, y_pred=y_pred, y_prob=y_prob)
 
         return BenchmarkResults(results=scores, benchmark_name=self.name, benchmark_owner=self.owner)
 
@@ -506,7 +493,7 @@ def upload_to_hub(
         settings: Optional[PolarisHubSettings] = None,
         cache_auth_token: bool = True,
         access: Optional[AccessType] = "private",
-        owner: Optional[Union[HubOwner, str]] = None,
+        owner: Union[HubOwner, str, None] = None,
         **kwargs: dict,
     ):
         """
diff --git a/polaris/competition/__init__.py b/polaris/competition/__init__.py
new file mode 100644
index 00000000..376773e8
--- /dev/null
+++ b/polaris/competition/__init__.py
@@ -0,0 +1,3 @@
+from polaris.competition._competition import CompetitionSpecification
+
+__all__ = ["CompetitionSpecification"]
diff --git a/polaris/competition/_competition.py b/polaris/competition/_competition.py
new file mode 100644
index 00000000..b7d6e444
--- /dev/null
+++ b/polaris/competition/_competition.py
@@ -0,0 +1,43 @@
+from datetime import datetime
+from typing import Optional
+
+from polaris.benchmark import BenchmarkSpecification
+from polaris.evaluate._results import CompetitionPredictions
+from polaris.hub.settings import PolarisHubSettings
+from polaris.utils.types import HubOwner
+
+
+class CompetitionSpecification(BenchmarkSpecification):
+    """Much of the underlying data model and logic is shared across Benchmarks and Competitions, and
+    anything within this class serves as a point of differentiation between the two.
+
+    Attributes:
+        owner: A slug-compatible name for the owner of the competition. This is redefined such
+            that it is required.
+        start_time: The time at which the competition becomes active and interactable.
+        end_time: The time at which the competition ends and is no longer interactable.
+    """
+
+    # Additional properties specific to Competitions
+    owner: HubOwner
+    start_time: datetime | None = None
+    end_time: datetime | None = None
+
+    def evaluate(
+        self,
+        predictions: CompetitionPredictions,
+        settings: Optional[PolarisHubSettings] = None,
+        cache_auth_token: bool = True,
+        **kwargs: dict,
+    ):
+        """Light convenience wrapper around
+        [`PolarisHubClient.evaluate_competition`][polaris.hub.client.PolarisHubClient.evaluate_competition].
+        """
+        from polaris.hub.client import PolarisHubClient
+
+        with PolarisHubClient(
+            settings=settings,
+            cache_auth_token=cache_auth_token,
+            **kwargs,
+        ) as client:
+            return client.evaluate_competition(self, predictions)
diff --git a/polaris/dataset/__init__.py b/polaris/dataset/__init__.py
index fbd2037f..2f2ab41f 100644
--- a/polaris/dataset/__init__.py
+++ b/polaris/dataset/__init__.py
@@ -2,10 +2,12 @@
 from polaris.dataset._dataset import Dataset
 from polaris.dataset._factory import DatasetFactory, create_dataset_from_file
 from polaris.dataset._subset import Subset
+from polaris.dataset._competition_dataset import CompetitionDataset
 
 __all__ = [
     "ColumnAnnotation",
     "Dataset",
+    "CompetitionDataset",
     "Subset",
     "Modality",
     "DatasetFactory",
diff --git a/polaris/dataset/_column.py b/polaris/dataset/_column.py
index 0474d930..6bb7500e 100644
--- a/polaris/dataset/_column.py
+++ b/polaris/dataset/_column.py
@@ -36,7 +36,7 @@ class ColumnAnnotation(BaseModel):
     modality: Union[str, Modality] = Modality.UNKNOWN
     description: Optional[str] = None
     user_attributes: Dict[str, str] = Field(default_factory=dict)
-    dtype: Optional[Union[np.dtype, str]] = None
+    dtype: Union[np.dtype, str, None] = None
 
     model_config = ConfigDict(arbitrary_types_allowed=True, alias_generator=to_camel, populate_by_name=True)
 
diff --git a/polaris/dataset/_competition_dataset.py b/polaris/dataset/_competition_dataset.py
new file mode 100644
index 00000000..2f224c22
--- /dev/null
+++ b/polaris/dataset/_competition_dataset.py
@@ -0,0 +1,21 @@
+from pydantic import model_validator
+from polaris.dataset import Dataset
+from polaris.utils.errors import InvalidCompetitionError
+
+_CACHE_SUBDIR = "datasets"
+
+
+class CompetitionDataset(Dataset):
+    """Dataset subclass for Polaris competitions.
+
+    In addition to the data model and logic of the base Dataset class,
+    this class adds additional functionality which validates certain aspects
+    of the training data for a given competition.
+    """
+
+    @model_validator(mode="after")
+    def _validate_model(cls, m: "CompetitionDataset"):
+        """We reject the instantiation of competition datasets which leverage Zarr for the time being"""
+
+        if m.uses_zarr:
+            raise InvalidCompetitionError("Pointer columns are not currently supported in competitions.")
diff --git a/polaris/dataset/_dataset.py b/polaris/dataset/_dataset.py
index ebd3538b..268f14a5 100644
--- a/polaris/dataset/_dataset.py
+++ b/polaris/dataset/_dataset.py
@@ -368,7 +368,7 @@ def get_data(self, row: str | int, col: str, adapters: Optional[List[Adapter]] =
         return arr
 
     def upload_to_hub(
-        self, access: Optional[AccessType] = "private", owner: Optional[Union[HubOwner, str]] = None
+        self, access: Optional[AccessType] = "private", owner: Union[HubOwner, str, None] = None
     ):
         """
         Very light, convenient wrapper around the
diff --git a/polaris/evaluate/__init__.py b/polaris/evaluate/__init__.py
index a5e739bb..95e61efc 100644
--- a/polaris/evaluate/__init__.py
+++ b/polaris/evaluate/__init__.py
@@ -1,4 +1,22 @@
 from polaris.evaluate._metric import Metric, MetricInfo
-from polaris.evaluate._results import BenchmarkResults, ResultsType
+from polaris.evaluate._results import (
+    BenchmarkResults,
+    ResultsType,
+    CompetitionResults,
+    CompetitionPredictions,
+    ResultsMetadata,
+    EvaluationResult,
+)
+from polaris.evaluate.utils import evaluate_benchmark
 
-__all__ = ["Metric", "MetricInfo", "BenchmarkResults", "ResultsType"]
+__all__ = [
+    "Metric",
+    "MetricInfo",
+    "ResultsMetadata",
+    "EvaluationResult",
+    "BenchmarkResults",
+    "CompetitionResults",
+    "ResultsType",
+    "evaluate_benchmark",
+    "CompetitionPredictions",
+]
diff --git a/polaris/evaluate/_results.py b/polaris/evaluate/_results.py
index b7eccf64..252cc597 100644
--- a/polaris/evaluate/_results.py
+++ b/polaris/evaluate/_results.py
@@ -2,6 +2,7 @@
 from datetime import datetime
 from typing import ClassVar, Optional, Union
 
+import numpy as np
 import pandas as pd
 from pydantic import (
     BaseModel,
@@ -20,7 +21,15 @@
 from polaris.utils.dict2html import dict2html
 from polaris.utils.errors import InvalidResultError
 from polaris.utils.misc import sluggify
-from polaris.utils.types import AccessType, HttpUrlString, HubOwner, HubUser, SlugCompatibleStringType
+from polaris.utils.types import (
+    AccessType,
+    CompetitionPredictionsType,
+    HttpUrlString,
+    HubOwner,
+    HubUser,
+    PredictionsType,
+    SlugCompatibleStringType,
+)
 
 # Define some helpful type aliases
 TestLabelType = str
@@ -58,15 +67,48 @@ def serialize_scores(self, value: dict):
         return {metric.name: score for metric, score in value.items()}
 
 
-ResultsType = Union[pd.DataFrame, list[Union[ResultRecords, dict]]]
+ResultsType = Union[pd.DataFrame, list[ResultRecords | dict]]
 
 
-class BenchmarkResults(BaseArtifactModel):
-    """Class for saving benchmarking results
+class ResultsMetadata(BaseArtifactModel):
+    """Base class for evaluation results
 
-    This object is returned by [`BenchmarkSpecification.evaluate`][polaris.benchmark.BenchmarkSpecification.evaluate].
-    In addition to the metrics on the test set, it contains additional meta-data and logic to integrate
-    the results with the Polaris Hub.
+    Attributes:
+        github_url: The URL to the GitHub repository of the code used to generate these results.
+        paper_url: The URL to the paper describing the methodology used to generate these results.
+        contributors: The users that are credited for these results.
+        _created_at: The time-stamp at which the results were created. Automatically set.
+    For additional meta-data attributes, see the [`BaseArtifactModel`][polaris._artifact.BaseArtifactModel] class.
+    """
+
+    # Additional meta-data
+    github_url: Optional[HttpUrlString] = None
+    paper_url: Optional[HttpUrlString] = None
+    contributors: Optional[list[HubUser]] = None
+
+    # Private attributes
+    _created_at: datetime = PrivateAttr(default_factory=datetime.now)
+
+    def _repr_dict_(self) -> dict:
+        """Utility function for pretty-printing to the command line and jupyter notebooks"""
+        repr_dict = self.model_dump(exclude=["results"])
+
+        df = self.results.copy(deep=True)
+        df["Metric"] = df["Metric"].apply(lambda x: x.name if isinstance(x, Metric) else x)
+        repr_dict["results"] = json.loads(df.to_json(orient="records"))
+
+        return repr_dict
+
+    def _repr_html_(self):
+        """For pretty-printing in Jupyter Notebooks"""
+        return dict2html(self._repr_dict_())
+
+    def __repr__(self):
+        return json.dumps(self._repr_dict_(), indent=2)
+
+
+class EvaluationResult(ResultsMetadata):
+    """Class for saving evaluation results
 
     The actual results are saved in the `results` field using the following tabular format:
 
@@ -79,43 +121,20 @@ class BenchmarkResults(BaseArtifactModel):
 
     question: Categorizing methods
         An open question is how to best categorize a methodology (e.g. a model).
-        This is needed since we would like to be able to aggregate results across benchmarks too,
+        This is needed since we would like to be able to aggregate results across benchmarks/competitions too,
         to say something about which (type of) methods performs best _in general_.
 
     Attributes:
-        results: Benchmark results are stored directly in a dataframe or in a serialized, JSON compatible dict
+        results: Evaluation results are stored directly in a dataframe or in a serialized, JSON compatible dict
             that can be decoded into the associated tabular format.
-        benchmark_name: The name of the benchmark for which these results were generated.
-            Together with the benchmark owner, this uniquely identifies the benchmark on the Hub.
-        benchmark_owner: The owner of the benchmark for which these results were generated.
-            Together with the benchmark name, this uniquely identifies the benchmark on the Hub.
-        github_url: The URL to the GitHub repository of the code used to generate these results.
-        paper_url: The URL to the paper describing the methodology used to generate these results.
-        contributors: The users that are credited for these results.
-        _created_at: The time-stamp at which the results were created. Automatically set.
-    For additional meta-data attributes, see the [`BaseArtifactModel`][polaris._artifact.BaseArtifactModel] class.
+    For additional meta-data attributes, see the [`ResultsMetadata`][polaris.evaluate._results.ResultsMetadata] class.
     """
 
     # Define the columns of the results table
     RESULTS_COLUMNS: ClassVar[list[str]] = ["Test set", "Target label", "Metric", "Score"]
 
-    # Data
+    # Results attribute
     results: ResultsType
-    benchmark_name: SlugCompatibleStringType = Field(..., frozen=True)
-    benchmark_owner: Optional[HubOwner] = Field(None, frozen=True)
-
-    # Additional meta-data
-    github_url: Optional[HttpUrlString] = None
-    paper_url: Optional[HttpUrlString] = None
-    contributors: Optional[list[HubUser]] = None
-
-    # Private attributes
-    _created_at: datetime = PrivateAttr(default_factory=datetime.now)
-
-    @computed_field
-    @property
-    def benchmark_artifact_id(self) -> str:
-        return f"{self.benchmark_owner}/{sluggify(self.benchmark_name)}"
 
     @field_validator("results")
     def _validate_results(cls, v):
@@ -175,12 +194,34 @@ def _serialize_results(self, value: ResultsType):
 
         return serialized
 
+
+class BenchmarkResults(EvaluationResult):
+    """Class specific to results for standard benchmarks.
+
+    This object is returned by [`BenchmarkSpecification.evaluate`][polaris.benchmark.BenchmarkSpecification.evaluate].
+    In addition to the metrics on the test set, it contains additional meta-data and logic to integrate
+    the results with the Polaris Hub.
+
+    benchmark_name: The name of the benchmark for which these results were generated.
+        Together with the benchmark owner, this uniquely identifies the benchmark on the Hub.
+    benchmark_owner: The owner of the benchmark for which these results were generated.
+        Together with the benchmark name, this uniquely identifies the benchmark on the Hub.
+    """
+
+    benchmark_name: SlugCompatibleStringType = Field(..., frozen=True)
+    benchmark_owner: Optional[HubOwner] = Field(None, frozen=True)
+
+    @computed_field
+    @property
+    def benchmark_artifact_id(self) -> str:
+        return f"{self.benchmark_owner}/{sluggify(self.benchmark_name)}"
+
     def upload_to_hub(
         self,
         settings: Optional[PolarisHubSettings] = None,
         cache_auth_token: bool = True,
         access: Optional[AccessType] = "private",
-        owner: Optional[Union[HubOwner, str]] = None,
+        owner: Union[HubOwner, str, None] = None,
         **kwargs: dict,
     ):
         """
@@ -192,22 +233,68 @@ def upload_to_hub(
         with PolarisHubClient(settings=settings, cache_auth_token=cache_auth_token, **kwargs) as client:
             return client.upload_results(self, access=access, owner=owner)
 
-    def _repr_dict_(self) -> dict:
-        """Utility function for pretty-printing to the command line and jupyter notebooks"""
-        repr_dict = self.model_dump(exclude=["results"])
 
-        df = self.results.copy(deep=True)
-        df["Metric"] = df["Metric"].apply(lambda x: x.name if isinstance(x, Metric) else x)
-        repr_dict["results"] = json.loads(df.to_json(orient="records"))
+class CompetitionResults(EvaluationResult):
+    """Class specific to results for competition benchmarks.
 
-        return repr_dict
+    This object is returned by [`CompetitionSpecification.evaluate`][polaris.competition.CompetitionSpecification.evaluate].
+    In addition to the metrics on the test set, it contains additional meta-data and logic to integrate
+    the results with the Polaris Hub.
 
-    def _repr_html_(self):
-        """For pretty-printing in Jupyter Notebooks"""
-        return dict2html(self._repr_dict_())
+    Attributes:
+        competition_name: The name of the competition for which these results were generated.
+            Together with the competition owner, this uniquely identifies the competition on the Hub.
+        competition_owner: The owner of the competition for which these results were generated.
+            Together with the competition name, this uniquely identifies the competition on the Hub.
+    """
 
-    def __len__(self):
-        return len(self.table)
+    competition_name: SlugCompatibleStringType = Field(..., frozen=True)
+    competition_owner: Optional[HubOwner] = Field(None, frozen=True)
 
-    def __repr__(self):
-        return json.dumps(self._repr_dict_(), indent=2)
+    @computed_field
+    @property
+    def competition_artifact_id(self) -> str:
+        return f"{self.competition_owner}/{sluggify(self.competition_name)}"
+
+
+class CompetitionPredictions(ResultsMetadata):
+    """Class specific to predictions for competition benchmarks.
+
+    This object is to be used as input to [`CompetitionSpecification.evaluate`][polaris.competition.CompetitionSpecification.evaluate].
+    It is used to ensure that the structure of the predictions are compatible with evaluation methods on the Polaris Hub.
+
+    Attributes:
+        predictions: The predictions created for a given competition's test set(s).
+    """
+
+    predictions: Union[PredictionsType, CompetitionPredictionsType]
+    access: Optional[AccessType] = "private"
+
+    @field_validator("predictions")
+    @classmethod
+    def _convert_predictions(cls, value: Union[PredictionsType, CompetitionPredictionsType]):
+        """Convert prediction arrays from a list type to a numpy array. This is required for certain
+        operations during prediction evaluation"""
+
+        if isinstance(value, list):
+            return np.array(value)
+        elif isinstance(value, np.ndarray):
+            return value
+        elif isinstance(value, dict):
+            for key, val in value.items():
+                value[key] = cls._convert_predictions(val)
+            return value
+
+    @field_serializer("predictions")
+    def _serialize_predictions(self, value: PredictionsType):
+        """Used to serialize a Predictions object such that it can be sent over the wire during
+        external evaluation for competitions"""
+
+        if isinstance(value, np.ndarray):
+            return value.tolist()
+        elif isinstance(value, list):
+            return value
+        elif isinstance(value, dict):
+            for key, val in value.items():
+                value[key] = self._serialize_predictions(val)
+            return value
diff --git a/polaris/evaluate/utils.py b/polaris/evaluate/utils.py
new file mode 100644
index 00000000..566abc37
--- /dev/null
+++ b/polaris/evaluate/utils.py
@@ -0,0 +1,96 @@
+import numpy as np
+import pandas as pd
+from typing import Optional
+
+from polaris.evaluate import BenchmarkResults, ResultsType
+from polaris.utils.types import PredictionsType
+from polaris.evaluate import Metric
+from numpy.typing import NDArray
+
+
+def is_multi_task_single_test_set(vals: PredictionsType, target_cols: list[str]):
+    """Check if the given values are for a multiple-task benchmark with a single
+    test set. This is inferred by comparing the target names with the keys of the
+    given data. If all keys in the given data match the target column names, we
+    assume they are target names (as opposed to test set names for a single-task,
+    multiple test set benchmark)."""
+    return all(k in target_cols for k in vals)
+
+
+def normalize_predictions_type(vals: PredictionsType, target_cols: list[str]):
+    if isinstance(vals, dict):
+        if is_multi_task_single_test_set(vals, target_cols):
+            return {"test": vals}
+        else:
+            return vals
+    elif vals is None:
+        return None
+    else:
+        return {"test": {target_cols[0]: vals}}
+
+
+def safe_mask(
+    input_values: dict | dict[str, dict], test_label: str, target_label: str, mask: NDArray[np.bool_]
+):
+    if (
+        input_values is None
+        or input_values.get(test_label) is None
+        or input_values[test_label].get(target_label) is None
+    ):
+        return None
+    else:
+        return input_values[test_label][target_label][mask]
+
+
+def evaluate_benchmark(
+    target_cols: list[str],
+    metrics: list[Metric],
+    y_true: PredictionsType,
+    y_pred: Optional[PredictionsType] = None,
+    y_prob: Optional[PredictionsType] = None,
+):
+    y_true = normalize_predictions_type(y_true, target_cols)
+    y_pred = normalize_predictions_type(y_pred, target_cols)
+    y_prob = normalize_predictions_type(y_prob, target_cols)
+
+    if y_pred and set(y_true.keys()) != set(y_pred.keys()):
+        raise KeyError(f"Missing keys for at least one of the test sets. Expecting: {sorted(y_true.keys())}")
+
+    # Results are saved in a tabular format. For more info, see the BenchmarkResults docs.
+    scores: ResultsType = pd.DataFrame(columns=BenchmarkResults.RESULTS_COLUMNS)
+
+    # For every test set...
+    for test_label, y_true_subset in y_true.items():
+        # For every metric...
+        for metric in metrics:
+            if metric.is_multitask:
+                # Multi-task but with a metric across targets
+                score = metric(
+                    y_true=y_true_subset, y_pred=y_pred.get(test_label), y_prob=y_prob.get(test_label)
+                )
+
+                scores.loc[len(scores)] = (test_label, "aggregated", metric, score)
+                continue
+
+            if not isinstance(y_true_subset, dict):
+                # Single task
+                score = metric(
+                    y_true=y_true_subset, y_pred=y_pred.get(test_label), y_prob=y_prob.get(test_label)
+                )
+                scores.loc[len(scores)] = (test_label, target_cols[0], metric, score)
+                continue
+
+            # Otherwise, for every target...
+            for target_label, y_true_target in y_true_subset.items():
+                # Single-task metrics for a multi-task benchmark
+                # In such a setting, there can be NaN values, which we thus have to filter out.
+                mask = ~np.isnan(y_true_target)
+                score = metric(
+                    y_true=y_true_target[mask],
+                    y_pred=safe_mask(y_pred, test_label, target_label, mask),
+                    y_prob=safe_mask(y_prob, test_label, target_label, mask),
+                )
+
+                scores.loc[len(scores)] = (test_label, target_label, metric, score)
+
+    return scores
diff --git a/polaris/hub/client.py b/polaris/hub/client.py
index 83682367..dca7c239 100644
--- a/polaris/hub/client.py
+++ b/polaris/hub/client.py
@@ -3,6 +3,7 @@
 from hashlib import md5
 from io import BytesIO
 from typing import Callable, get_args
+from typing import Union
 from urllib.parse import urljoin
 
 import certifi
@@ -24,8 +25,12 @@
 )
 from polaris.dataset import Dataset
 from polaris.evaluate import BenchmarkResults
+from polaris.evaluate._results import CompetitionPredictions
 from polaris.hub.external_auth_client import ExternalAuthClient
 from polaris.hub.oauth import CachedTokenAuth
+from polaris.dataset import CompetitionDataset
+from polaris.evaluate import CompetitionResults
+from polaris.competition import CompetitionSpecification
 from polaris.hub.polarisfs import PolarisFileSystem
 from polaris.hub.settings import PolarisHubSettings
 from polaris.utils.context import ProgressIndicator, tmp_attribute_change
@@ -39,6 +44,7 @@
 from polaris.utils.misc import should_verify_checksum
 from polaris.utils.types import (
     AccessType,
+    ArtifactSubtype,
     ChecksumStrategy,
     HubOwner,
     IOMode,
@@ -274,12 +280,12 @@ def list_datasets(self, limit: int = 100, offset: int = 0) -> list[str]:
             A list of dataset names in the format `owner/dataset_name`.
         """
         with ProgressIndicator(
-            start_msg="Fetching datasets...",
-            success_msg="Fetched datasets.",
+            start_msg="Fetching artifacts...",
+            success_msg="Fetched artifacts.",
             error_msg="Failed to fetch datasets.",
         ):
             response = self._base_request_to_hub(
-                url="/dataset", method="GET", params={"limit": limit, "offset": offset}
+                url="/v1/dataset", method="GET", params={"limit": limit, "offset": offset}
             )
             dataset_list = [bm["artifactId"] for bm in response["data"]]
 
@@ -291,7 +297,7 @@ def get_dataset(
         name: str,
         verify_checksum: ChecksumStrategy = "verify_unless_zarr",
     ) -> Dataset:
-        """Load a dataset from the Polaris Hub.
+        """Load a standard dataset from the Polaris Hub.
 
         Args:
             owner: The owner of the dataset. Can be either a user or organization from the Polaris Hub.
@@ -302,13 +308,37 @@ def get_dataset(
         Returns:
             A `Dataset` instance, if it exists.
         """
+        return self._get_dataset(owner, name, ArtifactSubtype.STANDARD.value, verify_checksum)
 
+    def _get_dataset(
+        self,
+        owner: Union[str, HubOwner],
+        name: str,
+        artifact_type: ArtifactSubtype,
+        verify_checksum: bool = True,
+    ) -> Dataset:
+        """Loads either a standard or competition dataset from Polaris Hub
+
+        Args:
+            owner: The owner of the dataset. Can be either a user or organization from the Polaris Hub.
+            name: The name of the dataset.
+            artifact_type: indicates whether the artifact is of the standard or competition type.
+            verify_checksum: Whether to use the checksum to verify the integrity of the dataset.
+
+        Returns:
+            A `Dataset` instance, if it exists.
+        """
         with ProgressIndicator(
-            start_msg="Fetching dataset...",
-            success_msg="Fetched dataset.",
+            start_msg="Fetching artifact...",
+            success_msg="Fetched artifact.",
             error_msg="Failed to fetch dataset.",
         ):
-            response = self._base_request_to_hub(url=f"/dataset/{owner}/{name}", method="GET")
+            url = (
+                f"/v1/dataset/{owner}/{name}"
+                if artifact_type == ArtifactSubtype.STANDARD.value
+                else f"/v2/competition/dataset/{owner}/{name}"
+            )
+            response = self._base_request_to_hub(url=url, method="GET")
             storage_response = self.get(response["tableContent"]["url"])
 
             # This should be a 307 redirect with the signed URL
@@ -326,12 +356,17 @@ def get_dataset(
 
             response["table"] = self._load_from_signed_url(url=url, headers=headers, load_fn=pd.read_parquet)
 
-            dataset = Dataset(**response)
+            if artifact_type == ArtifactSubtype.COMPETITION:
+                dataset = CompetitionDataset(**response)
+                md5Sum = response["maskedMd5Sum"]
+            else:
+                dataset = Dataset(**response)
+                md5Sum = response["md5Sum"]
 
             if should_verify_checksum(verify_checksum, dataset):
-                dataset.verify_checksum()
+                dataset.verify_checksum(md5Sum)
             else:
-                dataset.md5sum = response["md5Sum"]
+                dataset.md5sum = md5Sum
 
             return dataset
 
@@ -383,13 +418,13 @@ def list_benchmarks(self, limit: int = 100, offset: int = 0) -> list[str]:
             A list of benchmark names in the format `owner/benchmark_name`.
         """
         with ProgressIndicator(
-            start_msg="Fetching benchmarks...",
-            success_msg="Fetched benchmarks.",
+            start_msg="Fetching artifacts...",
+            success_msg="Fetched artifacts.",
             error_msg="Failed to fetch benchmarks.",
         ):
             # TODO (cwognum): What to do with pagination, i.e. limit and offset?
             response = self._base_request_to_hub(
-                url="/benchmark", method="GET", params={"limit": limit, "offset": offset}
+                url="/v1/benchmark", method="GET", params={"limit": limit, "offset": offset}
             )
             benchmarks_list = [f"{HubOwner(**bm['owner'])}/{bm['name']}" for bm in response["data"]]
 
@@ -412,11 +447,11 @@ def get_benchmark(
             A `BenchmarkSpecification` instance, if it exists.
         """
         with ProgressIndicator(
-            start_msg="Fetching benchmark...",
-            success_msg="Fetched benchmark.",
+            start_msg="Fetching artifact...",
+            success_msg="Fetched artifact.",
             error_msg="Failed to fetch benchmark.",
         ):
-            response = self._base_request_to_hub(url=f"/benchmark/{owner}/{name}", method="GET")
+            response = self._base_request_to_hub(url=f"/v1/benchmark/{owner}/{name}", method="GET")
 
             # TODO (jstlaurent): response["dataset"]["artifactId"] is the owner/name unique identifier,
             #  but we'd need to change the signature of get_dataset to use it
@@ -473,8 +508,8 @@ def upload_results(
             owner: Which Hub user or organization owns the artifact. Takes precedence over `results.owner`.
         """
         with ProgressIndicator(
-            start_msg="Uploading result...",
-            success_msg="Uploaded result.",
+            start_msg="Uploading artifact...",
+            success_msg="Uploaded artifact.",
             error_msg="Failed to upload result.",
         ) as progress_indicator:
             # Get the serialized model data-structure
@@ -483,13 +518,13 @@ def upload_results(
 
             # Make a request to the hub
             response = self._base_request_to_hub(
-                url="/result", method="POST", json={"access": access, **result_json}
+                url="/v1/result", method="POST", json={"access": access, **result_json}
             )
 
             # Inform the user about where to find their newly created artifact.
             result_url = urljoin(
                 self.settings.hub_url,
-                f"benchmarks/{results.benchmark_owner}/{results.benchmark_name}/{response['id']}",
+                f"/v1/benchmarks/{results.benchmark_owner}/{results.benchmark_name}/{response['id']}",
             )
 
             progress_indicator.update_success_msg(
@@ -505,6 +540,20 @@ def upload_dataset(
         timeout: TimeoutTypes = (10, 200),
         owner: HubOwner | str | None = None,
         if_exists: ZarrConflictResolution = "replace",
+    ):
+        """Wrapper method for uploading standard datasets to Polaris Hub"""
+        return self._upload_dataset(
+            dataset, ArtifactSubtype.STANDARD.value, access, timeout, owner, if_exists
+        )
+
+    def _upload_dataset(
+        self,
+        dataset: Dataset,
+        artifact_type: ArtifactSubtype,
+        access: AccessType = "private",
+        timeout: TimeoutTypes = (10, 200),
+        owner: Union[HubOwner, str, None] = None,
+        if_exists: ZarrConflictResolution = "replace",
     ):
         """Upload the dataset to the Polaris Hub.
 
@@ -531,8 +580,8 @@ def upload_dataset(
                 an error, 'replace' to overwrite, or 'skip' to proceed without altering the existing files.
         """
         with ProgressIndicator(
-            start_msg="Uploading dataset...",
-            success_msg="Uploaded dataset.",
+            start_msg="Uploading artifact...",
+            success_msg="Uploaded artifact.",
             error_msg="Failed to upload dataset.",
         ) as progress_indicator:
             # Check if a dataset license was specified prior to upload
@@ -571,7 +620,11 @@ def upload_dataset(
             # Step 1: Upload meta-data
             # Instead of directly uploading the data, we announce to the hub that we intend to upload it.
             # We do so separately for the Zarr archive and Parquet file.
-            url = f"/dataset/{dataset.owner}/{dataset.name}"
+            url = (
+                f"/v1/dataset/{dataset.artifact_id}"
+                if artifact_type == ArtifactSubtype.STANDARD.value
+                else f"/v2/competition/dataset/{dataset.owner}/{dataset.name}"
+            )
             response = self._base_request_to_hub(
                 url=url,
                 method="PUT",
@@ -597,6 +650,7 @@ def upload_dataset(
                     "Content-type": "application/vnd.apache.parquet",
                 },
                 timeout=timeout,
+                json={"artifactType": artifact_type},
             )
 
             if hub_response.status_code == 307:
@@ -646,10 +700,12 @@ def upload_dataset(
                         log=logger.debug,
                         if_exists=if_exists,
                     )
-
+            base_artifact_url = (
+                "datasets" if artifact_type == ArtifactSubtype.STANDARD.value else "/competition/datasets"
+            )
             progress_indicator.update_success_msg(
-                "Your dataset has been successfully uploaded to the Hub. "
-                f"View it here: {urljoin(self.settings.hub_url, f'datasets/{dataset.owner}/{dataset.name}')}"
+                f"Your {artifact_type} dataset has been successfully uploaded to the Hub. "
+                f"View it here: {urljoin(self.settings.hub_url, f'{base_artifact_url}/{dataset.owner}/{dataset.name}')}"
             )
 
             return response
@@ -681,9 +737,40 @@ def upload_benchmark(
             access: Grant public or private access to result
             owner: Which Hub user or organization owns the artifact. Takes precedence over `benchmark.owner`.
         """
+        return self._upload_benchmark(benchmark, ArtifactSubtype.STANDARD.value, access, owner)
+
+    def _upload_benchmark(
+        self,
+        benchmark: BenchmarkSpecification | CompetitionSpecification,
+        artifact_type: ArtifactSubtype,
+        access: AccessType = "private",
+        owner: Union[HubOwner, str, None] = None,
+    ):
+        """Upload a standard or competition benchmark to the Polaris Hub.
+
+        Info: Owner
+            You have to manually specify the owner in the benchmark data model. Because the owner could
+            be a user or an organization, we cannot automatically infer this from the logged-in user.
+
+        Note: Required meta-data
+            The Polaris client and hub maintain different requirements as to which meta-data is required.
+            The requirements by the hub are stricter, so when uploading to the hub you might
+            get some errors on missing meta-data. Make sure to fill-in as much of the meta-data as possible
+            before uploading.
+
+        Note: Non-existent datasets
+            The client will _not_ upload the associated dataset to the hub if it does not yet exist.
+            Make sure to specify an existing dataset or upload the dataset first.
+
+        Args:
+            benchmark: The benchmark to upload.
+            artifact_type: indicates whether the artifact is of the standard or competition type.
+            access: Grant public or private access to result
+            owner: Which Hub user or organization owns the artifact. Takes precedence over `benchmark.owner`.
+        """
         with ProgressIndicator(
-            start_msg="Uploading benchmark...",
-            success_msg="Uploaded benchmark.",
+            start_msg="Uploading artifact...",
+            success_msg="Uploaded artifact.",
             error_msg="Failed to upload benchmark.",
         ) as progress_indicator:
             # Get the serialized data-model
@@ -693,12 +780,107 @@ def upload_benchmark(
             benchmark_json["datasetArtifactId"] = benchmark.dataset.artifact_id
             benchmark_json["access"] = access
 
-            url = f"/benchmark/{benchmark.owner}/{benchmark.name}"
+            path_params = (
+                "/v1/benchmark" if artifact_type == ArtifactSubtype.STANDARD.value else "/v2/competition"
+            )
+            url = f"{path_params}/{benchmark.owner}/{benchmark.name}"
             response = self._base_request_to_hub(url=url, method="PUT", json=benchmark_json)
 
             progress_indicator.update_success_msg(
-                "Your benchmark has been successfully uploaded to the Hub. "
-                f"View it here: {urljoin(self.settings.hub_url, f'benchmarks/{benchmark.owner}/{benchmark.name}')}"
+                f"Your {artifact_type} benchmark has been successfully uploaded to the Hub. "
+                f"View it here: {urljoin(self.settings.hub_url, url)}"
             )
-
             return response
+
+    def get_competition(
+        self, owner: Union[str, HubOwner], name: str, verify_checksum: bool = True
+    ) -> CompetitionSpecification:
+        """Load a competition from the Polaris Hub.
+
+        Args:
+            owner: The owner of the competition. Can be either a user or organization from the Polaris Hub.
+            name: The name of the competition.
+            verify_checksum: Whether to use the checksum to verify the integrity of the dataset.
+
+        Returns:
+            A `CompetitionSpecification` instance, if it exists.
+        """
+        response = self._base_request_to_hub(url=f"/v2/competition/{owner}/{name}", method="GET")
+
+        # TODO (jstlaurent): response["dataset"]["artifactId"] is the owner/name unique identifier,
+        #  but we'd need to change the signature of get_dataset to use it
+        response["dataset"] = self._get_dataset(
+            response["dataset"]["owner"]["slug"],
+            response["dataset"]["name"],
+            ArtifactSubtype.COMPETITION,
+            verify_checksum=verify_checksum,
+        )
+
+        if not verify_checksum:
+            response.pop("md5Sum", None)
+
+        return CompetitionSpecification.model_construct(**response)
+
+    def list_competitions(self, limit: int = 100, offset: int = 0) -> list[str]:
+        """List all available competitions on the Polaris Hub.
+
+        Args:
+            limit: The maximum number of competitions to return.
+            offset: The offset from which to start returning competitions.
+
+        Returns:
+            A list of competition names in the format `owner/competition_name`.
+        """
+        with ProgressIndicator(
+            start_msg="Fetching artifacts...",
+            success_msg="Fetched artifacts.",
+            error_msg="Failed to fetch artifacts.",
+        ):
+            # TODO (cwognum): What to do with pagination, i.e. limit and offset?
+            response = self._base_request_to_hub(
+                url="/v2/competition", method="GET", params={"limit": limit, "offset": offset}
+            )
+            competitions_list = [f"{HubOwner(**bm['owner'])}/{bm['name']}" for bm in response["data"]]
+            return competitions_list
+
+    def evaluate_competition(
+        self,
+        competition: CompetitionSpecification,
+        competition_predictions: CompetitionPredictions,
+    ) -> CompetitionResults:
+        """Evaluate the predictions for a competition on the Polaris Hub. Target labels are fetched
+        by Polaris Hub and used only internally.
+
+        Args:
+            competition: The competition to evaluate the predictions for.
+            competition_predictions: The predictions and associated metadata to be submitted for evaluation by the Hub.
+
+        Returns:
+             A `CompetitionResults` object.
+        """
+        with ProgressIndicator(
+            start_msg="Evaluating competition predictions...",
+            success_msg="Evaluated competition predictions.",
+            error_msg="Failed to evaluate competition predictions.",
+        ) as progress_indicator:
+            competition.owner = HubOwner(**competition.owner)
+
+            response = self._base_request_to_hub(
+                url=f"/v2/competition/{competition.owner}/{competition.name}/evaluate",
+                method="POST",
+                json=competition_predictions.model_dump(),
+            )
+
+            # Inform the user about where to find their newly created artifact.
+            result_url = urljoin(
+                self.settings.hub_url,
+                f"/v2/competition/{competition.owner}/{competition.name}/{response['id']}",
+            )
+            progress_indicator.update_success_msg(
+                f"Your competition result has been successfully uploaded to the Hub. View it here: {result_url}"
+            )
+
+            scores = response["results"]
+            return CompetitionResults(
+                results=scores, competition_name=competition.name, competition_owner=competition.owner
+            )
diff --git a/polaris/hub/settings.py b/polaris/hub/settings.py
index 1780a4d3..1192b805 100644
--- a/polaris/hub/settings.py
+++ b/polaris/hub/settings.py
@@ -54,7 +54,7 @@ class PolarisHubSettings(BaseSettings):
     @field_validator("api_url", mode="before")
     def validate_api_url(cls, v, info: ValidationInfo):
         if v is None:
-            v = urljoin(str(info.data["hub_url"]), "/api/v1")
+            v = urljoin(str(info.data["hub_url"]), "/api")
         return v
 
     @field_validator("hub_token_url", mode="before")
diff --git a/polaris/loader/__init__.py b/polaris/loader/__init__.py
index 980f6dcf..835fe561 100644
--- a/polaris/loader/__init__.py
+++ b/polaris/loader/__init__.py
@@ -1,3 +1,3 @@
-from .load import load_benchmark, load_dataset
+from .load import load_benchmark, load_dataset, load_competition
 
-_all__ = ["load_benchmark", "load_dataset"]
+_all__ = ["load_benchmark", "load_dataset", "load_competition"]
diff --git a/polaris/loader/load.py b/polaris/loader/load.py
index 49fea5bb..6e152f68 100644
--- a/polaris/loader/load.py
+++ b/polaris/loader/load.py
@@ -93,3 +93,21 @@ def load_benchmark(path: str, verify_checksum: ChecksumStrategy = "verify_unless
         benchmark.verify_checksum()
 
     return benchmark
+
+
+def load_competition(slug: str, verify_checksum: bool = True):
+    """
+    Loads a Polaris competition.
+
+    In Polaris, a competition can be thought of as a more secure version of a standard benchmark.
+    In competitions, the target labels never exist on the client and all results are evaluated
+    through Polaris' servers.
+
+    Note: Dataset is automatically loaded
+        The dataset underlying the competition is automatically loaded when pulling the competition.
+
+    """
+
+    # Load from the Hub
+    client = PolarisHubClient()
+    return client.get_competition(*slug.split("/"), verify_checksum=verify_checksum)
diff --git a/polaris/utils/errors.py b/polaris/utils/errors.py
index baa6d418..3e800847 100644
--- a/polaris/utils/errors.py
+++ b/polaris/utils/errors.py
@@ -11,6 +11,10 @@ class InvalidBenchmarkError(ValueError):
     pass
 
 
+class InvalidCompetitionError(ValueError):
+    pass
+
+
 class InvalidResultError(ValueError):
     pass
 
diff --git a/polaris/utils/types.py b/polaris/utils/types.py
index fbee9d0e..e1cac444 100644
--- a/polaris/utils/types.py
+++ b/polaris/utils/types.py
@@ -29,10 +29,16 @@
 A prediction is one of three things:
 
 - A single array (single-task, single test set)
-- A dictionary of arrays (single-task, multiple test sets) 
+- A dictionary of arrays (single-task, multiple test sets)
 - A dictionary of dictionaries of arrays (multi-task, multiple test sets)
 """
 
+CompetitionPredictionsType: TypeAlias = Union[list, dict[str, Union[list, dict[str, list]]]]
+"""
+An additional type to represent the structure of predictions which are specific to competitions. This
+type allows for the predictions to be sent over the wire for external evaluation.
+"""
+
 DatapointPartType = Union[Any, tuple[Any], dict[str, Any]]
 DatapointType: TypeAlias = tuple[DatapointPartType, DatapointPartType]
 """
@@ -55,14 +61,14 @@
 """
 A URL-compatible string that can be turned into a slug by the hub.
 
-Can only use alpha-numeric characters, underscores and dashes. 
+Can only use alpha-numeric characters, underscores and dashes.
 The string must be at least 4 and at most 64 characters long.
 """
 
 
 HubUser: TypeAlias = SlugCompatibleStringType
 """
-A user on the Polaris Hub is identified by a username, 
+A user on the Polaris Hub is identified by a username,
 which is a [`SlugCompatibleStringType`][polaris.utils.types.SlugCompatibleStringType].
 """
 
@@ -150,3 +156,10 @@ class TaskType(Enum):
 
     MULTI_TASK = "multi_task"
     SINGLE_TASK = "single_task"
+
+
+class ArtifactSubtype(Enum):
+    """The major artifact types which Polaris supports"""
+
+    STANDARD = "standard"
+    COMPETITION = "competition"
diff --git a/pyproject.toml b/pyproject.toml
index b1787c6c..55136937 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -90,7 +90,7 @@ Documentation = "https://polaris-hub.github.io/polaris/"
 include-package-data = true
 
 [tool.setuptools_scm]
-fallback_version = "dev"
+fallback_version = "0.0.0.dev1"
 
 [tool.setuptools.packages.find]
 where = ["."]
diff --git a/tests/conftest.py b/tests/conftest.py
index 2170a62a..fce4ec2f 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -9,7 +9,8 @@
     MultiTaskBenchmarkSpecification,
     SingleTaskBenchmarkSpecification,
 )
-from polaris.dataset import ColumnAnnotation, Dataset
+from polaris.competition import CompetitionSpecification
+from polaris.dataset import ColumnAnnotation, Dataset, CompetitionDataset
 from polaris.utils.types import HubOwner
 
 
@@ -77,6 +78,24 @@ def test_dataset(test_data, test_org_owner):
     return dataset
 
 
+@pytest.fixture(scope="function")
+def test_competition_dataset(test_data, test_org_owner):
+    dataset = CompetitionDataset(
+        table=test_data,
+        name="test-competition-dataset",
+        source="https://www.example.com",
+        annotations={"expt": ColumnAnnotation(user_attributes={"unit": "kcal/mol"})},
+        tags=["tagA", "tagB"],
+        user_attributes={"attributeA": "valueA", "attributeB": "valueB"},
+        owner=test_org_owner,
+        license="CC-BY-4.0",
+        curation_reference="https://www.example.com",
+    )
+
+    check_version(dataset)
+    return dataset
+
+
 @pytest.fixture(scope="function")
 def zarr_archive(tmp_path):
     tmp_path = fs.join(tmp_path, "data.zarr")
@@ -92,7 +111,7 @@ def test_single_task_benchmark(test_dataset):
     train_indices = list(range(90))
     test_indices = list(range(90, 100))
     benchmark = SingleTaskBenchmarkSpecification(
-        name="single-task-benchmark",
+        name="single-task-single-set-benchmark",
         dataset=test_dataset,
         metrics=[
             "mean_absolute_error",
@@ -117,7 +136,7 @@ def test_single_task_benchmark_clf(test_dataset):
     train_indices = list(range(90))
     test_indices = list(range(90, 100))
     benchmark = SingleTaskBenchmarkSpecification(
-        name="single-task-benchmark",
+        name="single-task-single-set-benchmark",
         dataset=test_dataset,
         main_metric="accuracy",
         metrics=["accuracy", "f1", "roc_auc", "pr_auc", "mcc", "cohen_kappa", "balanced_accuracy"],
@@ -138,7 +157,7 @@ def test_single_task_benchmark_multi_clf(test_dataset):
     test_indices = indices[80:]
 
     benchmark = SingleTaskBenchmarkSpecification(
-        name="single-task-benchmark",
+        name="single-task-single-set-benchmark",
         dataset=test_dataset,
         main_metric="accuracy",
         metrics=[
@@ -165,7 +184,7 @@ def test_single_task_benchmark_multiple_test_sets(test_dataset):
     train_indices = list(range(90))
     test_indices = {"test_1": list(range(90, 95)), "test_2": list(range(95, 100))}
     benchmark = SingleTaskBenchmarkSpecification(
-        name="single-task-benchmark",
+        name="single-task-multi-set-benchmark",
         dataset=test_dataset,
         metrics=[
             "mean_absolute_error",
@@ -193,7 +212,7 @@ def test_single_task_benchmark_clf_multiple_test_sets(test_dataset):
     train_indices = indices[:80]
     test_indices = {"test_1": indices[80:90], "test_2": indices[90:]}
     benchmark = SingleTaskBenchmarkSpecification(
-        name="single-task-benchmark-clf",
+        name="single-task-multi-set-benchmark-clf",
         dataset=test_dataset,
         metrics=["accuracy", "f1", "roc_auc", "pr_auc", "mcc", "cohen_kappa"],
         main_metric="pr_auc",
@@ -248,3 +267,53 @@ def test_multi_task_benchmark_clf(test_dataset):
     )
     check_version(benchmark)
     return benchmark
+
+
+@pytest.fixture(scope="function")
+def test_competition(test_competition_dataset, test_org_owner):
+    train_indices = list(range(90))
+    test_indices = list(range(90, 100))
+    competition = CompetitionSpecification(
+        name="test-competition",
+        dataset=test_competition_dataset,
+        owner=test_org_owner,
+        metrics=[
+            "mean_absolute_error",
+            "mean_squared_error",
+            "r2",
+            "spearmanr",
+            "pearsonr",
+            "explained_var",
+        ],
+        main_metric="mean_absolute_error",
+        split=(train_indices, test_indices),
+        target_cols="expt",
+        input_cols="smiles",
+    )
+    check_version(competition)
+    return competition
+
+
+@pytest.fixture(scope="function")
+def test_multi_task_benchmark_multiple_test_sets(test_dataset):
+    train_indices = list(range(90))
+    test_indices = {"test_1": list(range(90, 95)), "test_2": list(range(95, 100))}
+    benchmark = MultiTaskBenchmarkSpecification(
+        name="multi-task-multi-set-benchmark",
+        dataset=test_dataset,
+        metrics=[
+            "mean_absolute_error",
+            "mean_squared_error",
+            "r2",
+            "spearmanr",
+            "pearsonr",
+            "explained_var",
+            "absolute_average_fold_error",
+        ],
+        main_metric="r2",
+        split=(train_indices, test_indices),
+        target_cols=["expt", "calc"],
+        input_cols="smiles",
+    )
+    check_version(benchmark)
+    return benchmark
diff --git a/tests/test_competition.py b/tests/test_competition.py
new file mode 100644
index 00000000..e6421585
--- /dev/null
+++ b/tests/test_competition.py
@@ -0,0 +1,103 @@
+import numpy as np
+import pandas as pd
+
+from polaris.evaluate.utils import evaluate_benchmark, normalize_predictions_type
+from polaris.competition import CompetitionSpecification
+
+
+def test_competition_from_json(test_competition, tmpdir):
+    """Test whether we can successfully save and load a competition from JSON."""
+    path = test_competition.to_json(str(tmpdir))
+    new_competition = CompetitionSpecification.from_json(path)
+    assert new_competition == test_competition
+
+
+def test_multi_col_competition_evaluation(test_competition):
+    """Test that multi-column competitions will be evaluated properly when when
+    target labels are read as a pandas dataframe from a file."""
+    data = np.random.randint(2, size=(6, 3))
+    labels = pd.DataFrame(data, columns=["Column1", "Column2", "Column3"])
+    labels_as_from_hub = {col: np.array(labels[col]) for col in labels.columns}
+    predictions = {target_col: np.random.randint(2, size=labels.shape[0]) for target_col in labels.columns}
+
+    result = evaluate_benchmark(
+        ["Column1", "Column2", "Column3"], test_competition.metrics, labels_as_from_hub, y_pred=predictions
+    )
+
+    assert isinstance(result, pd.DataFrame)
+    assert set(result.columns) == {
+        "Test set",
+        "Target label",
+        "Metric",
+        "Score",
+    }
+
+
+def test_single_col_competition_evaluation(test_competition):
+    """Test that multi-column competitions will be evaluated properly when when
+    target labels are read as a pandas dataframe from a file."""
+    data = np.array(
+        [
+            1.15588236,
+            1.56414507,
+            1.04828639,
+            0.98362629,
+            1.22613572,
+            2.56594576,
+            0.67568671,
+            0.86099644,
+            0.67568671,
+            2.28213589,
+            1.06617679,
+            1.05709529,
+            0.67568671,
+            0.67568671,
+            0.67568671,
+        ]
+    )
+    labels = {"LOG HLM_CLint (mL/min/kg)": data}
+    predictions = data + np.random.uniform(0, 3, size=len(data))
+
+    result = evaluate_benchmark(
+        ["LOG HLM_CLint (mL/min/kg)"], test_competition.metrics, labels, y_pred=predictions
+    )
+
+    assert isinstance(result, pd.DataFrame)
+    assert set(result.columns) == {
+        "Test set",
+        "Target label",
+        "Metric",
+        "Score",
+    }
+
+
+def test_normalize_predictions_type():
+    "Single column, single test set"
+    assert {"test": {"col1": [1, 2, 3]}} == normalize_predictions_type([1, 2, 3], ["col1"])
+    assert {"test": {"col1": [1, 2, 3]}} == normalize_predictions_type({"col1": [1, 2, 3]}, ["col1"])
+    assert {"test": {"col1": [1, 2, 3]}} == normalize_predictions_type(
+        {"test": {"col1": [1, 2, 3]}}, ["col1"]
+    )
+
+    "Multi-column, single test set"
+    assert {"test": {"col1": [1, 2, 3], "col2": [4, 5, 6]}} == normalize_predictions_type(
+        {"col1": [1, 2, 3], "col2": [4, 5, 6]}, ["col1", "col2"]
+    )
+
+    assert {"test": {"col1": [1, 2, 3], "col2": [4, 5, 6]}} == normalize_predictions_type(
+        {"test": {"col1": [1, 2, 3], "col2": [4, 5, 6]}}, ["col1", "col2"]
+    )
+
+    "Single column, multi-test set"
+    assert {"test1": {"col1": [1, 2, 3]}, "test2": {"col1": [4, 5, 6]}} == normalize_predictions_type(
+        {"test1": {"col1": [1, 2, 3]}, "test2": {"col1": [4, 5, 6]}}, ["col1"]
+    )
+
+    "Multi-column, multi-test set"
+    assert {
+        "test1": {"col1": [1, 2, 3], "col2": [4, 5, 6]},
+        "test2": {"col1": [7, 8, 9], "col2": [10, 11, 12]},
+    } == normalize_predictions_type(
+        {"test1": {"col1": [1, 2, 3], "col2": [4, 5, 6]}, "test2": {"col1": [7, 8, 9], "col2": [10, 11, 12]}},
+        ["col1", "col2"],
+    )
diff --git a/tests/test_integration.py b/tests/test_integration.py
index 5a6c1f69..fbcb7e58 100644
--- a/tests/test_integration.py
+++ b/tests/test_integration.py
@@ -33,9 +33,10 @@ def test_single_task_benchmark_loop_with_multiple_test_sets(test_single_task_ben
     model.fit(X=x_train, y=y)
 
     y_pred = {}
+    task_name = test_single_task_benchmark_multiple_test_sets.target_cols[0]
     for k, test_subset in test.items():
         x_test = np.array([dm.to_fp(dm.to_mol(smi)) for smi in test_subset.inputs])
-        y_pred[k] = model.predict(x_test)
+        y_pred[k] = {task_name: model.predict(x_test)}
 
     scores = test_single_task_benchmark_multiple_test_sets.evaluate(y_pred)
     assert isinstance(scores, BenchmarkResults)
@@ -56,10 +57,11 @@ def test_single_task_benchmark_clf_loop_with_multiple_test_sets(
 
     y_prob = {}
     y_pred = {}
+    task_name = test_single_task_benchmark_clf_multiple_test_sets.target_cols[0]
     for k, test_subset in test.items():
         x_test = np.array([dm.to_fp(dm.to_mol(smi)) for smi in test_subset.inputs])
-        y_prob[k] = model.predict_proba(x_test)[:, :1]  # for binary classification
-        y_pred[k] = model.predict(x_test)
+        y_prob[k] = {task_name: model.predict_proba(x_test)[:, :1]}  # for binary classification
+        y_pred[k] = {task_name: model.predict(x_test)}
 
     scores = test_single_task_benchmark_clf_multiple_test_sets.evaluate(y_prob=y_prob, y_pred=y_pred)
     assert isinstance(scores, BenchmarkResults)
@@ -83,3 +85,26 @@ def test_multi_task_benchmark_loop(test_multi_task_benchmark):
 
     scores = test_multi_task_benchmark.evaluate(y_pred)
     assert isinstance(scores, BenchmarkResults)
+
+
+def test_multi_task_benchmark_loop_with_multiple_test_sets(test_multi_task_benchmark_multiple_test_sets):
+    """Tests the integrated API for a multi-task benchmark with multiple test sets."""
+    train, test = test_multi_task_benchmark_multiple_test_sets.get_train_test_split()
+    smiles, multi_y = train.as_array("xy")
+
+    x_train = np.array([dm.to_fp(dm.to_mol(smi)) for smi in smiles])
+
+    y_pred = {}
+    for test_set_name, test_subset in test.items():
+        y_pred[test_set_name] = {}
+        x_test = np.array([dm.to_fp(dm.to_mol(smi)) for smi in test_subset.inputs])
+
+        for task_name, y in multi_y.items():
+            model = RandomForestRegressor()
+
+            mask = ~np.isnan(y)
+            model.fit(X=x_train[mask], y=y[mask])
+            y_pred[test_set_name][task_name] = model.predict(x_test)
+
+    scores = test_multi_task_benchmark_multiple_test_sets.evaluate(y_pred)
+    assert isinstance(scores, BenchmarkResults)