✨ New in-memory dataset

probabilists · Feb 28, 2023 · f1a003a · f1a003a
1 parent 83fd155
commit f1a003a
Show file tree

Hide file tree

Showing 3 changed files with 257 additions and 67 deletions.
diff --git a/docs/tutorials/simulators.ipynb b/docs/tutorials/simulators.ipynb
@@ -248,17 +248,72 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Saving on disk\n",
-    "\n",
-    "If the simulator is fast or inexpensive, it is reasonable to generate pairs $(\\theta, x)$ on demand. Otherwise, the pairs have to be generated and stored on disk ahead of time. The [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file format is commonly used for this purpose, as it was specifically designed to hold large amounts of numerical data.\n",
+    "## Loading in memory\n",
     "\n",
-    "The [`lampe.data`](lampe.data) module provides the [`H5Dataset`](lampe.data.H5Dataset) class to help load and store pairs $(\\theta, x)$ in HDF5 files. The [`H5Dataset.store`](lampe.data.H5Dataset.store) function takes an iterable of batched pairs $(\\theta, x)$ as input and stores them into a new HDF5 file. The iterable can be a precomputed list, a custom generator or even a `JointLoader` instance."
+    "If the simulator is fast or inexpensive, it is reasonable to generate pairs $(\\theta, x)$ on demand. Otherwise, the pairs have to be generated ahead of time. The [`lampe.data`](lampe.data) module provides the [`JointDataset`](lampe.data.JointDataset) class to interact with in-memory pairs $(\\theta, x)$. This is ideal when your data fits in RAM."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 9,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([ 0.6757, -0.3787,  0.9581])\n",
+      "tensor([0.3410, 0.7707])\n"
+     ]
+    }
+   ],
+   "source": [
+    "theta = prior.sample((1024,))\n",
+    "x = simulator(theta)\n",
+    "\n",
+    "dataset = lampe.data.JointDataset(theta, x)\n",
+    "\n",
+    "print(*dataset[42], sep='\\n')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 1024/1024 [00:00<00:00, 282991.85it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "for theta, x in tqdm(dataset):\n",
+    "    pass"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`JointDataset` can be wrapped in a [`DataLoader`](torch.utils.data.DataLoader) to enable batching and shuffling."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Saving on disk\n",
+    "\n",
+    "If your data does not fit in RAM or you need to reuse it later, you may want to store it on disk. The [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file format is commonly used for this purpose, as it was specifically designed to hold large amounts of numerical data. The [`lampe.data`](lampe.data) module provides the [`H5Dataset`](lampe.data.H5Dataset) class to help load and store pairs $(\\theta, x)$ in HDF5 files. The [`H5Dataset.store`](lampe.data.H5Dataset.store) function takes an iterable of batched pairs $(\\theta, x)$ as input and stores them into a new HDF5 file. The iterable can be a precomputed list, a custom generator or even a `JointLoader` instance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
    "outputs": [
     {
      "name": "stderr",
@@ -269,20 +324,20 @@
     }
    ],
    "source": [
-    "data = []\n",
+    "pairs = []\n",
     "\n",
     "for _ in range(256):\n",
     "    theta = prior.sample((256,))\n",
     "    x = simulator(theta)\n",
     "\n",
-    "    data.append((theta, x))\n",
+    "    pairs.append((theta, x))\n",
     "\n",
-    "lampe.data.H5Dataset.store(data, 'data_0.h5', size=2**16)"
+    "lampe.data.H5Dataset.store(pairs, 'data_0.h5', size=2**16)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
@@ -306,7 +361,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
@@ -332,7 +387,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -359,7 +414,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [
     {
@@ -375,6 +430,22 @@
     "    pass"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Alternatively, if your data fits in memory, you can load it at once with the `to_memory` method, which returns a `JointDataset`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = lampe.data.H5Dataset('data_0.h5').to_memory()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -386,7 +457,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -395,7 +466,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
@@ -421,7 +492,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
@@ -456,7 +527,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [
     {
@@ -483,7 +554,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
@@ -502,7 +573,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
@@ -537,7 +608,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.15"
+   "version": "3.9.16"
   },
   "vscode": {
    "interpreter": {