Skip to content

Commit

Permalink
✨ New in-memory dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
francois-rozet committed Feb 28, 2023
1 parent 83fd155 commit f1a003a
Show file tree
Hide file tree
Showing 3 changed files with 257 additions and 67 deletions.
107 changes: 89 additions & 18 deletions docs/tutorials/simulators.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -248,17 +248,72 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving on disk\n",
"\n",
"If the simulator is fast or inexpensive, it is reasonable to generate pairs $(\\theta, x)$ on demand. Otherwise, the pairs have to be generated and stored on disk ahead of time. The [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file format is commonly used for this purpose, as it was specifically designed to hold large amounts of numerical data.\n",
"## Loading in memory\n",
"\n",
"The [`lampe.data`](lampe.data) module provides the [`H5Dataset`](lampe.data.H5Dataset) class to help load and store pairs $(\\theta, x)$ in HDF5 files. The [`H5Dataset.store`](lampe.data.H5Dataset.store) function takes an iterable of batched pairs $(\\theta, x)$ as input and stores them into a new HDF5 file. The iterable can be a precomputed list, a custom generator or even a `JointLoader` instance."
"If the simulator is fast or inexpensive, it is reasonable to generate pairs $(\\theta, x)$ on demand. Otherwise, the pairs have to be generated ahead of time. The [`lampe.data`](lampe.data) module provides the [`JointDataset`](lampe.data.JointDataset) class to interact with in-memory pairs $(\\theta, x)$. This is ideal when your data fits in RAM."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([ 0.6757, -0.3787, 0.9581])\n",
"tensor([0.3410, 0.7707])\n"
]
}
],
"source": [
"theta = prior.sample((1024,))\n",
"x = simulator(theta)\n",
"\n",
"dataset = lampe.data.JointDataset(theta, x)\n",
"\n",
"print(*dataset[42], sep='\\n')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1024/1024 [00:00<00:00, 282991.85it/s]\n"
]
}
],
"source": [
"for theta, x in tqdm(dataset):\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`JointDataset` can be wrapped in a [`DataLoader`](torch.utils.data.DataLoader) to enable batching and shuffling."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving on disk\n",
"\n",
"If your data does not fit in RAM or you need to reuse it later, you may want to store it on disk. The [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file format is commonly used for this purpose, as it was specifically designed to hold large amounts of numerical data. The [`lampe.data`](lampe.data) module provides the [`H5Dataset`](lampe.data.H5Dataset) class to help load and store pairs $(\\theta, x)$ in HDF5 files. The [`H5Dataset.store`](lampe.data.H5Dataset.store) function takes an iterable of batched pairs $(\\theta, x)$ as input and stores them into a new HDF5 file. The iterable can be a precomputed list, a custom generator or even a `JointLoader` instance."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
Expand All @@ -269,20 +324,20 @@
}
],
"source": [
"data = []\n",
"pairs = []\n",
"\n",
"for _ in range(256):\n",
" theta = prior.sample((256,))\n",
" x = simulator(theta)\n",
"\n",
" data.append((theta, x))\n",
" pairs.append((theta, x))\n",
"\n",
"lampe.data.H5Dataset.store(data, 'data_0.h5', size=2**16)"
"lampe.data.H5Dataset.store(pairs, 'data_0.h5', size=2**16)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 12,
"metadata": {},
"outputs": [
{
Expand All @@ -306,7 +361,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 13,
"metadata": {},
"outputs": [
{
Expand All @@ -332,7 +387,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 14,
"metadata": {},
"outputs": [
{
Expand All @@ -359,7 +414,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 15,
"metadata": {},
"outputs": [
{
Expand All @@ -375,6 +430,22 @@
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, if your data fits in memory, you can load it at once with the `to_memory` method, which returns a `JointDataset`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"dataset = lampe.data.H5Dataset('data_0.h5').to_memory()"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -386,7 +457,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -395,7 +466,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 18,
"metadata": {},
"outputs": [
{
Expand All @@ -421,7 +492,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 19,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -456,7 +527,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 20,
"metadata": {},
"outputs": [
{
Expand All @@ -483,7 +554,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 21,
"metadata": {},
"outputs": [
{
Expand All @@ -502,7 +573,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 22,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -537,7 +608,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
"version": "3.9.16"
},
"vscode": {
"interpreter": {
Expand Down
Loading

0 comments on commit f1a003a

Please sign in to comment.