Update documentation

fsspec · Feb 8, 2024 · 2766355 · 2766355
1 parent 93a81e5
commit 2766355
Show file tree

Hide file tree

Showing 4 changed files with 121 additions and 1 deletion.
diff --git a/_sources/spec.rst.txt b/_sources/spec.rst.txt
@@ -138,3 +138,69 @@ from the given URL, at an offset of 1000.
 
     <script data-goatcounter="https://kerchunk.goatcounter.com/count"
             async src="//gc.zgo.at/count.js"></script>
+
+Parquet references
+------------------
+
+Since JSON is rather verbose, it is easy with enough chunks to make a references file
+that is too big: slow to load and heavy on memory. Although the former can be
+alleviated by compression (I recommend Zstd), the latter cannot. This can
+become particularly apparent during the combine phase when loading many reference sets.
+
+The class `fsspec.implementations.reference.LazyReferenceMapper`_ provides an
+alternative _implementation_, and its on-disk layout effectively is a new reference
+spec, and we describe it here. The class itself has a dict mapper interface, just
+like the rendered references from JSON files; except that it assumes that it is
+working on a zarr dataset. This is because the references are split into files, and
+an array's shape/chunk information is used to figure out which reference file
+to load.
+
+.. _fsspec.implementations.reference.LazyReferenceMapper: https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=lazyreference#fsspec.implementations.reference.LazyReferenceMapper
+
+The following code
+
+.. code-block:: python
+
+    lz = fsspec.implementations.reference.LazyReferenceMapper.create("ref.parquet")
+    z = zarr.open_group(lz, mode="w")
+    d = z.create_dataset("name", shape=(1,))
+    d[:] = 1
+    g2 = z.create_group("deep")
+    d = g2.create_dataset("name", shape=(1,))
+    d[:] = 1
+
+produces files
+
+.. code-block:: text
+
+    ref.parquet/deep/name/refs.0.parq
+    ref.parquet/name/refs.0.parq
+    ref.parquet/.zmetadata
+
+Here, .zmetadata is all of the metadata of all of all subgroups/arrays (similar to
+zarr "consolidated metadata", with two top-level fields: "metadata" (dict[str, str] all of the
+zarr metadata key/values) and "record_size", an integer set during ``.create()``.
+
+Each parquet file contains references within the corresponding path to where it is.
+For example, key "name/0" will be the zeroth reference in "./name/refs.0.parq". If
+there are multiple dimensions, normal C indexing is used to find the Nth reference,
+and there are up to "record_size" references (default 10000) in the first file;
+reference >10000,<=2000 would be in "./name/refs.2.parquet". Each file is (for now)
+padded to record_size, but they compress really well.
+
+Each row of the parquet data contains fields
+
+.. code-block::
+
+    path: optional str/categorical, remote location URL
+    offset: int, start location of block
+    size: int, number of bytes in block
+    raw: optional bytes, binary data
+
+If ``raw`` is populated, this is the data of the key. If ``path`` is
+populated but size is 0, it is the whole file indicated. Otherwise,
+it is a byte block in the indicated file. If both ``raw`` and ``path``
+are NULL, the key does not exist.
+
+We reserve the possibility to store small array data in .zmetadata instead
+of creating a small/mostly empty parquet file for each.
diff --git a/index.html b/index.html
@@ -161,6 +161,7 @@ <h2>Introduction<a class="headerlink" href="#introduction" title="Link to this h
 <li class="toctree-l1"><a class="reference internal" href="spec.html">References specification</a><ul>
 <li class="toctree-l2"><a class="reference internal" href="spec.html#version-0">Version 0</a></li>
 <li class="toctree-l2"><a class="reference internal" href="spec.html#version-1">Version 1</a></li>
+<li class="toctree-l2"><a class="reference internal" href="spec.html#parquet-references">Parquet references</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="beyond.html">Beyond Python</a></li>

diff --git a/searchindex.js b/searchindex.js
diff --git a/spec.html b/spec.html
@@ -57,6 +57,7 @@
 <li class="toctree-l1 current"><a class="current reference internal" href="#">References specification</a><ul>
 <li class="toctree-l2"><a class="reference internal" href="#version-0">Version 0</a></li>
 <li class="toctree-l2"><a class="reference internal" href="#version-1">Version 1</a></li>
+<li class="toctree-l2"><a class="reference internal" href="#parquet-references">Parquet references</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="beyond.html">Beyond Python</a></li>
@@ -214,6 +215,58 @@ <h2>Version 1<a class="headerlink" href="#version-1" title="Link to this heading
 from the given URL, at an offset of 1000.</p>
 <script data-goatcounter="https://kerchunk.goatcounter.com/count"
         async src="//gc.zgo.at/count.js"></script></section>
+<section id="parquet-references">
+<h2>Parquet references<a class="headerlink" href="#parquet-references" title="Link to this heading"></a></h2>
+<p>Since JSON is rather verbose, it is easy with enough chunks to make a references file
+that is too big: slow to load and heavy on memory. Although the former can be
+alleviated by compression (I recommend Zstd), the latter cannot. This can
+become particularly apparent during the combine phase when loading many reference sets.</p>
+<p>The class <a class="reference external" href="https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=lazyreference#fsspec.implementations.reference.LazyReferenceMapper">fsspec.implementations.reference.LazyReferenceMapper</a> provides an
+alternative _implementation_, and its on-disk layout effectively is a new reference
+spec, and we describe it here. The class itself has a dict mapper interface, just
+like the rendered references from JSON files; except that it assumes that it is
+working on a zarr dataset. This is because the references are split into files, and
+an array’s shape/chunk information is used to figure out which reference file
+to load.</p>
+<p>The following code</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">lz</span> <span class="o">=</span> <span class="n">fsspec</span><span class="o">.</span><span class="n">implementations</span><span class="o">.</span><span class="n">reference</span><span class="o">.</span><span class="n">LazyReferenceMapper</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="s2">&quot;ref.parquet&quot;</span><span class="p">)</span>
+<span class="n">z</span> <span class="o">=</span> <span class="n">zarr</span><span class="o">.</span><span class="n">open_group</span><span class="p">(</span><span class="n">lz</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s2">&quot;w&quot;</span><span class="p">)</span>
+<span class="n">d</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">create_dataset</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,))</span>
+<span class="n">d</span><span class="p">[:]</span> <span class="o">=</span> <span class="mi">1</span>
+<span class="n">g2</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">create_group</span><span class="p">(</span><span class="s2">&quot;deep&quot;</span><span class="p">)</span>
+<span class="n">d</span> <span class="o">=</span> <span class="n">g2</span><span class="o">.</span><span class="n">create_dataset</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,))</span>
+<span class="n">d</span><span class="p">[:]</span> <span class="o">=</span> <span class="mi">1</span>
+</pre></div>
+</div>
+<p>produces files</p>
+<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>ref.parquet/deep/name/refs.0.parq
+ref.parquet/name/refs.0.parq
+ref.parquet/.zmetadata
+</pre></div>
+</div>
+<p>Here, .zmetadata is all of the metadata of all of all subgroups/arrays (similar to
+zarr “consolidated metadata”, with two top-level fields: “metadata” (dict[str, str] all of the
+zarr metadata key/values) and “record_size”, an integer set during <code class="docutils literal notranslate"><span class="pre">.create()</span></code>.</p>
+<p>Each parquet file contains references within the corresponding path to where it is.
+For example, key “name/0” will be the zeroth reference in “./name/refs.0.parq”. If
+there are multiple dimensions, normal C indexing is used to find the Nth reference,
+and there are up to “record_size” references (default 10000) in the first file;
+reference &gt;10000,&lt;=2000 would be in “./name/refs.2.parquet”. Each file is (for now)
+padded to record_size, but they compress really well.</p>
+<p>Each row of the parquet data contains fields</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">path</span><span class="p">:</span> <span class="n">optional</span> <span class="nb">str</span><span class="o">/</span><span class="n">categorical</span><span class="p">,</span> <span class="n">remote</span> <span class="n">location</span> <span class="n">URL</span>
+<span class="n">offset</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">start</span> <span class="n">location</span> <span class="n">of</span> <span class="n">block</span>
+<span class="n">size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">number</span> <span class="n">of</span> <span class="nb">bytes</span> <span class="ow">in</span> <span class="n">block</span>
+<span class="n">raw</span><span class="p">:</span> <span class="n">optional</span> <span class="nb">bytes</span><span class="p">,</span> <span class="n">binary</span> <span class="n">data</span>
+</pre></div>
+</div>
+<p>If <code class="docutils literal notranslate"><span class="pre">raw</span></code> is populated, this is the data of the key. If <code class="docutils literal notranslate"><span class="pre">path</span></code> is
+populated but size is 0, it is the whole file indicated. Otherwise,
+it is a byte block in the indicated file. If both <code class="docutils literal notranslate"><span class="pre">raw</span></code> and <code class="docutils literal notranslate"><span class="pre">path</span></code>
+are NULL, the key does not exist.</p>
+<p>We reserve the possibility to store small array data in .zmetadata instead
+of creating a small/mostly empty parquet file for each.</p>
+</section>
 </section>