Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Feb 8, 2024
1 parent 93a81e5 commit 2766355
Show file tree
Hide file tree
Showing 4 changed files with 121 additions and 1 deletion.
66 changes: 66 additions & 0 deletions _sources/spec.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,69 @@ from the given URL, at an offset of 1000.

<script data-goatcounter="https://kerchunk.goatcounter.com/count"
async src="//gc.zgo.at/count.js"></script>

Parquet references
------------------

Since JSON is rather verbose, it is easy with enough chunks to make a references file
that is too big: slow to load and heavy on memory. Although the former can be
alleviated by compression (I recommend Zstd), the latter cannot. This can
become particularly apparent during the combine phase when loading many reference sets.

The class `fsspec.implementations.reference.LazyReferenceMapper`_ provides an
alternative _implementation_, and its on-disk layout effectively is a new reference
spec, and we describe it here. The class itself has a dict mapper interface, just
like the rendered references from JSON files; except that it assumes that it is
working on a zarr dataset. This is because the references are split into files, and
an array's shape/chunk information is used to figure out which reference file
to load.

.. _fsspec.implementations.reference.LazyReferenceMapper: https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=lazyreference#fsspec.implementations.reference.LazyReferenceMapper

The following code

.. code-block:: python
lz = fsspec.implementations.reference.LazyReferenceMapper.create("ref.parquet")
z = zarr.open_group(lz, mode="w")
d = z.create_dataset("name", shape=(1,))
d[:] = 1
g2 = z.create_group("deep")
d = g2.create_dataset("name", shape=(1,))
d[:] = 1
produces files

.. code-block:: text
ref.parquet/deep/name/refs.0.parq
ref.parquet/name/refs.0.parq
ref.parquet/.zmetadata
Here, .zmetadata is all of the metadata of all of all subgroups/arrays (similar to
zarr "consolidated metadata", with two top-level fields: "metadata" (dict[str, str] all of the
zarr metadata key/values) and "record_size", an integer set during ``.create()``.

Each parquet file contains references within the corresponding path to where it is.
For example, key "name/0" will be the zeroth reference in "./name/refs.0.parq". If
there are multiple dimensions, normal C indexing is used to find the Nth reference,
and there are up to "record_size" references (default 10000) in the first file;
reference >10000,<=2000 would be in "./name/refs.2.parquet". Each file is (for now)
padded to record_size, but they compress really well.

Each row of the parquet data contains fields

.. code-block::
path: optional str/categorical, remote location URL
offset: int, start location of block
size: int, number of bytes in block
raw: optional bytes, binary data
If ``raw`` is populated, this is the data of the key. If ``path`` is
populated but size is 0, it is the whole file indicated. Otherwise,
it is a byte block in the indicated file. If both ``raw`` and ``path``
are NULL, the key does not exist.

We reserve the possibility to store small array data in .zmetadata instead
of creating a small/mostly empty parquet file for each.
1 change: 1 addition & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@ <h2>Introduction<a class="headerlink" href="#introduction" title="Link to this h
<li class="toctree-l1"><a class="reference internal" href="spec.html">References specification</a><ul>
<li class="toctree-l2"><a class="reference internal" href="spec.html#version-0">Version 0</a></li>
<li class="toctree-l2"><a class="reference internal" href="spec.html#version-1">Version 1</a></li>
<li class="toctree-l2"><a class="reference internal" href="spec.html#parquet-references">Parquet references</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="beyond.html">Beyond Python</a></li>
Expand Down
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

53 changes: 53 additions & 0 deletions spec.html
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
<li class="toctree-l1 current"><a class="current reference internal" href="#">References specification</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#version-0">Version 0</a></li>
<li class="toctree-l2"><a class="reference internal" href="#version-1">Version 1</a></li>
<li class="toctree-l2"><a class="reference internal" href="#parquet-references">Parquet references</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="beyond.html">Beyond Python</a></li>
Expand Down Expand Up @@ -214,6 +215,58 @@ <h2>Version 1<a class="headerlink" href="#version-1" title="Link to this heading
from the given URL, at an offset of 1000.</p>
<script data-goatcounter="https://kerchunk.goatcounter.com/count"
async src="//gc.zgo.at/count.js"></script></section>
<section id="parquet-references">
<h2>Parquet references<a class="headerlink" href="#parquet-references" title="Link to this heading"></a></h2>
<p>Since JSON is rather verbose, it is easy with enough chunks to make a references file
that is too big: slow to load and heavy on memory. Although the former can be
alleviated by compression (I recommend Zstd), the latter cannot. This can
become particularly apparent during the combine phase when loading many reference sets.</p>
<p>The class <a class="reference external" href="https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=lazyreference#fsspec.implementations.reference.LazyReferenceMapper">fsspec.implementations.reference.LazyReferenceMapper</a> provides an
alternative _implementation_, and its on-disk layout effectively is a new reference
spec, and we describe it here. The class itself has a dict mapper interface, just
like the rendered references from JSON files; except that it assumes that it is
working on a zarr dataset. This is because the references are split into files, and
an array’s shape/chunk information is used to figure out which reference file
to load.</p>
<p>The following code</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">lz</span> <span class="o">=</span> <span class="n">fsspec</span><span class="o">.</span><span class="n">implementations</span><span class="o">.</span><span class="n">reference</span><span class="o">.</span><span class="n">LazyReferenceMapper</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="s2">&quot;ref.parquet&quot;</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">zarr</span><span class="o">.</span><span class="n">open_group</span><span class="p">(</span><span class="n">lz</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s2">&quot;w&quot;</span><span class="p">)</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">create_dataset</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,))</span>
<span class="n">d</span><span class="p">[:]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">g2</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">create_group</span><span class="p">(</span><span class="s2">&quot;deep&quot;</span><span class="p">)</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">g2</span><span class="o">.</span><span class="n">create_dataset</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,))</span>
<span class="n">d</span><span class="p">[:]</span> <span class="o">=</span> <span class="mi">1</span>
</pre></div>
</div>
<p>produces files</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>ref.parquet/deep/name/refs.0.parq
ref.parquet/name/refs.0.parq
ref.parquet/.zmetadata
</pre></div>
</div>
<p>Here, .zmetadata is all of the metadata of all of all subgroups/arrays (similar to
zarr “consolidated metadata”, with two top-level fields: “metadata” (dict[str, str] all of the
zarr metadata key/values) and “record_size”, an integer set during <code class="docutils literal notranslate"><span class="pre">.create()</span></code>.</p>
<p>Each parquet file contains references within the corresponding path to where it is.
For example, key “name/0” will be the zeroth reference in “./name/refs.0.parq”. If
there are multiple dimensions, normal C indexing is used to find the Nth reference,
and there are up to “record_size” references (default 10000) in the first file;
reference &gt;10000,&lt;=2000 would be in “./name/refs.2.parquet”. Each file is (for now)
padded to record_size, but they compress really well.</p>
<p>Each row of the parquet data contains fields</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">path</span><span class="p">:</span> <span class="n">optional</span> <span class="nb">str</span><span class="o">/</span><span class="n">categorical</span><span class="p">,</span> <span class="n">remote</span> <span class="n">location</span> <span class="n">URL</span>
<span class="n">offset</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">start</span> <span class="n">location</span> <span class="n">of</span> <span class="n">block</span>
<span class="n">size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">number</span> <span class="n">of</span> <span class="nb">bytes</span> <span class="ow">in</span> <span class="n">block</span>
<span class="n">raw</span><span class="p">:</span> <span class="n">optional</span> <span class="nb">bytes</span><span class="p">,</span> <span class="n">binary</span> <span class="n">data</span>
</pre></div>
</div>
<p>If <code class="docutils literal notranslate"><span class="pre">raw</span></code> is populated, this is the data of the key. If <code class="docutils literal notranslate"><span class="pre">path</span></code> is
populated but size is 0, it is the whole file indicated. Otherwise,
it is a byte block in the indicated file. If both <code class="docutils literal notranslate"><span class="pre">raw</span></code> and <code class="docutils literal notranslate"><span class="pre">path</span></code>
are NULL, the key does not exist.</p>
<p>We reserve the possibility to store small array data in .zmetadata instead
of creating a small/mostly empty parquet file for each.</p>
</section>
</section>


Expand Down

0 comments on commit 2766355

Please sign in to comment.