Skip to content

Commit

Permalink
Merge pull request #358 from Qbeast-io/update-md-files
Browse files Browse the repository at this point in the history
Update .md qb-spark files
  • Loading branch information
jorgeMarin1 authored Jul 24, 2024
2 parents b685bd3 + 5221277 commit 47f271e
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 24 deletions.
30 changes: 17 additions & 13 deletions docs/CloudStorages.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,23 @@ We currently support Hadoop 2.7 and 3.2 (recommended), so feel free to use any o
Nevertheless, if you use Hadoop 2.7 you'll need to add some **extra** configurations depending on the provider, which you can find below.
Note that some versions may not work for a cloud provider, so please read carefully.

### Configs for Hadoop 2.7
<details><summary>AWS S3</summary>
There's no known working version of Hadoop 2.7 for AWS S3. However, you can try to use it.<br />
Remember to include the following option if using Hadoop 2.7:<br />
<code>--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem</code>
</details>
## Configs for Hadoop 2.7

### AWS S3

There's no known working version of Hadoop 2.7 for AWS S3. However, you can try to use it.

Remember to include the following option if using Hadoop 2.7:
``` --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem ```

### Azure Blob Storage

- You can use this provider with Hadoop 2.7. To do so, you need to change the Hadoop library to 2.7 (remember to change your Spark installation as well):
``` org.apache.hadoop:hadoop-azure:2.7.4 ```

- In addition, you must include the following config to use the _wasb_ filesystem:
``` --conf spark.hadoop.fs.AbstractFileSystem.wasb.impl=org.apache.hadoop.fs.azure.Wasb ```

<details><summary>Azure Blob Storage</summary>
- You can use this provider with Hadoop 2.7. To do so, you need to change the hadoop library to 2.7 (remember to change your Spark
installation as well):<br />
<code>org.apache.hadoop:hadoop-azure:2.7.4</code><br>
- In addition you must include the following config to use the _wasb_ filesystem:<br /><code>--conf spark.hadoop.fs.AbstractFileSystem.wasb.impl=org.apache.hadoop.fs.azure.Wasb</code>
</details>

## AWS S3
Amazon Web Services S3 does not work with Hadoop 2.7. For this provider you'll need Hadoop 3.2.
Expand Down Expand Up @@ -66,4 +70,4 @@ $SPARK_HOME/bin/spark-shell \
--packages io.qbeast:qbeast-spark_2.12:0.3.2,\
io.delta:delta-core_2.12:1.2.0,\
org.apache.hadoop:hadoop-azure:3.2.0
```
```
6 changes: 3 additions & 3 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# FAQ: Frequently Asked Questions
<hr>
<hr />

Q - I get an error like this when first indexing with qbeast following the steps from Quickstart:
```
java.io.IOException: (null) entry in command string: null chmod 0644
```
A - You can find the solution [here](https://stackoverflow.com/questions/48010634/why-does-spark-application-fail-with-ioexception-null-entry-in-command-strin/48012285#48012285)
<hr>
<hr />

Q - I run into an "out or memory error" when indexing with qbeast format.

Expand All @@ -24,4 +24,4 @@ Try to `repartition` the `DataFrame` before writing on your Spark Application:
```scala
df.repartition(200).write.format("qbeast").option("columnsToIndex", "x,y").save("/tmp/qbeast")
```
<hr>
<hr />
8 changes: 4 additions & 4 deletions docs/OTreeAlgorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The two primary goals of the **OTree algorithm** are
### Recursive Space Division
One of the most important techniques used to build a **multi-dimensional index** is through **recursive space division**; a bounded vector space initially containing all the data is **recursively divided** into **equal-sized**, **non-overlapping** subspaces, as long as they exceed the predefined **capacity**.

For a dataset indexed with `n` columns, the constructed index is an n-dimensional vector space composed of <img src="https://render.githubusercontent.com/render/math?math=2^n"> subspaces, or what we call `cubes`, with **non-overlapping** boundaries. Each cube can contain a predefined number of element `cap`, and exceeding it would trigger **recursively dividing** a cube into child cubes by halving the ranges in all dimensions until the number of elements included no longer exceeds `cap`.
For a dataset indexed with `n` columns, the constructed index is an n-dimensional vector space composed of <img src="https://render.githubusercontent.com/render/math?math=2^n" /> subspaces, or what we call `cubes`, with **non-overlapping** boundaries. Each cube can contain a predefined number of element `cap`, and exceeding it would trigger **recursively dividing** a cube into child cubes by halving the ranges in all dimensions until the number of elements included no longer exceeds `cap`.

Say that we use two columns, `x`, and `y` to build the index, and the parameter cap for each cube is 2. The first image in the figure below is the **root cube**, containing more than two elements. The cube is split into four **equal-sized**, **non-overlapping** child cubes with one space division step, as shown in the middle image. Three of the four cubes are in good condition as a result of the division.

Expand Down Expand Up @@ -92,7 +92,7 @@ The rest of the page describes the theoretical details about the OTree, includin


<p align="center">
<img src="./images/proper-cube.png">
<img src="./images/proper-cube.png" />
</p>


Expand All @@ -113,7 +113,7 @@ The following image depicts the three possible states, and whether a cube is of


<p align="center">
<img src="./images/states-and-transitions.png">
<img src="./images/states-and-transitions.png" />
</p>


Expand Down Expand Up @@ -148,4 +148,4 @@ The following image depicts the three possible states, and whether a cube is of
- READ:
- `f >= maxWeight`: don't read anything
- `f < maxWeight`: read elements from the `payload` with `weight <= f`


8 changes: 4 additions & 4 deletions docs/QbeastFormat.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ A **transaction log** in Delta Lake holds information about what objects compris


<p align="center">
<img src="./images/delta.png" width=600 height=500>
<img src="./images/delta.png" width="600" height="500" />
</p>


Expand Down Expand Up @@ -281,7 +281,7 @@ revisions.foreach(revision =>
```
> Note that **Revision ID number 0 is reserved for Stagin Area** (non-indexed files). This ensures compatibility with underlying table formats.

## Compaction (<v0.6.0)
## Compaction (&lt;v0.6.0)

> Compaction is **NOT available from version 0.6.0**. Although it is present, it calls the `optimize` command underneath.
> Read all the reasoning and changes on the [Qbeast Format 0.6.0](./QbeastFormat0.6.0.md) document and check the issue [#294](https://github.com/Qbeast-io/qbeast-spark/issues/294) for more info.
Expand All @@ -304,7 +304,7 @@ table.compact(0)
```


## Index Replication (<v0.6.0)
## Index Replication (&lt;v0.6.0)


> Analyze and Replication operations are **NOT available from version 0.6.0**. Read all the reasoning and changes on the [Qbeast Format 0.6.0](./QbeastFormat0.6.0.md) document.
> Analyze and Replication operations are **NOT available from version 0.6.0**. Read all the reasoning and changes on the [Qbeast Format 0.6.0](./QbeastFormat0.6.0.md) document.

0 comments on commit 47f271e

Please sign in to comment.