Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Indexing #247

Merged
merged 36 commits into from
Dec 15, 2023
Merged

Auto Indexing #247

merged 36 commits into from
Dec 15, 2023

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Dec 5, 2023

Description

Adds new Auto Indexing functionality #244

Type of change

A new feature that enables indexing a DataFrame/Table without specifying the columnsToIndex.

The feature is not enabled by default. If you want to use it, you should add the necessary configuration.

spark.qbeast.index.columnsToIndex.auto=true
spark.qbeast.index.columnsToIndex.auto.max=10

Code

The idea is to have an ColumnsToIndexSelector interface in the core project with the following information:

trait ColumnsToIndexSelector[DATA] {

  /**
   * The maximum number of columns to index.
   * @return
   */
  def MAX_COLUMNS_TO_INDEX: Int

  /**
   * Selects the columns to index given a DataFrame
   * @param data
   *   the data to index
   * @return
   */
  def selectColumnsToIndex(data: DATA): Seq[String] =
    selectColumnsToIndex(data, MAX_COLUMNS_TO_INDEX)

  /**
   * Selects the columns to index with a given number of columns to index
   * @param data
   *   the data to index
   * @param numColumnsToIndex
   *   the number of columns to index
   * @return
   *   A sequence with the names of the columns to index
   */
  def selectColumnsToIndex(data: DATA, numColumnsToIndex: Int): Seq[String]

}
  • MAX_COLUMNS_TO_INDEX: maximum number of columns to index for this implementation (could be a configurable parameter)
  • selectColumnsToIndex: this method is called every time we build a new table, only once (if we do not want to update it).

Checklist:

Here is the list of things you should do before submitting this pull request:

  • Add Skeleton for extensible AutoIndexing code (support for different algorithms and engines)
  • Add default PCA code
  • Add tests.
  • Add comments to the code (make it easier for the community!).
  • Change the documentation.
  • Your branch is updated to the main-1.0.0 branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

This should be tested individually (each AutoIndexer should have unit tests) as well as Integrated with the Spark API (DataFrame reads and writes).

…ipping

Add Delta Data Skipping on Staging Area
@osopardo1 osopardo1 changed the base branch from main to main-1.0.0 December 5, 2023 08:09
@osopardo1 osopardo1 changed the title Auto Indexing Auto Indexing [1.0.0] Dec 5, 2023
@osopardo1 osopardo1 added 1.0.0 and removed 1.0.0 labels Dec 5, 2023
@osopardo1 osopardo1 changed the title Auto Indexing [1.0.0] Auto Indexing Dec 5, 2023
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Attention: 17 lines in your changes are missing coverage. Please review.

Comparison is base (18b4534) 91.76% compared to head (e423a66) 91.02%.
Report is 52 commits behind head on main-1.0.0.

Files Patch % Lines
.../main/scala/io/qbeast/spark/delta/IndexFiles.scala 94.20% 4 Missing ⚠️
...cala/io/qbeast/spark/index/OTreeDataAnalyzer.scala 94.73% 3 Missing ⚠️
...cala/io/qbeast/core/model/CubeDomainsBuilder.scala 97.53% 2 Missing ⚠️
...o/qbeast/spark/delta/writer/RollupDataWriter.scala 98.44% 2 Missing ⚠️
...re/src/main/scala/io/qbeast/core/model/Block.scala 94.44% 1 Missing ⚠️
src/main/scala/io/qbeast/spark/QbeastTable.scala 50.00% 1 Missing ⚠️
...ala/io/qbeast/spark/delta/writer/BlockWriter.scala 96.96% 1 Missing ⚠️
...in/scala/io/qbeast/spark/delta/writer/Rollup.scala 96.00% 1 Missing ⚠️
...la/io/qbeast/spark/index/query/QueryExecutor.scala 90.90% 1 Missing ⚠️
src/main/scala/io/qbeast/spark/utils/Params.scala 50.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           main-1.0.0     #247      +/-   ##
==============================================
- Coverage       91.76%   91.02%   -0.75%     
==============================================
  Files              91       95       +4     
  Lines            2258     2528     +270     
  Branches          167      323     +156     
==============================================
+ Hits             2072     2301     +229     
- Misses            186      227      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@osopardo1 osopardo1 marked this pull request as ready for review December 13, 2023 08:54
@osopardo1
Copy link
Member Author

Codecov is failing, but test pass and PR is ready to be reviewed.

docs/ColumnsToIndexSelector.md Outdated Show resolved Hide resolved
docs/ColumnsToIndexSelector.md Outdated Show resolved Hide resolved
@osopardo1 osopardo1 merged commit dd71a00 into Qbeast-io:main-1.0.0 Dec 15, 2023
2 of 3 checks passed
fpj pushed a commit that referenced this pull request Mar 27, 2024
Main added features:
- Rollup
- Domain-driven appends
- Auto-indexing
- Multi-block file

Additionally, this merge performs the following:
- Updates documentation according to the new version.
- Removes unnecessary classes (e.g., CubeInfo).
- Resolves inconsistencies with the Auto Indexing #247 and CREATE EXTERNAL TABLE without OPTIONS #248 changes.
- compact() is no longer necessary, but we are leaving it to avoid additional changes to the staging area. We have issue 294 open to resolve it later.

---------

Co-authored-by: Alexey Akimov
Co-authored-by: Jiawei
Co-authored-by: osopardo1
Co-authored-by: osopardo1
Co-authored-by: SrTangente
Co-authored-by: SrTangente
Co-authored-by: jiawei
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants