Skip to content

perf: speed up fsst decompression#2626

Merged
westonpace merged 5 commits intolance-format:mainfrom
broccoliSpicy:decompression_perf
Jul 25, 2024
Merged

perf: speed up fsst decompression#2626
westonpace merged 5 commits intolance-format:mainfrom
broccoliSpicy:decompression_perf

Conversation

@broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Jul 21, 2024

before:
Screenshot 2024-07-21 at 4 24 44 PM

after:
Screenshot 2024-07-21 at 4 28 49 PM

to reproduce:
cargo run --release --example benchmark
in rust/lance-encoding/compression-algo/fsst

machine info:
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Linux 192 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31) x86_64 GNU/Linux

@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2024

Codecov Report

Attention: Patch coverage is 98.22485% with 3 lines in your changes missing coverage. Please review.

Project coverage is 79.39%. Comparing base (02294a1) to head (2a40120).
Report is 1 commits behind head on main.

Files Patch % Lines
...t/lance-encoding/compression-algo/fsst/src/fsst.rs 98.22% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2626      +/-   ##
==========================================
+ Coverage   79.35%   79.39%   +0.04%     
==========================================
  Files         213      213              
  Lines       62520    62706     +186     
  Branches    62520    62706     +186     
==========================================
+ Hits        49610    49784     +174     
- Misses       9996    10005       +9     
- Partials     2914     2917       +3     
Flag Coverage Δ
unittests 79.39% <98.22%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@broccoliSpicy
Copy link
Contributor Author

end-to-end test:

Screenshot 2024-07-22 at 5 34 13 PM

to reproduce:

  1. disable dictionary encoding
  2. download dataset
    wget https://huggingface.co/datasets/HuggingFaceFW/fineweb/resolve/main/data/CC-MAIN-2013-20/000_00000.parquet\?download\=true
  3. run script(set LANCE_USE_FSST environmental variable to enable fsst and unset it to disable)
from lance.file import LanceFileReader, LanceFileWriter
import pyarrow.parquet as pq

parquet_file_path = "/home/x/data.parquet"
data = pq.read_table(parquet_file_path)
lance_file_path = '/home/x/lance-experiments/fineweb/output.lance'
with LanceFileWriter(lance_file_path) as writer:
  writer.write_batch(data)

import datetime
import pyarrow.parquet as pq
from lance.file import LanceFileReader

start = datetime.datetime.now()
tab = pq.read_table(parquet_file_path)
end = datetime.datetime.now()
elapsed = (end - start).total_seconds()
print(f"Parquet elapsed: {elapsed}s")

start = datetime.datetime.now()
tab2 = LanceFileReader(lance_file_path).read_all().to_table()
end = datetime.datetime.now()
elapsed = (end - start).total_seconds()

import os  


lance_file_size = os.path.getsize(lance_file_path)
lance_file_size_mib = lance_file_size // 1048576
parquet_file_size = os.path.getsize(parquet_file_path)
parquet_file_size_mib = parquet_file_size // 1048576

if os.getenv("LANCE_USE_FSST") is not None:

  print(f"Parquet file size(fsst): {parquet_file_size_mib} Mbytes")
  print(f"Lance file size(fsst): {lance_file_size_mib} Mbytes")
  print(f"Lv2(fsst) elapsed: {elapsed}s")
else:
  print(f"Parquet file size(fsst): {parquet_file_size_mib} Mbytes")
  print(f"Lance file size(no fsst): {lance_file_size_mib} Mbytes")
  print(f"Lv2(no fsst) elapsed: {elapsed}s")

assert tab == tab2

print("Tables are equal")

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting back to this. I don't yet fully understand FSST and so I don't understand exactly this change. However, the tests pass and FSST is still guarded by env variable so lets merge this and I will try and set some time aside next week to really dig through in more detail.

@westonpace westonpace merged commit fbf7a4a into lance-format:main Jul 25, 2024
@broccoliSpicy
Copy link
Contributor Author

ha, sorry that I didn't explain this PR, it is basically a rust translation from the original c++ implementation of decompression.
and uses rust's ptr::write_unaligned as a performant way to do store

@broccoliSpicy broccoliSpicy deleted the decompression_perf branch October 16, 2024 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants