Skip to content

Commit 1900a30

Browse files
youngale-pingthingsjustinGilmerdavidkonigsbergjleifnfandrewchambers
authored
Merge Arrow into Main for Release (#37)
* Threadpool executor (#22) * Release v5.15.0 * update protobuf to v4.22.3 * Add threaded streamset calls Using concurrent.futures.ThreadPoolExecutor * Blacken code * Update for failing tests * Ignore flake8 as part of testing pytest-flake8 seems to have issues with the later versions of flake8 tholo/pytest-flake8#92 * Update .gitignore * Update ignore and remove extra print. * Remove idea folder (pycharm) --------- Co-authored-by: David Konigsberg <72822263+davidkonigsberg@users.noreply.github.com> Co-authored-by: Jeff Lin <42981468+jleifnf@users.noreply.github.com> * Threaded arrow (#23) * Release v5.15.0 * update protobuf to v4.22.3 * Add threaded streamset calls Using concurrent.futures.ThreadPoolExecutor * Blacken code * Update for failing tests * Ignore flake8 as part of testing pytest-flake8 seems to have issues with the later versions of flake8 tholo/pytest-flake8#92 * Update .gitignore * Update proto definitions. * Update endpoint to support arrow methods * Support arrow endpoints * Additional arrow updates * Update transformers, add polars conversion * Update .gitignore * Update ignore and remove extra print. * Remove idea folder (pycharm) * Update requirements.txt * Update btrdb/transformers.py * Update the way to check for arrow-enabled btrdb This has not been "turned on" yet though, since we dont know the version number this will be enabled for. The method is currently commented out, but can be re-enabled pretty easily. * Use IPC streams to send the arrow bytes for insert Instead of writing out feather files to an `io.BytesIO` stream and then sending the feather files over the wire, this creates a buffered outputstream and then sends that data back as bytes to btrdb. * Create arrow specific stream methods. * Update test conn object to support minor version * Update tests and migrate arrow code. * Arrow and standard streamset insert * Create basic arrow to dataframe transformer * Support multirawvalues, arrow transformers * Multivalue arrow queries, in progress * Update stream filter to properly filter for sampling frequency * Update arrow values queries for multivalues * Update param passing for sampling frequency * Update index passing, and ignore depth * benchmark raw values queries for arrow and current api * Add aligned windows and run func * Streamset read benchmarks (WIP) In addition: * update streamset.count to support the `precise` boolean flag. * Update mock return value for versionMajor * In progress validation of stream benchs --------- Co-authored-by: David Konigsberg <72822263+davidkonigsberg@users.noreply.github.com> Co-authored-by: Jeff Lin <42981468+jleifnf@users.noreply.github.com> * Add 3.10 python to the testing matrix (#21) * Add 3.10 python to the testing matrix * Fix yaml parsing * Update requirements to support 3.10 * Use pip-tools `pip-compile` cli tool to generate requirements.txt files from the updated pyproject.toml file * Include pyproject.toml with basic features to support proper extra deps * Support different ways to install btrdb from pip * `btrdb, btrdb[data], btrdb[all], btrdb[testing], btrdb[ray]` * Update transformers.py to build up a numpy array when the subarrays are not the same size (number of entries) * This converts the main array's dtype to `object` * tests still pass with this change * recompile the btrdb proto files with latest protobuf and grpc plugins * Create multiple requirements.txt files for easier updating in the future as well as a locked version with pinned dependencies * Ignore protoc generated flake errors * Update test requirements * Include pre-commit and setup. * Pre-commit lints. * Update pre-commit.yaml add staging to pre-commit checks * Fix missing logging import, rerun pre-commit (#24) * Add basic doc string to endpoint object (#25) * Update benchmark scripts. * Multistream read bench insert bench (#26) * Fix multistream endpoint bugs * The streamset was passing the incorrect params to the endpoint * The endpoint does not return a `version` in its response, just `stat` and `arrowBytes` Params have been updated and a NoneType is passed around to ignore the lack of version info, which lets us use the same logic for all bytes decoding. * Add multistream benchmark methods for timesnap and no timesnap. * Add insert benchmarking methods (#27) Benchmarking methods added for: * stream inserts using tuples of time, value data * stream inserts using pyarrow tables of timestamps, value columns * streamset inserts using a dict map of streamset stream uuids, and lists of tuples of time, value data * streamset inserts using a dict map of streamset stream uuids, and pyarrow tables of timestamps, values. * Fix arrow inserts (#28) * Add insert benchmarking methods Benchmarking methods added for: * stream inserts using tuples of time, value data * stream inserts using pyarrow tables of timestamps, value columns * streamset inserts using a dict map of streamset stream uuids, and lists of tuples of time, value data * streamset inserts using a dict map of streamset stream uuids, and pyarrow tables of timestamps, values. * Include nullable false in pyarrow schema inserts * This was the only difference in the schemas between go and python. * also using a bytesIO stream to act as the sink for the ipc bytes. * Start integration test suite * Add more streamset integration tests. * Add support for authenticated requests without encryption. * Optimize logging calls (#30) Previously, the debug logging in the api would create the f-strings no matter if logging.DEBUG was the current log level or not. This can impact the performance, especially for benchmarking. Now, a cached IS_DEBUG flag is created for the stream operations, and other locations, the logger.isEnabledFor boolean is checked. Note that in the stream.py, this same function call is only executed once, and the results are cached for the rest of the logic. * Add more arrow tests and minor refactoring. * More integration test cases * Restructure tests. * Mark new failing tests as expected failures for now. * Disable gzip compression, it is very slow. * Reenable test, server has been fixed. * Update pandas testing and fix flake8 issues (#31) * Update pandas testing and fix flake8 issues * Update stream logic for unpacking arrow tables, update integration tests. * add init.py for integration tests. * Add additional tests for arrow methods vs their old api counterparts. * Add tests for timesnap boundary conditions. (#32) * Add more integration tests. * Add additional integration tests, modify the name_callable ability of the arrow_values. * remove extraneous prints. * Include retry logic. * Update statpoint order in arrow, fix some bugs with the arrow methods. * Update testing to account for NaNs. * Update github action versions. * Update tests, add in a test for duplicate values. * Remove empty test, remove extraneous prints --------- Co-authored-by: andrewchambers <andrewchamberss@gmail.com> * Update docs for arrow (#35) * Update docs, add in final enhanced edits. * Only enable arrow-endpoints when version >= 5.30 (#36) Once we have a v5.30tag of the server with arrow/multistream, we can merge this and complete the ticket. * Update arrow notes, small doc changes. (#38) --------- Co-authored-by: Justin Gilmer <justin@pingthings.io> Co-authored-by: David Konigsberg <72822263+davidkonigsberg@users.noreply.github.com> Co-authored-by: Jeff Lin <42981468+jleifnf@users.noreply.github.com> Co-authored-by: Andrew Chambers <andrew@pingthings.io> Co-authored-by: andrewchambers <andrewchamberss@gmail.com>
1 parent 74e7567 commit 1900a30

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+7396
-5018
lines changed

.github/workflows/pre-commit.yaml

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: pre-commit
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- master
7+
- staging
8+
types:
9+
- opened
10+
- reopened
11+
- ready_for_review
12+
- synchronize
13+
14+
env:
15+
SKIP: pytest-check
16+
17+
jobs:
18+
pre-commit:
19+
runs-on: ubuntu-latest
20+
steps:
21+
- uses: actions/checkout@v3
22+
with:
23+
token: ${{ secrets.GITHUB_TOKEN }}
24+
fetch-depth: 0 # get full git history
25+
- uses: actions/setup-python@v3
26+
with:
27+
cache: 'pip'
28+
- name: Install pre-commit
29+
run: |
30+
pip install pre-commit
31+
- name: Get changed files
32+
id: changed-files
33+
uses: tj-actions/changed-files@v21
34+
with:
35+
token: ${{ secrets.GITHUB_TOKEN }}
36+
- name: Run pre-commit
37+
uses: pre-commit/action@v2.0.3
38+
with:
39+
extra_args: --files ${{ steps.changed-files.outputs.all_changed_files }}

.github/workflows/release.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ jobs:
1414
runs-on: ${{ matrix.os }}
1515
strategy:
1616
matrix:
17-
python-version: [3.7, 3.8, 3.9]
17+
python-version: [3.7, 3.8, 3.9, '3.10']
1818
os: [ubuntu-latest, macos-latest, windows-latest]
1919

2020
steps:
21-
- uses: actions/checkout@v2
21+
- uses: actions/checkout@v3
2222
- name: Set up Python ${{ matrix.python-version }} ${{ matrix.os }}
23-
uses: actions/setup-python@v2
23+
uses: actions/setup-python@v4
2424
with:
2525
python-version: ${{ matrix.python-version }}
2626
- name: Install dependencies
@@ -39,7 +39,7 @@ jobs:
3939
if: startsWith(github.ref, 'refs/tags/')
4040
runs-on: ubuntu-latest
4141
steps:
42-
- uses: actions/checkout@v2
42+
- uses: actions/checkout@v3
4343
- name: Create Release
4444
id: create_release
4545
uses: actions/create-release@v1
@@ -59,9 +59,9 @@ jobs:
5959
runs-on: ubuntu-latest
6060

6161
steps:
62-
- uses: actions/checkout@v2
62+
- uses: actions/checkout@v3
6363
- name: Set up Python
64-
uses: actions/setup-python@v2
64+
uses: actions/setup-python@v4
6565
with:
6666
python-version: '3.8'
6767
- name: Install dependencies

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,3 +118,13 @@ dmypy.json
118118

119119
# Pyre type checker
120120
.pyre/
121+
122+
# arrow parquet files
123+
*.parquet
124+
125+
.idea
126+
.idea/misc.xml
127+
.idea/vcs.xml
128+
.idea/inspectionProfiles/profiles_settings.xml
129+
.idea/inspectionProfiles/Project_Default.xml
130+
/.idea/

.pre-commit-config.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v4.4.0
4+
hooks:
5+
- id: check-yaml
6+
- id: end-of-file-fixer
7+
- id: trailing-whitespace
8+
exclude: ^(setup.cfg|btrdb/grpcinterface)
9+
- repo: https://github.com/psf/black
10+
rev: 23.3.0
11+
hooks:
12+
- id: black-jupyter
13+
args: [--line-length=88]
14+
exclude: btrdb/grpcinterface/.*\.py
15+
- repo: https://github.com/pycqa/isort
16+
rev: 5.11.5
17+
hooks:
18+
- id: isort
19+
name: isort (python)
20+
args: [--profile=black, --line-length=88]
21+
exclude: btrdb/grpcinterface/.*\.py
22+
- repo: https://github.com/PyCQA/flake8
23+
rev: 6.0.0
24+
hooks:
25+
- id: flake8
26+
args: [--config=setup.cfg]
27+
exclude: ^(btrdb/grpcinterface|tests|setup.py|btrdb4|docs|benchmarks)
28+
- repo: local
29+
hooks:
30+
- id: pytest-check
31+
name: pytest-check
32+
entry: pytest
33+
language: system
34+
pass_filenames: false
35+
always_run: true

MANIFEST.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,4 @@ global-exclude *.py[co]
2020
global-exclude .ipynb_checkpoints
2121
global-exclude .DS_Store
2222
global-exclude .env
23-
global-exclude .coverage.*
23+
global-exclude .coverage.*
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
from time import perf_counter
2+
from typing import Dict, List, Tuple, Union
3+
4+
import pyarrow
5+
6+
import btrdb
7+
8+
9+
def time_stream_insert(
10+
stream: btrdb.stream.Stream,
11+
data: List[Tuple[int, float]],
12+
merge_policy: str = "never",
13+
) -> Dict[str, Union[int, float, str]]:
14+
"""Insert raw data into a single stream, where data is a List of tuples of int64 timestamps and float64 values.
15+
16+
Parameters
17+
----------
18+
stream : btrdb.stream.Stream, required
19+
The stream to insert data into.
20+
data : List[Tuple[int, float]], required
21+
The data to insert into stream.
22+
merge_policy : str, optional, default = 'never'
23+
How should the platform handle duplicated data?
24+
Valid policies:
25+
`never`: the default, no points are merged
26+
`equal`: points are deduplicated if the time and value are equal
27+
`retain`: if two points have the same timestamp, the old one is kept
28+
`replace`: if two points have the same timestamp, the new one is kept
29+
"""
30+
prev_ver = stream.version()
31+
tic = perf_counter()
32+
new_ver = stream.insert(data, merge=merge_policy)
33+
toc = perf_counter()
34+
run_time = toc - tic
35+
n_points = len(data)
36+
result = {
37+
"uuid": stream.uuid,
38+
"previous_version": prev_ver,
39+
"new_version": new_ver,
40+
"points_to_insert": n_points,
41+
"total_time_seconds": run_time,
42+
"merge_policy": merge_policy,
43+
}
44+
return result
45+
46+
47+
def time_stream_arrow_insert(
48+
stream: btrdb.stream.Stream, data: pyarrow.Table, merge_policy: str = "never"
49+
) -> Dict[str, Union[int, float, str]]:
50+
"""Insert raw data into a single stream, where data is a pyarrow Table of timestamps and float values.
51+
52+
Parameters
53+
----------
54+
stream : btrdb.stream.Stream, required
55+
The stream to insert data into.
56+
data : pyarrow.Table, required
57+
The table of data to insert into stream.
58+
merge_policy : str, optional, default = 'never'
59+
How should the platform handle duplicated data?
60+
Valid policies:
61+
`never`: the default, no points are merged
62+
`equal`: points are deduplicated if the time and value are equal
63+
`retain`: if two points have the same timestamp, the old one is kept
64+
`replace`: if two points have the same timestamp, the new one is kept
65+
"""
66+
prev_ver = stream.version()
67+
tic = perf_counter()
68+
new_ver = stream.arrow_insert(data, merge=merge_policy)
69+
toc = perf_counter()
70+
run_time = toc - tic
71+
n_points = data.num_rows
72+
result = {
73+
"uuid": stream.uuid,
74+
"previous_version": prev_ver,
75+
"new_version": new_ver,
76+
"points_to_insert": n_points,
77+
"total_time_seconds": run_time,
78+
"merge_policy": merge_policy,
79+
}
80+
return result

0 commit comments

Comments
 (0)