diff --git a/README.md b/README.md index c1be217f..7537d99a 100644 --- a/README.md +++ b/README.md @@ -4,221 +4,501 @@ [![GoDoc](https://godoc.org/github.com/v3io/frames?status.svg)](https://godoc.org/github.com/v3io/frames) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) -V3IO Frames is a high-speed server and client library for accessing time-series (TSDB), NoSQL, and streaming data in the [Iguazio Data Science Platform](https://www.iguazio.com). +V3IO Frames (**"Frames"**) is a multi-model open-source data-access library, developed by Iguazio, which provides a unified high-performance DataFrame API for working with data in the data store of the [Iguazio Data Science Platform](https://www.iguazio.com) (**"the platform"**). -## Documentation +#### In This Document -Frames currently supports 3 backends and basic CRUD functionality for each. +- [Client Python API Reference](#client-python-api-reference) +- [Contributing](#contributing) +- [LICENSE](#license) -Supported Backends: -1. TSDB -2. KV -3. Stream -4. CSV - for testing purposes + +## Client Python API Reference +- [Overview](#overview) +- [User Authentication](#user-authentication) +- [Client Constructor](#client-constructor) +- [Common Client Method Parameters](#client-common-method-params) +- [The create Method](#method-create) +- [The write Method](#method-write) +- [The read Method](#method-read) +- [The delete Method](#method-delete) +- [The execute Method](#method-execute) + + +### Overview + +To use Frames, you first need to import the **v3io_frames** Python library. +For example: +```python +import v3io_frames as v3f +``` +Then, you need to create and initialize an instance of the `Client` class; see [Client Constructor](#client-constructor). +You can then use the client methods to perform different data operations on the supported backend types: + + +#### Client Methods + +The `Client` class features the following methods for supporting basic data operations: + +- [`create`](#method-create) — create a new NoSQL or TSDB table or a stream ("the backend"). +- [`delete`](#method-delete) — delete the backend. +- [`read`](#method-read) — read data from the backend (as a pandas DataFrame or DataFrame iterator). +- [`write`](#method-write) — write one or more DataFrames to the backend. +- [`execute`](#method-execute) — execute a command on the backend. Each backend may support multiple commands. + + +#### Backend Types + +All Frames client methods receive a [`backend`](#client-method-param-backend) parameter for setting the Frames backend type. +Frames supports the following backend types: + +- `kv` — a platform NoSQL (key/value) table. +- `stream` — a platform data stream. +- `tsdb` — a time-series database (TSDB). +- `csv` — a comma-separated-value (CSV) file. + This backend type is used only for testing purposes. + +> **Note:** Some method parameters are common to all backend types and some are backend-specific, as detailed in this reference. + + +### User Authentication + +When creating a Frames client, you must provide valid platform credentials for accessing the backend data, which Frames will use to identify the identity of the user. +This can be done by using any of the following alternative methods (documented in order of precedence): + +- Provide the authentication credentials in the [`Client` constructor parameters](#client-constructor-parameters) by using either of the following methods: + + - Set the [`token`](#client-param-token) constructor parameter to a valid platform access key with the required data-access permissions. + - Set the [`user`](#client-param-user) and [`password`](#client-param-password) constructor parameters to the username and password of a platform user with the required data-access permissions. +
+ + > **Note:** You can't use both methods concurrently: setting both the `token` and `username` and `password` parameters in the same constructor call will produce an error. + +- Set the authentication credentials in environment variables, by using either of the following methods: + + - Set the `V3IO_ACCESS_KEY` environment variable to a valid platform access key with the required data-access permissions. + - Set the `V3IO_USERNAME` and `V3IO_PASSWORD` environment variables to the username and password of a platform user with the required data-access permissions. +
+ + > **Note:** + > - When the client constructor is called with [authentication parameters](#user-auth-client-const-params), the authentication-credentials environment variables (if defined) are ignored. + > - When `V3IO_ACCESS_KEY` is defined, `V3IO_USERNAME` and `V3IO_PASSWORD` are ignored. + > - The platform's Jupyter Notebook service automatically defines the `V3IO_ACCESS_KEY` environment variable and initializes it to a valid access key for the running user of the service. + + +### Client Constructor + +All Frames operations are executed via an object of the `Client` class. + +- [Syntax](#client-constructor-syntax) +- [Parameters and Data Members](#client-constructor-parameters) +- [Example](#client-constructor-example) + + +#### Syntax + +```python +Client(address, user, password, token, container) +``` + + +#### Parameters and Data Members + +- **address** — the address of the Frames service (`framesdb`). +
+ When running locally on the platform (for example, from a Jupyter Notebook service), set this parameter to `framesd:8081`. +
+ When connecting to the platform remotely, set this parameter to the API address of Frames platform service of the parent tenant. + You can copy this address from the **API** column of the V3IO Frames service on the **Services** platform dashboard page. + + - **Type:** String + - **Requirement:** Required + +- **container** — the name of the platform data container that contains the backend data. + For example, `"bigdata"` or `"users"`. + + - **Type:** String + - **Requirement:** Required + +- **user** — the username of a platform user with permissions to access the backend data. + + - **Type:** String + - **Requirement:** Required when neither the [`token`](#client-param-token) parameter or the authentication environment variables are set. + See [User Authentication](#user-authentication). +
+ When the `user` parameter is set, the [`password`](#client-param-password) parameter must also be set to a matching user password. + +- **password** — a platform password for the user configured in the [`user`](#client-param-user) parameter. + + - **Type:** String + - **Requirement:** Required when the [`user`](#client-param-user) parameter is set. + See [User Authentication](#user-authentication). + +- **token** — a valid platform access key that allows access to the backend data. + To get this access key, select the user profile icon on any platform dashboard page, select **Access Tokens**, and copy an existing access key or create a new key. + + - **Type:** String + - **Requirement:** Required when neither the [`user`](#client-param-user) or [`password`](#client-param-password) parameters or the authentication environment variables are set. + See [User Authentication](#user-authentication). + + +#### Example + +The following example, for local platform execution, creates a Frames client for accessing data in the "users" container by using the authentication credentials of the user "iguazio": -All of frames operations are executed via the `client` object. To create a client object simply provide the Iguazio web-api endpoint and optional credentials. ```python import v3io_frames as v3f -client = v3f.Client('framesd:8081', user='user1', password='pass') +client = v3f.Client("framesd:8081", user="iguazio", password="mypass", container="users") ``` -Note: When running from within the managed jupyter notebook on the iguazio platform there is no need to add credentials as this is handled by the platform. -Next, for every operation we need to provide a `backend`, and a `table` parameters and optionally other function specific arguments. -### Create -Creates a new table for the desired backend. Not all backends require a table to be created prior to ingestion. For example KV table will be created while ingesting new data, on the other hand since TSDB tables have mandatory fields we need to create a table before ingesting new data. + +### Common Client Method Parameters + +All client methods receive the following common parameters: + +- **backend** — the backend data type for the operation. + See the backend-types descriptions in the [overview](#backend-types). + + - **Type:** String + - **Valid Values:** `"csv"` | `"kv"` | `"stream"` | `"tsdb"` + - **Requirement:** Required + +- **table** — the relative path to the backend data — a directory in the target platform data container (as configured for the client object) that represents a platform data collection, such as a TSDB or NoSQL table or a stream. + For example, `"mytable"` or `"examples/tsdb/my_metrics"`. + + - **Type:** String + - **Requirement:** Required + +Additional method-specific parameters are described for each method. + + +### create Method + +Creates a new data collection (table/stream) in a platform data container according to the configured backend. + +The `create` method is supported by the `tsdb` and `stream` backends, but not by the `kv` backend, because NoSQL tables in the platform don't need to be created prior to ingestion; when ingesting data into a table that doesn't exist, the table is automatically created. + +- [Syntax](#method-create-syntax) +- [`tsdb` backend `create` parameters](#method-create-params-tsdb) +- [`stream` backend `create` parameters](#method-create-params-stream) + + +#### Syntax + ```python -client.create(backend=, table=, attrs=) +create(backend=, table=
, attrs=) ``` -#### backend specific parameters -##### TSDB -* rate -* aggregates (optional) -* aggregation-granularity (optional) + +#### tsdb Backend create Parameters + +- **rate** (Required) — `string` — the ingestion rate TSDB's metric-samples, as `"[0-9]+/[smh]"` (where `s` = seconds, `m` = minutes, and `h` = hours); for example, `"1/s"` (one sample per minute). + The rate should be calculated according to the slowest expected ingestion rate. +- **aggregates** (Optional) +- **aggregation-granularity** (Optional) + +For detailed information about these parameters, refer to the [V3IO TSDB documentation](https://github.com/v3io/v3io-tsdb#v3io-tsdb). -For detailed info on these parameters please visit [TSDB](https://github.com/v3io/v3io-tsdb#v3io-tsdb) docs. Example: ```python -client.create('tsdb', '/mytable', attrs={'rate': '1/m'}) +client.create("tsdb", "/mytable", attrs={"rate": "1/m"}) ``` -##### Stream -* shards=1 (optional) -* retention_hours=24 (optional) + +#### stream Backend create Parameters + +- **shards** (Optional) (default: `1`) — `int` — the number of stream shards to create. +- **retention_hours** (Optional) (default: `24`) — `int` — the stream's retention period, in hours. + +For detailed information about these parameters, refer to the [platform streams documentation](https://www.iguazio.com/docs/concepts/latest-release/streams). -For detailed info on these parameters please visit [Stream](https://www.iguazio.com/docs/concepts/latest-release/streams) docs. Example: ```python -client.create('stream', '/mystream', attrs={'shards': '6'}) +client.create("stream", "/mystream", attrs={"shards": 6}) +``` + + +### write Method + +Writes data from a DataFrame to a data collection (table/stream) in a platform data container according to the configured backend. + +- [Syntax](#method-write-syntax) +- [Common parameters](#method-write-backend-common-params) +- [`kv` backend `write` parameters](#method-write-params-kv) + + +#### Syntax + +```python +write(backend=, table=
, attrs=) ``` -### Write -Writes a Dataframe into one of the supported backends. -Common write parameters: -* dfs - list of Dataframes to write -* index_cols=None (optional) - specify specific index columns, by default Dataframe's index columns will be used. -* labels=None (optional) -* max_in_message=0 (optional) -* partition_keys=None (Not yet supported) + +#### Common write Parameters + +All Frames backends that support the `write` method support the following common parameters, which can be set in the `attrs` method parameter: + +- **dfs** — list of DataFrames to write. +- **index_cols** (Optional) (default: none) — specify specific index columns, by default DataFrame's index columns will be used. +- **labels** (Optional) (default: none) +- **max_in_message** (Optional) (default: `0`) +- **partition_keys** (Optional) (default: none) (**Not yet supported**) Example: ```python -data = [['tom', 10], ['nick', 15], ['juli', 14]] -df = pd.DataFrame(data, columns = ['name', 'age']) -df.set_index('name') -client.write(backend='kv', table='mytable', dfs=df) +data = [["tom", 10], ["nick", 15], ["juli", 14]] +df = pd.DataFrame(data, columns = ["name", "age"]) +df.set_index("name") +client.write(backend="kv", table="mytable", dfs=df) ``` -#### backend specific parameters -##### KV + +#### kv Backend write Parameters + -* condition=' ' (optional) - for detailed information on condition expressions see [docs](https://www.iguazio.com/docs/reference/latest-release/expressions/condition-expression/) + + +- **condition** (Optional) (default: none) — a platform condition expression that defines conditions for performing the update. + For detailed information about platform condition expressions, see the [platform documentation](https://www.iguazio.com/docs/reference/latest-release/expressions/condition-expression/). Example: ```python -data = [['tom', 10, 'TLV'], ['nick', 15, 'Berlin'], ['juli', 14, 'NY']] -df = pd.DataFrame(data, columns = ['name', 'age', 'city']) -df.set_index('name') -v3c.write(backend='kv', table='mytable', dfs=df, condition='age>14') +data = [["tom", 10, "TLV"], ["nick", 15, "Berlin"], ["juli", 14, "NY"]] +df = pd.DataFrame(data, columns = ["name", "age", "city"]) +df.set_index("name") +v3c.write(backend="kv", table="mytable", dfs=df, condition="age>14") ``` -### Read -Reads data from a backend. -Common read parameters: -* iterator: bool - Return iterator of DataFrames or (if False) just one DataFrame -* filter: string - Query filter (can't be used with query) -* columns: []str - List of columns to pass (can't be used with query) -* data_format: string - Data format (Not yet supported) -* marker: string - Query marker (Not yet supported) -* limit: int - Maximal number of rows to return (Not yet supported) -* row_layout: bool - Weather to use row layout (vs the default column layout) (Not yet supported) - - -#### backend specific parameters -##### TSDB -* start: string -* end: string -* step: string -* aggregators: string -* aggregationWindow: string -* query: string - Query in SQL format -* group_by: string - Query group by (can't be used with query) -* multi_index: bool - Get the results as a multi index data frame where the labels are used as indexes - in addition to the timestamp, or if `False` (default behavior) only the timestamp will function as the index. - -For detailed info on these parameters please visit [TSDB](https://github.com/v3io/v3io-tsdb#v3io-tsdb) docs. -Example: + +### read Method + +Reads data from a data collection (table/stream) in a platform data container to a DataFrame according to the configured backend. + +- [Syntax](#method-read-syntax) +- [Common parameters](#method-read-backend-common-params) +- [`tsdb` backend `read` parameters](#method-read-params-tsdb) +- [`kv` backend `read` parameters](#method-read-params-kv) +- [`stream` backend `read` parameters](#method-read-params-stream) + +Reads data from a backend. + + +#### Syntax + ```python -df = client.read(backend='tsdb', query="select avg(cpu) as cpu, avg(diskio), avg(network)from mytable", start='now-1d', end='now', step='2h') +read(backend=, table=
, attrs=) ``` -##### KV -* reset_index: bool - Reset the index. When set to `false` (default), the dataframe will have the key column of the v3io kv as the index column. -When set to `true`, the index will be reset to a range index. -* max_in_message: int - Maximal number of rows per message -* sharding_keys: []string (Experimental)- list of specific sharding keys to query. For range scan formatted tables only. -* segments: []int64 (Not yet supported) -* total_segments: int64 (Not yet supported) -* sort_key_range_start: string (Not yet supported) -* sort_key_range_end: string (Not yet supported) + +#### Common read Parameters + +All Frames backends that support the `read` method support the following common parameters, which can be set in the `attrs` method parameter: -For detailed info on these parameters please visit KV docs. +- **iterator** — `bool` — return iterator of DataFrames or (if False) just one DataFrame +- **filter** — `string` — query filter (can't be used with query) +- **columns** — `[]str` — list of columns to pass (can't be used with query) +- **data_format** — `string` — data format (**Not yet supported**) +- **marker** — `string` — query marker (**Not yet supported**) +- **limit** — `int` — maximal number of rows to return (**Not yet supported**) +- **row_layout** — `bool` — weather to use row layout (vs the default column layout) (**Not yet supported**) + + +#### tsdb Backend read Parameters + +- **start** — `string` +- **end** — `string` +- **step** — `string` +- **aggregators** — `string` +- **aggregationWindow** — `string` +- **query** — `string` — query in SQL format +- **group_by** — `string` — query group by (can't be used with query) +- **multi_index** — `bool` — get the results as a multi index data frame where the labels are used as indexes in addition to the timestamp, or if `False` (default behavior) only the timestamp will function as the index. + +For detailed information about these parameters, refer to the [V3IO TSDB documentation](https://github.com/v3io/v3io-tsdb#v3io-tsdb). Example: ```python -df = client.read(backend='kv', table='mytable', filter='col1>666') +df = client.read(backend="tsdb", query="select avg(cpu) as cpu, avg(diskio), avg(network)from mytable", start="now-1d", end="now", step="2h") ``` -##### Stream -* seek: string - excepted values: time | seq/sequence | latest | earliest. -if `seq` seek type is requested, need to provide the desired sequence id via `sequence` parameter. -if `time` seek type is requested, need to provide the desired start time via `start` parameter. -* shard_id: string -* sequence: int64 (optional) + +#### kv Backend read Parameters -For detailed info on these parameters please visit [Stream](https://www.iguazio.com/docs/concepts/latest-release/streams) docs. +- **reset_index** — `bool` — Reset the index. When set to `false` (default), the DataFrame will have the key column of the v3io kv as the index column. + When set to `true`, the index will be reset to a range index. +- **max_in_message** — `int` — Maximal number of rows per message +- **sharding_keys** — `[]string` (**Experimental**) — a list of specific sharding keys to query, for range-scan formatted tables only. +- **segments** — `[]int64` (**Not yet supported**) +- **total_segments** — `int64` (**Not yet supported**) +- **sort_key_range_start** — `string` (**Not yet supported**) +- **sort_key_range_end** — `string` (**Not yet supported**) + +For detailed information about these parameters, refer to the platform's NoSQL documentation. Example: ```python -df = client.read(backend='stream', table='mytable', seek='latest', shard_id='5') +df = client.read(backend="kv", table="mytable", filter="col1>666") ``` -### Delete -Deletes a table of a specific backend. + +#### stream Backend read Parameters + +- **seek** — `string` — valid values: `"time" | "seq"/"sequence" | "latest" | "earliest"`. +
+ If the `"seq"|"sequence"` seek type is set, you need to provide the desired record sequence ID via the [`sequence`](#method-read-stream-param-sequence) parameter. +
+ If the `time` seek type is set, you need to provide the desired start time via the `start` parameter. +- **shard_id** — `string` +- **sequence** — `int64` (Optional) + +For detailed information about these parameters, refer to the [platform streams documentation](https://www.iguazio.com/docs/concepts/latest-release/streams). Example: ```python -df = client.delete(backend='', table='mytable') +df = client.read(backend="stream", table="mytable", seek="latest", shard_id="5") ``` -#### backend specific parameters -##### TSDB -* start: string - delete since start -* end: string - delete since start + +### delete Method + +Deletes a data collection (table/stream) in a platform data container according to the configured backend. +
+The `kb` backend also supports an optional [`filter`](#method-delete-kv-param-filter) parameter that can be used to delete only specific items in a NoSQL tables. + +- [Syntax](#method-delete-syntax) +- [`tsdb` backend `delete` parameters](#method-delete-params-tsdb) +- [`kv` backend `delete` parameters](#method-delete-params-kv) + + +#### Syntax + +```python +delete(backend=, table=
, attrs=) +``` + + +#### tsdb Backend delete Parameters + +- **start** — `string` — delete since start +- **end** — `string` — delete since start + +> **Note:** When neither the `start` or `end` parameters are set, the entire TSDB table is deleted. + +For detailed information about these parameters, refer to the [V3IO TSDB](https://github.com/v3io/v3io-tsdb#v3io-tsdb) documentation. -Note: if both `start` and `end` are not specified **all** the TSDB table will be deleted. -For detailed info on these parameters please visit [TSDB](https://github.com/v3io/v3io-tsdb#v3io-tsdb) docs. Example: ```python -df = client.delete(backend='tsdb', table='mytable', start='now-1d', end='now-5h') +df = client.delete(backend="tsdb", table="mytable", start="now-1d", end="now-5h") ``` -##### KV -* filter: string - Filter for selective delete + + +#### kv Backend delete Parameters + +- **filter** — `string` — a platform filter expression that identifies specific items to delete. + For detailed information about platform filter expressions, see the [platform documentation](https://www.iguazio.com/docs/reference/latest-release/expressions/condition-expression/#filter-expression). + +> **Note:** When the `filter` parameter isn't set, the entire table is deleted. Example: ```python -df = client.delete(backend='kv', table='mytable', filter='age>40') +df = client.delete(backend="kv", table="mytable", filter="age > 40") ``` -### Execute -Provides additional functions that are not covered in the basic CRUD functionality. + +### execute Method -##### TSDB -Currently no `execute` commands are available for the TSDB backend. +Extends the basic CRUD functionality of the other client methods via custom commands: + +- [tsdb backend commands](#method-execute-tsdb-cmds) +- [kv backend commands](#method-execute-kv-cmds) +- [stream backend commands](#method-execute-stream-cmds) + + +### tsdb Backend execute Commands + +Currently, no `execute` commands are available for the `tsdb` backend. + + +### kv Backend execute Commands + +- **infer | inferschema** — infers the data schema of a given NoSQL table and creates a schema file for the table. + + Example: + ```python + client.execute(backend="kv", table="mytable", command="infer") + ```` -##### KV -* infer, inferschema - inferring and creating a schema file for a given kv table. - Example: `client.execute(backend='kv', table='mytable', command='infer')` -##### Stream -* put - putting a new object to a stream. -Example: `client.execute(backend='stream', table='mystream', command='put', args={'data': 'this a record', 'clientinfo': 'some_info', 'partition': 'partition_key'})` + +### stream Backend execute Commands +- **put** — adds records to a stream. + + Example: + ```python + client.execute(backend="stream", table="mystream", command="put", args={"data": "this a record", "clientinfo": "some_info", "partition": "partition_key"}) + ``` + + ## Contributing +To contribute to V3IO Frames, you need to be aware of the following: + +- [Components](#components) +- [Development](#development) + - [Adding and Changing Dependencies](#adding-and-changing-dependencies) + - [Travis CI](#travis-ci) +- [Docker Image](#docker-image) + - [Building the Image](#building-the-image) + - [Running the Image](#running-the-image) + + ### Components +The following components are required for building Frames code: + - Go server with support for both the gRPC and HTTP protocols - Go client - Python client + ### Development The core is written in [Go](https://golang.org/). The development is done on the `development` branch and then released to the `master` branch. +Before submitting changes, test the code: + - To execute the Go tests, run `make test`. - To execute the Python tests, run `make test-python`. -#### Adding/Changing Dependencies + +#### Adding and Changing Dependencies - If you add Go dependencies, run `make update-go-deps`. - If you add Python dependencies, update **clients/py/Pipfile** and run `make update-py-deps`. + #### Travis CI Integration tests are run on [Travis CI](https://travis-ci.org/). @@ -227,13 +507,13 @@ See **.travis.yml** for details. The following environment variables are defined in the [Travis settings](https://travis-ci.org/v3io/frames/settings): - Docker Container Registry ([Quay.io](https://quay.io/)) - - `DOCKER_PASSWORD` — Password for pushing images to Quay.io. - - `DOCKER_USERNAME` — Username for pushing images to Quay.io. + - `DOCKER_PASSWORD` — password for pushing images to Quay.io. + - `DOCKER_USERNAME` — username for pushing images to Quay.io. - Python Package Index ([PyPI](https://pypi.org/)) - - `V3IO_PYPI_PASSWORD` — Password for pushing a new release to PyPi. - - `V3IO_PYPI_USER` — Username for pushing a new release to PyPi. + - `V3IO_PYPI_PASSWORD` — password for pushing a new release to PyPi. + - `V3IO_PYPI_USER` — username for pushing a new release to PyPi. - Iguazio Data Science Platform - - `V3IO_SESSION` — A JSON encoded map with session information for running tests. + - `V3IO_SESSION` — a JSON encoded map with session information for running tests. For example: ``` @@ -241,8 +521,10 @@ The following environment variables are defined in the [Travis settings](https:/ ``` > **Note:** Make sure to embed the JSON object within single quotes (`'{...}'`). + ### Docker Image + #### Building the Image Use the following command to build the Docker image: @@ -251,6 +533,7 @@ Use the following command to build the Docker image: make build-docker ``` + #### Running the Image Use the following command to run the Docker image: @@ -261,6 +544,7 @@ docker run \ quay.io/v3io/frames:unstable ``` + ## LICENSE [Apache 2](LICENSE)