diff --git a/docs/install.rst b/docs/basic/index.rst similarity index 99% rename from docs/install.rst rename to docs/basic/index.rst index 5063948..32fbf35 100644 --- a/docs/install.rst +++ b/docs/basic/index.rst @@ -1,10 +1,10 @@ .. highlight:: bash -.. _install: +.. _install_basic: -============ -Installation -============ +=============== +Getting started +=============== .. contents:: :local: diff --git a/docs/cloud/aws/aws-crate-gui.png b/docs/cloud/aws/aws-crate-gui.png new file mode 100644 index 0000000..7aeab3e Binary files /dev/null and b/docs/cloud/aws/aws-crate-gui.png differ diff --git a/docs/cloud/aws/aws-terminal-output.png b/docs/cloud/aws/aws-terminal-output.png new file mode 100644 index 0000000..7d2ea75 Binary files /dev/null and b/docs/cloud/aws/aws-terminal-output.png differ diff --git a/docs/cloud/aws/aws-terraform-setup.rst b/docs/cloud/aws/aws-terraform-setup.rst new file mode 100644 index 0000000..f208bed --- /dev/null +++ b/docs/cloud/aws/aws-terraform-setup.rst @@ -0,0 +1,201 @@ +.. _aws_terraform_setup: + +============================= +Running CrateDB via Terraform +============================= + +In :ref:`ec2_setup`, we elaborated on how to leverage EC2's functionality to set +up a CrateDB cluster. Here, we will explore how to automate this kind of setup. + +`Terraform`_ is an infrastructure as code tool, often used as an abstraction +layer on top of a cloud's management APIs. Instead of creating cloud resources +manually, the target state is specified via configuration files which can also +be managed in a version control system. This brings some advantages, such as but +not limited to: + +- Reproducibility of deployments, e.g., across different accounts or in case of + disaster recovery +- Enables common development workflows like code reviews, automated testing, and + so on +- Better prediction and tracing of infrastructure changes + +The `crate-terraform`_ repository provides a predefined configuration template +of various AWS resources to form a CrateDB cluster on AWS (such as EC2 +instances, load balancer, etc). This eliminates the need to manually compose all +required resources and their interactions. + +.. SEEALSO:: + + Engage with us in the `community post`_ on Terraform deployments for any + questions or feedback! + +.. CAUTION:: + + The provided configuration is meant to be used for development or testing + purposes and does not aim to fulfil all needs of a production environment. + +Prerequisites +============= + +Before creating the configuration to launch your CrateDB cluster, the following +prerequisites should be fulfilled: + +1. The Terraform CLI is installed as per + `Terraform's installation guide`_ +2. The git CLI is installed as per `git's installation guide`_ +3. AWS credentials are configured for Terraform. If you already have a + configured AWS CLI setup, Terraform will reuse this configuration. If not, + see the `AWS provider`_ documentation on authentication. + +Deployment configuration +======================== + +The CrateDB Terraform configuration consists of a set of variables to customize +your deployment. Create a new file ``main.tf`` with the following content and +adjust variable values as needed: + +.. code-block:: + + module "cratedb-cluster" { + source = "github.com/crate/crate-terraform.git/aws" + + # Global configuration items for naming/tagging resources + config = { + project_name = "example-project" + environment = "test" + owner = "Crate.IO" + team = "Customer Engineering" + } + + # CrateDB-specific configuration + crate = { + # Java Heap size in GB available to CrateDB + heap_size_gb = 2 + + cluster_name = "crate-cluster" + + # The number of nodes the cluster will consist of + cluster_size = 2 + + # Enables a self-signed SSL certificate + ssl_enable = true + } + + # The disk size in GB to use for CrateDB's data directory + disk_size_gb = 512 + + # The AWS region + region = "eu-central-1" + + # The VPC to deploy to + vpc_id = "vpc-1234567" + + # Applicable subnets of the VPC + subnet_ids = ["subnet-123456", "subnet-123457"] + + # The corresponding availability zones of above subnets + availability_zones = ["eu-central-1b", "eu-central-1a"] + + # The SSH key pair for EC2 instances + ssh_keypair = "cratedb-cluster" + + # Enable SSH access to EC2 instances + ssh_access = true + } + + output "cratedb" { + value = module.cratedb-cluster + sensitive = true + } + +The AWS-specific variables need to be adjusted according to your environment: + ++------------------------+--------------------------------------------------------------+----------------------------------+ +| Variable | Explanation | How to obtain | ++========================+==============================================================+==================================+ +| ``region`` | The geographic region in which to create the AWS resources | `List of available AWS regions`_ | ++------------------------+--------------------------------------------------------------+----------------------------------+ +| ``vpc_id`` | The ID of the Virtual Private Cloud (VPC) in which the EC2 | `How to view VPC properties`_ | +| | instances will be launched | | ++------------------------+--------------------------------------------------------------+----------------------------------+ +| ``subnet_ids`` | Each VPC consists of multiple subnets, typically distributed | `How to view subnet properties`_ | +| | across availability zones. Choose the ones you want to | | +| | launch EC2 instances in. | | ++------------------------+--------------------------------------------------------------+----------------------------------+ +| ``availability_zones`` | The availability zones of the above subnets. | `How to view subnet properties`_ | +| | The positions in the ``availability_zones`` array must match | | +| | with the corresponding element in ``subnet_ids``. | | +| | In the example above, ``subnet-123456`` is in | | +| | ``eu-central-1b``, and ``subnet-123457`` in | | +| | ``eu-central-1a``. | | ++------------------------+--------------------------------------------------------------+----------------------------------+ +| ``ssh_keypair`` | The EC2 key pair used for SSH access. This must be an | `How to create EC2 key pairs`_ | +| | already existing key pair name. | | ++------------------------+--------------------------------------------------------------+----------------------------------+ + +Execution +========= + +Once all variables are configured properly, Terraform needs to be initialized: + +.. code-block:: bash + + terraform init + +To proceed with executing the creation of resources, apply the configuration. +There will be a final confirmation prompt before any changes are applied to your +AWS account: + +.. code-block:: bash + + terraform apply + +Once the execution succeeded, a message similar to the one below is shown: + +.. code-block:: bash + + Apply complete! Resources: 22 added, 0 changed, 0 destroyed. + + Outputs: + + cratedb = + +Terraform internally tracks the state of each resource it manages, including +certain outputs with details on the created Cluster. As those details include +credentials, they are marked as sensitive and not shown in the output above. +To view the output, run: + +.. code-block:: bash + + terraform output cratedb + +The output variable ``cratedb_application_url`` points to the load balancer with +the port of CrateDB's Admin UI. Opening that URL in your browser should show a +password prompt on which you can authenticate using ``cratedb_username`` and +``cratedb_password``. + +Deprovisioning +============== + +If the CrateDB cluster is not needed anymore, you can easily instruct Terraform +to destroy all associated resources: + +.. code-block:: bash + + terraform destroy + +.. CAUTION:: + + Destroying the cluster will permanently delete all data stored on it. Use + :ref:`snapshots ` to create a backup on S3 if needed. + +.. _Terraform: https://www.terraform.io +.. _crate-terraform: https://github.com/crate/crate-terraform +.. _Terraform's installation guide: https://www.terraform.io/downloads.html +.. _git's installation guide: https://git-scm.com/downloads +.. _AWS provider: https://registry.terraform.io/providers/hashicorp/aws/latest/docs +.. _List of available AWS regions: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions +.. _How to view VPC properties: https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#view-vpc +.. _How to view subnet properties: https://docs.aws.amazon.com/vpc/latest/userguide/working-with-subnets.html#view-subnet +.. _How to create EC2 key pairs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html +.. _community post: https://community.crate.io/t/deploying-cratedb-to-the-cloud-via-terraform/849 diff --git a/docs/cloud/aws/crate-ami-search.png b/docs/cloud/aws/crate-ami-search.png new file mode 100644 index 0000000..d80c60a Binary files /dev/null and b/docs/cloud/aws/crate-ami-search.png differ diff --git a/docs/cloud/aws/ec2-discovery-security-groups.png b/docs/cloud/aws/ec2-discovery-security-groups.png new file mode 100644 index 0000000..d00461a Binary files /dev/null and b/docs/cloud/aws/ec2-discovery-security-groups.png differ diff --git a/docs/cloud/aws/ec2-discovery-tags.png b/docs/cloud/aws/ec2-discovery-tags.png new file mode 100644 index 0000000..7517f16 Binary files /dev/null and b/docs/cloud/aws/ec2-discovery-tags.png differ diff --git a/docs/cloud/aws/ec2-setup.rst b/docs/cloud/aws/ec2-setup.rst new file mode 100644 index 0000000..406c502 --- /dev/null +++ b/docs/cloud/aws/ec2-setup.rst @@ -0,0 +1,223 @@ +.. highlight:: yaml +.. _ec2_setup: + +============================= +Running CrateDB on Amazon EC2 +============================= + +.. rubric:: Table of contents + +.. contents:: + :local: + +Introduction +============ + +When running CrateDB in a cloud environment such as `Amazon EC2`_ (Elastic +Compute Cloud) you usually face the problem that CrateDB's default discovery +mechanism does not work out of the box. + +Luckily, CrateDB has several built-in mechanisms for unicast host discovery, +also one for EC2. EC2 discovery uses the `EC2 API`_ to look up other EC2 hosts +that are then used as unicast hosts for node discovery (see +`Unicast Host Discovery`_). + +.. NOTE:: + + Note that this best practice only describes how to use the EC2 discovery and + its settings, and not how to set up a cluster on EC2 securely. + +Basic Configuration +=================== + +The most important step for EC2 discovery is that you have to launch your EC2 +instances within the same security group. The rules of that security group must +at least allow traffic on CrateDB's transport port (default ``4300``). This +will allow CrateDB to accept and respond to pings from other CrateDB instances +with the same cluster name and form a cluster. + +Once you have your instances running and CrateDB installed, you can enable EC2 +discovery: + ++-----------------+-------------------+---------------------------------------+ +| CrateDB Version | Reference | Example | ++=================+===================+=======================================+ +| >=4.x | `latest`_ | :: | +| | | | +| | | discovery.zen.seed_providers: ec2 | ++-----------------+-------------------+---------------------------------------+ +| <=3.x | `3.3`_ | :: | +| | | | +| | | discovery.zen.hosts_provider: ec2 | ++-----------------+-------------------+---------------------------------------+ + +To be able to use the EC2 API, CrateDB must `sign the requests`_ by using +AWS credentials consisting of an access key and a secret key. Therefore +AWS provides `IAM roles`_ to avoid any distribution of your AWS credentials +to the instances. + +CrateDB binds to the loopback interface by default. To get EC2 discovery +working, you need to update the `Hosts`_ setting to bind to and publish the +site-local address:: + + network.host: _site_ + +.. NOTE:: + + The requirement to explicitly configure CrateDB to bind to and publish the + site-local address is new in `1.2.0`_. + +.. _ec2_authentication: + +Authentication +-------------- + +For that, it is recommended to create a separate user that has only the +necessary permissions to describe instances. First, you need to create an IAM +role in order to assign the instances later on. This `AWS guide`_ gives you a +short description of how you can create a policy via the CLI or AWS management +console. An example policy file is attached below and should at least contain +these API permissions/actions: + +.. code-block:: json + + { + "Statement": [ + { + "Action": [ + "ec2:DescribeInstances" + ], + "Effect": "Allow", + "Resource": [ + "*" + ] + } + ], + "Version": "2012-10-17" + } + +This policy allows the instances to find each other if they have been assigned +to this role on startup. + +.. NOTE:: + + The same environment variables are used when performing ``COPY FROM`` and + ``COPY TO``. This means that if you want to use these statements you'll have + to extend the permissions of that EC2 user. + +You could also provide them as system properties or as settings in the +``crate.yml``, but the advantage of env variables is that also +``COPY FROM/TO`` statements use the same environment variables. + +.. NOTE:: + + Note that the env variables need to be provided for the user that runs the + CrateDB process, which is usually the user ``crate`` in production. + +Now you are ready to start your CrateDB instances and they will discover each +other automatically. Use the `AWS CLI`_ or the `AWS Console`_ to run instances +and assign them with an IAM role. Note that all CrateDB instances of the same +region will join the cluster as long as their cluster name is equal and they are +able to communicate to each other over the transport port. + +Production Setup +================ + +For a production setup, the best way to filter instances for discovery is via +a security group. This requires that you create a separate security group for +each cluster and allow TCP traffic on transport port ``4300`` (or other, if set +to a different port) only from within the group. + + .. image:: ec2-discovery-security-groups.png + :alt: Assign security group on instance launch + :width: 100% + +Since the instances that belong to the same CrateDB cluster have the same +security group then, you can easily filter instances by that group. + +For example, when you launch your instances with the security group +``sg-crate-demo``, your CrateDB setting would be:: + + discovery.ec2.groups: sg-crate-demo + +The combination with the unique cluster name makes the production setup very +simple yet secure. + +See also `discovery.ec2.groups`_. + +Optional Filters +================ + +Sometimes, however, you will want to have a more flexible setup. In this case, +there are a few other configuration settings that can be adjusted. + +.. _filter-by-tags: + +Filter by Tags +-------------- + +The EC2 discovery mechanism can additionally filter machines by instance tags. +Tags are key-value pairs that can be assigned to an instance as metadata when +it is launched. + +A good example usage of tags is to assign environment and usage type +information. + +Let's assume you have a pool of several instances tagged with ``env`` and +``type``, where ``env`` is either ``dev`` or ``production`` and ``type`` is +either ``app`` or ``database``. + + .. image:: ec2-discovery-tags.png + :alt: Adding tags on instance launch + :width: 100% + +Setting ``discovery.ec2.tag.env`` to ``production`` will filter machines with +the tag key ``env`` set to ``production`` excluding machines that have set the +same key set to ``dev`` (and vice versa). + +To further more exclude "``app`` instances" from discovery you can add the +setting ``discovery.ec2.tag.type: database``. + +This way, any number of tags can be used for filtering, using the +``discovery.ec2.tag.`` prefix for the setting name. + +Filtering by tags can help when you want to launch several CrateDB clusters +within the same security group, e.g:: + + discovery.ec2: + groups: sg-crate-demo + tag.env: production + tag.type: database + +See also `discovery.ec2.tags`_. + +Filter by Availability Zones +---------------------------- + +A third possible way to filter instances is via availability zones. Let's say +you have several clusters for the same tenant in different availability zones +(e.g. ``us-west-1`` and ``us-west-2``), you can launch the instance with the +same security group (e.g. ``sg-crate-demo``) and filter the instances used for +discovery by availability zone:: + + discovery.ec2: + groups: sg-crate-demo + availability_zones: us-west-1 + +See also `discovery.ec2.availability_zones`_. + +.. _1.2.0: https://crate.io/docs/crate/reference/en/latest/appendices/release-notes/1.2.0.html +.. _3.3: https://crate.io/docs/crate/reference/en/3.3/config/cluster.html#discovery +.. _Amazon EC2: https://aws.amazon.com/ec2/ +.. _AWS CLI: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#launch-instance-with-role-cli +.. _AWS Console: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#launch-instance-with-role-console +.. _AWS guide: httsp://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html +.. _discovery.ec2.availability_zones: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery-ec2-availability-zones +.. _discovery.ec2.groups: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery-ec2-groups +.. _discovery.ec2.tags: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery-ec2-tag-name +.. _EC2 API: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Welcome.html +.. _Hosts: https://crate.io/docs/crate/reference/en/latest/config/node.html#hosts +.. _IAM roles: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html +.. _latest: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery +.. _sign the requests: https://docs.aws.amazon.com/general/latest/gr/signing_aws_api_requests.html +.. _Unicast Host Discovery: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#unicast-host-discovery diff --git a/docs/cloud/aws/index.rst b/docs/cloud/aws/index.rst new file mode 100644 index 0000000..f74a035 --- /dev/null +++ b/docs/cloud/aws/index.rst @@ -0,0 +1,18 @@ +======================================== +Run CrateDB on Amazon Web Services (AWS) +======================================== + +Amazon Web Services (AWS) offers a wide range of cloud services, allowing to +easily run and scale applications such as CrateDB. + +In this section, we explain particularities in setting up CrateDB on AWS to make the +best use of its capabilities. + +.. rubric:: Table of contents + +.. toctree:: + :maxdepth: 1 + + ec2-setup + aws-terraform-setup + s3-setup diff --git a/docs/cloud/aws/s3-setup.rst b/docs/cloud/aws/s3-setup.rst new file mode 100644 index 0000000..372a302 --- /dev/null +++ b/docs/cloud/aws/s3-setup.rst @@ -0,0 +1,107 @@ +.. highlight:: yaml +.. _s3_setup: + +======================================== +Using Amazon S3 as a snapshot repository +======================================== + +CrateDB supports using the `Amazon S3`_ (Amazon Simple Storage Service) as a +snapshot repository. For this, you need to register the AWS plugin with +CrateDB. + +.. rubric:: Table of contents + +.. contents:: + :local: + +Basic configuration +=================== + +Support for *Snapshot* and *Restore* to the `Amazon S3`_ service is enabled by +default in CrateDB. If you need to explicitly turn it off, disable the cloud +setting in the ``crate.yml`` file:: + + cloud.enabled: false + +To be able to use the S3 API, CrateDB must `sign the requests`_ by using AWS +credentials consisting of an access key and a secret key. Therefore AWS +provides `IAM roles`_ to avoid any distribution of your AWS credentials to the +instances. + +.. _s3_authentication: + +Authentication +-------------- + +It is recommended to restrict the permissions of CrateDB on the S3 to only the +required extend. First, an IAM role is required. This `AWS guide`_ gives a +short description of how to create a policy offer using the CLI or the AWS +management console. Further, access of the snapshot to the S3 bucket needs to +be restricted. An example policy file granting anybody access to a bucket +called ``snaps.example.com`` is attached below: + +.. code-block:: json + + { + "Statement": [ + { + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:ListBucketMultipartUploads", + "s3:ListBucketVersions" + ], + "Effect": "Allow", + "Principal": "*", + "Resource": [ + "arn:aws:s3:::snaps.example.com" + ] + }, + { + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:DeleteObject", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Effect": "Allow", + "Principal": "*", + "Resource": [ + "arn:aws:s3:::snaps.example.com/*" + ] + } + ], + "Version": "2012-10-17" + } + +Access permissions can be further restricted to a specific AWS Principal by +changing the ``Statement.Principal`` setting. Please refer to `AWS principals`_ +for more information. + +For further AWS policy examples and detailed information, please refer to +`AWS policy examples`_ and the links therein. + +It has to be noted, that the bucket needs to exist before registering a +repository for snapshots within CrateDB. CrateDB can also be allowed to create +the bucket. However this requires the following permissions to be contained +within the policy: + +.. code-block:: json + + { + "Action": [ + "s3:CreateBucket" + ], + "Effect": "Allow", + "Resource": [ + "arn:aws:s3:::snaps.example.com" + ] + } + +.. _`Amazon S3`: https://aws.amazon.com/s3/ +.. _`sign the requests`: https://docs.aws.amazon.com/general/latest/gr/signing_aws_api_requests.html +.. _`IAM roles`: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html +.. _`AWS guide`: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html +.. _`AWS principals`: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_principal.html +.. _`AWS policy examples`: https://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html diff --git a/docs/cloud/azure/azure-create-vn.png b/docs/cloud/azure/azure-create-vn.png new file mode 100644 index 0000000..c79036b Binary files /dev/null and b/docs/cloud/azure/azure-create-vn.png differ diff --git a/docs/cloud/azure/azure-envvar.png b/docs/cloud/azure/azure-envvar.png new file mode 100644 index 0000000..02272de Binary files /dev/null and b/docs/cloud/azure/azure-envvar.png differ diff --git a/docs/cloud/azure/azure-inbound-rules.png b/docs/cloud/azure/azure-inbound-rules.png new file mode 100644 index 0000000..e4a6445 Binary files /dev/null and b/docs/cloud/azure/azure-inbound-rules.png differ diff --git a/docs/cloud/azure/azure-new-nsg.png b/docs/cloud/azure/azure-new-nsg.png new file mode 100644 index 0000000..f8368d4 Binary files /dev/null and b/docs/cloud/azure/azure-new-nsg.png differ diff --git a/docs/cloud/azure/azure-new-resource-group.png b/docs/cloud/azure/azure-new-resource-group.png new file mode 100644 index 0000000..7e60060 Binary files /dev/null and b/docs/cloud/azure/azure-new-resource-group.png differ diff --git a/docs/cloud/azure/azure-nsg-inbound.png b/docs/cloud/azure/azure-nsg-inbound.png new file mode 100644 index 0000000..e938c49 Binary files /dev/null and b/docs/cloud/azure/azure-nsg-inbound.png differ diff --git a/docs/cloud/azure/azure-port.gif b/docs/cloud/azure/azure-port.gif new file mode 100644 index 0000000..61d44e7 Binary files /dev/null and b/docs/cloud/azure/azure-port.gif differ diff --git a/docs/cloud/azure/azure-terraform-setup.rst b/docs/cloud/azure/azure-terraform-setup.rst new file mode 100644 index 0000000..1df4178 --- /dev/null +++ b/docs/cloud/azure/azure-terraform-setup.rst @@ -0,0 +1,190 @@ +.. _azure_terraform_setup: + +============================= +Running CrateDB via Terraform +============================= + +In :ref:`azure_vm_setup`, we elaborated on how to leverage Azure's functionality to +set up a CrateDB cluster. Here, we will explore how to automate this kind of +setup. + +`Terraform`_ is an infrastructure as code tool, often used as an abstraction +layer on top of a cloud's management APIs. Instead of creating cloud resources +manually, the target state is specified via configuration files which can also +be managed in a version control system. This brings some advantages, such as but +not limited to: + +- Reproducibility of deployments, e.g., across different accounts or in case of + disaster recovery +- Enables common development workflows like code reviews, automated testing, and + so on +- Better prediction and tracing of infrastructure changes + +The `crate-terraform`_ repository provides a predefined configuration template +of various Azure resources to form a CrateDB cluster on Azure (such as VMs, +load balancer, etc). This eliminates the need to manually compose all +required resources and their interactions. + +.. SEEALSO:: + + Engage with us in the `community post`_ on Terraform deployments for any + questions or feedback! + +.. CAUTION:: + + The provided configuration is meant to be used for development or testing + purposes and does not aim to fulfil all needs of a production environment. + +Prerequisites +============= + +Before creating the configuration to launch your CrateDB cluster, the following +prerequisites should be fulfilled: + +1. The Terraform CLI is installed as per + `Terraform's installation guide`_ +2. The git CLI is installed as per `git's installation guide`_ +3. Azure credentials are configured for Terraform. If you already have a + configured Azure CLI setup, Terraform will reuse this configuration. If not, + see the `Azure provider`_ documentation on authentication. + +Deployment configuration +======================== + +The CrateDB Terraform configuration consists of a set of variables to customize +your deployment. Create a new file ``main.tf`` with the following content and +adjust variable values as needed: + +.. code-block:: + + module "cratedb-cluster" { + source = "github.com/crate/crate-terraform.git/azure" + + # The Azure subscription ID + subscription_id = "x-y-z" + + # Global configuration items for naming/tagging resources + config = { + project_name = "example-project" + environment = "test" + owner = "Crate.IO" + team = "Customer Engineering" + + # Run "az account list-locations" for a full list + location = "westeurope" + } + + # CrateDB-specific configuration + crate = { + # Java Heap size in GB available to CrateDB + heap_size_gb = 2 + + cluster_name = "crate-cluster" + + # The number of nodes the cluster will consist of + cluster_size = 2 + + # Enables a self-signed SSL certificate + ssl_enable = true + } + + # Azure VM specific configuration + vm = { + # The size of the disk storing CrateDB's data directory + disk_size_gb = 512 + storage_account_type = "Premium_LRS" + size = "Standard_DS12_v2" + + # Enabling SSH access + ssh_access = true + # Username to connect via SSH to the nodes + user = "cratedb-vmadmin" + } + } + + output "cratedb" { + value = module.cratedb-cluster + sensitive = true + } + +The Azure-specific variables need to be adjusted according to your environment: + ++--------------------------+--------------------------------------------------------------+----------------------------------+ +| Variable | Explanation | How to obtain | ++==========================+==============================================================+==================================+ +| ``subscription_id`` | The ID of the Azure subscription to use for creating the | ``az account list`` | +| | resource group in | | ++---------------+----------+--------------------------------------------------------------+----------------------------------+ +| ``location`` | The geographic region in which to create the Azure | ``az account list-locations`` | +| | resources | | ++---------------+----------+--------------------------------------------------------------+----------------------------------+ +| ``storage_account_type`` | Storage Account Type of the disk containing the CrateDB | `List of Storage Account Types`_ | +| | data directory | | ++--------------------------+--------------------------------------------------------------+----------------------------------+ +| ``size`` | Specifies the size of the VM | ``az vm list-sizes`` | ++--------------------------+--------------------------------------------------------------+----------------------------------+ + +Execution +========= + +Once all variables are configured properly, Terraform needs to be initialized: + +.. code-block:: bash + + terraform init + +To proceed with executing the creation of resources, apply the configuration. +There will be a final confirmation prompt before any changes are applied to your +Azure account: + +.. code-block:: bash + + terraform apply + +Once the execution succeeded, a message similar to the one below is shown: + +.. code-block:: bash + + Apply complete! Resources: 22 added, 0 changed, 0 destroyed. + + Outputs: + + cratedb = + +Terraform internally tracks the state of each resource it manages, including +certain outputs with details on the created Cluster. As those details include +credentials, they are marked as sensitive and not shown in the output above. +To view the output, run: + +.. code-block:: bash + + terraform output cratedb + +The output variable ``cratedb_application_url`` points to the load balancer with +the port of CrateDB's Admin UI. Opening that URL in your browser should show a +password prompt on which you can authenticate using ``cratedb_username`` and +``cratedb_password``. + +Deprovisioning +============== + +If the CrateDB cluster is not needed anymore, you can easily instruct Terraform +to destroy all associated resources: + +.. code-block:: bash + + terraform destroy + +.. CAUTION:: + + Destroying the cluster will permanently delete all data stored on it. Use + :ref:`snapshots ` to create a backup on Azure Blob storage + if needed. + +.. _Terraform: https://www.terraform.io +.. _crate-terraform: https://github.com/crate/crate-terraform +.. _Terraform's installation guide: https://www.terraform.io/downloads.html +.. _git's installation guide: https://git-scm.com/downloads +.. _Azure provider: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs +.. _List of Storage Account Types: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/virtualmachines?tabs=bicep#manageddiskparameters +.. _community post: https://community.crate.io/t/deploying-cratedb-to-the-cloud-via-terraform/849 diff --git a/docs/cloud/azure/azure-vm-setup.rst b/docs/cloud/azure/azure-vm-setup.rst new file mode 100644 index 0000000..6cd6972 --- /dev/null +++ b/docs/cloud/azure/azure-vm-setup.rst @@ -0,0 +1,188 @@ +.. _azure_vm_setup: + +============================ +Running CrateDB on Azure VMs +============================ + +Getting CrateDB working on Azure with Linux or Windows is a simple process. You +can use Azure's management console or CLI interface (`Learn how to install +here`_). + +.. rubric:: Table of contents + +.. contents:: + :local: + +Azure and Linux +=============== + +Create a resource group +----------------------- + +Azure uses 'Resource Groups' to group together related services and resources +for easier management. + +Create a resource group for the CrateDB cluster by selecting *Resource groups* +under the *new* left hand panel of the Azure portal. + +.. image:: azure-new-resource-group.png + :alt: Create Virtual Network + +Create a network security group +------------------------------- + +CrateDB uses two ports, one for inter-node communication (``4300``) and one for +it's http endpoint (``4200``), so access to these needs to be opened. + +Create a *New Security Group*, giving it a name and assigning it to the +'Resource Group' just created. + +.. image:: azure-new-nsg.png + :alt: Create New Security Group + +Find that security group in your resources list and open it's settings, +navigating to the *Inbound security rules* section. + +.. image:: azure-nsg-inbound.png + :alt: Create New Security Group + +Add a rule for each port: + +.. image:: azure-inbound-rules.png + :alt: Create New Security Group + +Create a virtual network +------------------------ + +To create a cluster of CrateDB nodes on some cloud hosting providers, CrateDB +relies on unicast for inter-node communication. + +The easiest way to get Unicast communication working with Azure is to create a +Virtual Network (*+ -> Networking -> Virtual Network*) so that all the cluster +nodes exist on the same IP range. Give the network a name, a region and let +Azure handle all the remaining settings by clicking the next arrow on each +screen. + +.. image:: azure-create-vn.png + :alt: Create Virtual Network + +Once the Virtual Network has been created, find it in your resources list, open +the edit screen and the *Subnets* setting. Add the security group created +earlier to the subnet. + +.. image:: azure-vn-subnet-sg.png + :alt: Add Security Group + +Create virtual machines +----------------------- + +Next create virtual machines to act as your CrateDB nodes. In this tutorial, I +chose two low-specification Ubuntu 14.04 servers, but you likely have your own +preferred configurations. + +Most importantly, make sure you select the Virtual Network created earlier. + +Install CrateDB +--------------- + +*Note that these instructions should be followed on each VM in your cluster.* + +To Install CrateDB, ssh into your VMs and follow `the standard process for +Linux installation`_, this will automatically start an instance of CrateDB, +which we will need to restart after the next step. + + +Configure CrateDB +----------------- + +*Note that these instructions should be followed on each VM in your cluster.* + +To set the Unicast hosts for the CrateDB cluster we change the default +configuration file at */etc/crate/crate.yml*. + +Uncomment / add these lines: + ++-----------------+-----------+---------------------------------------+ +| CrateDB Version | Reference | Configuration Example | ++=================+===========+=======================================+ +| <=4.x | `latest`_ | .. code-block:: yaml | +| | | | +| | | discovery.seed_hosts: | +| | | - node1.example.com:4300 | +| | | - node2.example.com:4300 | +| | | - 10.0.1.102:4300 | +| | | - 10.0.1.103:4300 | ++-----------------+-----------+---------------------------------------+ +| <=3.x | `3.3`_ | .. code-block:: yaml | +| | | | +| | | discovery.zen.ping.unicast.hosts: | +| | | - node1.example.com:4300 | +| | | - node2.example.com:4300 | +| | | - 10.0.1.102:4300 | +| | | - 10.0.1.103:4300 | ++-----------------+-----------+---------------------------------------+ + +Note You might want to try DNS based discovery for inter-node communication, +`find more details`_ in our documentation. + +Uncomment and set the cluster name + +.. code-block:: yaml + + cluster.name: crate + +Restart CrateDB ``service crate restart``. + +Azure and Windows +================= + +Initial setup +------------- + +To create a Resource Group, Network security group and virtual network, follow +the same steps as for Azure and Linux. + +Create virtual machines +----------------------- + +Similar steps to creating Virtual Machines for Azure and Linux, but create the +VM based on the 'Windows Server 2012 R2 Datacenter' image. + +Install CrateDB +--------------- + +*Note that these instructions should be followed on each VM in your cluster.* + +To install CrateDB on Windows Server, you will need a `Java JDK installed`_. +Ensure that the ``JAVA*HOME`` environment variable is set. + +.. image:: azure-envvar.png + :alt: Environment Variables + +Next `download the CrateDB Tarball`_, expand it and move to a convenient +location. + + +Configure CrateDB and Windows +----------------------------- + +*Note that these instructions need to be followed on each VM in your cluster.* + +Edit the *config/crate.yml* configuration file in the expanded directory to +make the same changes noted above in running CrateDB on Azure & Linux. + +We need to allow the ports CrateDB uses through the Windows Firewall + +.. image:: azure-port.gif + :alt: Firewall configuration + +Start crate by running ``bin/crate``. + + +.. _3.3: https://crate.io/docs/crate/reference/en/3.3/config/cluster.html#discovery +.. _download the CrateDB Tarball: https://crate.io/docs/crate/tutorials/en/latest/install.html#install-adhoc +.. _find more details: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery-via-dns +.. _Java JDK installed: https://www.oracle.com/java/technologies/downloads/#java8 +.. _latest: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery +.. _Learn how to install here: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli +.. _the standard process for Linux installation: https://crate.io/docs/crate/tutorials/en/latest/install.html diff --git a/docs/cloud/azure/azure-vn-subnet-sg.png b/docs/cloud/azure/azure-vn-subnet-sg.png new file mode 100644 index 0000000..52e6187 Binary files /dev/null and b/docs/cloud/azure/azure-vn-subnet-sg.png differ diff --git a/docs/cloud/azure/index.rst b/docs/cloud/azure/index.rst new file mode 100644 index 0000000..63b1f3b --- /dev/null +++ b/docs/cloud/azure/index.rst @@ -0,0 +1,15 @@ +============================== +Run CrateDB on Microsoft Azure +============================== + +Microsoft Azure is the second largest and fastest growing provider of Cloud +Services in the world. It offers a wide variety of options including Windows +servers, containers, application images and much more. + +.. rubric:: Table of contents + +.. toctree:: + :maxdepth: 1 + + azure-vm-setup + azure-terraform-setup diff --git a/docs/cloud/index.rst b/docs/cloud/index.rst new file mode 100644 index 0000000..ec1a1bd --- /dev/null +++ b/docs/cloud/index.rst @@ -0,0 +1,15 @@ +========================= +CrateDB and cloud hosting +========================= + +CrateDB provides packages and executables that will work on any operating +system capable of running Java. + +.. rubric:: Table of contents + +.. toctree:: + :maxdepth: 3 + :titlesonly: + + aws/index + azure/index diff --git a/docs/cloud/packet-project.png b/docs/cloud/packet-project.png new file mode 100644 index 0000000..349f33c Binary files /dev/null and b/docs/cloud/packet-project.png differ diff --git a/docs/cloud/packet-servers.png b/docs/cloud/packet-servers.png new file mode 100644 index 0000000..6968eb7 Binary files /dev/null and b/docs/cloud/packet-servers.png differ diff --git a/docs/containers/containership-add-crate.png b/docs/containers/containership-add-crate.png new file mode 100644 index 0000000..f059744 Binary files /dev/null and b/docs/containers/containership-add-crate.png differ diff --git a/docs/containers/containership-create-cluster.png b/docs/containers/containership-create-cluster.png new file mode 100644 index 0000000..7f86044 Binary files /dev/null and b/docs/containers/containership-create-cluster.png differ diff --git a/docs/containers/containership-menu.png b/docs/containers/containership-menu.png new file mode 100644 index 0000000..1b86699 Binary files /dev/null and b/docs/containers/containership-menu.png differ diff --git a/docs/containers/containership-providers.png b/docs/containers/containership-providers.png new file mode 100644 index 0000000..c8f2017 Binary files /dev/null and b/docs/containers/containership-providers.png differ diff --git a/docs/containers/dc-create-cluster.png b/docs/containers/dc-create-cluster.png new file mode 100644 index 0000000..802f5be Binary files /dev/null and b/docs/containers/dc-create-cluster.png differ diff --git a/docs/containers/docker-cloud-create-cluster.png b/docs/containers/docker-cloud-create-cluster.png new file mode 100644 index 0000000..802f5be Binary files /dev/null and b/docs/containers/docker-cloud-create-cluster.png differ diff --git a/docs/containers/docker-cloud-droplets.png b/docs/containers/docker-cloud-droplets.png new file mode 100644 index 0000000..64cb5b0 Binary files /dev/null and b/docs/containers/docker-cloud-droplets.png differ diff --git a/docs/containers/docker-cloud-endpoint.png b/docs/containers/docker-cloud-endpoint.png new file mode 100644 index 0000000..1a1d6ec Binary files /dev/null and b/docs/containers/docker-cloud-endpoint.png differ diff --git a/docs/containers/docker-cloud-node-dashboard-spread.png b/docs/containers/docker-cloud-node-dashboard-spread.png new file mode 100644 index 0000000..9376b67 Binary files /dev/null and b/docs/containers/docker-cloud-node-dashboard-spread.png differ diff --git a/docs/containers/docker-cloud-nodes.png b/docs/containers/docker-cloud-nodes.png new file mode 100644 index 0000000..1b3c188 Binary files /dev/null and b/docs/containers/docker-cloud-nodes.png differ diff --git a/docs/containers/docker-cloud-running-services.png b/docs/containers/docker-cloud-running-services.png new file mode 100644 index 0000000..2978544 Binary files /dev/null and b/docs/containers/docker-cloud-running-services.png differ diff --git a/docs/containers/docker.rst b/docs/containers/docker.rst new file mode 100644 index 0000000..f5e4c36 --- /dev/null +++ b/docs/containers/docker.rst @@ -0,0 +1,513 @@ +.. highlight:: sh + +.. _cratedb-docker: + +===================== +Run CrateDB on Docker +===================== + +CrateDB and `Docker`_ are a great match thanks to CrateDB’s `horizontally +scalable`_ `shared-nothing architecture`_ that lends itself well to +`containerization`_. + +This document covers the essentials of running CrateDB on Docker. + +.. NOTE:: + + If you are just getting started with CrateDB and Docker, check out the + introductory guides for `spinning up your first CrateDB instance`_. + +.. SEEALSO:: + + A guide for running CrateDB on :ref:`Kubernetes `. + + The official `CrateDB Docker image`_. + +.. rubric:: Table of contents + +.. contents:: + :local: + + +Quick start +=========== + + +Creating a cluster +------------------ + +To get started with CrateDB and Docker, you will create a three-node cluster +on your dev machine. The cluster will run on a dedicated network and will +require the first two nodes, ``crate01`` and ``crate02``, to vote which one +is the master. The third node, ``crate03``, will simply join the cluster +with no vote. + +To create the `user-defined network`_, run the command:: + + sh$ docker network create crate + +You should then be able to see something like this: + +.. code-block:: text + + sh$ docker network ls + NETWORK ID NAME DRIVER SCOPE + 1bf1b7acd66f bridge bridge local + 51cebbdf7d2b crate bridge local + 5b8e6fbe9ab6 host host local + 8baa149b6986 none null local + +Any CrateDB container put into the ``crate`` network will be able to resolve +other CrateDB containers by name. Each container will run a single node, which +is identified by its node name. In this guide, container ``crate01`` will run +node ``crate01``, container ``crate02`` will run node ``crate02``, and +container ``crate03`` will run cluster node ``crate03``. + +You can then create your first CrateDB container and node, like this:: + + sh$ docker run --rm -d \ + --name=crate01 \ + --net=crate \ + -p 4201:4200 \ + --env CRATE_HEAP_SIZE=2g \ + crate -Cnetwork.host=_site_ \ + -Cnode.name=crate01 \ + -Cdiscovery.seed_hosts=crate02,crate03 \ + -Ccluster.initial_master_nodes=crate01,crate02 \ + -Cgateway.expected_nodes=3 -Cgateway.recover_after_nodes=3 + +Breaking the command down: + +- Creates and runs a container called ``crate01`` (--name) in detached + mode (-d). The container will automatically be removed on exit (--rm), + and all its internal data will be lost. If you would like to avoid this, + you can mount a dedicated volume (-v) for the container (each container + would need its own dedicated folder on your dev machine, see + :ref:`docker-compose` as reference). +- Puts the container into the ``crate`` network and maps port ``4201`` on your + host machine to port ``4200`` on the container (admin UI). +- Defines the environment variable ``CRATE_HEAP_SIZE`` which is used by CrateDB + to allocate 2G for its heap. +- Runs the command ``crate`` inside the container with parameters: + * ``network.host``: The ``_site_`` value results in the binding of the + CrateDB process to a site-local IP address. + * ``node.name``: Defines the node's name as ``crate01`` (used by + master election). + * ``discovery.seed_hosts``: This parameter lists the other hosts in the + cluster. The format is a comma-separated list of ``host:port`` entries, + where port defaults to setting ``transport.tcp.port``. Each node must + contain the name of all the other hosts in this list. Notice also that + any node in the cluster might be started at any time, and this will + create connection exceptions in the log files, however all nodes will + eventually be running and interconnected. + * ``cluster.initial_master_nodes``: Defines the list of master-eligible + node names which will participate in the vote of the first master + (first bootstrap). If this parameter is not defined, then it is expected + that the node will join an already formed cluster. This parameter is only + relevant for the first election. + * ``gateway.expected_nodes`` and ``gateway.recover_after_nodes``: Specifies + how many nodes you expect in the cluster and how many nodes must be + discovered before the cluster state is recovered. + +.. NOTE:: + + If this command aborts with an error, consult the + :ref:`docker-troubleshooting` section for help. + +Verify that the node is running with ``docker ps`` and you should see something like this: + +.. code-block:: text + + sh$ docker ps + CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES + f79116373877 crate "/docker-entrypoin..." 16 seconds ago Up 15 seconds 4300/tcp, 5432-5532/tcp, 0.0.0.0:4201->4200/tcp crate01 + +You can have a look at the container's logs in tail mode like this: + +.. code-block:: text + + sh$ docker logs -f crate01 + +.. NOTE:: + + To exit the logs view, press ctrl+C. + +You can visit the admin UI in your browser with this URL: + +.. code-block:: text + + http://localhost:4201/ + +Select the *Cluster* icon from the left-hand navigation, and you should see a +page that lists a single node. + +Now add the second node, ``crate02``, to the cluster:: + + sh$ docker run --rm -d \ + --name=crate02 \ + --net=crate \ + -p 4202:4200 \ + --env CRATE_HEAP_SIZE=2g \ + crate -Cnetwork.host=_site_ \ + -Cnode.name=crate02 \ + -Cdiscovery.seed_hosts=crate01,crate03 \ + -Ccluster.initial_master_nodes=crate01,crate02 \ + -Cgateway.expected_nodes=3 -Cgateway.recover_after_nodes=2 + +Notice here that: + +- You updated the container and node name to ``crate02``. +- You updated the port mapping, so that port ``4202`` on your host is mapped + to ``4200`` on the container. +- You set the parameter ``discovery.seed_hosts`` to contain the other hosts of + the cluster. +- ``cluster.initial_master_nodes``: Since only nodes ``crate01`` and ``crate02`` + will participate in the election of the first master, this setting is unchanged. + +Now, if you go back to the admin UI you opened earlier, or visit the admin UI +of the node you just created (located at ``http://localhost:4202/``) you +should see two nodes. + +You can now add ``crate03`` like this:: + + sh$ docker run --rm -d \ + --name=crate03 \ + --net=crate -p 4203:4200 \ + --env CRATE_HEAP_SIZE=2g \ + crate -Cnetwork.host=_site_ \ + -Cnode.name=crate03 \ + -Cdiscovery.seed_hosts=crate01,crate02 \ + -Cgateway.expected_nodes=3 -Cgateway.recover_after_nodes=2 + +Notice here that: + +- You updated the container and node name to ``crate03``. +- You updated the port mapping, so that port ``4203`` on your host is mapped + to ``4200`` on the container. +- You set parameter ``discovery.seed_hosts`` to contain the other hosts of the + cluster. +- ``cluster.initial_master_nodes``: This setting is removed since only nodes + ``crate01`` and ``crate02`` will participate in the election of the first + master. + + +Success! You just created a three-node CrateDB cluster with Docker. + +.. NOTE:: + + This is only a quick start example and you will notice some failing checks + in the admin UI. For a more robust cluster, you should, at the very least, + configure the `Metadata Gateway`_ and `Discovery`_ settings. + + +.. _docker-troubleshooting: + +Troubleshooting +--------------- + +The most common issue when running CrateDB on Docker is a failing +:ref:`bootstrap check ` because the *memory map limit* +is too low. This can be :ref:`adjusted on the host system `. + +If the limit cannot be adjusted on the host system, the memory map limit check +can be bypassed by passing the ``-Cnode.store.allow_mmapfs=false`` option to +the ``crate`` command:: + + sh$ docker run -d --name=crate01 \ + --net=crate -p 4201:4200 --env CRATE_HEAP_SIZE=2g \ + crate -Cnetwork.host=_site_ \ + -Cnode.store.allow_mmapfs=false + +.. CAUTION:: + + This will result in degraded performance. + +You can also start a single node without any bootstrap checks by passing the +``-Cdiscovery.type=single-node`` option:: + + sh$ docker run -d --name=crate01 \ + --net=crate -p 4201:4200 \ + --env CRATE_HEAP_SIZE=2g \ + crate -Cnetwork.host=_site_ \ + -Cdiscovery.type=single-node + +.. NOTE:: + + This means that the node cannot form a cluster with any other nodes. + + +Taking it further +----------------- + +`CrateDB settings `_ are set +using the ``-C`` flag, as shown in the examples above. + +Check out the `Docker docs `_ +for more Docker-specific features that CrateDB can leverage. + + +CrateDB Shell +------------- + +The CrateDB Shell, ``crash``, is bundled with the Docker image. + +If you wanted to run ``crash`` inside a user-defined network called ``crate`` +and connect to three hosts named ``crate01``, ``crate02``, and ``crate03`` +(i.e. the example covered in the `Creating a Cluster`_ section) you could run:: + + $ docker run --rm -ti \ + --net=crate crate \ + crash --hosts crate01 crate02 crate03 + + +.. _docker-compose: + +Docker Compose +============== + +Docker's Compose tool allows developers to define and run multi-container +Docker applications that can be started with a single ``docker-compose up`` +command. + +Read about Docker Compose specifics `here `_. + +You can define the services that make up your app in a `docker-compose.yml` +file. To recreate the three-node cluster in the previous example, you can +define your services like this: + +.. code-block:: yaml + + version: '3.8' + services: + cratedb01: + image: crate:latest + ports: + - "4201:4200" + volumes: + - /tmp/crate/01:/data + command: ["crate", + "-Ccluster.name=crate-docker-cluster", + "-Cnode.name=cratedb01", + "-Cnode.data=true", + "-Cnetwork.host=_site_", + "-Cdiscovery.seed_hosts=cratedb02,cratedb03", + "-Ccluster.initial_master_nodes=cratedb01,cratedb02,cratedb03", + "-Cgateway.expected_nodes=3", + "-Cgateway.recover_after_nodes=2"] + deploy: + replicas: 1 + restart_policy: + condition: on-failure + environment: + - CRATE_HEAP_SIZE=2g + + cratedb02: + image: crate:latest + ports: + - "4202:4200" + volumes: + - /tmp/crate/02:/data + command: ["crate", + "-Ccluster.name=crate-docker-cluster", + "-Cnode.name=cratedb02", + "-Cnode.data=true", + "-Cnetwork.host=_site_", + "-Cdiscovery.seed_hosts=cratedb01,cratedb03", + "-Ccluster.initial_master_nodes=cratedb01,cratedb02,cratedb03", + "-Cgateway.expected_nodes=3", + "-Cgateway.recover_after_nodes=2"] + deploy: + replicas: 1 + restart_policy: + condition: on-failure + environment: + - CRATE_HEAP_SIZE=2g + + cratedb03: + image: crate:latest + ports: + - "4203:4200" + volumes: + - /tmp/crate/03:/data + command: ["crate", + "-Ccluster.name=crate-docker-cluster", + "-Cnode.name=cratedb03", + "-Cnode.data=true", + "-Cnetwork.host=_site_", + "-Cdiscovery.seed_hosts=cratedb01,cratedb02", + "-Ccluster.initial_master_nodes=cratedb01,cratedb02,cratedb03", + "-Cgateway.expected_nodes=3", + "-Cgateway.recover_after_nodes=2"] + deploy: + replicas: 1 + restart_policy: + condition: on-failure + environment: + - CRATE_HEAP_SIZE=2g + +In the file above: + +- You specified the latest `compose file version`_. +- You created three CrateDB services which pulls the latest CrateDB Docker + image and maps the ports manually. +- You created a file system volume per instance and defined a set of + configuration parameters (`-C`). +- You defined some deploy settings and an environment variable for the heap size. +- Network settings no longer need to be defined in the latest compose file + version because a `default bridge network`_ will be created. If you are + using multiple hosts and want to use an overlay network, you will need to + explicitly define that. +- The start order of the containers is not deterministic and you want all + three containers to be up and running before the election of the master node. + + +Best Practices +============== + + +One container per host +---------------------- + +For performance reasons, we strongly recommend that you only run one container +per host machine. + +If you are running one container per machine, you can map the container ports +to the host ports so that the host acts like a native installation. For example:: + + $ docker run -d -p 4200:4200 -p 4300:4300 -p 5432:5432 crate \ + crate -Cnetwork.host=_site_ + + +Persistent data directory +------------------------- + +Docker containers are ephemeral, meaning that containers are expected to come +and go, and any data inside them is lost when the container is removed. For +this reason, you should mount a persistent ``data`` directory on your host +machine to the ``/data`` directory inside the container:: + + $ docker run -d -v /srv/crate/data:/data crate \ + crate -Cnetwork.host=_site_ + +Here, ``/srv/crate/data`` is an example path, and should be replaced with the +path to your host machine's ``data`` directory. + +See the `Docker volume`_ documentation for more help. + + +Custom configuration +-------------------- + +If you want to use a custom configuration, it is recommended that you mount +configuration files on the host machine to the appropriate path inside the +container. That way, your configuration will not be lost if the container is +removed. + +Here is an example of how you could mount the ``crate.yml`` config file:: + + $ docker run -d \ + -v /srv/crate/config/crate.yml:/crate/config/crate.yml crate \ + crate -Cnetwork.host=_site_ + +Here, ``/srv/crate/config/crate.yml`` is an example path, and should be +replaced with the path to your host machine's ``crate.yml`` file. + + +Troubleshooting +=============== + +The official `CrateDB Docker image`_ ships with a liveness `healthcheck`_ +configured. + +This healthcheck will flag a problem if the CrateDB process crashed or hung +inside the container without terminating. + +If you use `Docker Swarm`_ and are experiencing trouble starting your Docker +containers, try to deactivate the healthcheck. + +You can do that by editing your `Docker Stack YAML file`_: + +.. code-block:: yaml + + healthcheck: + disable: true + + +.. _resource_constraints: + +Resource constraints +==================== + +To avoid overallocation of resources, you may want to consider setting +constraints on CPU and memory if you plan to run multiple CrateDB containers +on a single machine. + + +Bootstrap checks +---------------- + +When using CrateDB with Docker, CrateDB binds by default to any site-local IP +address on the system (i.e. 192.168.0.1). This performs a number of checks +during bootstrap. The settings listed in `Bootstrap Checks`_ must be addressed on +the Docker **host system** in order to start CrateDB successfully and when +`going into production`_. + + +Memory +------ + +You must calculate and explicitly `set the maximum memory`_ that the container +can use. This is dependent on your host system and should typically be as high +as possible. + +You must then calculate the appropriate heap size (typically half the container's +memory limit, see `CRATE_HEAP_SIZE`_ for details) and pass this to CrateDB, +which in turn passes it to the JVM. + +It is not necessary to configure swap memory since CrateDB does not use swap. + + +CPU +--- + +You must calculate and explicitly `set the maximum number of CPUs`_ that the +container can use. This is dependent on your host system and should typically +be as high as possible. + + +Combined configuration +---------------------- + +If you want the container to use a maximum of 1.5 CPUs, a maximum of 2 GB +memory, with a heap size of 1 GB, you could configure everything at once. For +example:: + + $ docker run -d \ + --cpus 1.5 \ + --memory 2g \ + --env CRATE_HEAP_SIZE=1g \ + crate \ + crate -Cnetwork.host=_site_ + + +.. _Bootstrap Checks: https://crate.io/docs/crate/howtos/en/latest/admin/bootstrap-checks.html +.. _compose file version: https://docs.docker.com/compose/compose-file/compose-versioning/ +.. _containerization: https://www.docker.com/resources/what-container +.. _CRATE_HEAP_SIZE: https://crate.io/docs/crate/reference/en/latest/config/environment.html#conf-env-heap-size +.. _CrateDB Docker image: https://hub.docker.com/_/crate/ +.. _default bridge network: https://docs.docker.com/network/#network-drivers +.. _Discovery: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery +.. _Docker Stack YAML file: https://docs.docker.com/docker-cloud/apps/stack-yaml-reference/ +.. _Docker Swarm: https://docs.docker.com/engine/swarm/ +.. _Docker volume: https://docs.docker.com/engine/tutorials/dockervolumes/ +.. _Docker: https://www.docker.com/ +.. _going into production: https://crate.io/docs/crate/howtos/en/latest/going-into-production.html +.. _healthcheck: https://docs.docker.com/engine/reference/builder/#healthcheck +.. _horizontally scalable: https://en.wikipedia.org/wiki/Scalability#Horizontal_(scale_out)_and_vertical_scaling_(scale_up) +.. _Metadata Gateway: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#metadata-gateway +.. _running Docker locally: https://crate.io/docs/crate/tutorials/en/latest/install.html#docker +.. _set the maximum memory: https://docs.docker.com/config/containers/resource_constraints/#memory +.. _set the maximum number of CPUs: https://docs.docker.com/config/containers/resource_constraints/#cpu +.. _shared-nothing architecture : https://en.wikipedia.org/wiki/Shared-nothing_architecture +.. _spinning up your first CrateDB instance: https://crate.io/docs/crate/tutorials/en/latest/install.html#docker +.. _user-defined network: https://docs.docker.com/network/bridge/ diff --git a/docs/containers/index.rst b/docs/containers/index.rst new file mode 100644 index 0000000..178c274 --- /dev/null +++ b/docs/containers/index.rst @@ -0,0 +1,14 @@ +====================== +CrateDB and containers +====================== + +CrateDB is ideal for containerized environments, creating and scaling a cluster +takes minutes and your valuable data is always in sync and available. + +.. rubric:: Table of contents + +.. toctree:: + :maxdepth: 1 + + docker + kubernetes diff --git a/docs/containers/kubernetes.rst b/docs/containers/kubernetes.rst new file mode 100644 index 0000000..0da6fe2 --- /dev/null +++ b/docs/containers/kubernetes.rst @@ -0,0 +1,414 @@ +.. _cratedb-kubernetes: + +========================= +Run CrateDB on Kubernetes +========================= + +CrateDB and `Docker`_ are a great match thanks to CrateDB’s `horizontally +scalable`_ `shared-nothing architecture`_ that lends itself well to +`containerization`_. + +`Kubernetes`_ is an open-source container orchestration system for the +management, deployment, and scaling of containerized systems. + +Together, Docker and Kubernetes are a fantastic way to deploy and scale CrateDB. + +.. NOTE:: + + While Kubernetes works with a variety of container technologies, this + document only covers its use with Docker. + +.. SEEALSO:: + + A complimentary blog post miniseries that walks you through the process of + `setting up your first CrateDB cluster on Kubernetes`_. + + A lower-level introduction to :ref:`running CrateDB on Docker `. + + A guide to :ref:`scaling CrateDB on Kubernetes `. + + The official `CrateDB Docker image`_. + +.. rubric:: Table of contents + +.. contents:: + :local: + + +Prerequisites +============= + +This document assumes `familiarity with Kubernetes`_. + +Before continuing you should already have a Kubernetes cluster up-and-running +with at least one master node and one worker node. + +.. SEEALSO:: + + You can use `kubeadm`_ to bootstrap a Kubernetes cluster by hand. + + Alternatively, cloud services such as `Azure Kubernetes Service`_ or the + `Amazon Kubernetes Service`_ can do this for you. + + +Managing Kubernetes +=================== + +Kubernetes deployments can be `managed`_ in many different ways. Which one +makes sense for you will depend on your situation. + +This section shows you three basic commands you can use to create and update a +resource. + +You can create a resource like so: + +.. code-block:: console + + sh$ kubectl create -f crate-controller.yaml --namespace crate + statefulset.apps/crate-controller created + +Here, we are creating a `StatefulSet`_ controller in the ``crate`` namespace +using a configuration file named ``crate-controller.yaml``. + +You can update the resource after editing the configuration file, like so: + +.. code-block:: console + + sh$ kubectl replace -f crate-controller.yaml --namespace crate + statefulset.apps/crate replaced + +If your StatefulSet uses the default `rolling update strategy`_, this command will +restart your pods with the new configuration one-by-one. + +.. WARNING:: + + If you use a regular ``replace`` command, pods are restarted, and any + `persistent volumes`_ will still be intact. + + If, however, you pass the ``--force`` option to the ``replace`` command, + resources are deleted and recreated, and the pods will come back up with no + data. + + +Configuration +============= + +This section provides four Kubernetes `configuration`_ snippets that can be +used to create a three-node CrateDB cluster. + + +Services +-------- + +A Kubernetes pod is ephemeral and so are its network addresses. Typically, this +means that it is inadvisable to connect to pods directly. + +A Kubernetes `service`_ allows you to define a network access policy for a set +of pods. You can then use the network address of the service to communicate +with the pods. The network address of the service remains static even though the +constituent pods may come and go. + +For our purposes, we define two services: an `internal service`_ and an +`external service`_. + + +Internal service +................ + +CrateDB uses the internal service for `node discovery via DNS`_ and +:ref:`inter-node communication `. + +Here's an example configuration snippet: + +.. code-block:: yaml + + kind: Service + apiVersion: v1 + metadata: + name: crate-internal-service + labels: + app: crate + spec: + # A static IP address is assigned to this service. This IP address is + # only reachable from within the Kubernetes cluster. + type: ClusterIP + ports: + # Port 4300 for inter-node communication. + - port: 4300 + name: crate-internal + selector: + # Apply this to all nodes with the `app:crate` label. + app: crate + + +External service +................ + +The external service provides a stable network address for external clients. + +Here's an example configuration snippet: + +.. code-block:: yaml + + kind: Service + apiVersion: v1 + metadata: + name: crate-external-service + labels: + app: crate + spec: + # Create an externally reachable load balancer. + type: LoadBalancer + ports: + # Port 4200 for HTTP clients. + - port: 4200 + name: crate-web + # Port 5432 for PostgreSQL wire protocol clients. + - port: 5432 + name: postgres + selector: + # Apply this to all nodes with the `app:crate` label. + app: crate + +.. NOTE:: + + In production, a `LoadBalancer`_ service type is typically only available on + hosted cloud platforms that provide externally managed load balancers. + However, an `ingress`_ resource can be used to provide internally managed + load balancers. + + For local development, `Minikube`_ provides a LoadBalancer service. + + +Controller +---------- + +A Kubernetes `pod`_ is a group of one or more containers. Pods are designed to +provide discrete units of functionality. + +CrateDB nodes are self-contained, so we don't need to use more than one +container in a pod. We can configure our pods as a single container running +CrateDB. + +Pods are designed to be fungible computing units, meaning they can be created or +destroyed at will. This, in turn, means that: + +- A cluster can be scaled in or out by destroying or creating pods + +- A cluster can be healed by replacing pods + +- A cluster can be rebalanced by rescheduling pods (i.e., destroying the pod on + one Kubernetes node and recreating it on a new node) + +However, CrateDB nodes that leave and then want to rejoin a cluster must retain +their state. That is, they must continue to use the same name and must continue +to use the same data on disk. + +For this reason, we use the `StatefulSet`_ controller to define our cluster, +which ensures that CrateDB nodes retain state across restarts or rescheduling. + +The following configuration snippet defines a controller for a three-node +CrateDB 3.0.5 cluster: + +.. code-block:: yaml + + kind: StatefulSet + apiVersion: "apps/v1" + metadata: + # This is the name used as a prefix for all pods in the set. + name: crate + spec: + serviceName: "crate-set" + # Our cluster has three nodes. + replicas: 3 + selector: + matchLabels: + # The pods in this cluster have the `app:crate` app label. + app: crate + template: + metadata: + labels: + app: crate + spec: + # InitContainers run before the main containers of a pod are + # started, and they must terminate before the primary containers + # are initialized. Here, we use one to set the correct memory + # map limit. + initContainers: + - name: init-sysctl + image: busybox + imagePullPolicy: IfNotPresent + command: ["sysctl", "-w", "vm.max_map_count=262144"] + securityContext: + privileged: true + # This final section is the core of the StatefulSet configuration. + # It defines the container to run in each pod. + containers: + - name: crate + # Use the CrateDB 4.2.4 Docker image. + image: crate:4.2.4 + # Pass in configuration to CrateDB via command-line options. + # We are setting the name of the node's explicitly, which is + # needed to determine the initial master nodes. These are set to + # the name of the pod. + # We are using the SRV records provided by Kubernetes to discover + # nodes within the cluster. + args: + - -Cnode.name=${POD_NAME} + - -Ccluster.name=${CLUSTER_NAME} + - -Ccluster.initial_master_nodes=crate-0,crate-1,crate-2 + - -Cdiscovery.seed_providers=srv + - -Cdiscovery.srv.query=_crate-internal._tcp.crate-internal-service.${NAMESPACE}.svc.cluster.local + - -Cgateway.recover_after_nodes=2 + - -Cgateway.expected_nodes=${EXPECTED_NODES} + - -Cpath.data=/data + volumeMounts: + # Mount the `/data` directory as a volume named `data`. + - mountPath: /data + name: data + resources: + limits: + # How much memory each pod gets. + memory: 512Mi + ports: + # Port 4300 for inter-node communication. + - containerPort: 4300 + name: crate-internal + # Port 4200 for HTTP clients. + - containerPort: 4200 + name: crate-web + # Port 5432 for PostgreSQL wire protocol clients. + - containerPort: 5432 + name: postgres + # Environment variables passed through to the container. + env: + # This is variable is detected by CrateDB. + - name: CRATE_HEAP_SIZE + value: "256m" + # The rest of these variables are used in the command-line + # options. + - name: EXPECTED_NODES + value: "3" + - name: CLUSTER_NAME + value: "my-crate" + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + volumeClaimTemplates: + # Use persistent storage. + - metadata: + name: data + spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 1Gi + +.. CAUTION:: + + If you are not running CrateDB 3.0.5, you must adapt this example + configuration to your specific CrateDB version. + + Specifically, the ``discovery.zen.minimum_master_nodes`` setting is :ref:`no + longer used ` in CrateDB versions 4.x and above. + +.. SEEALSO:: + + CrateDB supports `configuration via command-line options`_ and `node + discovery via DNS`_. + + :ref:`Configure memory ` by hand for optimum performance. + + You must set memory map limits correctly. Consult the :ref:`bootstrap checks + ` documentation for more information. + + +Persistent volume +----------------- + +As mentioned in the `Controller`_ section, CrateDB containers must be able to +retain state between restarts and rescheduling. Stateful containers can be +achieved with `persistent volumes`_. + +Persistent volumes can be provisioned in many different ways, so the specific +configuration will depend on your setup. + + +Microsoft Azure +............... + +You can create a `StorageClass`_ for `Azure Managed Disks`_ with a +configuration snippet like this: + +.. code-block:: yaml + + apiVersion: storage.k8s.io/v1 + kind: StorageClass + metadata: + labels: + addonmanager.kubernetes.io/mode: Reconcile + app.kubernetes.io/managed-by: kube-addon-manager + app.kubernetes.io/name: crate-premium + app.kubernetes.io/part-of: infrastructure + app.kubernetes.io/version: "0.1" + storage-tier: premium + volume-type: ssd + name: crate-premium + parameters: + kind: Managed + storageaccounttype: Premium_LRS + provisioner: kubernetes.io/azure-disk + reclaimPolicy: Delete + volumeBindingMode: Immediate + +You can then use this in your controller configuration with something like this: + +.. code-block:: yaml + + [...] + volumeClaimTemplates: + - metadata: + name: persistant-data + spec: + # This will create one 100GB read-write Azure Managed Disks volume + # for every CrateDB pod. + accessModes: [ "ReadWriteOnce" ] + storageClassName: crate-premium + resources: + requests: + storage: 100g + + +.. _Amazon Kubernetes Service: https://aws.amazon.com/eks/ +.. _Azure Kubernetes Service: https://azure.microsoft.com/en-us/services/kubernetes-service/ +.. _Azure Managed Disks: https://azure.microsoft.com/en-us/pricing/details/managed-disks/ +.. _configuration via command-line options: https://crate.io/docs/crate/reference/en/latest/config/index.html +.. _configuration: https://kubernetes.io/docs/concepts/configuration/overview/ +.. _containerization: https://www.docker.com/resources/what-container +.. _CrateDB Docker image: https://hub.docker.com/_/crate/ +.. _Docker: https://www.docker.com/ +.. _familiarity with Kubernetes: https://kubernetes.io/docs/tutorials/kubernetes-basics/ +.. _horizontally scalable: https://en.wikipedia.org/wiki/Scalability#Horizontal_(scale_out)_and_vertical_scaling_(scale_up) +.. _Ingress: https://kubernetes.io/docs/concepts/services-networking/ingress/ +.. _kubeadm: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/ +.. _Kubernetes: https://kubernetes.io/ +.. _LoadBalancer: https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer +.. _managed: https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/ +.. _Minikube: https://kubernetes.io/docs/setup/minikube/ +.. _node discovery via DNS: https://crate.io/docs/crate/reference/en/latest/config/cluster.html#discovery-via-dns +.. _persistent volume: https://kubernetes.io/docs/concepts/storage/persistent-volumes/ +.. _persistent volumes: https://kubernetes.io/docs/concepts/storage/persistent-volumes/ +.. _pod: https://kubernetes.io/docs/concepts/workloads/pods/ +.. _rolling update strategy: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates +.. _service: https://kubernetes.io/docs/concepts/services-networking/service/ +.. _services: https://kubernetes.io/docs/concepts/services-networking/service/ +.. _setting up your first CrateDB cluster on Kubernetes: https://crate.io/a/run-your-first-cratedb-cluster-on-kubernetes-part-one/ +.. _shared-nothing architecture : https://en.wikipedia.org/wiki/Shared-nothing_architecture +.. _StatefulSet: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ +.. _StorageClass: https://kubernetes.io/docs/concepts/storage/storage-classes/ diff --git a/docs/create-sharded-table.rst b/docs/create-sharded-table.rst deleted file mode 100644 index b104520..0000000 --- a/docs/create-sharded-table.rst +++ /dev/null @@ -1,126 +0,0 @@ -.. _create-sharded-table: - -==================== -Create sharded table -==================== - -One core concept CrateDB uses to distribute data across a cluster is -:ref:`sharding `. CrateDB splits every table into a -configured number of shards, which are distributed evenly across the cluster. -You can think of shards as a self-contained part of a table, that includes both -a subset of records and corresponding indexing structures. If we -:ref:`create a table ` like the following: - -.. code-block:: psql - - CREATE TABLE first_table ( - ts TIMESTAMP, - val DOUBLE PRECISION - ); - -The table is by default is split into several shards on a single node cluster. -You can check this by running: - -.. code-block:: psql - - SHOW CREATE TABLE first_table; - -Which should output the following: - -.. code-block:: psql - - CREATE TABLE IF NOT EXISTS "doc"."first_table" ( - "ts" TIMESTAMP WITH TIME ZONE, - "val" DOUBLE PRECISION - ) - CLUSTERED INTO 4 SHARDS - -By default, ingested data is distributed evenly across all available shards. -Altough you can influence that distribution by specifying a routing column, in -many cases it is best to keep the default settings. - - -Partitioning -============ - -CrateDB also supports splitting up data across another dimension with -:ref:`partitioning `. You can think of a -partition as a set of shards. For each partition the number of shards defined -by ``CLUSTERED INTO x SHARDS`` are created, when a first row with a specific -``partition key`` is inserted. - -In the following example - which represents a very simple time-series use-case -- we added another column ``part`` which automatically generates the current -month upon insert from the ``ts`` column. The ``part`` column is further used -as the ``partition key``. - -.. code-block:: psql - - CREATE TABLE second_table ( - ts TIMESTAMP, - val DOUBLE PRECISION, - part GENERATED ALWAYS AS date_trunc('month',ts) - ) PARTITIONED BY(part); - -When inserting a first row with the following statement: - -.. code-block:: psql - - INSERT INTO second_table (ts,val) VALUES (1617823229974, 1.23); - -and then querying for the total amount of shards for the table: - -.. code-block:: psql - - SELECT COUNT(*) FROM sys.shards - WHERE table_name = 'second_table'; - -We can see that the table is split into 4 shards. - -Adding another row to the table with a different partition key (i.e. different -month): - -.. code-block:: psql - - INSERT INTO second_table (ts,val) VALUES (1620415701974, 2.31); - -We can see that there are now 8 shards for the table ``second_table`` in the -cluster. - - -.. danger:: - - **Over-sharding and over-partitioning** - - Sharding can drastically improve the performance on large datasets. - However, having too many small shards will most likely degrade performance. - Over-sharding and over-partitioning are common flaws leading to an overall - poor performance. - - **As a rule of thumb, a single shard should hold somewhere between 5 - 100 - GB of data.** - - To avoid oversharding, CrateDB by default limits the number of shards per - node to 1000. Any operation that would exceed that limit, leads to an - exception. - - - -.. tip:: - - **Example**: You want to create a *partitioned table* on your *single node - cluster* to store time-series data with the following assumptions: - - - Inserts: 1.000 records / s - - Record size: 128 byte / record - - Throughput: 125 KB / s or 10.3 GB / day - - Depending on query patterns, a good partition key would most likely be the - extracted week or month (considering 4 shards per partition). This would - give an average shard size between 18 GB to 80 GB. - -.. note:: - - An optimal sharding and partitioning strategy is always depends on the - specific use case and should typically be determined by conducting - benchmarks across various strategies. diff --git a/docs/create-user.rst b/docs/create-user.rst deleted file mode 100644 index f868dd6..0000000 --- a/docs/create-user.rst +++ /dev/null @@ -1,70 +0,0 @@ -.. _create-user: - -=========== -Create user -=========== - - ------------- -Introduction ------------- - -This part of the documentation sheds some light on the topics of -:ref:`crate-reference:administration_user_management` and -:ref:`crate-reference:administration-privileges`. - -CrateDB ships with a superuser account called "``crate``", which has the -privileges to perform any action. However, with the default configuration, this -superuser can only access CrateDB from the local machine CrateDB has been -installed on. If you are trying to connect from another machine, you are -prompted to enter a username and password. - -In order to create a user that can be used to authenticate from a remote -machine, first :ref:`install crash ` or other -:ref:`crate-clients-tools:index` on the same machine you installed CrateDB on. -Then, connect to CrateDB running on ``localhost``. - -While you can also perform the steps outlined below within -:ref:`crate-admin-ui:index` itself, the walkthrough will outline how to do it -using the :ref:`crate-crash:index` on the command line. - - -------- -Details -------- - -Invoke Crash within the terminal of your choice. - -.. code-block:: console - - sh$ crash - -Add your first user with a secure password to the database: - -.. code-block:: psql - - cr> CREATE USER username WITH (password = 'a_secret_password'); - -Grant all privileges to the newly created user: - -.. code-block:: psql - - cr> GRANT ALL PRIVILEGES TO username; - -.. image:: _assets/img/create-user.png - -Now try navigating to the :ref:`crate-admin-ui:index` in your browser. In the URL -below, please replace ``cratedb.example.org`` with the host name or IP address -of the machine CreateDB is running on and sign in with your newly created user -account:: - - http://cratedb.example.org:4200/ - -You should see something like this: - -.. image:: _assets/img/first-use/admin-ui.png - - -After creating the user and granting all privileges, you should be able to -continue with :ref:`the guided tour ` connecting to CrateDB from a remote -machine. diff --git a/docs/first-use.rst b/docs/first-use.rst deleted file mode 100644 index a0214d2..0000000 --- a/docs/first-use.rst +++ /dev/null @@ -1,88 +0,0 @@ -.. _use: - -========= -First use -========= - -Once CrateDB is :ref:`installed and running `, you can start to -interact with the database for the first time. - -.. rubric:: Table of contents - -.. contents:: - :local: - - -.. _use-admin-ui: - -Introducing the Admin UI -======================== - -CrateDB ships with a browser-based administration interface called -:ref:`Admin UI `. - -The CrateDB Admin UI runs on every CrateDB node, and you can use it to inspect -and interact with the whole CrateDB cluster in a number of ways. - -We will use the Admin UI throughout this section. - -Access the Admin UI in your browser using a URL like this:: - - http://localhost:4200/ - -If CrateDB is not running locally, replace ``localhost`` with the hostname -CrateDB is running on. - -You should see something like this: - -.. image:: _assets/img/first-use/admin-ui.png - - -.. _use-crash: - -Introducing the CrateDB Shell -============================= - -The CrateDB Shell (aka Crash) is an interactive command-line interface (CLI) -program for working with CrateDB on your favorite terminal. For further -information about it, please follow up on its documentation at -:ref:`crate-crash:index`. - -.. NOTE:: - - If you are running CrateDB on a remote machine, you will have to create a - dedicated user account for accessing the Admin UI. See :ref:`create-user`. - - -.. _use-more-tutorials: - -Follow more tutorials to get a sense of CrateDB -=============================================== - -If you want to get a feel for using CrateDB to work with time series data, you -are going to need a source of time series data. Fortunately, there are many -ways to generate time series data by sampling the systems running on your local -computer. - -The :ref:`next collection of tutorials ` shows how to generate mock -time series data about the International Space Station (ISS) and write it to -CrateDB using the client of your choice. - - -.. _use-start-building: - -Start building with CrateDB clients and tools -============================================= - -If you'd like to skip the tutorials and start building with CrateDB, you can -find a list of :ref:`crate-clients-tools:index` in a different section of the -documentation. - - -.. _use-dive-in: - -Dive into CrateDB -================= - -Check out the :ref:`crate-howtos:index` for goal oriented topics. Alternatively, -check out the :ref:`crate-reference:index` for a complete reference manual. diff --git a/docs/generate-time-series/cli.rst b/docs/generate-time-series/cli.rst deleted file mode 100644 index ebdb6b8..0000000 --- a/docs/generate-time-series/cli.rst +++ /dev/null @@ -1,299 +0,0 @@ -.. _gts-cli: - -=============================================== -Generate time series data from the command line -=============================================== - -This tutorial will show you how to generate :ref:`mock time series data -` about the `International Space Station`_ (ISS) using the -:ref:`crate-crash:index` and a little bit of `shell scripting`_. - -.. SEEALSO:: - - :ref:`gen-ts` - -.. rubric:: Table of contents - -.. contents:: - :local: - - -Prerequisites -============= - -CrateDB must be :ref:`installed and running `. - -Crash is available as `pip`_ package. :ref:`Install ` it -like this: - -.. code-block:: console - - sh$ pip install crash - -We have designed the commands in this tutorial to be run directly from the -`command line`_ so that you can experiment with them as you see fit. - -You will need the `curl`_ and `jq`_ tools installed. - -.. NOTE:: - - This tutorial should work in most POSIX-compatible environments (e.g., - Linux, macOS, and Windows Cygwin). Please `let us know`_ if you run into - issues. - - -Get the current position of the ISS -==================================== - -`Open Notify`_ is a third-party service that provides an API to consume data -about the current position, or `ground point`_, of the ISS. - -The endpoint for this API is ``_. - -You can query this endpoint using ``curl``: - -.. code-block:: console - - sh$ curl -s -w "\n" http://api.open-notify.org/iss-now.json - - {"message": "success", "iss_position": {"latitude": "23.1703", "longitude": "-105.4034"}, "timestamp": 1590394500} - -As shown, the endpoint returns a JSON payload, which contains an -``iss_position`` object with ``latitude`` and ``longitude`` data. - - -Parse the ISS position -======================= - -The ``jq`` command is a convenient tool to parse JSON payloads on the command -line. You can use the ``|`` character to `pipe`_ the output from ``curl`` into -``jq`` for processing. - -For example, to return the whole payload, do this: - -.. code-block:: console - - sh$ curl -s http://api.open-notify.org/iss-now.json | jq '.' - - { - "message": "success", - "iss_position": { - "latitude": "21.9711", - "longitude": "-104.3298" - }, - "timestamp": 1590394525 - } - -The most useful information is the latitude and longitude coordinates. You can -use ``jq`` with a filter to isolate those data points: - -.. code-block:: console - - sh$ curl -s http://api.open-notify.org/iss-now.json | \ - jq -r '[.iss_position.longitude, .iss_position.latitude] | @tsv' - - -103.4015 20.9089 - -You can encapsulate this command with a `shell function`_: - -.. code-block:: console - - sh$ position () { \ - curl -s http://api.open-notify.org/iss-now.json | \ - jq -r '[.iss_position.longitude, .iss_position.latitude] | @tsv'; \ - } - -Now, when you want the position, run ``position``: - -.. code-block:: console - - sh$ position - - -102.3230 19.6460 - -To insert these values into an SQL query, you need to format them into a `WKT`_ -string, like so: - -.. code-block:: console - - sh$ echo "POINT ($(position | expand -t 1))" - - POINT (-101.2633 18.3756) - -Encapsulate this command with a function: - -.. code-block:: console - - sh$ wkt_position () { \ - echo "POINT ($(position | expand -t 1))"; \ - } - -Which you can now call using ``wkt_position``: - -.. code-block:: console - - sh$ wkt_position - - POINT (-96.4784 12.3053) - - -Set up CrateDB -============== - -Start an interactive Crash session: - -.. code-block:: console - - sh$ crash --hosts localhost:4200 - -.. NOTE:: - - You can omit the ``--hosts`` argument if CrateDB is running on - ``localhost:4200``. We have included it here for the sake of clarity. - Modify the argument if you wish to connect to a CrateDB node on a different - host or port number. - -Then, :ref:`create a table ` suitable for writing -load averages. - -.. code-block:: psql - - cr> CREATE TABLE iss ( - timestamp TIMESTAMP GENERATED ALWAYS AS CURRENT_TIMESTAMP, - position GEO_POINT - ); - - CREATE OK, 1 row affected (0.726 sec) - -In the :ref:`crate-admin-ui:index`, you should see the new table when you navigate -to the *Tables* screen using the left-hand navigation menu: - -.. image:: ../_assets/img/generate-time-series/table.png - - -Record the ISS position -======================= - -With the table in place, you can start recording the position of the ISS. - -Crash provides a non-interactive mode that you can use to execute SQL -statements directly from the command line. - -First, exit from the interactive Crash session (or open a new terminal). Then, -use ``crash`` with the ``--command`` argument to execute an :ref:`INSERT -` query. - -.. code-block:: console - - sh$ crash --hosts localhost:4200 \ - --command "INSERT INTO iss (position) VALUES ('$(wkt_position)')" - - CONNECT OK - INSERT OK, 1 row affected (0.037 sec) - -.. WARNING:: - - For any real-world application, you must always sanitize your data before - interpolating it into an SQL query. - -Press the up arrow on your keyboard and hit *Enter* to run the same command a -few more times. - -When you're done, you can :ref:`select ` that data -back out of CrateDB. - -.. code-block:: console - - sh$ crash --hosts localhost:4200 \ - --command 'SELECT * FROM iss ORDER BY timestamp DESC' - - +---------------+---------------------+ - | timestamp | position | - +---------------+---------------------+ - | 1590395103748 | [-82.6328, -6.9134] | - | 1590395102176 | [-82.6876, -6.8376] | - | 1590395018584 | [-85.7139, -2.6095] | - +---------------+---------------------+ - SELECT 3 rows in set (0.105 sec) - -Here you have recorded three sets of ISS position coordinates. - - -Automate the process -==================== - -Now you have key components, you can automate the data collection. - -Create a file named ``iss-position.sh``, like this: - -.. code-block:: sh - - # Exit immediately if a pipeline returns a non-zero status - set -e - - position () { - curl -s http://api.open-notify.org/iss-now.json | - jq -r '[.iss_position.longitude, .iss_position.latitude] | @tsv'; - } - - wkt_position () { - echo "POINT ($(position | expand -t 1))"; - } - - while true; do - crash --hosts localhost:4200 \ - --command "INSERT INTO iss (position) VALUES ('$(wkt_position)')" - echo 'Sleeping for 10 seconds...' - sleep 10 - done - -Here, the script sleeps for 10 seconds after each sample. Accordingly, the time -series data will have a *resolution* of 10 seconds. You may want to configure -your script differently. - -Run it from the command line, like so: - -.. code-block:: console - - $ sh iss-position.sh - - CONNECT OK - INSERT OK, 1 row affected (0.029 sec) - Sleeping for 10 seconds... - CONNECT OK - INSERT OK, 1 row affected (0.033 sec) - Sleeping for 10 seconds... - CONNECT OK - INSERT OK, 1 row affected (0.038 sec) - Sleeping for 10 seconds... - -As this runs, you should see the table filling up in the CrateDB Admin UI: - -.. image:: ../_assets/img/generate-time-series/rows.png - -Lots of freshly generated time series data, ready for use. - -And, for bonus points, if you select the arrow next to the location data, it -will open up a map view showing the current position of the ISS: - -.. image:: ../_assets/img/generate-time-series/map.png - -.. TIP:: - - The ISS passes over large bodies of water. If the map looks empty, try - zooming out. - - -.. _command line: https://en.wikipedia.org/wiki/Command-line_interface -.. _curl: https://curl.se/ -.. _data sanitization: https://xkcd.com/327/ -.. _ground point: https://en.wikipedia.org/wiki/Ground_track -.. _International Space Station: https://www.nasa.gov/mission_pages/station/main/index.html -.. _jq: https://stedolan.github.io/jq/ -.. _let us know: https://github.com/crate/crate-tutorials/issues/new -.. _open notify: http://open-notify.org/ -.. _pip: https://pypi.org/project/pip/ -.. _pipe: https://www.geeksforgeeks.org/piping-in-unix-or-linux/ -.. _shell function: https://www.gnu.org/software/bash/manual/html_node/Shell-Functions.html -.. _shell scripting: https://en.wikipedia.org/wiki/Shell_script -.. _WKT: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry diff --git a/docs/generate-time-series/go.rst b/docs/generate-time-series/go.rst deleted file mode 100644 index ef755cd..0000000 --- a/docs/generate-time-series/go.rst +++ /dev/null @@ -1,432 +0,0 @@ - -.. _gen-ts-go: - -================================== -Generate time series data using Go -================================== - -This tutorial will show you how to generate some :ref:`mock time series data -` about the `International Space Station`_ (ISS) using `Go`_. - -.. SEEALSO:: - - :ref:`gen-ts` - -.. rubric:: Table of contents - -.. contents:: - :local: - - -Prerequisites -============= - -CrateDB must be :ref:`installed and running `. - -Make sure you are running an up-to-date version of `Go`_. We recommend Go 1.11 -or higher since you will be making use of `modules`_. - -Most of this tutorial is designed to be run as a local project using Go -tooling since the `compilation`_ unit is the package and not a single line. - -To begin, create a project directory and navigate into it: - -.. code-block:: console - - sh$ mkdir time-series-go - sh$ cd time-series-go - -Next, choose a module path and create a ``go.mod`` file that declares it. A -module is a collection of Go packages stored in a file hierarchy with a -``go.mod`` file at the root. This file defines the module’s module path, which -is also the import path for the root directory and its dependency requirements. - -Without a ``go.mod`` file, your project contains a package, but no module and -the ``go`` command will make up a fake import path based on the directory name. - -Make the current directory the root of a module by using the -``go mod init`` command to create a ``go.mod`` file there: - -.. code-block:: console - - sh$ go mod init example.com/time-series-go - -You should see a ``go.mod`` file in the current directory with contents similar -to: - -.. code-block:: console - - module example.com/time-series-go - - go 1.14 - -Next, create a file named ``main.go`` in the same directory: - -.. code-block:: console - - sh$ touch main.go - -Open this file in your favorite code editor. - - -Get the current position of the ISS -=================================== - -`Open Notify`_ is a third-party service that provides an API to consume data -about the current position, or `ground point`_, of the ISS. - -The endpoint for this API is ``_. - -In the ``main.go`` file, declare the main package at the top (to tell the -compiler that the program is an executable) and import some packages from the -`standard library`_ that will be used in this tutorial. Declare a main -function which will be the entry point of the executable program: - -.. code-block:: go - - package main - - import ( - "encoding/json" - "fmt" - "io/ioutil" - "log" - "net/http" - ) - - func main() { - - } - -Then, read the current position of the ISS by going to the Open Notify API -endpoint at ``_ in your browser. - -.. code-block:: go - - { - "message":"success", - "timestamp":1591703638, - "iss_position":{ - "longitude":"84.9504", - "latitude":"41.6582" - } - } - -As shown, the endpoint returns a JSON payload, which contains an -``iss_position`` object with ``latitude`` and ``longitude`` data. - - -Parse the ISS position -======================= - -To parse the JSON payload, you can create a `struct`_ to `unmarshal`_ the data -into. When you unmarshal JSON into a struct, the function matches incoming -object keys to the keys in the struct field name or its tag. By default, object -keys which don't have a corresponding struct field are ignored. - -.. code-block:: go - - type issInfo struct { - IssPosition struct { - Longitude string `json:"longitude"` - Latitude string `json:"latitude"` - } `json:"iss_position"` - } - -Now, create a function that makes an HTTP GET request to the Open Notify API -endpoint and returns longitude and latitude as a -:ref:`crate-reference:data-types-geo` declaration. - -.. code-block:: go - - func getISSPosition() (string, error) { - var i issInfo - - response, err := http.Get("http://api.open-notify.org/iss-now.json") - if err != nil { - return "", fmt.Errorf("unable to retrieve request: %v", err) - } - defer response.Body.Close() - - if response.StatusCode/100 != 2 { - return "", fmt.Errorf("bad response status: %s", response.Status) - } - - responseData, err := ioutil.ReadAll(response.Body) - if err != nil { - return "", fmt.Errorf("unable to read response body: %v", err) - } - - err = json.Unmarshal(responseData, &i) - if err != nil { - return "", fmt.Errorf("unable to unmarshal response body: %v", err) - } - - s := fmt.Sprintf("(%s, %s)", i.IssPosition.Longitude, i.IssPosition.Latitude) - return s, nil - } - -Above, the ``getISSPosition()`` function: - -.. rst-class:: open - - * Uses the `net/http`_ package from the Go standard library to issue an - HTTP GET request to the API endpoint - - * Implements some basic error handling and checks to see whether the - response code is in the 200 range - - * Reads the response body and unmarshals the JSON into the defined - struct ``issInfo`` - - * Formats the return string and returns it - -Then in the main function, call the ``getISSPosition()`` function and print -out the result: - -.. code-block:: go - - func main() { - pos, err := getISSPosition() - if err != nil { - log.Fatal(err) - } - - fmt.Println(pos) - } - -Save your changes and run the code: - -.. code-block:: console - - sh$ go run main.go - -The result should contain your geo_point string: - -.. code-block:: go - - (104.7298, 5.0335) - -You can run this multiple times to get the new position of the ISS each time. - - -Set up CrateDB -============== - -First, import the `context`_ package from the standard library and the `pgx`_ -client: - -.. code-block:: go - - import ( - "context" - "encoding/json" - "flag" - "fmt" - "io/ioutil" - "log" - "net/http" - - "github.com/jackc/pgx/v4" - ) - -Then, in your main function, connect to CrateDB using the -:ref:`crate-reference:interface-postgresql` port (``5432``) and -:ref:`create a table ` suitable for writing ISS -position coordinates. - -.. code-block:: go - - var conn *pgx.Conn - - func main() { - var err error - conn, err = pgx.Connect(context.Background(), "postgresql://crate@localhost:5432/doc") - if err != nil { - log.Fatalf("unable to connect to database: %v\n", err) - } else { - fmt.Println("CONNECT OK") - } - defer conn.Close(context.Background()) - - conn.Exec(context.Background(), - "CREATE TABLE [ IF NOT EXISTS ] iss ( - timestamp TIMESTAMP GENERATED ALWAYS AS CURRENT_TIMESTAMP, - position GEO_POINT - )") - } - -Save your changes and run the code: - -.. code-block:: console - - sh$ go run main.go - -When you run the script this time, the ``go`` command will look up the module -containing the `pgx`_ package and add it to ``go.mod``. - -In the :ref:`crate-admin-ui:index`, you should see the new table when you navigate -to the *Tables* screen using the left-hand navigation menu: - -.. image:: ../_assets/img/generate-time-series/table.png - - -Record the ISS position -======================= - -With the table in place, you can start recording the position of the ISS. - -Create some logic that calls your ``getISSPosition`` function and :ref:`insert -` the result into the ``iss`` table. - -.. code-block:: go - - ... - - func main() { - ... - - pos, err := getISSPosition() - if err != nil { - log.Fatalf("unable to get ISS position: %v\n", err) - } else { - _, err := conn.Exec(context.Background(), - "INSERT INTO iss (position) VALUES ($1)", pos) - if err != nil { - log.Fatalf("unable to insert data: %v\n", err) - } else { - fmt.Println("INSERT OK") - } - } - } - -Save your changes and run the code: - -.. code-block:: console - - sh$ go run main.go - -Press the up arrow on your keyboard and hit *Enter* to run the same command a -few more times. - -When you're done, you can :ref:`select ` that data -back out of CrateDB with this query: - -.. code-block:: psql - - SELECT * FROM "doc"."iss" - -.. TIP:: - - You can run ad-hoc SQL queries directly from the *Console* screen in the - Admin UI. You can navigate to the console from the left-hand navigation - menu, as before. - -Automate the process -==================== - -Now that you have the key components, you can automate the data collection. - -In your file ``main.go``, create a function that encapsulates data insertion: - -.. code-block:: go - - func insertData(position string) error { - _, err := conn.Exec(context.Background(), - "INSERT INTO iss (position) VALUES ($1)", position) - return err - } - -Then in the script's ``main`` function, create an infinite loop that gets the -latest ISS position and inserts the data into the database. - -.. code-block:: go - - ... - - func main() { - ... - - for { - pos, err := getISSPosition() - if err != nil { - log.Fatalf("unable to get ISS position: %v\n", err) - } else { - err = insertData(pos) - if err != nil { - log.Fatalf("unable to insert data: %v\n", err) - } else { - fmt.Println("INSERT OK") - } - } - fmt.Println("Sleeping for 10 seconds...") - time.Tick(time.Second * 10) - } - } - -.. SEEALSO:: - - `The completed script source`_ - -Above, the ``main()`` function: - -.. rst-class:: open - - * Retrieves the latest ISS position through the ``getISSPosition()`` function - - * Inserts the ISS position into CrateDB through the ``insertData()`` function - - * Implements some basic error handling, in case either the API query or the - CrateDB operation fails - - * Sleeps for 10 seconds after each sample using the `time`_ package - -Accordingly, the time series data will have a *resolution* of 10 seconds. If -you wish to change this resolution, you may want to configure your script -differently. - -Run the script from the command line: - -.. code-block:: console - - $ go run main.go - - INSERT OK - Sleeping for 10 seconds... - INSERT OK - Sleeping for 10 seconds... - INSERT OK - Sleeping for 10 seconds... - -As the script runs, you should see the table filling up in the -:ref:`crate-admin-ui:index`. - -.. image:: ../_assets/img/generate-time-series/rows.png - -Lots of freshly generated time series data, ready for use. - -And, for bonus points, if you select the arrow next to the location data, it -will open up a map view showing the current position of the ISS: - -.. image:: ../_assets/img/generate-time-series/map.png - -.. TIP:: - - The ISS passes over large bodies of water. If the map looks empty, try - zooming out. - - -.. _compilation: https://www.geeksforgeeks.org/difference-between-compiled-and-interpreted-language/ -.. _context: https://golang.org/pkg/context/ -.. _Go: https://golang.org/ -.. _ground point: https://en.wikipedia.org/wiki/Ground_track -.. _International Space Station: https://www.nasa.gov/mission_pages/station/main/index.html -.. _modules: https://blog.golang.org/migrating-to-go-modules -.. _net/http: https://golang.org/pkg/net/http/ -.. _open notify: http://open-notify.org/ -.. _pgx: https://github.com/jackc/pgx/tree/v4 -.. _standard library: https://golang.org/pkg/ -.. _struct: https://golang.org/ref/spec#Struct_types -.. _The completed script source: https://play.golang.org/p/2HoBzpBn-iF -.. _time: https://golang.org/pkg/time/ -.. _unmarshal: https://pkg.go.dev/encoding/json#Unmarshal diff --git a/docs/generate-time-series/index.rst b/docs/generate-time-series/index.rst deleted file mode 100644 index 28f609a..0000000 --- a/docs/generate-time-series/index.rst +++ /dev/null @@ -1,33 +0,0 @@ -.. _gen-ts: - -========================= -Generate time series data -========================= - -CrateDB is purpose-built for working with massive amounts of time series data, -like the type of data produced by smart sensors and other `Internet of Things`_ -(IoT) devices. - -If you want to get a feel for using CrateDB to work with time series data, you -are going to need a source of time series data. Fortunately, there are many -ways to generate time series data by sampling the systems running on your local -computer. - -This collection of tutorials will show you how to generate mock time series -data about the `International Space Station`_ (ISS) and write it to CrateDB -using the client of your choice. - -.. rubric:: Table of contents - -.. toctree:: - :maxdepth: 2 - :titlesonly: - - cli - python - node - go - -.. _International Space Station: https://www.nasa.gov/mission_pages/station/main/index.html -.. _Internet of Things: https://en.wikipedia.org/wiki/Internet_of_things -.. _system load: https://en.wikipedia.org/wiki/Load_(computing) diff --git a/docs/generate-time-series/node.rst b/docs/generate-time-series/node.rst deleted file mode 100644 index aca82a7..0000000 --- a/docs/generate-time-series/node.rst +++ /dev/null @@ -1,357 +0,0 @@ -.. _gen-ts-javascript: - -======================================= -Generate time series data using Node.js -======================================= - -This tutorial will show you how to generate :ref:`mock time series data -` about the `International Space Station`_ (ISS) using `Node.js`_. - -.. SEEALSO:: - - :ref:`gen-ts` - -.. rubric:: Table of contents - -.. contents:: - :local: - - -Prerequisites -============= - -You must have CrateDB :ref:`installed and running `. - -Make sure you're running an up-to-date version of `Node.js`_. - -Then, upgrade to the latest `npm`_ version: - -.. code-block:: console - - sh$ npm install -g npm@latest - -Install the `node-postgres`_ and `Axios`_ libraries: - -.. code-block:: console - - sh$ npm install pg axios - -The ``node-postgres`` and ``axios`` libraries both use `promises`_ when -performing network operations. Promises are a way of encapsulating the eventual -result of an asynchronous operation. - -.. seealso:: - - If you're not familiar with asynchronous operations and promises, check out - Mozilla's `detailed guide`_ on the topic. - -Most of this tutorial is designed for Node's `interactive REPL mode`_ so that -you can experiment with the commands as you see fit. Since both libraries use -promises, you should start ``node`` with support for the `await`_ operator: - -.. code-block:: console - - sh$ node --experimental-repl-await - - -Get the current position of the ISS -==================================== - -`Open Notify`_ is a third-party service that provides an API to consume data -about the current position, or `ground point`_, of the ISS. - -The endpoint for this API is ``_. - -Start an interactive Node session (as above). - -Next, import the `Axios`_ library: - -.. code-block:: js - - > const axios = require('axios').default; - -Then, read the current position of the ISS with an HTTP GET request to the Open -Notify API endpoint: - -.. code-block:: js - - > let response = await axios.get('http://api.open-notify.org/iss-now.json') - -.. code-block:: js - - > response.data - { - iss_position: { longitude: '-107.0497', latitude: '42.5431' }, - message: 'success', - timestamp: 1582568638 - } - -As shown, the endpoint returns a JSON payload, which contains an -``iss_position`` object with ``latitude`` and ``longitude`` data. - -You can encapsulate this operation with a function that returns longitude and -latitude as a `WKT`_ string: - -.. code-block:: js - - > async function position() { - ... let response = await axios.get('http://api.open-notify.org/iss-now.json') - ... return `POINT (${response.data.iss_position.longitude} ${response.data.iss_position.latitude})` - ... } - -When you run this function, it should return your point string: - -.. code-block:: js - - > await position() - -.. code-block:: js - - 'POINT (-99.4196 38.1642)' - -Set up CrateDB -============== - -First, import the `node-postgres`_ client: - -.. code-block:: js - - > const { Client } = require('pg') - -Then `connect`_ to CrateDB, using the :ref:`crate-reference:interface-postgresql` port -(``5432``): - -.. code-block:: js - - > const client = new Client({connectionString: 'postgresql://crate@localhost:5432/doc'}) - -.. code-block:: js - - > await client.connect() - -Finally, :ref:`create a table ` suitable for writing -ISS position coordinates. - -.. code-block:: js - - > var query = ` - ... CREATE TABLE iss ( - ... timestamp TIMESTAMP GENERATED ALWAYS AS CURRENT_TIMESTAMP, - ... position GEO_POINT)` - -.. code-block:: js - - > await client.query(query) - -.. code-block:: js - - Result { - command: 'CREATE', - rowCount: 1, - oid: null, - rows: [], - fields: [], - _parsers: undefined, - _types: TypeOverrides { - _types: { - getTypeParser: [Function: getTypeParser], - setTypeParser: [Function: setTypeParser], - arrayParser: [Object], - builtins: [Object] - }, - text: {}, - binary: {} - }, - RowCtor: null, - rowAsArray: false - } - -Success! - -In the :ref:`crate-admin-ui:index`, you should see the new table when you navigate to -the *Tables* screen using the left-hand navigation menu: - -.. image:: ../_assets/img/generate-time-series/table.png - - -Record the ISS position -======================= - -With the table in place, you can start recording the position of the ISS. - -The following command calls your ``position`` function and will :ref:`insert -` the result into the ``iss`` table. - -.. code-block:: js - - > await client.query("INSERT INTO iss (position) VALUES (?)", [await position()]) - -.. code-block:: js - - Result { - command: 'INSERT', - rowCount: 1, - oid: 0, - rows: [], - fields: [], - _parsers: undefined, - _types: TypeOverrides { - _types: { - getTypeParser: [Function: getTypeParser], - setTypeParser: [Function: setTypeParser], - arrayParser: [Object], - builtins: [Object] - }, - text: {}, - binary: {} - }, - RowCtor: null, - rowAsArray: false - } - -Press the up arrow on your keyboard and hit *Enter* to run the same command a -few more times. - -When you're done, you can :ref:`select ` that data -back out of CrateDB. - -.. code-block:: js - - > let result = await client.query('SELECT * FROM iss') - -.. code-block:: js - - > result.rows - [ - { - timestamp: 2020-02-24T18:32:09.744Z, - position: { x: -80.7016, y: 21.5174 } - }, - { - timestamp: 2020-02-24T18:31:43.542Z, - position: { x: -81.8096, y: 22.7667 } - }, - { - timestamp: 2020-02-24T18:32:03.622Z, - position: { x: -80.9554, y: 21.8065 } - } - ] - -Here you have recorded three sets of ISS position coordinates. - - -Automate the process -==================== - -Now you have key components, you can automate the data collection. Doing this -will require a change of approach. - -Previously, you were using a `client`_ to connect to and insert data into -CrateDB. However, clients are ephemeral, and once closed, you need to recreate -them. Creating a new client requires a handshake with CrateDB, and this -overhead cost can be prohibitive if you are rapidly creating new clients. - -Instead, use a `connection pool`_ to manage your connections. Connection pools -manage a collection of connected clients that you can request, use, and return -to the pool. - -Create a new file called ``iss-position.js``: - -.. code-block:: javascript - - const axios = require('axios').default; - const { Pool } = require('pg') - const pool = new Pool({connectionString: 'postgresql://crate@localhost:5432/doc'}) - - // Sampling resolution - const seconds = 10 - - // Get data from the API, and, if successful, insert it into CrateDB - function insert() { - axios.get('http://api.open-notify.org/iss-now.json') - .then(response => { - longitude = response.data.iss_position.longitude - latitude = response.data.iss_position.latitude - current_position = `POINT (${longitude} ${latitude})` - return pool.query( - "INSERT INTO iss (position) VALUES (?)", [current_position]) - }) - .then(_ => console.log("INSERT OK")) - .catch(err => console.error("INSERT ERROR", err)) - } - - // Loop indefinitely - async function loop() { - while (true) { - insert() - console.log("Sleeping for 10 seconds...") - await new Promise(r => setTimeout(r, seconds * 1000)) - } - } - - loop() - -In the above script, you have merged the ``position`` function with the -insertion. It uses `promise chaining`_ so that the API query and the CrateDB -insertion can happen sequentially, yet asynchronously. - -You also have some basic error handling, in case either the API query or the -CrateDB operation fails. - -Here, the script sleeps for 10 seconds after each sample. Accordingly, the time -series data will have a *resolution* of 10 seconds. If you wish to change this -resolution, you may want to configure your script differently. - -Run the script from the command line: - -.. code-block:: console - - sh$ node iss-position.js - INSERT OK - Sleeping for 10 seconds... - INSERT OK - Sleeping for 10 seconds... - INSERT OK - Sleeping for 10 seconds... - -.. TIP:: - - If you get a ``MODULE_NOT_FOUND`` error when trying to run this script, - make sure you are running it from the same directory where the npm - libraries are installed. - -As the script runs, you should see the table filling up in the CrateDB Admin -UI: - -.. image:: ../_assets/img/generate-time-series/rows.png - -Lots of freshly generated time series data, ready for use. - -And, for bonus points, if you select the arrow next to the location data, it -will open up a map view showing the current position of the ISS: - -.. image:: ../_assets/img/generate-time-series/map.png - -.. TIP:: - - The ISS passes over large bodies of water. If the map looks empty, try - zooming out. - - -.. _await: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/await -.. _axios: https://www.npmjs.com/package/axios -.. _Client: https://node-postgres.com/api/client -.. _connect: https://node-postgres.com/features/connecting -.. _Connection Pool: https://node-postgres.com/api/pool -.. _detailed guide: https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/Promises -.. _ground point: https://en.wikipedia.org/wiki/Ground_track -.. _input values: https://node-postgres.com/features/queries#Parameterized%20query -.. _interactive REPL mode: https://www.oreilly.com/library/view/learning-node-2nd/9781491943113/ch04.html -.. _International Space Station: https://www.nasa.gov/mission_pages/station/main/index.html -.. _node-postgres: https://www.npmjs.com/package/pg -.. _Node.js: https://nodejs.org/en/ -.. _npm: https://www.npmjs.com/ -.. _open notify: http://open-notify.org/ -.. _promise chaining: https://javascript.info/promise-chaining -.. _promises: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise -.. _WKT: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry diff --git a/docs/generate-time-series/python.rst b/docs/generate-time-series/python.rst deleted file mode 100644 index 1680418..0000000 --- a/docs/generate-time-series/python.rst +++ /dev/null @@ -1,249 +0,0 @@ -.. _gts-python: - -====================================== -Generate time series data using Python -====================================== - -This tutorial will show you how to generate :ref:`mock time series data -` about the `International Space Station`_ (ISS) using `Python`_. - -.. SEEALSO:: - - :ref:`gen-ts` - -.. rubric:: Table of contents - -.. contents:: - :local: - - -Prerequisites -============= - -CrateDB must be :ref:`installed and running `. - -Make sure you're running an up-to-date version of Python (we recommend 3.7 or -higher). - -Then, use `pip`_ to install the `requests`_ and :ref:`crate-python:index` libraries: - -.. code-block:: console - - sh$ pip install requests crate - -The rest of this tutorial is designed for Python's `interactive mode`_ so that -you can experiment with the commands as you see fit. The `standard -Python interpreter`_ works fine for this, but we recommend `IPython`_ for a more -user-friendly experience. - -You can install IPython with Pip: - -.. code-block:: console - - sh$ pip install ipython - -Once installed, you can start an interactive IPython session like this: - -.. code-block:: console - - sh$ ipython - - -Get the current position of the ISS -==================================== - -`Open Notify`_ is a third-party service that provides an API to consume data -about the current position, or `ground point`_, of the ISS. - -The endpoint for this API is ``_. - -Start an interactive Python session (as above). - -Next, import the `requests`_ library:: - - >>> import requests - -Then, read the current position of the ISS with an HTTP GET request to the Open -Notify API endpoint, like this: - - >>> response = requests.get("http://api.open-notify.org/iss-now.json") - >>> response.json() - {'message': 'success', - 'timestamp': 1582730500, - 'iss_position': {'latitude': '33.3581', 'longitude': '-57.3929'}} - -As shown, the endpoint returns a JSON payload, which contains an -``iss_position`` object with ``latitude`` and ``longitude`` data. - -You can encapsulate this operation with a function that returns longitude and -latitude as a `WKT`_ string: - - >>> def position(): - ... response = requests.get("http://api.open-notify.org/iss-now.json") - ... position = response.json()["iss_position"] - ... return f'POINT ({position["longitude"]} {position["latitude"]})' - -When you run this function, it should return your point string:: - - >>> position() - 'POINT (-30.9188 42.8036)' - -Set up CrateDB -============== - -First, import the :ref:`crate-python:index` client: - - >>> from crate import client - -Then, :ref:`crate-python:connect`: - - >>> connection = client.connect("localhost:4200") - -.. NOTE:: - - You can omit the function argument if CrateDB is running on - ``localhost:4200``. We have included it here for the sake of clarity. - Modify the argument if you wish to connect to a CrateDB node on a different - host or port number. - -Get a :ref:`cursor `: - - >>> cursor = connection.cursor() - -Finally, :ref:`create a table ` suitable for writing -ISS position coordinates. - - >>> cursor.execute( - ... """CREATE TABLE iss ( - ... timestamp TIMESTAMP GENERATED ALWAYS AS CURRENT_TIMESTAMP, - ... position GEO_POINT)""" - ... ) - -In the :ref:`crate-admin-ui:index`, you should see the new table when you navigate to -the *Tables* screen using the left-hand navigation menu: - -.. image:: ../_assets/img/generate-time-series/table.png - - -Record the ISS position -======================= - -With the table in place, you can start recording the position of the ISS. - -The following command calls your ``position`` function and will :ref:`insert -` the result into the ``iss`` table: - - >>> cursor.execute("INSERT INTO iss (position) VALUES (?)", [position()]) - -Press the up arrow on your keyboard and hit *Enter* to run the same command a -few more times. - -When you're done, you can :ref:`select ` that data -back out of CrateDB. - - >>> cursor.execute('SELECT * FROM iss ORDER BY timestamp DESC') - -Then, :ref:`fetch all ` the result rows at once: - - >>> cursor.fetchall() - [[1582295967721, [-8.0689, 25.8967]], - [1582295966383, [-8.1371, 25.967]], - [1582295926523, [-9.9662, 27.8032]]] - -Here you have recorded three sets of ISS position coordinates. - - -Automate the process -==================== - -Now you have key components, you can automate the data collection. - -Create a new file called ``iss-position.py``, like this: - -.. code-block:: python - - import time - - import requests - from crate import client - - - def position(): - response = requests.get("http://api.open-notify.org/iss-now.json") - position = response.json()["iss_position"] - return f'POINT ({position["longitude"]} {position["latitude"]})' - - - def insert(): - # New connection each time - try: - connection = client.connect("localhost:4200") - print("CONNECT OK") - except Exception as err: - print("CONNECT ERROR: %s" % err) - return - cursor = connection.cursor() - try: - cursor.execute( - "INSERT INTO iss (position) VALUES (?)", [position()], - ) - print("INSERT OK") - except Exception as err: - print("INSERT ERROR: %s" % err) - return - - - # Loop indefinitely - while True: - insert() - print("Sleeping for 10 seconds...") - time.sleep(10) - - -Here, the script sleeps for 10 seconds after each sample. Accordingly, the time -series data will have a *resolution* of 10 seconds. You may want to configure -your script differently. - -Run the script from the command line, like so: - -.. code-block:: console - - sh$ python iss-position.py - CONNECT OK - INSERT OK - Sleeping for 10 seconds... - CONNECT OK - INSERT OK - Sleeping for 10 seconds... - CONNECT OK - INSERT OK - Sleeping for 10 seconds... - -As the script runs, you should see the table filling up in the CrateDB Admin -UI: - -.. image:: ../_assets/img/generate-time-series/rows.png - -Lots of freshly generated time series data, ready for use. - -And, for bonus points, if you select the arrow next to the location data, it -will open up a map view showing the current position of the ISS: - -.. image:: ../_assets/img/generate-time-series/map.png - -.. TIP:: - - The ISS passes over large bodies of water. If the map looks empty, try - zooming out. - - -.. _ground point: https://en.wikipedia.org/wiki/Ground_track -.. _interactive mode: https://docs.python.org/3/tutorial/interpreter.html#interactive-mode -.. _International Space Station: https://www.nasa.gov/mission_pages/station/main/index.html -.. _IPython: https://ipython.org/ -.. _open notify: http://open-notify.org/ -.. _pip: https://pypi.org/project/pip/ -.. _Python: https://www.python.org/ -.. _requests: https://docs.python-requests.org/en/master/ -.. _standard Python interpreter: https://docs.python.org/3/tutorial/interpreter.html -.. _WKT: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry diff --git a/docs/index.rst b/docs/index.rst index 61fa70c..7bea21f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,33 +1,75 @@ -.. _index: +.. highlight:: bash -================= -CrateDB Tutorials +.. _install: + +============ +Installation +============ + +Security and upgrade notes +========================== + +.. WARNING:: + + CrateDB versions prior to 4.6.6 are suspicable to the `Log4Shell issue`_ + published on December 12, 2021. Please consider upgrading to the most + recent version and follow up reading `CrateDB Log4Shell mitigations`_. + + +Introduction +============ + +This part of the documentation covers the installation of CrateDB on Linux, +macOS and Windows systems. +The first step to using any software package is getting it properly installed. +Please read this section carefully. + +Try CrateDB Cloud ================= -CrateDB is a distributed SQL database that makes it simple to store and analyze -massive amounts of machine data in real-time. +The easiest way to get started with CrateDB is to use a 30 day free CrateDB +Cloud cluster, no credit card requrired. Visit the `sign up page`_ to start your +CrateDB cluster today. -.. rubric:: Table of contents +Try CrateDB locally +=================== -.. toctree:: - :maxdepth: 1 - :titlesonly: +If you want to try out CrateDB locally on Linux or macOS but would prefer to +avoid the hassle of manual installation or extracting release archives, you can +get a fresh CrateDB node up and running in your current working directory with a +single command: - install - first-use - create-user - create-sharded-table - generate-time-series/index - normalize-intervals +.. code-block:: console -.. SEEALSO:: + sh$ bash -c "$(curl -L https://try.crate.io/)" - Check out the :ref:`crate-howtos:index` and the :ref:`crate-reference:index`. .. NOTE:: - This is an open source documentation project. You can view the source code, - create pull requests, and report issues on `GitHub`_. + This is a quick way to *try out* CrateDB. It is not the recommended method + to *install* CrateDB in a durable way. The following sections will outline + that method. + + +Installing CrateDB +================== + +This section of the documentation shows you how to deploy CrateDB in different +environments. + +.. rubric:: Table of contents + +.. toctree:: + :maxdepth: 3 + :titlesonly: + + basic/index + linux/index + containers/index + cloud/index + -.. _GitHub: https://github.com/crate/crate-tutorials +.. _CrateDB Log4Shell mitigations: https://community.crate.io/t/security-vulnerability-log4shell-rce-0-day-exploit/935 +.. _Log4Shell issue: https://www.lunasec.io/docs/blog/log4j-zero-day/ +.. _sign up page: https://crate.io/lp-free-trial diff --git a/docs/linux/debian.rst b/docs/linux/debian.rst new file mode 100644 index 0000000..797ac96 --- /dev/null +++ b/docs/linux/debian.rst @@ -0,0 +1,163 @@ +.. _debian: + +=============================== +Run CrateDB on Debian GNU/Linux +=============================== + +CrateDB actively maintains packages for the following Debian versions: + +- `Bullseye`_ (11.x) +- `Buster`_ (10.x) +- `Stretch`_ (9.x) + +This guide will show you how to install, control, and configure a single-node +CrateDB on a local Debian system. + +.. rubric:: Table of contents + +.. contents:: + :local: + + +Configure Apt +============= + +You need to configure `Apt`_ (the package manager) to trust and to add the +CrateDB repositories: + +.. code-block:: sh + + # Add HTTPS support + sh$ sudo apt install apt-transport-https + + # Download the CrateDB GPG key + sh$ wget https://cdn.crate.io/downloads/apt/DEB-GPG-KEY-crate + + # Add the key to Apt + sh$ sudo apt-key add DEB-GPG-KEY-crate + + # Add CrateDB repositories to Apt + # `lsb_release -cs` returns the codename of your OS + echo "deb https://cdn.crate.io/downloads/apt/stable/ $(lsb_release -cs) main" | + sudo tee /etc/apt/sources.list.d/crate-stable.list + + +.. NOTE:: + + CrateDB provides a *stable release* and a *testing release* channel. To use + the testing channel, replace ``stable`` with ``testing`` in the command + above. You can read more about our `release workflow`_. + +Now update Apt: + +.. code-block:: sh + + sh$ sudo apt update + +You should see a success message. This indicates that the CrateDB release +channel is correctly configured and the ``crate`` package has been registered +locally. + + +Install CrateDB +=============== + +You can now install CrateDB: + +.. code-block:: sh + + sh$ sudo apt install crate + +After the installation is finished, the ``crate`` service should be +up-and-running. + +You should be able to access it by visiting:: + + http://localhost:4200/ + +.. CAUTION:: + When you install via Apt, CrateDB automatically starts as a single-node + cluster and you won't be able to add additional nodes. In order to form a + multi-node cluster, you will need to remove the cluster state after + changing the configuration. + + +Control CrateDB +=============== + +You can control the ``crate`` service with the `systemctl` utility: + +.. code-block:: sh + + sh$ sudo systemctl COMMAND crate + +Replace ``COMMAND`` with ``start``, ``stop``, ``restart``, ``status`` and so on. +so on. + +.. CAUTION:: + + Be sure to read the guide to :ref:`rolling upgrades ` and + :ref:`full restart upgrades ` before attempting to + upgrade a running cluster. + + +Configure CrateDB +================= + +In order to configure CrateDB, take note of the configuration file +location and the available environment variables. + + +Configuration files +------------------- + +The main CrateDB `configuration files`_ are located in the ``/etc/crate`` +directory. + + +Environment +----------- + +The CrateDB startup script `sources`_ `environment variables`_ from the +``/etc/default/crate`` file. Here is an example: + +.. code-block:: sh + + # Heap Size (defaults to 256m min, 1g max) + CRATE_HEAP_SIZE=2g + + # Maximum number of open files, defaults to 65535. + # MAX_OPEN_FILES=65535 + + # Maximum locked memory size. Set to "unlimited" if you use the + # bootstrap.mlockall option in crate.yml. You must also set + # CRATE_HEAP_SIZE. + MAX_LOCKED_MEMORY=unlimited + + # Additional Java OPTS + # CRATE_JAVA_OPTS= + + # Force the JVM to use IPv4 stack + CRATE_USE_IPV4=true + + +Customized setups +================= + +A full list of package files can be obtained with this command:: + + sh$ dpkg-query -L crate + +If you want to deviate from the way that the ``crate`` package integrates with +your system, we recommend that you go with a `basic tarball installation`_. + + +.. _Apt: https://wiki.debian.org/Apt +.. _basic tarball installation: https://crate.io/docs/crate/tutorials/en/latest/install.html#install-adhoc +.. _Bullseye: https://www.debian.org/releases/bullseye/ +.. _Buster: https://www.debian.org/releases/buster/ +.. _configuration files: https://crate.io/docs/crate/reference/en/latest/config/index.html +.. _environment variables: https://crate.io/docs/crate/reference/en/latest/config/environment.html +.. _release workflow: https://github.com/crate/crate/blob/master/devs/docs/release.rst +.. _sources: https://en.wikipedia.org/wiki/Source_(command) +.. _Stretch: https://www.debian.org/releases/stretch/ diff --git a/docs/linux/index.rst b/docs/linux/index.rst new file mode 100644 index 0000000..9fcfd66 --- /dev/null +++ b/docs/linux/index.rst @@ -0,0 +1,16 @@ +.. _linux: + +================= +CrateDB and Linux +================= + +CrateDB provides a number of Linux packages. + +.. rubric:: Table of contents + +.. toctree:: + :maxdepth: 1 + + debian + red-hat + ubuntu diff --git a/docs/linux/red-hat.rst b/docs/linux/red-hat.rst new file mode 100644 index 0000000..fc4085f --- /dev/null +++ b/docs/linux/red-hat.rst @@ -0,0 +1,155 @@ +.. _red-hat: + +============================ +Run CrateDB on Red Hat Linux +============================ + +CrateDB maintains the official RPM repositories for: + +- `Red Hat Enterprise Linux`_ + +Both of these work with RedHat Enterprise Linux, CentOS, and Scientific Linux. + +.. rubric:: Table of contents + +.. contents:: + :local: + + +Configure YUM +============= + +All CrateDB packages are signed with GPG. + +To get started, you must import the CrateDB public key, like so: + +.. code-block:: sh + + sh$ sudo rpm --import https://cdn.crate.io/downloads/yum/RPM-GPG-KEY-crate + +You must then install the CrateDB repository definition. + +For Red Hat Enterprise Linux, run: + +.. code-block:: sh + + sh$ sudo rpm -Uvh https://cdn.crate.io/downloads/yum/7/x86_64/crate-release-7.0-1.x86_64.rpm + +For CrateDB versions < 4.2.0, run: + +.. code-block:: sh + + sh$ sudo rpm -Uvh https://cdn.crate.io/downloads/yum/7/noarch/crate-release-7.0-1.noarch.rpm + +The above commands will create the ``/etc/yum.repos.d/crate.repo`` +configuration file. + +CrateDB provides a stable and a testing release channel. At this point, you +should select which one you wish to use. + +By default, `YUM`_ (Red Hat's package manager) will use the stable repository. +This is because the testing repository's configuration marks it as disabled. + +If you would like to enable to testing repository, open the ``crate.repo`` file +and set ``enabled=1`` under the ``[crate-testing]`` section. + + +Install CrateDB +=============== + +With everything set up, you can install CrateDB, like so: + +.. code-block:: sh + + yum install crate + +CrateDB is now installed, but not running. + + +Running and controlling CrateDB +=============================== + +With Red Hat Enterprise Linux, you can control the ``crate`` service like so: + +.. code-block:: sh + + sh$ sudo systemctl COMMAND crate + +Here, replace ``COMMAND`` with ``start``, ``stop``, ``restart``, ``status`` and +so on. + +After you run the appropriate command with the ``start`` argument, the +``crate`` service should be up-and-running. + +You should be able to access it by visiting:: + + http://localhost:4200/ + +.. SEEALSO:: + + If you're new to CrateDB, check out our our `first use`_ documentation. + +.. CAUTION:: + + Be sure to read the guide to :ref:`rolling upgrades ` and + :ref:`full restart upgrades ` before attempting to + upgrade a running cluster. + + +Configuration +============= + + +Configuration files +------------------- + +The main CrateDB configuration files are located in the ``/etc/crate`` +directory. + + +Environment +----------- + +The CrateDB startup script `sources`_ environment variables from the +``/etc/sysconfig/crate`` file. + +You can use this mechanism to configure CrateDB. + +Here's one example: + +.. code-block:: sh + + # Heap Size (defaults to 256m min, 1g max) + CRATE_HEAP_SIZE=2g + + # Maximum number of open files, defaults to 65535. + # MAX_OPEN_FILES=65535 + + # Maximum locked memory size. Set to "unlimited" if you use the + # bootstrap.mlockall option in crate.yml. You must also set + # CRATE_HEAP_SIZE. + MAX_LOCKED_MEMORY=unlimited + + # Additional Java OPTS + # CRATE_JAVA_OPTS= + + # Force the JVM to use IPv4 stack + CRATE_USE_IPV4=true + + +Customized setups +================= + +A full list of package files can be obtained with this command:: + + sh$ rpm -ql crate + +If you want to deviate from the way that the ``crate`` package integrates with +your system, we recommend that you go with a `basic tarball installation`_. + + +.. _basic tarball installation: https://crate.io/docs/crate/tutorials/en/latest/install.html#install-adhoc +.. _first use: https://crate.io/docs/crate/tutorials/en/latest/first-use.html +.. _Red Hat Enterprise Linux: https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux +.. _sources: https://en.wikipedia.org/wiki/Source_(command) +.. _YUM: https://access.redhat.com/solutions/9934 diff --git a/docs/linux/ubuntu.rst b/docs/linux/ubuntu.rst new file mode 100644 index 0000000..327bb51 --- /dev/null +++ b/docs/linux/ubuntu.rst @@ -0,0 +1,159 @@ +.. _ubuntu: + +===================== +Run CrateDB on Ubuntu +===================== + +CrateDB maintains packages for the following Ubuntu versions: + +- `Ubuntu 20.04 LTS`_ (Focal Fossa) +- `Ubuntu 18.04.5 LTS`_ (Bionic Beaver) +- `Ubuntu 16.04.7 LTS`_ (Xenial Xerus) + +This guide will show you how to install, control, and configure a single-node +CrateDB on a local Ubuntu system. + +.. rubric:: Table of contents + +.. contents:: + :local: + + +Configure Apt +============= + +You need to configure `Apt`_ (the package manager) to trust and to add the +CrateDB repositories: + +.. code-block:: sh + + # Download the CrateDB GPG key + sh$ wget https://cdn.crate.io/downloads/deb/DEB-GPG-KEY-crate + + # Add the key to Apt + sh$ sudo apt-key add DEB-GPG-KEY-crate + + # Add CrateDB repositories to Apt + # `lsb_release -cs` returns the codename of your OS + sh$ sudo add-apt-repository "deb https://cdn.crate.io/downloads/deb/stable/ $(lsb_release -cs) main" + + +.. NOTE:: + + CrateDB provides a *stable release* and a *testing release* channel. To use + the testing channel, replace ``stable`` with ``testing`` in the command + above. You can read more about our `release workflow`_. + +Now update Apt: + +.. code-block:: sh + + sh$ sudo apt update + +You should see a success message. This indicates that the CrateDB release +channel is correctly configured and the ``crate`` package has been registered +locally. + + +Install CrateDB +=============== + +You can now install CrateDB: + +.. code-block:: sh + + sh$ sudo apt install crate + +After the installation is finished, the ``crate`` service should be +up-and-running. + +You should be able to access it by visiting:: + + http://localhost:4200/ + +.. CAUTION:: + When you install via Apt, CrateDB automatically starts as a single-node + cluster and you won't be able to add additional nodes. In order to form a + multi-node cluster, you will need to remove the cluster state after + changing the configuration. + + +Control CrateDB +=============== + +You can control the ``crate`` service with the `systemctl` utility: + +.. code-block:: sh + + sh$ sudo systemctl COMMAND crate + +Replace ``COMMAND`` with ``start``, ``stop``, ``restart``, ``status`` and so on. + +.. CAUTION:: + + Be sure to read the guide to :ref:`rolling upgrades ` and + :ref:`full restart upgrades ` before attempting to + upgrade a running cluster. + + +Configure CrateDB +================= + +In order to configure CrateDB, take note of the configuration file +location and the available environment variables. + + +Configuration files +------------------- + +The main CrateDB `configuration files`_ are located in the ``/etc/crate`` +directory. + + +Environment variables +--------------------- + +The CrateDB startup script `sources`_ `environment variables`_ from the +``/etc/default/crate`` file. Here is an example: + +.. code-block:: sh + + # Heap Size (defaults to 256m min, 1g max) + CRATE_HEAP_SIZE=2g + + # Maximum number of open files, defaults to 65535. + # MAX_OPEN_FILES=65535 + + # Maximum locked memory size. Set to "unlimited" if you use the + # bootstrap.mlockall option in crate.yml. You must also set + # CRATE_HEAP_SIZE. + MAX_LOCKED_MEMORY=unlimited + + # Additional Java OPTS + # CRATE_JAVA_OPTS= + + # Force the JVM to use IPv4 stack + CRATE_USE_IPV4=true + + +Customized setups +================= + +A full list of package files can be obtained with this command:: + + sh$ dpkg-query -L crate + +If you want to deviate from the way that the ``crate`` package integrates with +your system, you can do a `basic tarball installation`_. + + +.. _Apt: https://wiki.debian.org/Apt +.. _basic tarball installation: https://crate.io/docs/crate/tutorials/en/latest/install.html#install-adhoc +.. _configuration files: https://crate.io/docs/crate/reference/en/latest/config/index.html +.. _environment variables: https://crate.io/docs/crate/reference/en/latest/config/environment.html +.. _release workflow: https://github.com/crate/crate/blob/master/devs/docs/release.rst +.. _sources: https://en.wikipedia.org/wiki/Source_(command) +.. _Ubuntu 14.04.6: https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes +.. _Ubuntu 16.04.7 LTS: https://wiki.ubuntu.com/XenialXerus/ReleaseNotes +.. _Ubuntu 18.04.5 LTS: https://wiki.ubuntu.com/BionicBeaver/ReleaseNotes +.. _Ubuntu 20.04 LTS: https://wiki.ubuntu.com/FocalFossa/ReleaseNotes diff --git a/docs/normalize-intervals.rst b/docs/normalize-intervals.rst deleted file mode 100644 index 79b3aff..0000000 --- a/docs/normalize-intervals.rst +++ /dev/null @@ -1,789 +0,0 @@ -.. _normalize-intervals: - -==================================== -Normalize time series data intervals -==================================== - -.. sidebar:: The ISS Orbit - - .. image:: _assets/img/normalize-intervals/orbit.gif - :alt: An animated visualization of the ISS orbit - :target: https://en.wikipedia.org/wiki/International_Space_Station#Orbit - - The ISS travels at 27,724 kilometers per hour and orbits Earth - approximately once every 90 minutes. - -If you followed one of the tutorials in the :ref:`previous section `, -you should have some mock time series data about the position, or `ground -point`_, of the `International Space Station`_ (ISS). - -It is common to visualize time series data by graphing values over time. -However, you may run into the following issues: - -1. The resolution of your data does not match the resolution you want for your - visualization. - - *For example, you want to plot a single value per minute, but your data is - spaced in 10-second intervals. You will need to resample the data.* - -2. Your data is non-continuous, but you want to visualize a continuous time - series. - - *For example, you want to plot every minute for the past 24 hours, but you - are missing data for some intervals. You will need to fill in the missing - values.* - -This tutorial demonstrates the shortcomings of visualizing the non-normalized -data and shows you how to address these shortcomings by normalizing your data -using SQL. - -.. NOTE:: - - This tutorial focuses on the use of SQL. Code examples demonstrate the use - of the CrateDB Python client. However, the following guidelines will work - with any language that allows for the execution of SQL. - -.. SEEALSO:: - - :ref:`Tutorials for generating mock time series data ` - -.. rubric:: Table of contents - -.. contents:: - :local: - - -.. _ni-prereq: - -Prerequisites -============= - - -.. _ni-mock-data: - -Mock data ---------- - -You must have CrateDB :ref:`installed and running `. - -This tutorial works with ISS location data. Before continuing, you should have -generated some ISS data by following one of the tutorials in the :ref:`previous -section `. - - -.. _ni-python: - -Python setup ------------- - -You should be using the latest stable version of `Python`_. - -You must have the following Python libraries installed: - -- `pandas`_ -- querying and data manipulation -- `SQLAlchemy`_ -- a powerful database abstraction layer -- The :ref:`crate-python:index` -- SQLAlchemy support for CrateDB -- `Matplotlib`_ -- data visualization -- `geojson`_ -- Functions for encoding and decoding GeoJSON formatted data - -You can install (or upgrade) the necessary libraries with `Pip`_: - -.. code-block:: console - - sh$ pip3 install --upgrade pandas sqlalchemy crate matplotlib geojson - - -.. _ni-jupyter: - -Using Jupyter Notebook -~~~~~~~~~~~~~~~~~~~~~~ - -This tutorial shows you how to use `Jupyter Notebook`_ so that you can display -data visually and experiment with the commands as you see fit. - -Jupyter Notebook allows you to create and share documents containing live code, -equations, visualizations, and narrative text. - -You can install Jupyter with Pip: - -.. code-block:: console - - sh$ pip3 install --upgrade notebook - -Once installed, you can start a new Jupyter Notebook session, like this: - -.. code-block:: console - - sh$ jupyter notebook - -This command should open a new browser window. In this window, select *New* (in -the top right-hand corner), then *Notebook* → *Python 3*. - -Type your Python code at the input prompt. Then, select *Run* (Shift-Enter ⇧⏎) -to evaluate the code: - -.. image:: _assets/img/normalize-intervals/jupyter-hello-world.png - -You can re-evaluate input blocks as many times as you like. - -.. SEEALSO:: - - `Jupyter Notebook basics`_ - - -.. _ni-alt-shells: - -Alternative shells -~~~~~~~~~~~~~~~~~~ - -Jupyter mimics Python's `interactive mode`_. - -If you're more comfortable in a text-based environment, you can use the -`standard Python interpreter`_. However, we recommend `IPython`_ (the kernel -used by Jupyter) for a more user-friendly experience. - -You can install IPython with Pip: - -.. code-block:: console - - sh$ pip3 install --upgrade ipython - -Once installed, you can start an interactive IPython session like this: - -.. code-block:: console - - sh$ ipython - - Python 3.9.10 (main, Jan 15 2022, 11:48:04) - Type 'copyright', 'credits' or 'license' for more information - IPython 8.0.1 -- An enhanced Interactive Python. Type '?' for help. - - In [1]: - - -.. _ni-steps: - -Steps -===== - -To follow along with this tutorial, copy and paste the example Python code into -Jupyter Notebook and evaluate the input one block at a time. - - -.. _ni-query-raw: - -Query the raw data ------------------- - -This tutorial uses `pandas`_ to query CrateDB and manipulate the results. - -To get started, import the ``pandas`` library: - -.. code-block:: python - - import pandas - -Pandas uses `SQLAlchemy`_ and the :ref:`crate-python:index` to provide support -for ``crate://`` style :ref:`connection strings `. - -Then, query the raw data: - -.. code-block:: python - - pandas.read_sql('SELECT * FROM doc.iss', 'crate://localhost:4200') - -.. NOTE:: - - By default, CrateDB binds to port ``4200`` on ``localhost``. - - Edit the connection string as needed. - -If you evaluate the :py:func:`read_sql() ` call above, the -Jupyter notebook should eventually display a table like this: - -.. csv-table:: - :header: "", "timestamp", "position" - :widths: auto - - "0", "1591865682133", "[144.0427, 22.7383]" - "1", "1591865702975", "[144.9187, 21.7528]" - "2", "1591865775973", "[147.9357, 18.2015]" - "3", "1591865818387", "[149.6088, 16.1326]" - "4", "1591865849756", "[150.8377, 14.5709]" - "…", "…", "…" - "59", "1591866131684", "[161.2033, 0.4045]" - "60", "1591866236187", "[164.9696, -4.896]" - "61", "1591866016657", "[157.0666, 6.21]" - "62", "1591866267764", "[166.1145, -6.4896]" - "63", "1591866278210", "[166.4979, -7.0202]" - -Here are a few ways to improve this result: - -.. rst-class:: open - - * The current query returns all data. At first, this is probably okay for - visualization purposes. However, as you generate more data, you will probably - find it more useful to limit the results to a specific time window. - - * The ``timestamp`` column isn't human-readable. It would be easier to - understand the results if this value was as a human-readable time. - - * The ``position`` column is a :ref:`crate-reference:data-types-geo`. This data - type isn't easy to plot on a traditional graph. However, you can use the - :ref:`distance() ` function to calculate the - distance between two ``geo_point`` values. If you compare ``position`` to a - fixed place, you can plot distance over time for a graph showing you how far - away the ISS is from some location of interest. - -Here's an improvement that wraps the code in a function named ``raw_data()`` so -that you can execute this query multiple times: - -.. code-block:: python - - import pandas - - def raw_data(): - # From - berlin_position = [52.520008, 13.404954] - # Returns distance in kilometers (division by 1000) - sql = f''' - SELECT iss.timestamp AS time, - DISTANCE(iss.position, {berlin_position}) / 1000 AS distance - FROM doc.iss - WHERE iss.timestamp >= CURRENT_TIMESTAMP - INTERVAL '1' DAY - ORDER BY time ASC - ''' - return pandas.read_sql(sql, 'crate://localhost:4200', parse_dates={'time': 'ms'}) - -Specifically: - -.. rst-class:: open - - * You can define the `location`_ of Berlin and interpolate that into the query - to calculate the ``DISTANCE()`` of the ISS ground point in kilometers. - - * You can use :ref:`CURRENT_TIMESTAMP ` with an - interval :ref:`value expression ` - (``INTERVAL '1' DAY``) to calculate a timestamp that is 24 hours in the - past. You can then use a :ref:`WHERE clause ` - to filter out records with a ``timestamp`` older than one day. - - An :ref:`ORDER BY clause ` sorts the results - by ``timestamp``, oldest first. - - * You can use the ``parse_dates`` argument to specify which columns - ``read_sql()`` should parse as datetimes. Here, a dictionary with the value - of ``ms`` is used to specify that ``time`` is a millisecond integer. - -Execute the ``raw_data()`` function: - -.. code-block:: python - - raw_data() - -Jupyter should display a table like this: - -.. csv-table:: - :header: "", "time", "distance" - :widths: auto - - "0", "2020-06-11 08:54:21.153", "9472.748594" - "1", "2020-06-11 08:54:31.675", "9530.500793" - "2", "2020-06-11 08:54:42.133", "9588.243498" - "3", "2020-06-11 08:54:52.559", "9643.233027" - "4", "2020-06-11 08:55:02.975", "9700.967306" - "…", "…", "…" - "444", "2020-06-11 10:11:51.812", "4249.557635" - "445", "2020-06-11 10:12:02.273", "4251.786695" - "446", "2020-06-11 10:12:12.698", "4254.968453" - "447", "2020-06-11 10:12:23.147", "4259.121566" - "448", "2020-06-11 10:12:33.699", "4264.223073" - -Above, notice the query used by the ``raw_data()`` function produces: - - * Fewer rows than the previous query (limited by the 24 hour time window) - - * A human-readable time (instead of a timestamp) - - * The distance of the ISS ground point in kilometers (instead of a - ``geo_point`` object) - - -.. _ni-plot: - -Plot the data -------------- - -You can plot the data returned by the previous query using `Matplotlib`_. - -Here's an example function that plots the data: - -.. code-block:: python - - import matplotlib.pyplot as plt - import matplotlib.dates as mdates - - def plot(data): - fig, ax = plt.subplots(figsize=(12, 6)) - ax.scatter(data['time'], data['distance']) - ax.set( - xlabel='Time', - ylabel='Distance (km)', - title='ISS Ground Point Distance (Past 24 Hours)') - ax.xaxis_date() - ax.xaxis.set_major_locator(mdates.HourLocator()) - ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:00')) - # Plot the whole date range (null time values are trimmed by default) - ax.set_xlim(data.min()['time'], data.max()['time']) - fig.autofmt_xdate() - -Above, the ``plot()`` function: - - * Generates a :py:func:`figure ` that measures 12 × 6 (inches) - * Plots ``data`` as a :py:meth:`scatter ` diagram (distance over time) - * Sets the :py:class:`axes ` labels and title - * Sets up the x-axis to :py:meth:`handle datetimes ` - * Configures major :py:meth:`tick locations ` - every :py:class:`hour ` - * Configures major :py:meth:`tick formatting ` - with a :py:class:`time string ` (``%H:00``) - * Forces Matplotlib to plot the whole data set, including null ``time`` - values, by manually setting the :py:meth:`limits of the x-axis ` - (which are trimmed by default) - * Activates x-axis tick label :py:meth:`auto-formatting ` - (rotates them for improved readability) - - -.. SEEALSO:: - - The full `Matplotlib documentation`_ - -You can test the ``plot()`` function by passing in the return value of -``raw_data()``: - -.. code-block:: python - - plot(raw_data()) - -Jupyter should display a plot like this: - -.. image:: _assets/img/normalize-intervals/raw-data.png - -Above, notice that: - - * This plot looks more like a :py:func:`line chart ` - than a :py:func:`scatter diagram `. That's - because the raw data appears in intervals of 10 seconds. At this - resolution, such a high sampling frequency produces so many data points that - they appear to be a continuous line. - - * The x-axis does not cover a full 24 hours. - - Matplotlib is plotting the whole data set, as requested. However, the - data generation script has only been running for a short period. - - The query used by ``raw_data()`` only filters out records older than 24 - hours (using a ``WHERE`` clause). The query does not fill in data for any - missing time intervals. As a result, the visualization may be inaccurate if - there is any missing data (in the sense that it will not indicate the - presence of missing data). - - -.. _ni-resample: - -Resample the data ------------------- - -When plotting a longer timeframe, a sampling frequency of 10 seconds can be too -high, creating an unnecessary large number of data points. Therefore, here is a -basic approach to resample data at a lower frequency: - - 1. Place values of the ``time`` column into bins for a given interval (using - :ref:`DATE_BIN() `). - - In this example, we are resampling the data per minute. This means that all - rows with an identical ``time`` value on minute-level are placed into the - same date bin. - - 2. Group rows per date bin (using - :ref:`GROUP BY `). - - The position index ``1`` is a reference to the first column of the - ``SELECT`` clause so we don't need to repeat the whole ``DATE_BIN`` function call. - - 3. Calculate an :ref:`aggregate ` value across the - grouped rows. - - For example, if you have six rows with six distances, you can calculate the - average distance (using :ref:`crate-reference:aggregation-avg`) and return a - single value. - -.. TIP:: - - *Date bin* is short for *date binning*, or `data binning`_ in general. - It is sometimes also referred to as *time bucketing*. - -Here's a new function with a rewritten query that implements the three steps -above and resamples the raw data by the minute: - -.. code-block:: python - - def data_by_minute(): - # From - berlin_position = [52.520008, 13.404954] - # Returns distance in kilometers (division by 1000) - sql = f''' - SELECT DATE_BIN('1 minute'::INTERVAL, iss.timestamp, 0) AS time, - COUNT(*) AS records, - AVG(DISTANCE(iss.position, {berlin_position}) / 1000.0) AS distance - FROM doc.iss - WHERE iss.timestamp >= CURRENT_TIMESTAMP - '1 day'::INTERVAL - GROUP BY 1 - ORDER BY 1 ASC - ''' - return pandas.read_sql(sql, 'crate://localhost:4200', parse_dates={'time': 'ms'}) - -.. NOTE:: - - The ``DATE_BIN`` function is available in CrateDB versions >= 4.7.0. In - older versions, you can use ``DATE_TRUNC('minute', "timestamp")`` instead. - - The ``records`` column produced by this query will tell you how many source - rows have been grouped by the query per result row. - -Check the output: - -.. code-block:: python - - data_by_minute() - -.. csv-table:: - :header: "", "time", "records", "distance" - :widths: auto - - "0", "2020-06-11 08:54:00", "4", "9558.681475" - "1", "2020-06-11 08:55:00", "6", "9844.287176" - "2", "2020-06-11 08:56:00", "6", "10188.625052" - "3", "2020-06-11 08:57:00", "5", "10504.130406" - "4", "2020-06-11 08:58:00", "6", "10816.039363" - "…", "…", "…", "…" - "130", "2020-06-11 11:04:00", "6", "15800.416911" - "131", "2020-06-11 11:05:00", "5", "15716.643869" - "132", "2020-06-11 11:06:00", "6", "15605.661046" - "133", "2020-06-11 11:07:00", "6", "15457.347545" - "134", "2020-06-11 11:08:00", "1", "15358.879053" - -.. TIP:: - - Despite an ideal time series interval of 10 seconds, some result rows may - be aggregating values over fewer than six records. - - Irregularities may occur when: - - * Data collection started or stopped during that period - * There were delays in the data collection (e.g., caused by network - latency, CPU latency, disk latency, and so on) - -You can plot this data like before: - -.. code-block:: python - - plot(data_by_minute()) - -.. image:: _assets/img/normalize-intervals/data-by-minute.png - -Here, notice that the individual data points are now visible (i.e., the -apparent line in the previous diagram is now discernible as a series of -discrete values). - - -.. _ni-interpolate: - -Interpolate missing records ---------------------------- - -The ``data_by_minute()`` function resamples data by the minute. However, the -query used can only resample data for minutes with one or more records. - -If you want one data point per minute interval irrespective of the number of - ``records``, you must `interpolate`_ those values. - -You can interpolate data in many ways, some more advanced than others. For this -tutorial, we will show you how to achieve the simplest possible type of -interpolation: *null interpolation*. - -Null interpolation works by filling in any gaps in the time series with -``NULL`` values. ``NULL`` is a value used to indicate missing data. The result -is a time series that indicates the presence of missing data, lending -itself well to accurate visualization. - -You can perform null interpolation like so: - -.. rst-class:: open - - 1. Generate continuous null data for the same period as the right-hand table - of a join. You should sample this data at the frequency most appropriate - for your visualization. - - 2. Select the data for the period you are interested in as the left-hand table - of a join. You should resample this data at the same frequency as your null - data. - - 3. Join both tables with a left :ref:`inner join ` on - ``time`` to pull across any non-null values from the right-hand table. - -The result is a row set that has one row per interval for a fixed period with -null values filling in for missing data. - -.. SEEALSO:: - - Read more about :ref:`how joins work `. - -.. _ni-brief-example: - -A brief example -~~~~~~~~~~~~~~~ - -To illustrate how null interpolation works with a brief example, imagine that -you are interested in a specific five minute period between 07:00 and 07:05. - -Here's your resampled data: - -.. csv-table:: - :header: "", "time", "records", "distance" - :widths: auto - - "0", "2020-06-11 07:00:00", "5", "11871.619396" - "1", "2020-06-11 07:02:00", "6", "12415.473163" - "2", "2020-06-11 07:03:00", "3", "13055.554924" - -Notice that rows for 07:01 and 07:04 are missing. Perhaps the data collection -process ran into issues during those time windows. - -If you generate null data for the same period, it will look like this: - -.. csv-table:: - :header: "", "time", "distance" - :widths: auto - - "0", "2020-06-11 07:00:00", "None" - "1", "2020-06-11 07:01:00", "None" - "2", "2020-06-11 07:02:00", "None" - "3", "2020-06-11 07:03:00", "None" - "4", "2020-06-11 07:04:00", "None" - -.. NOTE:: - - A column full of null values will be :py:meth:`cast - ` to `None`_ values by pandas. - That's why this table displays ``None`` instead of ``NULL``. - -If you perform a left inner join with those two result sets (on the ``time`` -column), you will end up with the following: - -.. csv-table:: - :header: "", "time", "records", "distance" - :widths: auto - - "0", "2020-06-11 11:00:00", "5", "11871.619396" - "1", "2020-06-11 11:01:00", "0", "NaN" - "2", "2020-06-11 11:02:00", "6", "12415.473163" - "3", "2020-06-11 11:03:00", "3", "13055.554924" - "4", "2020-06-11 11:04:00", "0", "NaN" - -Here, notice that: - -.. rst-class:: open - - * There is one result row per minute interval, even when there are no - corresponding ``records``. - - * Missing data results in a ``distance`` value of :py:obj:`NaN - ` (Not a Number). Pandas will cast ``NULL`` values to - ``NaN`` when a column contains numeric data. - -.. SEEALSO:: - - Read more about :ref:`pandas:missing_data` using pandas. - - -.. _ni-null-data: - -Generate continuous null data for the past 24 hours -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can generate continuous null data with the :ref:`generate_series() -` table function. A :ref:`table -function ` is a function that produces a set -of rows. - -For example, this query generates null values for every minute in the past 24 -hours: - -.. code-block:: python - - def null_by_minute_24h(): - sql = ''' - SELECT time, - NULL AS distance - FROM generate_series( - DATE_TRUNC('minute', CURRENT_TIMESTAMP) - INTERVAL '24 hours', - DATE_TRUNC('minute', CURRENT_TIMESTAMP), - '1 minute'::INTERVAL - ) AS series(time) - ''' - return pandas.read_sql(sql, 'crate://localhost:4200', parse_dates={'time': 'ms'}) - -Test the function, like so: - -.. code-block:: python - - null_by_minute_24h() - -.. csv-table:: - :header: "", "time", "distance" - :widths: auto - - "0", "2020-06-10 07:09:00", "None" - "1", "2020-06-10 07:10:00", "None" - "2", "2020-06-10 07:11:00", "None" - "3", "2020-06-10 07:12:00", "None" - "4", "2020-06-10 07:13:00", "None" - "…", "…", "…" - "1436", "2020-06-11 07:05:00", "None" - "1437", "2020-06-11 07:06:00", "None" - "1438", "2020-06-11 07:07:00", "None" - "1439", "2020-06-11 07:08:00", "None" - "1440", "2020-06-11 07:09:00", "None" - -Plot the data: - -.. code-block:: python - - plot(null_by_minute_24h()) - -.. image:: _assets/img/normalize-intervals/null-by-minute-24h.png - -This plot displays null values for a full 24 hour period. - -Conceptually, all that remains is to combine this null plot with the plot that -includes your resampled data. - - -.. _ni-bring-together: - -Bring it all together -~~~~~~~~~~~~~~~~~~~~~ - -To combine the null data with your resampled data, you can write a new query -that performs a left :ref:`crate-reference:inner-joins`, as per the previous -:ref:`introductions `. - -.. code-block:: python - - def data_24h(): - # From - berlin_position = [52.520008, 13.404954] - # Returns distance in kilometers (division by 1000) - sql = f''' - SELECT time, - COUNT(*) AS records, - AVG(DISTANCE(iss.position, {berlin_position}) / 1000) AS distance - FROM generate_series( - DATE_TRUNC('minute', CURRENT_TIMESTAMP) - INTERVAL '24 hours', - DATE_TRUNC('minute', CURRENT_TIMESTAMP), - '1 minute'::INTERVAL - ) AS series(time) - LEFT JOIN doc.iss ON DATE_TRUNC('1 minute'::INTERVAL, iss.timestamp, 0) = time - GROUP BY time - ORDER BY time ASC - ''' - return pandas.read_sql(sql, 'crate://localhost:4200', parse_dates={'time': 'ms'}) - -In the code above: - -.. rst-class:: open - - * The :ref:`generate_series() ` - table function creates a row set called ``time`` that has one row per minute - for the past 24 hours. - - * The ``iss`` table can be joined to the ``time`` series by truncating the - ``iss.timestamp`` column to the minute for the :ref:`join condition - `. - - * Like before, a :ref:`GROUP BY ` clause can be - used to collapse multiple rows per minute into a single row per minute. - - Similarly, the :ref:`crate-reference:aggregation-avg` function can be used to - compute an aggregate ``DISTANCE`` value across multiple rows. There is no - need to check for null values here because the ``AVG()`` function discards - null values. - -Test the function: - -.. code-block:: python - - data_24h() - -.. csv-table:: - :header: "", "time", "records", "distance" - :widths: auto - - "0", "2020-06-11 12:23:00", "0", "NaN" - "1", "2020-06-11 12:24:00", "0", "NaN" - "2", "2020-06-11 12:25:00", "0", "NaN" - "3", "2020-06-11 12:26:00", "0", "NaN" - "4", "2020-06-11 12:27:00", "0", "NaN" - "…", "…", "…", "…" - "1436", "2020-06-12 12:19:00", "5", "9605.382566" - "1437", "2020-06-12 12:20:00", "5", "9229.775335" - "1438", "2020-06-12 12:21:00", "4", "8880.479672" - "1439", "2020-06-12 12:22:00", "5", "8536.238527" - "1440", "2020-06-12 12:23:00", "0", "8318.402324" - -Plot the data: - -.. code-block:: python - - plot(data_24h()) - -.. image:: _assets/img/normalize-intervals/data-24h.png - -And here's what it looks like if you wait a few more hours: - -.. image:: _assets/img/normalize-intervals/data-24h-more.png - -The finished result is a visualization that uses time series normalization and -resamples raw data to regular intervals with the interpolation of missing values. - -This visualization resolves both original issues: - -.. rst-class:: open - -1. *You want to plot a single value per minute, but your data is spaced in - 10-second intervals. You will need to resample the data.* - -2. *You want to plot every minute for the past 24 hours, but you are missing - data for some intervals. You will need to fill in the missing values.* - -.. _data binning: https://en.wikipedia.org/wiki/Data_binning -.. _ground point: https://en.wikipedia.org/wiki/Ground_track -.. _interactive mode: https://docs.python.org/3/tutorial/interpreter.html#interactive-mode -.. _International Space Station: https://www.nasa.gov/mission_pages/station/main/index.html -.. _Internet of Things: https://en.wikipedia.org/wiki/Internet_of_things -.. _interpolate: https://en.wikipedia.org/wiki/Interpolation -.. _IPython: https://ipython.org/ -.. _Jupyter Notebook basics: https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb -.. _Jupyter Notebook: https://jupyter.org/ -.. _location: https://www.latlong.net/ -.. _Matplotlib documentation: https://matplotlib.org/stable/ -.. _Matplotlib: https://matplotlib.org/ -.. _None: https://docs.python.org/3/library/constants.html#None -.. _pandas: https://pandas.pydata.org/ -.. _Pip: https://pypi.org/project/pip/ -.. _Python: https://www.python.org/ -.. _SQLAlchemy: https://www.sqlalchemy.org/ -.. _standard Python interpreter: https://docs.python.org/3/tutorial/interpreter.html -.. _system load: https://en.wikipedia.org/wiki/Load_(computing) -.. _geojson: https://github.com/jazzband/geojson