Skip to content

Commit

Permalink
Merge pull request #48 from shaunkirthan/main
Browse files Browse the repository at this point in the history
Data_Storage_and_Wearhouse_Lab
  • Loading branch information
raminmohammadi authored Sep 18, 2024
2 parents 6da302e + 86d1548 commit 40b8125
Show file tree
Hide file tree
Showing 4 changed files with 303 additions and 0 deletions.
92 changes: 92 additions & 0 deletions Labs/Data_Storage_Warehouse_Labs/Lab1/BigQuery/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
Here is the revised lab guide, focusing only up to **How to query data in BigQuery using SQL**:

---

# **Lab 2: Data Warehousing using Google Cloud (GCP)**

---

## **Introduction to Google BigQuery**

### **What is Google BigQuery?**
Google BigQuery is a fully-managed, serverless data warehouse solution by Google Cloud. It allows you to store and analyze large datasets efficiently using SQL queries. BigQuery is designed to handle large-scale data analytics and can process billions of rows in just a few seconds.

---

## **Step 1: Set Up BigQuery in Google Cloud Console**

### **Steps to Set Up BigQuery:**

1. **Go to Google Cloud Console**:
- Navigate to the [Google Cloud Console](https://console.cloud.google.com/).

2. **Enable the BigQuery API**:
- From the **Navigation Menu** on the left, scroll down and select **BigQuery** under the **Big Data** section.
- If this is your first time using BigQuery, you may be prompted to enable the BigQuery API. Click **Enable**.

3. **Create a New BigQuery Dataset**:
- Once in BigQuery, you will see your project listed in the explorer pane on the left side.
- Click on your project, then click the **Create Dataset** button.
- Name your dataset (e.g., `lab2_dataset`), choose a **data location** (e.g., `us-east1`), and leave the other settings as default.
- Click **Create Dataset**.

---

## **Step 2: Load Data into BigQuery**

### **What is Data Loading in BigQuery?**
Loading data into BigQuery means taking data from a file (e.g., CSV) stored in a source like Google Cloud Storage (GCS) and uploading it to a BigQuery table so that you can analyze it using SQL.

In this step, you will load data from a CSV file into a table in BigQuery.

### **Steps to Load Data into BigQuery:**

1. **Prepare Your Dataset**:
- You should already have your dataset uploaded to a Google Cloud Storage bucket (as covered in the previous lab).
- For example, let’s assume your dataset is stored as `gs://<your-bucket-name>/data/dataset.csv`.

2. **Create a BigQuery Table**:
- In the BigQuery Console, under your dataset, click on **Create Table**.
- **Source**: For **Source**, choose **Google Cloud Storage**.
- **File Path**: In the **URI** field, enter the path to your file (e.g., `gs://<your-bucket-name>/data/dataset.csv`).
- **File Format**: Choose `CSV` as the file format.
- **Table Name**: Name your table (e.g., `lab2_table`).
- **Schema**: Select **Auto Detect** if you want BigQuery to automatically determine the schema, or enter the schema manually by specifying field names and data types.
- Click **Create Table** to load the data into BigQuery.

---

## **Step 3: Querying Data in BigQuery**

### **What is Querying in BigQuery?**
Querying in BigQuery is similar to writing SQL queries to retrieve specific data from your dataset. Once the data is loaded, you can run SQL queries to perform operations like filtering, aggregating, or summarizing the data.

### **Steps to Query Data in BigQuery:**

1. **Go to BigQuery Editor**:
- Once the data is loaded, you can query the table directly in BigQuery's SQL editor.

2. **Write a Basic Query**:
- Here’s a simple SQL query that selects the first 10 rows from the table:

```sql
SELECT * FROM `<your-project-id>.<lab2_dataset>.<lab2_table>` LIMIT 10;
```

- Replace `<your-project-id>`, `<lab2_dataset>`, and `<lab2_table>` with your actual project, dataset, and table names.

3. **Run the Query**:
- Click the **Run** button in the SQL editor.
- BigQuery will process the query and return the results in seconds.

---

## **Conclusion**

By completing this lab, you have learned the following:
1. How to create a BigQuery dataset and table.
2. How to load data from Google Cloud Storage into BigQuery.
3. How to query data in BigQuery using SQL.

BigQuery is a powerful tool for handling large-scale data analytics, and this lab provides the foundational skills needed to start working with BigQuery on real-world datasets.

165 changes: 165 additions & 0 deletions Labs/Data_Storage_Warehouse_Labs/Lab1/Buckets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# **Lab 1: Data Storage and Warehouse using Google Cloud (GCP)**

---

## **Set Up Google Cloud Storage (GCS) Bucket**

### **What is GCS?**
Google Cloud Storage (GCS) is a scalable and secure storage service that lets you store any type of data and easily integrate with other Google services such as BigQuery, Machine Learning, and more.

---

### **Steps to Create a GCS Bucket:**

1. **Go to Google Cloud Console**:
- Navigate to the [Google Cloud Console](https://console.cloud.google.com/).
- On the left sidebar, scroll down to **Storage** and click on **Browser**.

2. **Create a New Bucket**:
- Click on **Create Bucket**.
- Assign a **unique name** to your bucket (e.g., `gcp-lab-bucket`).
- For **Location**, choose `us-east1`.
- Click **Create**.

3. **Set Permissions**:
- Go to **IAM & Admin** in the console.
- Under **Service Accounts**, click **Create Service Account**.
- Name the service account (e.g., `lab1-service-account`), then select **Owner** role and click **Done**.

4. **Download Credentials**:
- In the **Service Accounts** page, click the three dots (⋮) next to your service account and select **Manage Keys**.
- Click **Add Key**, then **Create New Key** and choose **JSON**.
- Download the JSON file and store it securely. This file will authenticate your Google Cloud project.

---

## **Step 1: Establish Connection from Local System to GCS Bucket**

In this step, you will connect your local machine to the GCS bucket to enable data storage.

### **Steps to Connect Local System to GCS Bucket:**

1. **Authenticate with GCP**:
On your local machine, authenticate to Google Cloud by running:

```bash
gcloud auth login
```

This will open a web browser where you need to log in with your Google account.

2. **Set Your GCP Project**:
Once authenticated, set the project that contains your GCS bucket:

```bash
gcloud config set project <your-project-id>
```

Replace `<your-project-id>` with the ID of your Google Cloud project.

3. **Install Google Cloud SDK** (if not already installed):
- If you don't have the Google Cloud SDK installed on your local system, you can install it by following [this guide](https://cloud.google.com/sdk/docs/install).

4. **Verify Access to the GCS Bucket**:
Test your access to the bucket by listing the contents (if any):

```bash
gsutil ls gs://<your-bucket-name>
```

Replace `<your-bucket-name>` with the actual name of your bucket (e.g., `gcp-lab-bucket`). If this returns no errors, your local system is now connected to the GCS bucket.

---

## **Step 2: Storing a Dataset in the GCS Bucket**

Now that the connection between your local system and GCS is set, you can store a dataset in the bucket. Let's add some additional features for better management of the data.

### **Steps to Upload Dataset to GCS Bucket:**

1. **Prepare the Dataset**:
Assume you have a dataset named `dataset.csv` in a folder called `data`. First, make sure your dataset is in place:

```bash
mkdir -p data
mv <path-to-your-dataset> data/dataset.csv
```

2. **Upload the Dataset to GCS**:
Using `gsutil`, you can upload your dataset to the GCS bucket:

```bash
gsutil cp data/dataset.csv gs://<your-bucket-name>/data/dataset.csv
```

This command uploads the `dataset.csv` file to the specified GCS bucket under the `data` folder.

3. **Enable Versioning on the GCS Bucket**:
To track multiple versions of the dataset (useful for large-scale projects), enable object versioning on the GCS bucket:

```bash
gsutil versioning set on gs://<your-bucket-name>
```

With versioning enabled, GCS will maintain previous versions of the file when you update or overwrite it.

4. **Apply Object Lifecycle Management** (optional but recommended):
You can configure lifecycle management policies for the objects in your bucket, such as deleting older versions after a certain number of days.

- Create a lifecycle configuration file `lifecycle.json`:

```json
{
"rule": [
{
"action": {"type": "Delete"},
"condition": {"age": 30}
}
]
}
```

- Then apply it to the bucket:

```bash
gsutil lifecycle set lifecycle.json gs://<your-bucket-name>
```

This configuration will automatically delete any objects older than 30 days.

---

## **Track the Dataset with DVC (Optional)**

Since you already have DVC set up, you can track the dataset using DVC as follows :
(if you haven't set up DVCm you can find the steps under Labs--> Data Labs--> DVC_Labs)

```bash
dvc add data/dataset.csv
git add data/dataset.csv.dvc data/.gitignore
git commit -m "Track dataset with DVC"
dvc push
```

---

## **Access Your Dataset from GCS:**

To download or share the dataset later, you can use:

```bash
gsutil cp gs://<your-bucket-name>/data/dataset.csv ./data/dataset.csv
```

---

## **Additional Features:**

- **Encryption**: Ensure that your data is encrypted either with Google-managed keys or your own encryption keys.
- **Access Control**: Use GCS bucket policies and IAM roles to control who can access the data. You can grant access to specific users or service accounts.
- **Monitoring**: Enable logging and monitoring to track access patterns and any changes to your data over time using Google Cloud’s monitoring tools.

---

With these steps, your local system is now connected to the GCS bucket, and you can store and version datasets securely. Additionally, by enabling versioning and lifecycle management, you gain fine-grained control over how your data is stored and managed in the cloud.

45 changes: 45 additions & 0 deletions Labs/Data_Storage_Warehouse_Labs/Lab1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Data Storage and Warehousing using Google Cloud (GCP)

## Introduction

In today’s data-driven landscape, efficiently storing, managing, and analyzing large volumes of data is critical. Google Cloud Platform (GCP) offers powerful services to simplify these tasks. Two essential services in this space are **Google Cloud Storage (GCS)** and **Google BigQuery**.

**Google Cloud Storage (GCS)** provides scalable and secure storage for any type of data, acting as a foundation for cloud-based storage solutions. It integrates seamlessly with other Google services like BigQuery, making it ideal for building data pipelines.

**Google BigQuery** is a serverless data warehouse designed to process large datasets using SQL queries. Its fully-managed infrastructure enables you to analyze petabytes of data in seconds, making it a key tool for businesses seeking quick insights from their data.

This lab introduce you to how GCS and BigQuery work together to provide a full solution for cloud-based data storage and analysis.

---

## Google Cloud Storage (GCS)

- **Scalable Data Storage**
GCS is designed to store data of any size, from small files to large datasets. It provides a centralized location for securely storing data that can be easily accessed from anywhere.

- **Bucket Creation and Management**
You will learn how to create and manage GCS buckets, which serve as containers for your data. You’ll set permissions, manage access with service accounts, and upload datasets.

- **Versioning and Lifecycle Policies**
GCS enables versioning to track changes and lifecycle management for automatic retention policies. These features help you manage data effectively over time.

---

## Google BigQuery

- **Fast, Serverless Analytics**
BigQuery allows you to analyze massive datasets using SQL, without needing to manage any infrastructure. You’ll explore how to run queries and extract insights quickly and efficiently.

- **Dataset and Table Creation**
You will create datasets and tables in BigQuery, and load data from GCS. This data can then be queried using SQL to perform analysis and aggregations.

- **Seamless Integration with GCS**
BigQuery’s seamless integration with GCS allows you to load data directly from your storage bucket for analysis, making it easy to combine the two services for efficient data workflows.

---

## Conclusion

By completing these labs, you'll gain practical experience in setting up cloud storage with GCS and analyzing data with BigQuery. Together, they provide a robust solution for managing and processing large-scale datasets in the cloud.


1 change: 1 addition & 0 deletions Labs/Data_Storage_Warehouse_Labs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

0 comments on commit 40b8125

Please sign in to comment.