From 44149ea47815ff9bc5bc4161ccc1b839db54f817 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Thu, 2 Nov 2023 09:47:29 -0400 Subject: [PATCH 1/8] Initial draft of clustering module --- python_clustering/python_clustering.md | 265 +++++++++++++++++++++++++ 1 file changed, 265 insertions(+) create mode 100644 python_clustering/python_clustering.md diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md new file mode 100644 index 000000000..c685151ef --- /dev/null +++ b/python_clustering/python_clustering.md @@ -0,0 +1,265 @@ + + +# Python Lesson on Clustering for Machine Learning + +@overview + +### What is clustering? +- Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on their similarity. The goal of clustering is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Clustering algorithms work by measuring the similarity between data points and then grouping similar data points together. There are many different clustering algorithms, each with its own strengths and weaknesses. Some of the most common clustering algorithms include K-Means clustering, hierarchical clustering, and Gaussian Mixture Models (GMMs). + + + + +[True/False] Clustering algorithms are always able to find the "correct" clusters in the data. + + +[( )] True +[(X)] False +*** +
+ +This question is designed to test the test-taker's understanding of the limitations of clustering algorithms. Clustering algorithms are heuristics, which means that they do not guarantee to find the "correct" clusters in the data. The results of a clustering algorithm will depend on the distance metric used, the initialization of the algorithm, and the parameters of the algorithm. + + +
+*** + + +[True/False] Clustering algorithms can be used to detect outliers in the data. + + +[( )] True +[(X)] False +*** +
+ +This question is designed to test the test-taker's understanding of the difference between clustering and anomaly detection. Clustering algorithms are used to group similar data points together, while anomaly detection algorithms are used to identify data points that are significantly different from the rest of the data. + + +
+*** + +### Unsupervised vs. Supervised Learning +- **Unsupervised learning** is a type of machine learning where the algorithm is trained on unlabeled data. This means that the data does not have pre-defined labels or categories. The goal of unsupervised learning is to identify patterns and relationships in the data without any prior knowledge of the data. +- **Supervised learning** is a type of machine learning where the algorithm is trained on labeled data. This means that the data has pre-defined labels or categories. The goal of supervised learning is to train a model to predict the labels for new data points. + + + +Which of the following is a goal of supervised learning? + + + +[( )] Identify patterns and relationships in the data +[( )] Group similar data points together +[( )] Detect outliers in the data +[(X)] Predict the labels for new data points +[( )] Understand the underlying structure of the data +*** +
+ +Predicting the labels for new data points is a goal of supervised learning, not unsupervised learning. Unsupervised learning algorithms are used to identify patterns and relationships in the data without any prior knowledge of the data. This can be useful for tasks such as market segmentation, fraud detection, and anomaly detection. + + + +
+*** + + + + + +### Applications of clustering in machine learning +Clustering can be used for a variety of tasks, such as: + +- **Customer segmentation:** Clustering can be used to segment customers into different groups based on their demographics, purchase behavior, or other characteristics. This information can then be used to target marketing campaigns or product development efforts to specific customer segments. +- **Product grouping:** Clustering can be used to group products with similar characteristics, such as price, features, or customer reviews. This information can be used to improve product recommendations or to identify opportunities for cross-selling and up-selling. +- **Image segmentation:** Clustering can be used to segment images into different objects or regions. This information can be used in tasks such as object detection, image classification, and image compression. +- **Anomaly detection:** Clustering can be used to identify anomalous data points that are different from the rest of the data. This information can be used to detect fraud, identify errors in data collection, or predict future events. +- **Medical diagnosis:** Clustering can be used to group patients with similar symptoms or medical histories together. This information can be used to improve the accuracy of medical diagnosis and to develop more personalized treatment plans. +- **Scientific research:** Clustering can be used to identify patterns and relationships in scientific data. This information can be used to advance scientific knowledge and to develop new technologies. + +### Examples of clustering in real-world applications +- **Netflix uses clustering to recommend movies and TV shows to its users.** Netflix clusters its users based on their viewing history and then recommends movies and TV shows to users based on the clusters they belong to. +- **Amazon uses clustering to recommend products to its customers.** Amazon clusters its products based on customer reviews and purchase behavior. Amazon then recommends products to customers based on the clusters the products belong to and the customer's past purchase history. +- **Google uses clustering to improve the accuracy of its search results.** Google clusters search results based on the relevance of the results to the search query. Google then displays the most relevant results at the top of the search results page. +- **Banks use clustering to detect fraudulent transactions.** Banks cluster transactions based on their characteristics, such as the amount of money involved, the type of transaction, and the location of the transaction. Banks then flag anomalous transactions as potentially fraudulent. +- **Medical researchers use clustering to identify new biomarkers for diseases.** Medical researchers cluster patients based on their medical histories and symptoms. Researchers then look for patterns in the clusters to identify new biomarkers that can be used to diagnose and treat diseases. + +## K-Means Clustering Algorithm +- The K-Means clustering algorithm works by iteratively assigning data points to clusters based on their distance to the cluster centroids. The cluster centroids are the average values of all the data points in a cluster. + +``` +1. Choose the number of clusters (K): + - This is an important step, as it will determine the outcome of the clustering process. + - There is no one-size-fits-all answer to the question of how to choose K. One approach is to use the elbow method, which involves plotting the within-cluster sum of squares (WCSS) for different values of K. + - The elbow point on the plot is the point where the WCSS starts to flatten out, and this is often a good choice for K. +2. Initialize the cluster centroids + - The cluster centroids can be initialized randomly or by choosing K data points from the dataset. +3. Assign each data point to the cluster with the nearest centroid + - The distance between a data point and a cluster centroid can be measured using any distance metric, such as Euclidean distance or Manhattan distance. +4. Recalculate the cluster centroids + - The cluster centroids are recalculated by taking the average of all the data points in each cluster. +5. Repeat steps 3 and 4 until the cluster assignments no longer change + +``` + + + +What is the goal of the K-Means clustering algorithm? + + +[( )] To identify the clusters with the highest within-cluster sum of squares (WCSS) +[( )] To identify the clusters with the lowest within-cluster sum of squares (WCSS) +[( )] To identify the clusters with the highest between-cluster sum of squares (BCSS) +[( )] To identify the clusters with the lowest between-cluster sum of squares (BCSS) +[(X)] To group similar data points together +*** +
+ +The goal of the K-Means clustering algorithm is to group similar data points together. This is achieved by iteratively assigning data points to clusters based on their distance to the cluster centroids. + + + +
+*** + + + +### Python Implementation of K-Means Clustering + + +``` +import numpy as np # Library for math manipulation, loading data +import matplotlib.pyplot as plt # Library for plotting +from sklearn.cluster import KMeans # Library for KMeans clustering + +# Load the data +data = np.loadtxt("data.csv", delimiter=",") + +# Choose the number of clusters +n_clusters = 3 + +# Initialize the KMeans model +kmeans = KMeans(n_clusters=n_clusters) + +# Fit the model to the data +kmeans.fit(data) + +# Predict the cluster labels for each data point +cluster_labels = kmeans.predict(data) + +plt.scatter(data[:, 0], data[:, 1], c=cluster_labels) +plt.xlabel("Feature 1") +plt.ylabel("Feature 2") +plt.title("K-Means Clustering") +plt.show() +``` + + +### Applying K-Means Clustering to a Real-World Dataset + +- **Loading and cleaning the data:** The first step is to load the data into Python and clean it as needed. This may involve removing outliers, handling missing values, and scaling the data. +- **Scaling the data:** It is important to scale the data before applying K-Means clustering. This helps to ensure that all features have equal importance in the clustering process. +- **Choosing the number of clusters (K):** There is no one-size-fits-all answer to the question of how to choose the number of clusters (K). One approach is to use the elbow method, which involves plotting the within-cluster sum of squares (WCSS) for different values of K. The elbow point on the plot is the point where the WCSS starts to flatten out, and this is often a good choice for K. +- **Training and evaluating the K-Means model:** Once you have chosen the number of clusters, you can train the K-Means model on the data. You can then evaluate the model by computing the silhouette score. The silhouette score is a measure of how well the data points are clustered, and a higher score indicates better clustering. +- **Visualizing the clusters:** Once you have trained and evaluated the K-Means model, you can visualize the clusters using a scatter plot. This can help you to understand how the data is clustered and to identify any outliers. + +### Important Notes +Clustering is a machine learning technique that groups unlabeled data points into clusters based on their similarity. It is a powerful tool that can be used to solve a variety of problems, such as customer segmentation, product grouping, and anomaly detection. However, clustering also has some limitations. Here are some of the most important limitations of clustering: + +- **Sensitivity to the initialization:** Many clustering algorithms, such as k-means clustering, are sensitive to the initialization of the cluster centroids. If the cluster centroids are not initialized correctly, the clustering algorithm may not be able to find the optimal clusters. +- **Difficulty in choosing the number of clusters:** K-means clustering requires the user to specify the number of clusters (k) in advance. However, there is no one-size-fits-all answer to the question of how to choose k. Choosing the wrong number of clusters can lead to inaccurate results. +- **Inability to handle outliers:** Clustering algorithms are often sensitive to outliers, which are data points that are significantly different from the rest of the data. Outliers can have a large impact on the clustering results and can lead to inaccurate clusters. +- **Difficulty in interpreting the results:** It can be difficult to interpret the results of clustering algorithms, especially when the data is high-dimensional. It can be difficult to understand what the clusters represent and why the data points were assigned to the clusters they were assigned to. + + + +Which of the following techniques can be used to mitigate the sensitivity of clustering algorithms to the initialization? + + + +[( )] Running the clustering algorithm multiple times with different initializations and selecting the best results +[( )] Using a more robust clustering algorithm that is less sensitive to the initialization +[( )] Preprocessing the data to remove outliers +[(X)] All of the above +*** +
+ +All of the above techniques can be used to mitigate the sensitivity of clustering algorithms to the initialization. + + + +
+*** + + +## Conclusion + +At the end of the lesson, students should have a good understanding of the concept of clustering and how to implement the K-Means clustering algorithm in Python. They should also be able to apply K-Means clustering to real-world datasets to identify patterns and insights. + +## Additional Resources + +## Feedback + +@feedback From be925e84eb352cedaebdfe1edb3e275f0850db04 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Sat, 11 Nov 2023 16:01:43 -0500 Subject: [PATCH 2/8] Added heart data for clustering python exercise --- python_clustering/data/heart.csv | 304 +++++++++++++++++++++++++++++++ 1 file changed, 304 insertions(+) create mode 100644 python_clustering/data/heart.csv diff --git a/python_clustering/data/heart.csv b/python_clustering/data/heart.csv new file mode 100644 index 000000000..0966e67b5 --- /dev/null +++ b/python_clustering/data/heart.csv @@ -0,0 +1,304 @@ +age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output +63,1,3,145,233,1,0,150,0,2.3,0,0,1,1 +37,1,2,130,250,0,1,187,0,3.5,0,0,2,1 +41,0,1,130,204,0,0,172,0,1.4,2,0,2,1 +56,1,1,120,236,0,1,178,0,0.8,2,0,2,1 +57,0,0,120,354,0,1,163,1,0.6,2,0,2,1 +57,1,0,140,192,0,1,148,0,0.4,1,0,1,1 +56,0,1,140,294,0,0,153,0,1.3,1,0,2,1 +44,1,1,120,263,0,1,173,0,0,2,0,3,1 +52,1,2,172,199,1,1,162,0,0.5,2,0,3,1 +57,1,2,150,168,0,1,174,0,1.6,2,0,2,1 +54,1,0,140,239,0,1,160,0,1.2,2,0,2,1 +48,0,2,130,275,0,1,139,0,0.2,2,0,2,1 +49,1,1,130,266,0,1,171,0,0.6,2,0,2,1 +64,1,3,110,211,0,0,144,1,1.8,1,0,2,1 +58,0,3,150,283,1,0,162,0,1,2,0,2,1 +50,0,2,120,219,0,1,158,0,1.6,1,0,2,1 +58,0,2,120,340,0,1,172,0,0,2,0,2,1 +66,0,3,150,226,0,1,114,0,2.6,0,0,2,1 +43,1,0,150,247,0,1,171,0,1.5,2,0,2,1 +69,0,3,140,239,0,1,151,0,1.8,2,2,2,1 +59,1,0,135,234,0,1,161,0,0.5,1,0,3,1 +44,1,2,130,233,0,1,179,1,0.4,2,0,2,1 +42,1,0,140,226,0,1,178,0,0,2,0,2,1 +61,1,2,150,243,1,1,137,1,1,1,0,2,1 +40,1,3,140,199,0,1,178,1,1.4,2,0,3,1 +71,0,1,160,302,0,1,162,0,0.4,2,2,2,1 +59,1,2,150,212,1,1,157,0,1.6,2,0,2,1 +51,1,2,110,175,0,1,123,0,0.6,2,0,2,1 +65,0,2,140,417,1,0,157,0,0.8,2,1,2,1 +53,1,2,130,197,1,0,152,0,1.2,0,0,2,1 +41,0,1,105,198,0,1,168,0,0,2,1,2,1 +65,1,0,120,177,0,1,140,0,0.4,2,0,3,1 +44,1,1,130,219,0,0,188,0,0,2,0,2,1 +54,1,2,125,273,0,0,152,0,0.5,0,1,2,1 +51,1,3,125,213,0,0,125,1,1.4,2,1,2,1 +46,0,2,142,177,0,0,160,1,1.4,0,0,2,1 +54,0,2,135,304,1,1,170,0,0,2,0,2,1 +54,1,2,150,232,0,0,165,0,1.6,2,0,3,1 +65,0,2,155,269,0,1,148,0,0.8,2,0,2,1 +65,0,2,160,360,0,0,151,0,0.8,2,0,2,1 +51,0,2,140,308,0,0,142,0,1.5,2,1,2,1 +48,1,1,130,245,0,0,180,0,0.2,1,0,2,1 +45,1,0,104,208,0,0,148,1,3,1,0,2,1 +53,0,0,130,264,0,0,143,0,0.4,1,0,2,1 +39,1,2,140,321,0,0,182,0,0,2,0,2,1 +52,1,1,120,325,0,1,172,0,0.2,2,0,2,1 +44,1,2,140,235,0,0,180,0,0,2,0,2,1 +47,1,2,138,257,0,0,156,0,0,2,0,2,1 +53,0,2,128,216,0,0,115,0,0,2,0,0,1 +53,0,0,138,234,0,0,160,0,0,2,0,2,1 +51,0,2,130,256,0,0,149,0,0.5,2,0,2,1 +66,1,0,120,302,0,0,151,0,0.4,1,0,2,1 +62,1,2,130,231,0,1,146,0,1.8,1,3,3,1 +44,0,2,108,141,0,1,175,0,0.6,1,0,2,1 +63,0,2,135,252,0,0,172,0,0,2,0,2,1 +52,1,1,134,201,0,1,158,0,0.8,2,1,2,1 +48,1,0,122,222,0,0,186,0,0,2,0,2,1 +45,1,0,115,260,0,0,185,0,0,2,0,2,1 +34,1,3,118,182,0,0,174,0,0,2,0,2,1 +57,0,0,128,303,0,0,159,0,0,2,1,2,1 +71,0,2,110,265,1,0,130,0,0,2,1,2,1 +54,1,1,108,309,0,1,156,0,0,2,0,3,1 +52,1,3,118,186,0,0,190,0,0,1,0,1,1 +41,1,1,135,203,0,1,132,0,0,1,0,1,1 +58,1,2,140,211,1,0,165,0,0,2,0,2,1 +35,0,0,138,183,0,1,182,0,1.4,2,0,2,1 +51,1,2,100,222,0,1,143,1,1.2,1,0,2,1 +45,0,1,130,234,0,0,175,0,0.6,1,0,2,1 +44,1,1,120,220,0,1,170,0,0,2,0,2,1 +62,0,0,124,209,0,1,163,0,0,2,0,2,1 +54,1,2,120,258,0,0,147,0,0.4,1,0,3,1 +51,1,2,94,227,0,1,154,1,0,2,1,3,1 +29,1,1,130,204,0,0,202,0,0,2,0,2,1 +51,1,0,140,261,0,0,186,1,0,2,0,2,1 +43,0,2,122,213,0,1,165,0,0.2,1,0,2,1 +55,0,1,135,250,0,0,161,0,1.4,1,0,2,1 +51,1,2,125,245,1,0,166,0,2.4,1,0,2,1 +59,1,1,140,221,0,1,164,1,0,2,0,2,1 +52,1,1,128,205,1,1,184,0,0,2,0,2,1 +58,1,2,105,240,0,0,154,1,0.6,1,0,3,1 +41,1,2,112,250,0,1,179,0,0,2,0,2,1 +45,1,1,128,308,0,0,170,0,0,2,0,2,1 +60,0,2,102,318,0,1,160,0,0,2,1,2,1 +52,1,3,152,298,1,1,178,0,1.2,1,0,3,1 +42,0,0,102,265,0,0,122,0,0.6,1,0,2,1 +67,0,2,115,564,0,0,160,0,1.6,1,0,3,1 +68,1,2,118,277,0,1,151,0,1,2,1,3,1 +46,1,1,101,197,1,1,156,0,0,2,0,3,1 +54,0,2,110,214,0,1,158,0,1.6,1,0,2,1 +58,0,0,100,248,0,0,122,0,1,1,0,2,1 +48,1,2,124,255,1,1,175,0,0,2,2,2,1 +57,1,0,132,207,0,1,168,1,0,2,0,3,1 +52,1,2,138,223,0,1,169,0,0,2,4,2,1 +54,0,1,132,288,1,0,159,1,0,2,1,2,1 +45,0,1,112,160,0,1,138,0,0,1,0,2,1 +53,1,0,142,226,0,0,111,1,0,2,0,3,1 +62,0,0,140,394,0,0,157,0,1.2,1,0,2,1 +52,1,0,108,233,1,1,147,0,0.1,2,3,3,1 +43,1,2,130,315,0,1,162,0,1.9,2,1,2,1 +53,1,2,130,246,1,0,173,0,0,2,3,2,1 +42,1,3,148,244,0,0,178,0,0.8,2,2,2,1 +59,1,3,178,270,0,0,145,0,4.2,0,0,3,1 +63,0,1,140,195,0,1,179,0,0,2,2,2,1 +42,1,2,120,240,1,1,194,0,0.8,0,0,3,1 +50,1,2,129,196,0,1,163,0,0,2,0,2,1 +68,0,2,120,211,0,0,115,0,1.5,1,0,2,1 +69,1,3,160,234,1,0,131,0,0.1,1,1,2,1 +45,0,0,138,236,0,0,152,1,0.2,1,0,2,1 +50,0,1,120,244,0,1,162,0,1.1,2,0,2,1 +50,0,0,110,254,0,0,159,0,0,2,0,2,1 +64,0,0,180,325,0,1,154,1,0,2,0,2,1 +57,1,2,150,126,1,1,173,0,0.2,2,1,3,1 +64,0,2,140,313,0,1,133,0,0.2,2,0,3,1 +43,1,0,110,211,0,1,161,0,0,2,0,3,1 +55,1,1,130,262,0,1,155,0,0,2,0,2,1 +37,0,2,120,215,0,1,170,0,0,2,0,2,1 +41,1,2,130,214,0,0,168,0,2,1,0,2,1 +56,1,3,120,193,0,0,162,0,1.9,1,0,3,1 +46,0,1,105,204,0,1,172,0,0,2,0,2,1 +46,0,0,138,243,0,0,152,1,0,1,0,2,1 +64,0,0,130,303,0,1,122,0,2,1,2,2,1 +59,1,0,138,271,0,0,182,0,0,2,0,2,1 +41,0,2,112,268,0,0,172,1,0,2,0,2,1 +54,0,2,108,267,0,0,167,0,0,2,0,2,1 +39,0,2,94,199,0,1,179,0,0,2,0,2,1 +34,0,1,118,210,0,1,192,0,0.7,2,0,2,1 +47,1,0,112,204,0,1,143,0,0.1,2,0,2,1 +67,0,2,152,277,0,1,172,0,0,2,1,2,1 +52,0,2,136,196,0,0,169,0,0.1,1,0,2,1 +74,0,1,120,269,0,0,121,1,0.2,2,1,2,1 +54,0,2,160,201,0,1,163,0,0,2,1,2,1 +49,0,1,134,271,0,1,162,0,0,1,0,2,1 +42,1,1,120,295,0,1,162,0,0,2,0,2,1 +41,1,1,110,235,0,1,153,0,0,2,0,2,1 +41,0,1,126,306,0,1,163,0,0,2,0,2,1 +49,0,0,130,269,0,1,163,0,0,2,0,2,1 +60,0,2,120,178,1,1,96,0,0,2,0,2,1 +62,1,1,128,208,1,0,140,0,0,2,0,2,1 +57,1,0,110,201,0,1,126,1,1.5,1,0,1,1 +64,1,0,128,263,0,1,105,1,0.2,1,1,3,1 +51,0,2,120,295,0,0,157,0,0.6,2,0,2,1 +43,1,0,115,303,0,1,181,0,1.2,1,0,2,1 +42,0,2,120,209,0,1,173,0,0,1,0,2,1 +67,0,0,106,223,0,1,142,0,0.3,2,2,2,1 +76,0,2,140,197,0,2,116,0,1.1,1,0,2,1 +70,1,1,156,245,0,0,143,0,0,2,0,2,1 +44,0,2,118,242,0,1,149,0,0.3,1,1,2,1 +60,0,3,150,240,0,1,171,0,0.9,2,0,2,1 +44,1,2,120,226,0,1,169,0,0,2,0,2,1 +42,1,2,130,180,0,1,150,0,0,2,0,2,1 +66,1,0,160,228,0,0,138,0,2.3,2,0,1,1 +71,0,0,112,149,0,1,125,0,1.6,1,0,2,1 +64,1,3,170,227,0,0,155,0,0.6,1,0,3,1 +66,0,2,146,278,0,0,152,0,0,1,1,2,1 +39,0,2,138,220,0,1,152,0,0,1,0,2,1 +58,0,0,130,197,0,1,131,0,0.6,1,0,2,1 +47,1,2,130,253,0,1,179,0,0,2,0,2,1 +35,1,1,122,192,0,1,174,0,0,2,0,2,1 +58,1,1,125,220,0,1,144,0,0.4,1,4,3,1 +56,1,1,130,221,0,0,163,0,0,2,0,3,1 +56,1,1,120,240,0,1,169,0,0,0,0,2,1 +55,0,1,132,342,0,1,166,0,1.2,2,0,2,1 +41,1,1,120,157,0,1,182,0,0,2,0,2,1 +38,1,2,138,175,0,1,173,0,0,2,4,2,1 +38,1,2,138,175,0,1,173,0,0,2,4,2,1 +67,1,0,160,286,0,0,108,1,1.5,1,3,2,0 +67,1,0,120,229,0,0,129,1,2.6,1,2,3,0 +62,0,0,140,268,0,0,160,0,3.6,0,2,2,0 +63,1,0,130,254,0,0,147,0,1.4,1,1,3,0 +53,1,0,140,203,1,0,155,1,3.1,0,0,3,0 +56,1,2,130,256,1,0,142,1,0.6,1,1,1,0 +48,1,1,110,229,0,1,168,0,1,0,0,3,0 +58,1,1,120,284,0,0,160,0,1.8,1,0,2,0 +58,1,2,132,224,0,0,173,0,3.2,2,2,3,0 +60,1,0,130,206,0,0,132,1,2.4,1,2,3,0 +40,1,0,110,167,0,0,114,1,2,1,0,3,0 +60,1,0,117,230,1,1,160,1,1.4,2,2,3,0 +64,1,2,140,335,0,1,158,0,0,2,0,2,0 +43,1,0,120,177,0,0,120,1,2.5,1,0,3,0 +57,1,0,150,276,0,0,112,1,0.6,1,1,1,0 +55,1,0,132,353,0,1,132,1,1.2,1,1,3,0 +65,0,0,150,225,0,0,114,0,1,1,3,3,0 +61,0,0,130,330,0,0,169,0,0,2,0,2,0 +58,1,2,112,230,0,0,165,0,2.5,1,1,3,0 +50,1,0,150,243,0,0,128,0,2.6,1,0,3,0 +44,1,0,112,290,0,0,153,0,0,2,1,2,0 +60,1,0,130,253,0,1,144,1,1.4,2,1,3,0 +54,1,0,124,266,0,0,109,1,2.2,1,1,3,0 +50,1,2,140,233,0,1,163,0,0.6,1,1,3,0 +41,1,0,110,172,0,0,158,0,0,2,0,3,0 +51,0,0,130,305,0,1,142,1,1.2,1,0,3,0 +58,1,0,128,216,0,0,131,1,2.2,1,3,3,0 +54,1,0,120,188,0,1,113,0,1.4,1,1,3,0 +60,1,0,145,282,0,0,142,1,2.8,1,2,3,0 +60,1,2,140,185,0,0,155,0,3,1,0,2,0 +59,1,0,170,326,0,0,140,1,3.4,0,0,3,0 +46,1,2,150,231,0,1,147,0,3.6,1,0,2,0 +67,1,0,125,254,1,1,163,0,0.2,1,2,3,0 +62,1,0,120,267,0,1,99,1,1.8,1,2,3,0 +65,1,0,110,248,0,0,158,0,0.6,2,2,1,0 +44,1,0,110,197,0,0,177,0,0,2,1,2,0 +60,1,0,125,258,0,0,141,1,2.8,1,1,3,0 +58,1,0,150,270,0,0,111,1,0.8,2,0,3,0 +68,1,2,180,274,1,0,150,1,1.6,1,0,3,0 +62,0,0,160,164,0,0,145,0,6.2,0,3,3,0 +52,1,0,128,255,0,1,161,1,0,2,1,3,0 +59,1,0,110,239,0,0,142,1,1.2,1,1,3,0 +60,0,0,150,258,0,0,157,0,2.6,1,2,3,0 +49,1,2,120,188,0,1,139,0,2,1,3,3,0 +59,1,0,140,177,0,1,162,1,0,2,1,3,0 +57,1,2,128,229,0,0,150,0,0.4,1,1,3,0 +61,1,0,120,260,0,1,140,1,3.6,1,1,3,0 +39,1,0,118,219,0,1,140,0,1.2,1,0,3,0 +61,0,0,145,307,0,0,146,1,1,1,0,3,0 +56,1,0,125,249,1,0,144,1,1.2,1,1,2,0 +43,0,0,132,341,1,0,136,1,3,1,0,3,0 +62,0,2,130,263,0,1,97,0,1.2,1,1,3,0 +63,1,0,130,330,1,0,132,1,1.8,2,3,3,0 +65,1,0,135,254,0,0,127,0,2.8,1,1,3,0 +48,1,0,130,256,1,0,150,1,0,2,2,3,0 +63,0,0,150,407,0,0,154,0,4,1,3,3,0 +55,1,0,140,217,0,1,111,1,5.6,0,0,3,0 +65,1,3,138,282,1,0,174,0,1.4,1,1,2,0 +56,0,0,200,288,1,0,133,1,4,0,2,3,0 +54,1,0,110,239,0,1,126,1,2.8,1,1,3,0 +70,1,0,145,174,0,1,125,1,2.6,0,0,3,0 +62,1,1,120,281,0,0,103,0,1.4,1,1,3,0 +35,1,0,120,198,0,1,130,1,1.6,1,0,3,0 +59,1,3,170,288,0,0,159,0,0.2,1,0,3,0 +64,1,2,125,309,0,1,131,1,1.8,1,0,3,0 +47,1,2,108,243,0,1,152,0,0,2,0,2,0 +57,1,0,165,289,1,0,124,0,1,1,3,3,0 +55,1,0,160,289,0,0,145,1,0.8,1,1,3,0 +64,1,0,120,246,0,0,96,1,2.2,0,1,2,0 +70,1,0,130,322,0,0,109,0,2.4,1,3,2,0 +51,1,0,140,299,0,1,173,1,1.6,2,0,3,0 +58,1,0,125,300,0,0,171,0,0,2,2,3,0 +60,1,0,140,293,0,0,170,0,1.2,1,2,3,0 +77,1,0,125,304,0,0,162,1,0,2,3,2,0 +35,1,0,126,282,0,0,156,1,0,2,0,3,0 +70,1,2,160,269,0,1,112,1,2.9,1,1,3,0 +59,0,0,174,249,0,1,143,1,0,1,0,2,0 +64,1,0,145,212,0,0,132,0,2,1,2,1,0 +57,1,0,152,274,0,1,88,1,1.2,1,1,3,0 +56,1,0,132,184,0,0,105,1,2.1,1,1,1,0 +48,1,0,124,274,0,0,166,0,0.5,1,0,3,0 +56,0,0,134,409,0,0,150,1,1.9,1,2,3,0 +66,1,1,160,246,0,1,120,1,0,1,3,1,0 +54,1,1,192,283,0,0,195,0,0,2,1,3,0 +69,1,2,140,254,0,0,146,0,2,1,3,3,0 +51,1,0,140,298,0,1,122,1,4.2,1,3,3,0 +43,1,0,132,247,1,0,143,1,0.1,1,4,3,0 +62,0,0,138,294,1,1,106,0,1.9,1,3,2,0 +67,1,0,100,299,0,0,125,1,0.9,1,2,2,0 +59,1,3,160,273,0,0,125,0,0,2,0,2,0 +45,1,0,142,309,0,0,147,1,0,1,3,3,0 +58,1,0,128,259,0,0,130,1,3,1,2,3,0 +50,1,0,144,200,0,0,126,1,0.9,1,0,3,0 +62,0,0,150,244,0,1,154,1,1.4,1,0,2,0 +38,1,3,120,231,0,1,182,1,3.8,1,0,3,0 +66,0,0,178,228,1,1,165,1,1,1,2,3,0 +52,1,0,112,230,0,1,160,0,0,2,1,2,0 +53,1,0,123,282,0,1,95,1,2,1,2,3,0 +63,0,0,108,269,0,1,169,1,1.8,1,2,2,0 +54,1,0,110,206,0,0,108,1,0,1,1,2,0 +66,1,0,112,212,0,0,132,1,0.1,2,1,2,0 +55,0,0,180,327,0,2,117,1,3.4,1,0,2,0 +49,1,2,118,149,0,0,126,0,0.8,2,3,2,0 +54,1,0,122,286,0,0,116,1,3.2,1,2,2,0 +56,1,0,130,283,1,0,103,1,1.6,0,0,3,0 +46,1,0,120,249,0,0,144,0,0.8,2,0,3,0 +61,1,3,134,234,0,1,145,0,2.6,1,2,2,0 +67,1,0,120,237,0,1,71,0,1,1,0,2,0 +58,1,0,100,234,0,1,156,0,0.1,2,1,3,0 +47,1,0,110,275,0,0,118,1,1,1,1,2,0 +52,1,0,125,212,0,1,168,0,1,2,2,3,0 +58,1,0,146,218,0,1,105,0,2,1,1,3,0 +57,1,1,124,261,0,1,141,0,0.3,2,0,3,0 +58,0,1,136,319,1,0,152,0,0,2,2,2,0 +61,1,0,138,166,0,0,125,1,3.6,1,1,2,0 +42,1,0,136,315,0,1,125,1,1.8,1,0,1,0 +52,1,0,128,204,1,1,156,1,1,1,0,0,0 +59,1,2,126,218,1,1,134,0,2.2,1,1,1,0 +40,1,0,152,223,0,1,181,0,0,2,0,3,0 +61,1,0,140,207,0,0,138,1,1.9,2,1,3,0 +46,1,0,140,311,0,1,120,1,1.8,1,2,3,0 +59,1,3,134,204,0,1,162,0,0.8,2,2,2,0 +57,1,1,154,232,0,0,164,0,0,2,1,2,0 +57,1,0,110,335,0,1,143,1,3,1,1,3,0 +55,0,0,128,205,0,2,130,1,2,1,1,3,0 +61,1,0,148,203,0,1,161,0,0,2,1,3,0 +58,1,0,114,318,0,2,140,0,4.4,0,3,1,0 +58,0,0,170,225,1,0,146,1,2.8,1,2,1,0 +67,1,2,152,212,0,0,150,0,0.8,1,0,3,0 +44,1,0,120,169,0,1,144,1,2.8,0,0,1,0 +63,1,0,140,187,0,0,144,1,4,2,2,3,0 +63,0,0,124,197,0,1,136,1,0,1,0,2,0 +59,1,0,164,176,1,0,90,0,1,1,2,1,0 +57,0,0,140,241,0,1,123,1,0.2,1,0,3,0 +45,1,3,110,264,0,1,132,0,1.2,1,0,3,0 +68,1,0,144,193,1,1,141,0,3.4,1,2,3,0 +57,1,0,130,131,0,1,115,1,1.2,1,1,3,0 +57,0,1,130,236,0,0,174,0,0,1,1,2,0 From c0436530d846c2eaad3478152a00cf6a42cde508 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Sat, 11 Nov 2023 17:08:14 -0500 Subject: [PATCH 3/8] Added clustering python exercise --- python_clustering/python_clustering.md | 92 +++++++++++++++++++++----- 1 file changed, 74 insertions(+), 18 deletions(-) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index c685151ef..9cf95254e 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -58,7 +58,8 @@ Previous versions: @end import: https://raw.githubusercontent.com/arcus/education_modules/main/_module_templates/macros.md -import: https://raw.githubusercontent.com/arcus/education_modules/main/_module_templates/macros_python.md +import: https://raw.githubusercontent.com/arcus/education_modules/pyodide_testing/_module_templates/macros_python.md +import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md --> # Python Lesson on Clustering for Machine Learning @@ -189,32 +190,87 @@ The goal of the K-Means clustering algorithm is to group similar data points tog ### Python Implementation of K-Means Clustering +To implement k-means clustering in Python using Scikit-learn, we can follow these steps: + +1. Import the necessary libraries: +```python +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from sklearn.model_selection import train_test_split +from sklearn.cluster import KMeans +from scipy.spatial import distance ``` -import numpy as np # Library for math manipulation, loading data -import matplotlib.pyplot as plt # Library for plotting -from sklearn.cluster import KMeans # Library for KMeans clustering +@Pyodide.eval + + +2. Load the data: +```python @Pyodide.exec + +import pandas as pd +import io +from pyodide.http import open_url + +url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/heart.csv" -# Load the data -data = np.loadtxt("data.csv", delimiter=",") +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) -# Choose the number of clusters -n_clusters = 3 +data = pd.read_csv(file) -# Initialize the KMeans model -kmeans = KMeans(n_clusters=n_clusters) -# Fit the model to the data -kmeans.fit(data) +# Analyze data and features +data.info() +``` -# Predict the cluster labels for each data point -cluster_labels = kmeans.predict(data) -plt.scatter(data[:, 0], data[:, 1], c=cluster_labels) -plt.xlabel("Feature 1") -plt.ylabel("Feature 2") -plt.title("K-Means Clustering") +3. Visualize data +```python +# Create the scatter plot +data.plot.scatter(x='chol', y='trtbps', c='output', colormap='viridis') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.title("Scatter Plot of Cholesterol vs. Blood Pressure") plt.show() ``` +@Pyodide.eval + +3. Split the data into training and testing sets: +```python +# Normalize dataframe +def normalize(df, features): + result = df.copy() + for feature_name in features: + max_value = df[feature_name].max() + min_value = df[feature_name].min() + result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) + return result + +normalized_data = normalize(data, data.columns) +``` +@Pyodide.eval + +4. Train the clustering model and visualize: +```python +# Run KMeans +kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) + +# Predict clusters +identified_clusters = kmeans.fit_predict(normalized_data.values) +results = normalized_data.copy() +results['cluster'] = identified_clusters + +# Compute distance from cluster +distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] +results['dist'] = distance_from_centroid +results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.show() +``` +@Pyodide.eval + ### Applying K-Means Clustering to a Real-World Dataset From 46656def5fac5c0e89e06742abc26fc785af7da2 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 21 Mar 2024 09:08:24 -0400 Subject: [PATCH 4/8] Added polyps data for clustering real world example --- python_clustering/data/polyps.csv | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 python_clustering/data/polyps.csv diff --git a/python_clustering/data/polyps.csv b/python_clustering/data/polyps.csv new file mode 100644 index 000000000..54f327919 --- /dev/null +++ b/python_clustering/data/polyps.csv @@ -0,0 +1,23 @@ +"","participant_id","sex","age","baseline","treatment","number3m","number12m" +"1","001","female",17,7,"sulindac",6,NA +"2","002","female",20,77,"placebo",67,63 +"3","003","male",16,7,"sulindac",4,2 +"4","004","female",18,5,"placebo",5,28 +"5","005","male",22,23,"sulindac",16,17 +"6","006","female",13,35,"placebo",31,61 +"7","007","female",23,11,"sulindac",6,1 +"8","008","male",34,12,"placebo",20,7 +"9","009","male",50,7,"placebo",7,15 +"10","010","male",19,318,"placebo",347,44 +"11","011","male",17,160,"sulindac",142,25 +"12","012","female",23,8,"sulindac",1,3 +"13","013","male",22,20,"placebo",16,28 +"14","014","male",30,11,"placebo",20,10 +"15","015","male",27,24,"placebo",26,40 +"16","016","male",23,34,"sulindac",27,33 +"17","017","female",22,54,"placebo",45,46 +"18","018","male",13,16,"sulindac",10,NA +"19","019","male",34,30,"placebo",30,50 +"20","020","female",23,10,"sulindac",6,3 +"21","021","female",22,20,"sulindac",5,1 +"22","022","male",42,12,"sulindac",8,4 From 3af59e39d9dba071a9c6682c2740d97a79fb4f9e Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 21 Mar 2024 09:15:19 -0400 Subject: [PATCH 5/8] Updated module to include real-world example --- python_clustering/python_clustering.md | 96 ++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index 9cf95254e..bf7e0ef50 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -66,6 +66,12 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md @overview + + + + + + ### What is clustering? - Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on their similarity. The goal of clustering is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Clustering algorithms work by measuring the similarity between data points and then grouping similar data points together. There are many different clustering algorithms, each with its own strengths and weaknesses. Some of the most common clustering algorithms include K-Means clustering, hierarchical clustering, and Gaussian Mixture Models (GMMs). @@ -190,6 +196,8 @@ The goal of the K-Means clustering algorithm is to group similar data points tog ### Python Implementation of K-Means Clustering +This dataset contains various clinical attributes of patients, including their age, sex, chest pain type (cp), resting blood pressure (trtbps), serum cholesterol level (chol), fasting blood sugar (fbs) level, resting electrocardiographic results (restecg), maximum heart rate achieved (thalachh), exercise-induced angina (exng), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slp), number of major vessels (caa) colored by fluoroscopy, thalassemia (thall) type, and the presence of heart disease (output). The data seems to be related to the diagnosis of heart disease, with the output variable indicating whether a patient has heart disease (1) or not (0). Each row represents a different patient, with their respective clinical characteristics recorded. + To implement k-means clustering in Python using Scikit-learn, we can follow these steps: 1. Import the necessary libraries: @@ -310,6 +318,94 @@ All of the above techniques can be used to mitigate the sensitivity of clusterin *** + +### Real World Code Example + +This dataset, derived and refined from a landmark study published in the New England Journal of Medicine in 1993, investigates the effectiveness of sulindac treatment in individuals with familial adenomatous polyposis (FAP), a hereditary condition characterized by the development of numerous adenomatous polyps in the colon and rectum. Enhanced from the original datasets "polyps" and "polyps3" in the {HSAUR} package, this dataset includes crucial variables such as participant ID, sex, age, baseline polyp count, assigned treatment (sulindac or placebo), and polyp counts at 3 and 12 months post-treatment. These enhancements involved meticulous referencing of the original paper and offer improved granularity and completeness for analyzing the impact of sulindac treatment on polyp progression in FAP patients. This dataset serves as a valuable resource for further research and analysis in the field of gastrointestinal medicine and pharmacology. + +1. Install Packages: +```python @Pyodide.exec + +import pandas as pd +import io +from pyodide.http import open_url +from sklearn.cluster import KMeans +from sklearn.preprocessing import StandardScaler +import matplotlib.pyplot as plt +``` + +2. Load the data: +```python +# Load dataset and read to pandas dataframe +url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/polyps.csv" +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) +df = pd.read_csv(file) + +# Analyze data and features +df.info() + +# Select features for clustering +features = ['age', 'baseline', 'number3m', 'number12m'] +X = df[features] + +# Fill missing values with the mean of each column +X.fillna(X.mean(), inplace=True) + +# Standardize the feature values +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) +``` +@Pyodide.eval + + +3. Cluster Data: +```python +# Define the number of clusters +num_clusters = 3 + +# Apply KMeans clustering +kmeans = KMeans(n_clusters=num_clusters, random_state=42) +kmeans.fit(X_scaled) + +# Assign cluster labels to the original dataframe +df['cluster'] = kmeans.labels_ +``` +@Pyodide.eval + + +4. Visualize Clusters: +```python +# Visualize clusters for 'number3m' vs 'number12m' +plt.figure(figsize=(10, 8)) +colors = ['red', 'blue', 'green'] # Change colors as needed for more clusters + +for i in range(num_clusters): + cluster_data = df[df['cluster'] == i] + plt.scatter(cluster_data['number3m'], cluster_data['number12m'], + color=colors[i], label=f'Cluster {i}') + +plt.xlabel('Number of Polyps at 3 Months') +plt.ylabel('Number of Polyps at 12 Months') +plt.title('K-Means Clustering of Polyp Data: Number of Polyps at 3 Months vs Number of Polyps at 12 Months') +plt.legend() +plt.show() +``` +@Pyodide.eval + +If the K-Means algorithm identified distinct clusters with minimal overlap, it suggests there might be three underlying patient groups regarding polyp count progression: + +- **Cluster 1 (Low Progression):** This cluster might represent participants who have a relatively low number of polyps at 3 months and a stable or slightly increased number at 12 months. This could be associated with effective treatment or naturally slow polyp growth. +- **Cluster 2 (Moderate Progression):** This cluster could include participants with a moderate number of polyps at 3 months and a somewhat steeper increase by 12 months. This might indicate a less effective treatment or a faster natural growth rate for polyps. +- **Cluster 3 (High Progression):** This cluster might contain participants with a high number of polyps at 3 months and a substantial increase by 12 months. This could be linked to factors like a particularly aggressive polyp type or treatment resistance. + +**While clustering provides valuable insights into potential patient subgroups, further analysis of treatment effects and other relevant features is necessary to fully understand the underlying factors influencing polyp count progression.** + + + + + ## Conclusion At the end of the lesson, students should have a good understanding of the concept of clustering and how to implement the K-Means clustering algorithm in Python. They should also be able to apply K-Means clustering to real-world datasets to identify patterns and insights. From 981f599a3d818dd048c8f365d32387444498de1b Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 18 Apr 2024 12:19:54 -0400 Subject: [PATCH 6/8] Addressed Rose's comments from the following module --- python_clustering/python_clustering.md | 106 +++++++++++++++++++++---- 1 file changed, 90 insertions(+), 16 deletions(-) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index bf7e0ef50..8a41eb7cd 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -75,7 +75,20 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md ### What is clustering? - Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on their similarity. The goal of clustering is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Clustering algorithms work by measuring the similarity between data points and then grouping similar data points together. There are many different clustering algorithms, each with its own strengths and weaknesses. Some of the most common clustering algorithms include K-Means clustering, hierarchical clustering, and Gaussian Mixture Models (GMMs). +
+A little encouragement...
+As in many fields, machine learning involves a lot of technical language, some of which is unclear, redundant, or downright confusing. +For example: + +**Outcome** variables are also called **response variables**, **dependent variables**, or **labels**. + +**Input** variables are also called **predictors**, **features**, **independent variables**, or even just **variables**. + +To make matters worse, sometimes the same words are used to mean different things in different subfields. +If you find yourself stumbling on vocabulary as you read about machine learning, know you're not alone! + +
[True/False] Clustering algorithms are always able to find the "correct" clusters in the data. @@ -86,7 +99,11 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md ***
-This question is designed to test the test-taker's understanding of the limitations of clustering algorithms. Clustering algorithms are heuristics, which means that they do not guarantee to find the "correct" clusters in the data. The results of a clustering algorithm will depend on the distance metric used, the initialization of the algorithm, and the parameters of the algorithm. +Clustering algorithms are helpful tools, but they're not magic. Here's why this statement is false: + +- Clustering isn't about "right" or "wrong": There's often no single "correct" way to group data. Clustering depends on how you measure similarity and the type of patterns you're interested in finding. +- Different setups, different results: The clusters you get can change based on the clustering algorithm you choose, how you measure distances between data points, and even the starting settings of the algorithm. +Key takeaway: Clustering is an exploratory process. It can suggest interesting groupings in your data, but it's up to you to decide if those groupings make sense and are useful for your analysis.
@@ -101,7 +118,10 @@ This question is designed to test the test-taker's understanding of the limitati ***
-This question is designed to test the test-taker's understanding of the difference between clustering and anomaly detection. Clustering algorithms are used to group similar data points together, while anomaly detection algorithms are used to identify data points that are significantly different from the rest of the data. +While clustering algorithms can sometimes help identify potential outliers, they are not specifically designed for this purpose. Here's why: + +- Clustering focuses on grouping: Clustering algorithms aim to find groups of similar data points. Outliers, by definition, don't fit well into any group. +- Outliers might influence clusters: A significant outlier might distort the clustering process, either by forming its own tiny cluster or being forced into a larger cluster where it doesn't truly belong.
@@ -140,18 +160,25 @@ Predicting the labels for new data points is a goal of supervised learning, not Clustering can be used for a variety of tasks, such as: - **Customer segmentation:** Clustering can be used to segment customers into different groups based on their demographics, purchase behavior, or other characteristics. This information can then be used to target marketing campaigns or product development efforts to specific customer segments. -- **Product grouping:** Clustering can be used to group products with similar characteristics, such as price, features, or customer reviews. This information can be used to improve product recommendations or to identify opportunities for cross-selling and up-selling. -- **Image segmentation:** Clustering can be used to segment images into different objects or regions. This information can be used in tasks such as object detection, image classification, and image compression. -- **Anomaly detection:** Clustering can be used to identify anomalous data points that are different from the rest of the data. This information can be used to detect fraud, identify errors in data collection, or predict future events. -- **Medical diagnosis:** Clustering can be used to group patients with similar symptoms or medical histories together. This information can be used to improve the accuracy of medical diagnosis and to develop more personalized treatment plans. -- **Scientific research:** Clustering can be used to identify patterns and relationships in scientific data. This information can be used to advance scientific knowledge and to develop new technologies. - -### Examples of clustering in real-world applications -- **Netflix uses clustering to recommend movies and TV shows to its users.** Netflix clusters its users based on their viewing history and then recommends movies and TV shows to users based on the clusters they belong to. -- **Amazon uses clustering to recommend products to its customers.** Amazon clusters its products based on customer reviews and purchase behavior. Amazon then recommends products to customers based on the clusters the products belong to and the customer's past purchase history. -- **Google uses clustering to improve the accuracy of its search results.** Google clusters search results based on the relevance of the results to the search query. Google then displays the most relevant results at the top of the search results page. -- **Banks use clustering to detect fraudulent transactions.** Banks cluster transactions based on their characteristics, such as the amount of money involved, the type of transaction, and the location of the transaction. Banks then flag anomalous transactions as potentially fraudulent. -- **Medical researchers use clustering to identify new biomarkers for diseases.** Medical researchers cluster patients based on their medical histories and symptoms. Researchers then look for patterns in the clusters to identify new biomarkers that can be used to diagnose and treat diseases. +### Applications of Clustering in Biomedical Research + +Clustering is an invaluable machine learning technique with wide-ranging applications in biomedical research. Here are some key areas where it can be used : + +- **Patient Stratification:** Identify distinct subgroups within patient populations based on gene expression profiles, clinical data, or disease biomarkers. This can lead to insights into disease subtypes and more personalized treatment options. + - Specifically in research, ["Use of Latent Class Analysis and k-Means Clustering to Identify Complex Patient Profiles"](https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2774074) employs statistical techniques to categorize patients into specific groups based on their gene expression profiles, clinical data, or biomarkers, allowing for the identification of unique disease subtypes and facilitating personalized treatment options. This approach aligns with patient stratification by utilizing clustering methods to segregate patients into distinct categories, enabling healthcare professionals to tailor interventions based on individualized characteristics and needs. +- **Drug Development:** Clustering can help group compounds based on chemical structure, efficacy, or target interactions. This facilitates the identification of novel drug candidates or the repurposing of existing drugs. + - ["Integration of k-means clustering algorithm with network analysis for drug-target interactions network prediction"](https://www.inderscienceonline.com/doi/abs/10.1504/IJDMB.2018.094776) combines k-means clustering with network analysis to predict drug-target interactions, aiding in the identification of potential drug candidates or repurposing existing drugs by grouping compounds based on their interactions and properties. This study directly aligns with drug development goals by leveraging clustering to categorize compounds and enhance the understanding of their interactions, thereby facilitating the discovery and optimization of therapeutic agents. + +- **Gene Expression Analysis:** Clustering genes with similar expression patterns across different conditions or time points can help uncover regulatory networks and potential therapeutic targets. + - ["Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data"](https://link.springer.com/article/10.1186/s13059-018-1536-8) automates the extraction of co-expressed gene clusters from gene expression data, aiding in the identification of regulatory networks and potential therapeutic targets by clustering genes with similar expression patterns across different conditions or time points. This tool directly aligns with gene expression analysis goals by utilizing clustering to group genes based on their expression profiles, facilitating the discovery of underlying biological mechanisms and potential targets for intervention. + +- **Medical Image Analysis:** Segment medical images (MRI, CT scans) to differentiate tissues, identify tumors and other abnormalities. Clustering can aid in diagnosis and disease tracking. + - ["Diagnosis of Brain Tumor Using Combination of K-Means Clustering and Genetic Algorithm"](http://www.ijmi.ir/index.php/IJMI/article/view/159) utilizes a combination of k-means clustering and genetic algorithms to accurately diagnose brain tumors by segmenting medical images, demonstrating how clustering techniques can aid in medical image analysis to differentiate tissues and identify abnormalities such as tumors, aligning with the objective of leveraging clustering for diagnosis and disease tracking in medical imaging. + +- **Disease-Risk Prediction:** Analyze patient data to cluster individuals based on risk factors and medical history, enabling the prediction of susceptibility to various diseases. + - ["A K-Means Approach to Clustering Disease Progressions"](https://ieeexplore.ieee.org/document/8031156) utilizes k-means clustering to categorize individuals based on disease progression patterns, facilitating disease-risk prediction by analyzing patient data to cluster individuals according to their risk factors and medical histories. This study directly relates to the objective of disease-risk prediction by employing clustering techniques to identify distinct groups of patients with similar disease progressions, thereby enabling more accurate predictions of susceptibility to various diseases based on individualized characteristics. + + ## K-Means Clustering Algorithm - The K-Means clustering algorithm works by iteratively assigning data points to clusters based on their distance to the cluster centroids. The cluster centroids are the average values of all the data points in a cluster. @@ -170,7 +197,12 @@ Clustering can be used for a variety of tasks, such as: 5. Repeat steps 3 and 4 until the cluster assignments no longer change ``` +
+Learning connection
+ +To learn more about Linear Regression and for a visual explanation, watch [StatQuest: K-means clustering](https://youtu.be/4b5d3muPQmA?si=KMQxx23Ru8w7GOFP). +
What is the goal of the K-Means clustering algorithm? @@ -192,6 +224,25 @@ The goal of the K-Means clustering algorithm is to group similar data points tog *** + + + +### Understanding Machine Learning Techniques + +Before diving into the example, it's valuable to understand some key concepts used in machine learning. These techniques help us build more accurate and reliable models for clustering. + +- **Normalization:** Normalization is crucial for scaling the features of the dataset to a uniform range, typically between 0 and 1, ensuring that each feature contributes equally to the clustering process. + + - By ensuring equitable treatment of all features, normalization prevents features with larger magnitudes from dominating distance calculations in clustering algorithms. This fosters the identification of clusters based on similarity across multiple dimensions and enhances the discovery of meaningful patterns and relationships within the data. + +- **Computing Distance from Cluster Centroid:** Calculating the distance from each data point to its assigned cluster centroid provides a quantitative measure of the data point's fit within its cluster. + + - Distance metrics aid in assessing the compactness of clusters and the separation between clusters. In applications, distance calculations play a pivotal role in cluster validation and refinement, quantifying the similarity of data points within clusters and improving the overall efficacy of clustering algorithms in delineating coherent and distinct groups within the dataset. + +- **Visualization:** Visualizing clustering results facilitates intuitive interpretation and assessment of identified clusters. + + - Visual representations, such as scatter plots, enable the identification of inherent data patterns, outliers, and delineation of cluster boundaries. In applications, visualization aids in informed decision-making by providing stakeholders with insights into the data's structure and characteristics, fostering actionable insights and informed decisions. + ### Python Implementation of K-Means Clustering @@ -244,17 +295,26 @@ plt.show() ``` @Pyodide.eval -3. Split the data into training and testing sets: +3. This code defines a function called normalize that performs min-max scaling normalization on a DataFrame df, specifically on the features specified by the features parameter. The normalized DataFrame is returned as the output. Then, it calls this function to normalize all columns of a DataFrame data and assigns the result to a variable named normalized_data. ```python # Normalize dataframe def normalize(df, features): + # Create a copy of the DataFrame to avoid modifying the original data. result = df.copy() + + # Iterate through each feature specified for normalization. for feature_name in features: + # Find the maximum and minimum values of the current feature. max_value = df[feature_name].max() min_value = df[feature_name].min() + + # Normalize the current feature using min-max scaling formula. + # This ensures that all values of the feature are scaled between 0 and 1. result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) return result +# Call the normalize function with the entire DataFrame 'data' and all its columns. +# Store the result in 'normalized_data'. normalized_data = normalize(data, data.columns) ``` @Pyodide.eval @@ -263,15 +323,29 @@ normalized_data = normalize(data, data.columns) ```python # Run KMeans kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) +``` +@Pyodide.eval +5. Train the clustering model and visualize: +```python # Predict clusters identified_clusters = kmeans.fit_predict(normalized_data.values) results = normalized_data.copy() results['cluster'] = identified_clusters +``` +@Pyodide.eval -# Compute distance from cluster +6. Train the clustering model and visualize: +```python +# Compute distance from cluster. Loop through each data point and calculate the Euclidean distance between the data point and its assigned cluster centroid. distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] results['dist'] = distance_from_centroid +``` +@Pyodide.eval + + +7. Train the clustering model and visualize. Scatter plot of 'chol' (Cholesterol) against 'trtbps' (Resting Blood Pressure), colored by cluster, with marker size proportional to the distance from the cluster centroid. +```python results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') plt.xlabel("Cholesterol") plt.ylabel("Resting Blood Pressure") From a53001b4a0b105ea7968e969bdcf2dff8b4a7aa6 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Mon, 27 May 2024 15:37:14 -0400 Subject: [PATCH 7/8] Split up clustering module and extract overview --- .../intro_clustering_ml.md | 198 +----------- python_clustering/data/heart.csv | 304 ------------------ python_clustering/data/polyps.csv | 23 -- 3 files changed, 4 insertions(+), 521 deletions(-) rename python_clustering/python_clustering.md => intro_clustering_ml/intro_clustering_ml.md (68%) delete mode 100644 python_clustering/data/heart.csv delete mode 100644 python_clustering/data/polyps.csv diff --git a/python_clustering/python_clustering.md b/intro_clustering_ml/intro_clustering_ml.md similarity index 68% rename from python_clustering/python_clustering.md rename to intro_clustering_ml/intro_clustering_ml.md index 8a41eb7cd..3e6e02b16 100644 --- a/python_clustering/python_clustering.md +++ b/intro_clustering_ml/intro_clustering_ml.md @@ -243,115 +243,6 @@ Before diving into the example, it's valuable to understand some key concepts us - Visual representations, such as scatter plots, enable the identification of inherent data patterns, outliers, and delineation of cluster boundaries. In applications, visualization aids in informed decision-making by providing stakeholders with insights into the data's structure and characteristics, fostering actionable insights and informed decisions. - -### Python Implementation of K-Means Clustering - - -This dataset contains various clinical attributes of patients, including their age, sex, chest pain type (cp), resting blood pressure (trtbps), serum cholesterol level (chol), fasting blood sugar (fbs) level, resting electrocardiographic results (restecg), maximum heart rate achieved (thalachh), exercise-induced angina (exng), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slp), number of major vessels (caa) colored by fluoroscopy, thalassemia (thall) type, and the presence of heart disease (output). The data seems to be related to the diagnosis of heart disease, with the output variable indicating whether a patient has heart disease (1) or not (0). Each row represents a different patient, with their respective clinical characteristics recorded. - -To implement k-means clustering in Python using Scikit-learn, we can follow these steps: - -1. Import the necessary libraries: -```python -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt -from sklearn.model_selection import train_test_split -from sklearn.cluster import KMeans -from scipy.spatial import distance -``` -@Pyodide.eval - - -2. Load the data: -```python @Pyodide.exec - -import pandas as pd -import io -from pyodide.http import open_url - -url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/heart.csv" - -url_contents = open_url(url) -text = url_contents.read() -file = io.StringIO(text) - -data = pd.read_csv(file) - - -# Analyze data and features -data.info() -``` - - -3. Visualize data -```python -# Create the scatter plot -data.plot.scatter(x='chol', y='trtbps', c='output', colormap='viridis') -plt.xlabel("Cholesterol") -plt.ylabel("Resting Blood Pressure") -plt.title("Scatter Plot of Cholesterol vs. Blood Pressure") -plt.show() -``` -@Pyodide.eval - -3. This code defines a function called normalize that performs min-max scaling normalization on a DataFrame df, specifically on the features specified by the features parameter. The normalized DataFrame is returned as the output. Then, it calls this function to normalize all columns of a DataFrame data and assigns the result to a variable named normalized_data. -```python -# Normalize dataframe -def normalize(df, features): - # Create a copy of the DataFrame to avoid modifying the original data. - result = df.copy() - - # Iterate through each feature specified for normalization. - for feature_name in features: - # Find the maximum and minimum values of the current feature. - max_value = df[feature_name].max() - min_value = df[feature_name].min() - - # Normalize the current feature using min-max scaling formula. - # This ensures that all values of the feature are scaled between 0 and 1. - result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) - return result - -# Call the normalize function with the entire DataFrame 'data' and all its columns. -# Store the result in 'normalized_data'. -normalized_data = normalize(data, data.columns) -``` -@Pyodide.eval - -4. Train the clustering model and visualize: -```python -# Run KMeans -kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) -``` -@Pyodide.eval - -5. Train the clustering model and visualize: -```python -# Predict clusters -identified_clusters = kmeans.fit_predict(normalized_data.values) -results = normalized_data.copy() -results['cluster'] = identified_clusters -``` -@Pyodide.eval - -6. Train the clustering model and visualize: -```python -# Compute distance from cluster. Loop through each data point and calculate the Euclidean distance between the data point and its assigned cluster centroid. -distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] -results['dist'] = distance_from_centroid -``` -@Pyodide.eval - - -7. Train the clustering model and visualize. Scatter plot of 'chol' (Cholesterol) against 'trtbps' (Resting Blood Pressure), colored by cluster, with marker size proportional to the distance from the cluster centroid. -```python -results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') -plt.xlabel("Cholesterol") -plt.ylabel("Resting Blood Pressure") -plt.show() -``` -@Pyodide.eval @@ -393,96 +284,15 @@ All of the above techniques can be used to mitigate the sensitivity of clusterin -### Real World Code Example -This dataset, derived and refined from a landmark study published in the New England Journal of Medicine in 1993, investigates the effectiveness of sulindac treatment in individuals with familial adenomatous polyposis (FAP), a hereditary condition characterized by the development of numerous adenomatous polyps in the colon and rectum. Enhanced from the original datasets "polyps" and "polyps3" in the {HSAUR} package, this dataset includes crucial variables such as participant ID, sex, age, baseline polyp count, assigned treatment (sulindac or placebo), and polyp counts at 3 and 12 months post-treatment. These enhancements involved meticulous referencing of the original paper and offer improved granularity and completeness for analyzing the impact of sulindac treatment on polyp progression in FAP patients. This dataset serves as a valuable resource for further research and analysis in the field of gastrointestinal medicine and pharmacology. - -1. Install Packages: -```python @Pyodide.exec - -import pandas as pd -import io -from pyodide.http import open_url -from sklearn.cluster import KMeans -from sklearn.preprocessing import StandardScaler -import matplotlib.pyplot as plt -``` - -2. Load the data: -```python -# Load dataset and read to pandas dataframe -url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/polyps.csv" -url_contents = open_url(url) -text = url_contents.read() -file = io.StringIO(text) -df = pd.read_csv(file) - -# Analyze data and features -df.info() - -# Select features for clustering -features = ['age', 'baseline', 'number3m', 'number12m'] -X = df[features] - -# Fill missing values with the mean of each column -X.fillna(X.mean(), inplace=True) - -# Standardize the feature values -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) -``` -@Pyodide.eval - - -3. Cluster Data: -```python -# Define the number of clusters -num_clusters = 3 - -# Apply KMeans clustering -kmeans = KMeans(n_clusters=num_clusters, random_state=42) -kmeans.fit(X_scaled) - -# Assign cluster labels to the original dataframe -df['cluster'] = kmeans.labels_ -``` -@Pyodide.eval - - -4. Visualize Clusters: -```python -# Visualize clusters for 'number3m' vs 'number12m' -plt.figure(figsize=(10, 8)) -colors = ['red', 'blue', 'green'] # Change colors as needed for more clusters - -for i in range(num_clusters): - cluster_data = df[df['cluster'] == i] - plt.scatter(cluster_data['number3m'], cluster_data['number12m'], - color=colors[i], label=f'Cluster {i}') - -plt.xlabel('Number of Polyps at 3 Months') -plt.ylabel('Number of Polyps at 12 Months') -plt.title('K-Means Clustering of Polyp Data: Number of Polyps at 3 Months vs Number of Polyps at 12 Months') -plt.legend() -plt.show() -``` -@Pyodide.eval - -If the K-Means algorithm identified distinct clusters with minimal overlap, it suggests there might be three underlying patient groups regarding polyp count progression: - -- **Cluster 1 (Low Progression):** This cluster might represent participants who have a relatively low number of polyps at 3 months and a stable or slightly increased number at 12 months. This could be associated with effective treatment or naturally slow polyp growth. -- **Cluster 2 (Moderate Progression):** This cluster could include participants with a moderate number of polyps at 3 months and a somewhat steeper increase by 12 months. This might indicate a less effective treatment or a faster natural growth rate for polyps. -- **Cluster 3 (High Progression):** This cluster might contain participants with a high number of polyps at 3 months and a substantial increase by 12 months. This could be linked to factors like a particularly aggressive polyp type or treatment resistance. - -**While clustering provides valuable insights into potential patient subgroups, further analysis of treatment effects and other relevant features is necessary to fully understand the underlying factors influencing polyp count progression.** - - +## Conclusion +Through this lesson, we've explored the fundamental concept of clustering as an unsupervised machine learning technique. We've delved into the inner workings of the K-Means algorithm, seeing how it iteratively groups similar data points together. We've also applied this knowledge to real-world scenarios, showcasing the potential of clustering in fields like customer segmentation and biomedical research. -## Conclusion +However, clustering is not without its challenges. We've examined the importance of data preprocessing, including normalization, to ensure fair and accurate clustering results. We've also discussed the limitations of clustering algorithms, such as sensitivity to initialization and the difficulty of determining the optimal number of clusters. By understanding these limitations, you're better equipped to make informed decisions and interpret clustering results with a critical eye. -At the end of the lesson, students should have a good understanding of the concept of clustering and how to implement the K-Means clustering algorithm in Python. They should also be able to apply K-Means clustering to real-world datasets to identify patterns and insights. +With this foundational knowledge, you're now prepared to explore the wider landscape of clustering techniques, including variations like hierarchical clustering and density-based clustering. As you continue your journey in machine learning, remember that clustering is a versatile tool with far-reaching applications in data analysis, pattern recognition, and decision-making across diverse domains. ## Additional Resources diff --git a/python_clustering/data/heart.csv b/python_clustering/data/heart.csv deleted file mode 100644 index 0966e67b5..000000000 --- a/python_clustering/data/heart.csv +++ /dev/null @@ -1,304 +0,0 @@ -age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output -63,1,3,145,233,1,0,150,0,2.3,0,0,1,1 -37,1,2,130,250,0,1,187,0,3.5,0,0,2,1 -41,0,1,130,204,0,0,172,0,1.4,2,0,2,1 -56,1,1,120,236,0,1,178,0,0.8,2,0,2,1 -57,0,0,120,354,0,1,163,1,0.6,2,0,2,1 -57,1,0,140,192,0,1,148,0,0.4,1,0,1,1 -56,0,1,140,294,0,0,153,0,1.3,1,0,2,1 -44,1,1,120,263,0,1,173,0,0,2,0,3,1 -52,1,2,172,199,1,1,162,0,0.5,2,0,3,1 -57,1,2,150,168,0,1,174,0,1.6,2,0,2,1 -54,1,0,140,239,0,1,160,0,1.2,2,0,2,1 -48,0,2,130,275,0,1,139,0,0.2,2,0,2,1 -49,1,1,130,266,0,1,171,0,0.6,2,0,2,1 -64,1,3,110,211,0,0,144,1,1.8,1,0,2,1 -58,0,3,150,283,1,0,162,0,1,2,0,2,1 -50,0,2,120,219,0,1,158,0,1.6,1,0,2,1 -58,0,2,120,340,0,1,172,0,0,2,0,2,1 -66,0,3,150,226,0,1,114,0,2.6,0,0,2,1 -43,1,0,150,247,0,1,171,0,1.5,2,0,2,1 -69,0,3,140,239,0,1,151,0,1.8,2,2,2,1 -59,1,0,135,234,0,1,161,0,0.5,1,0,3,1 -44,1,2,130,233,0,1,179,1,0.4,2,0,2,1 -42,1,0,140,226,0,1,178,0,0,2,0,2,1 -61,1,2,150,243,1,1,137,1,1,1,0,2,1 -40,1,3,140,199,0,1,178,1,1.4,2,0,3,1 -71,0,1,160,302,0,1,162,0,0.4,2,2,2,1 -59,1,2,150,212,1,1,157,0,1.6,2,0,2,1 -51,1,2,110,175,0,1,123,0,0.6,2,0,2,1 -65,0,2,140,417,1,0,157,0,0.8,2,1,2,1 -53,1,2,130,197,1,0,152,0,1.2,0,0,2,1 -41,0,1,105,198,0,1,168,0,0,2,1,2,1 -65,1,0,120,177,0,1,140,0,0.4,2,0,3,1 -44,1,1,130,219,0,0,188,0,0,2,0,2,1 -54,1,2,125,273,0,0,152,0,0.5,0,1,2,1 -51,1,3,125,213,0,0,125,1,1.4,2,1,2,1 -46,0,2,142,177,0,0,160,1,1.4,0,0,2,1 -54,0,2,135,304,1,1,170,0,0,2,0,2,1 -54,1,2,150,232,0,0,165,0,1.6,2,0,3,1 -65,0,2,155,269,0,1,148,0,0.8,2,0,2,1 -65,0,2,160,360,0,0,151,0,0.8,2,0,2,1 -51,0,2,140,308,0,0,142,0,1.5,2,1,2,1 -48,1,1,130,245,0,0,180,0,0.2,1,0,2,1 -45,1,0,104,208,0,0,148,1,3,1,0,2,1 -53,0,0,130,264,0,0,143,0,0.4,1,0,2,1 -39,1,2,140,321,0,0,182,0,0,2,0,2,1 -52,1,1,120,325,0,1,172,0,0.2,2,0,2,1 -44,1,2,140,235,0,0,180,0,0,2,0,2,1 -47,1,2,138,257,0,0,156,0,0,2,0,2,1 -53,0,2,128,216,0,0,115,0,0,2,0,0,1 -53,0,0,138,234,0,0,160,0,0,2,0,2,1 -51,0,2,130,256,0,0,149,0,0.5,2,0,2,1 -66,1,0,120,302,0,0,151,0,0.4,1,0,2,1 -62,1,2,130,231,0,1,146,0,1.8,1,3,3,1 -44,0,2,108,141,0,1,175,0,0.6,1,0,2,1 -63,0,2,135,252,0,0,172,0,0,2,0,2,1 -52,1,1,134,201,0,1,158,0,0.8,2,1,2,1 -48,1,0,122,222,0,0,186,0,0,2,0,2,1 -45,1,0,115,260,0,0,185,0,0,2,0,2,1 -34,1,3,118,182,0,0,174,0,0,2,0,2,1 -57,0,0,128,303,0,0,159,0,0,2,1,2,1 -71,0,2,110,265,1,0,130,0,0,2,1,2,1 -54,1,1,108,309,0,1,156,0,0,2,0,3,1 -52,1,3,118,186,0,0,190,0,0,1,0,1,1 -41,1,1,135,203,0,1,132,0,0,1,0,1,1 -58,1,2,140,211,1,0,165,0,0,2,0,2,1 -35,0,0,138,183,0,1,182,0,1.4,2,0,2,1 -51,1,2,100,222,0,1,143,1,1.2,1,0,2,1 -45,0,1,130,234,0,0,175,0,0.6,1,0,2,1 -44,1,1,120,220,0,1,170,0,0,2,0,2,1 -62,0,0,124,209,0,1,163,0,0,2,0,2,1 -54,1,2,120,258,0,0,147,0,0.4,1,0,3,1 -51,1,2,94,227,0,1,154,1,0,2,1,3,1 -29,1,1,130,204,0,0,202,0,0,2,0,2,1 -51,1,0,140,261,0,0,186,1,0,2,0,2,1 -43,0,2,122,213,0,1,165,0,0.2,1,0,2,1 -55,0,1,135,250,0,0,161,0,1.4,1,0,2,1 -51,1,2,125,245,1,0,166,0,2.4,1,0,2,1 -59,1,1,140,221,0,1,164,1,0,2,0,2,1 -52,1,1,128,205,1,1,184,0,0,2,0,2,1 -58,1,2,105,240,0,0,154,1,0.6,1,0,3,1 -41,1,2,112,250,0,1,179,0,0,2,0,2,1 -45,1,1,128,308,0,0,170,0,0,2,0,2,1 -60,0,2,102,318,0,1,160,0,0,2,1,2,1 -52,1,3,152,298,1,1,178,0,1.2,1,0,3,1 -42,0,0,102,265,0,0,122,0,0.6,1,0,2,1 -67,0,2,115,564,0,0,160,0,1.6,1,0,3,1 -68,1,2,118,277,0,1,151,0,1,2,1,3,1 -46,1,1,101,197,1,1,156,0,0,2,0,3,1 -54,0,2,110,214,0,1,158,0,1.6,1,0,2,1 -58,0,0,100,248,0,0,122,0,1,1,0,2,1 -48,1,2,124,255,1,1,175,0,0,2,2,2,1 -57,1,0,132,207,0,1,168,1,0,2,0,3,1 -52,1,2,138,223,0,1,169,0,0,2,4,2,1 -54,0,1,132,288,1,0,159,1,0,2,1,2,1 -45,0,1,112,160,0,1,138,0,0,1,0,2,1 -53,1,0,142,226,0,0,111,1,0,2,0,3,1 -62,0,0,140,394,0,0,157,0,1.2,1,0,2,1 -52,1,0,108,233,1,1,147,0,0.1,2,3,3,1 -43,1,2,130,315,0,1,162,0,1.9,2,1,2,1 -53,1,2,130,246,1,0,173,0,0,2,3,2,1 -42,1,3,148,244,0,0,178,0,0.8,2,2,2,1 -59,1,3,178,270,0,0,145,0,4.2,0,0,3,1 -63,0,1,140,195,0,1,179,0,0,2,2,2,1 -42,1,2,120,240,1,1,194,0,0.8,0,0,3,1 -50,1,2,129,196,0,1,163,0,0,2,0,2,1 -68,0,2,120,211,0,0,115,0,1.5,1,0,2,1 -69,1,3,160,234,1,0,131,0,0.1,1,1,2,1 -45,0,0,138,236,0,0,152,1,0.2,1,0,2,1 -50,0,1,120,244,0,1,162,0,1.1,2,0,2,1 -50,0,0,110,254,0,0,159,0,0,2,0,2,1 -64,0,0,180,325,0,1,154,1,0,2,0,2,1 -57,1,2,150,126,1,1,173,0,0.2,2,1,3,1 -64,0,2,140,313,0,1,133,0,0.2,2,0,3,1 -43,1,0,110,211,0,1,161,0,0,2,0,3,1 -55,1,1,130,262,0,1,155,0,0,2,0,2,1 -37,0,2,120,215,0,1,170,0,0,2,0,2,1 -41,1,2,130,214,0,0,168,0,2,1,0,2,1 -56,1,3,120,193,0,0,162,0,1.9,1,0,3,1 -46,0,1,105,204,0,1,172,0,0,2,0,2,1 -46,0,0,138,243,0,0,152,1,0,1,0,2,1 -64,0,0,130,303,0,1,122,0,2,1,2,2,1 -59,1,0,138,271,0,0,182,0,0,2,0,2,1 -41,0,2,112,268,0,0,172,1,0,2,0,2,1 -54,0,2,108,267,0,0,167,0,0,2,0,2,1 -39,0,2,94,199,0,1,179,0,0,2,0,2,1 -34,0,1,118,210,0,1,192,0,0.7,2,0,2,1 -47,1,0,112,204,0,1,143,0,0.1,2,0,2,1 -67,0,2,152,277,0,1,172,0,0,2,1,2,1 -52,0,2,136,196,0,0,169,0,0.1,1,0,2,1 -74,0,1,120,269,0,0,121,1,0.2,2,1,2,1 -54,0,2,160,201,0,1,163,0,0,2,1,2,1 -49,0,1,134,271,0,1,162,0,0,1,0,2,1 -42,1,1,120,295,0,1,162,0,0,2,0,2,1 -41,1,1,110,235,0,1,153,0,0,2,0,2,1 -41,0,1,126,306,0,1,163,0,0,2,0,2,1 -49,0,0,130,269,0,1,163,0,0,2,0,2,1 -60,0,2,120,178,1,1,96,0,0,2,0,2,1 -62,1,1,128,208,1,0,140,0,0,2,0,2,1 -57,1,0,110,201,0,1,126,1,1.5,1,0,1,1 -64,1,0,128,263,0,1,105,1,0.2,1,1,3,1 -51,0,2,120,295,0,0,157,0,0.6,2,0,2,1 -43,1,0,115,303,0,1,181,0,1.2,1,0,2,1 -42,0,2,120,209,0,1,173,0,0,1,0,2,1 -67,0,0,106,223,0,1,142,0,0.3,2,2,2,1 -76,0,2,140,197,0,2,116,0,1.1,1,0,2,1 -70,1,1,156,245,0,0,143,0,0,2,0,2,1 -44,0,2,118,242,0,1,149,0,0.3,1,1,2,1 -60,0,3,150,240,0,1,171,0,0.9,2,0,2,1 -44,1,2,120,226,0,1,169,0,0,2,0,2,1 -42,1,2,130,180,0,1,150,0,0,2,0,2,1 -66,1,0,160,228,0,0,138,0,2.3,2,0,1,1 -71,0,0,112,149,0,1,125,0,1.6,1,0,2,1 -64,1,3,170,227,0,0,155,0,0.6,1,0,3,1 -66,0,2,146,278,0,0,152,0,0,1,1,2,1 -39,0,2,138,220,0,1,152,0,0,1,0,2,1 -58,0,0,130,197,0,1,131,0,0.6,1,0,2,1 -47,1,2,130,253,0,1,179,0,0,2,0,2,1 -35,1,1,122,192,0,1,174,0,0,2,0,2,1 -58,1,1,125,220,0,1,144,0,0.4,1,4,3,1 -56,1,1,130,221,0,0,163,0,0,2,0,3,1 -56,1,1,120,240,0,1,169,0,0,0,0,2,1 -55,0,1,132,342,0,1,166,0,1.2,2,0,2,1 -41,1,1,120,157,0,1,182,0,0,2,0,2,1 -38,1,2,138,175,0,1,173,0,0,2,4,2,1 -38,1,2,138,175,0,1,173,0,0,2,4,2,1 -67,1,0,160,286,0,0,108,1,1.5,1,3,2,0 -67,1,0,120,229,0,0,129,1,2.6,1,2,3,0 -62,0,0,140,268,0,0,160,0,3.6,0,2,2,0 -63,1,0,130,254,0,0,147,0,1.4,1,1,3,0 -53,1,0,140,203,1,0,155,1,3.1,0,0,3,0 -56,1,2,130,256,1,0,142,1,0.6,1,1,1,0 -48,1,1,110,229,0,1,168,0,1,0,0,3,0 -58,1,1,120,284,0,0,160,0,1.8,1,0,2,0 -58,1,2,132,224,0,0,173,0,3.2,2,2,3,0 -60,1,0,130,206,0,0,132,1,2.4,1,2,3,0 -40,1,0,110,167,0,0,114,1,2,1,0,3,0 -60,1,0,117,230,1,1,160,1,1.4,2,2,3,0 -64,1,2,140,335,0,1,158,0,0,2,0,2,0 -43,1,0,120,177,0,0,120,1,2.5,1,0,3,0 -57,1,0,150,276,0,0,112,1,0.6,1,1,1,0 -55,1,0,132,353,0,1,132,1,1.2,1,1,3,0 -65,0,0,150,225,0,0,114,0,1,1,3,3,0 -61,0,0,130,330,0,0,169,0,0,2,0,2,0 -58,1,2,112,230,0,0,165,0,2.5,1,1,3,0 -50,1,0,150,243,0,0,128,0,2.6,1,0,3,0 -44,1,0,112,290,0,0,153,0,0,2,1,2,0 -60,1,0,130,253,0,1,144,1,1.4,2,1,3,0 -54,1,0,124,266,0,0,109,1,2.2,1,1,3,0 -50,1,2,140,233,0,1,163,0,0.6,1,1,3,0 -41,1,0,110,172,0,0,158,0,0,2,0,3,0 -51,0,0,130,305,0,1,142,1,1.2,1,0,3,0 -58,1,0,128,216,0,0,131,1,2.2,1,3,3,0 -54,1,0,120,188,0,1,113,0,1.4,1,1,3,0 -60,1,0,145,282,0,0,142,1,2.8,1,2,3,0 -60,1,2,140,185,0,0,155,0,3,1,0,2,0 -59,1,0,170,326,0,0,140,1,3.4,0,0,3,0 -46,1,2,150,231,0,1,147,0,3.6,1,0,2,0 -67,1,0,125,254,1,1,163,0,0.2,1,2,3,0 -62,1,0,120,267,0,1,99,1,1.8,1,2,3,0 -65,1,0,110,248,0,0,158,0,0.6,2,2,1,0 -44,1,0,110,197,0,0,177,0,0,2,1,2,0 -60,1,0,125,258,0,0,141,1,2.8,1,1,3,0 -58,1,0,150,270,0,0,111,1,0.8,2,0,3,0 -68,1,2,180,274,1,0,150,1,1.6,1,0,3,0 -62,0,0,160,164,0,0,145,0,6.2,0,3,3,0 -52,1,0,128,255,0,1,161,1,0,2,1,3,0 -59,1,0,110,239,0,0,142,1,1.2,1,1,3,0 -60,0,0,150,258,0,0,157,0,2.6,1,2,3,0 -49,1,2,120,188,0,1,139,0,2,1,3,3,0 -59,1,0,140,177,0,1,162,1,0,2,1,3,0 -57,1,2,128,229,0,0,150,0,0.4,1,1,3,0 -61,1,0,120,260,0,1,140,1,3.6,1,1,3,0 -39,1,0,118,219,0,1,140,0,1.2,1,0,3,0 -61,0,0,145,307,0,0,146,1,1,1,0,3,0 -56,1,0,125,249,1,0,144,1,1.2,1,1,2,0 -43,0,0,132,341,1,0,136,1,3,1,0,3,0 -62,0,2,130,263,0,1,97,0,1.2,1,1,3,0 -63,1,0,130,330,1,0,132,1,1.8,2,3,3,0 -65,1,0,135,254,0,0,127,0,2.8,1,1,3,0 -48,1,0,130,256,1,0,150,1,0,2,2,3,0 -63,0,0,150,407,0,0,154,0,4,1,3,3,0 -55,1,0,140,217,0,1,111,1,5.6,0,0,3,0 -65,1,3,138,282,1,0,174,0,1.4,1,1,2,0 -56,0,0,200,288,1,0,133,1,4,0,2,3,0 -54,1,0,110,239,0,1,126,1,2.8,1,1,3,0 -70,1,0,145,174,0,1,125,1,2.6,0,0,3,0 -62,1,1,120,281,0,0,103,0,1.4,1,1,3,0 -35,1,0,120,198,0,1,130,1,1.6,1,0,3,0 -59,1,3,170,288,0,0,159,0,0.2,1,0,3,0 -64,1,2,125,309,0,1,131,1,1.8,1,0,3,0 -47,1,2,108,243,0,1,152,0,0,2,0,2,0 -57,1,0,165,289,1,0,124,0,1,1,3,3,0 -55,1,0,160,289,0,0,145,1,0.8,1,1,3,0 -64,1,0,120,246,0,0,96,1,2.2,0,1,2,0 -70,1,0,130,322,0,0,109,0,2.4,1,3,2,0 -51,1,0,140,299,0,1,173,1,1.6,2,0,3,0 -58,1,0,125,300,0,0,171,0,0,2,2,3,0 -60,1,0,140,293,0,0,170,0,1.2,1,2,3,0 -77,1,0,125,304,0,0,162,1,0,2,3,2,0 -35,1,0,126,282,0,0,156,1,0,2,0,3,0 -70,1,2,160,269,0,1,112,1,2.9,1,1,3,0 -59,0,0,174,249,0,1,143,1,0,1,0,2,0 -64,1,0,145,212,0,0,132,0,2,1,2,1,0 -57,1,0,152,274,0,1,88,1,1.2,1,1,3,0 -56,1,0,132,184,0,0,105,1,2.1,1,1,1,0 -48,1,0,124,274,0,0,166,0,0.5,1,0,3,0 -56,0,0,134,409,0,0,150,1,1.9,1,2,3,0 -66,1,1,160,246,0,1,120,1,0,1,3,1,0 -54,1,1,192,283,0,0,195,0,0,2,1,3,0 -69,1,2,140,254,0,0,146,0,2,1,3,3,0 -51,1,0,140,298,0,1,122,1,4.2,1,3,3,0 -43,1,0,132,247,1,0,143,1,0.1,1,4,3,0 -62,0,0,138,294,1,1,106,0,1.9,1,3,2,0 -67,1,0,100,299,0,0,125,1,0.9,1,2,2,0 -59,1,3,160,273,0,0,125,0,0,2,0,2,0 -45,1,0,142,309,0,0,147,1,0,1,3,3,0 -58,1,0,128,259,0,0,130,1,3,1,2,3,0 -50,1,0,144,200,0,0,126,1,0.9,1,0,3,0 -62,0,0,150,244,0,1,154,1,1.4,1,0,2,0 -38,1,3,120,231,0,1,182,1,3.8,1,0,3,0 -66,0,0,178,228,1,1,165,1,1,1,2,3,0 -52,1,0,112,230,0,1,160,0,0,2,1,2,0 -53,1,0,123,282,0,1,95,1,2,1,2,3,0 -63,0,0,108,269,0,1,169,1,1.8,1,2,2,0 -54,1,0,110,206,0,0,108,1,0,1,1,2,0 -66,1,0,112,212,0,0,132,1,0.1,2,1,2,0 -55,0,0,180,327,0,2,117,1,3.4,1,0,2,0 -49,1,2,118,149,0,0,126,0,0.8,2,3,2,0 -54,1,0,122,286,0,0,116,1,3.2,1,2,2,0 -56,1,0,130,283,1,0,103,1,1.6,0,0,3,0 -46,1,0,120,249,0,0,144,0,0.8,2,0,3,0 -61,1,3,134,234,0,1,145,0,2.6,1,2,2,0 -67,1,0,120,237,0,1,71,0,1,1,0,2,0 -58,1,0,100,234,0,1,156,0,0.1,2,1,3,0 -47,1,0,110,275,0,0,118,1,1,1,1,2,0 -52,1,0,125,212,0,1,168,0,1,2,2,3,0 -58,1,0,146,218,0,1,105,0,2,1,1,3,0 -57,1,1,124,261,0,1,141,0,0.3,2,0,3,0 -58,0,1,136,319,1,0,152,0,0,2,2,2,0 -61,1,0,138,166,0,0,125,1,3.6,1,1,2,0 -42,1,0,136,315,0,1,125,1,1.8,1,0,1,0 -52,1,0,128,204,1,1,156,1,1,1,0,0,0 -59,1,2,126,218,1,1,134,0,2.2,1,1,1,0 -40,1,0,152,223,0,1,181,0,0,2,0,3,0 -61,1,0,140,207,0,0,138,1,1.9,2,1,3,0 -46,1,0,140,311,0,1,120,1,1.8,1,2,3,0 -59,1,3,134,204,0,1,162,0,0.8,2,2,2,0 -57,1,1,154,232,0,0,164,0,0,2,1,2,0 -57,1,0,110,335,0,1,143,1,3,1,1,3,0 -55,0,0,128,205,0,2,130,1,2,1,1,3,0 -61,1,0,148,203,0,1,161,0,0,2,1,3,0 -58,1,0,114,318,0,2,140,0,4.4,0,3,1,0 -58,0,0,170,225,1,0,146,1,2.8,1,2,1,0 -67,1,2,152,212,0,0,150,0,0.8,1,0,3,0 -44,1,0,120,169,0,1,144,1,2.8,0,0,1,0 -63,1,0,140,187,0,0,144,1,4,2,2,3,0 -63,0,0,124,197,0,1,136,1,0,1,0,2,0 -59,1,0,164,176,1,0,90,0,1,1,2,1,0 -57,0,0,140,241,0,1,123,1,0.2,1,0,3,0 -45,1,3,110,264,0,1,132,0,1.2,1,0,3,0 -68,1,0,144,193,1,1,141,0,3.4,1,2,3,0 -57,1,0,130,131,0,1,115,1,1.2,1,1,3,0 -57,0,1,130,236,0,0,174,0,0,1,1,2,0 diff --git a/python_clustering/data/polyps.csv b/python_clustering/data/polyps.csv deleted file mode 100644 index 54f327919..000000000 --- a/python_clustering/data/polyps.csv +++ /dev/null @@ -1,23 +0,0 @@ -"","participant_id","sex","age","baseline","treatment","number3m","number12m" -"1","001","female",17,7,"sulindac",6,NA -"2","002","female",20,77,"placebo",67,63 -"3","003","male",16,7,"sulindac",4,2 -"4","004","female",18,5,"placebo",5,28 -"5","005","male",22,23,"sulindac",16,17 -"6","006","female",13,35,"placebo",31,61 -"7","007","female",23,11,"sulindac",6,1 -"8","008","male",34,12,"placebo",20,7 -"9","009","male",50,7,"placebo",7,15 -"10","010","male",19,318,"placebo",347,44 -"11","011","male",17,160,"sulindac",142,25 -"12","012","female",23,8,"sulindac",1,3 -"13","013","male",22,20,"placebo",16,28 -"14","014","male",30,11,"placebo",20,10 -"15","015","male",27,24,"placebo",26,40 -"16","016","male",23,34,"sulindac",27,33 -"17","017","female",22,54,"placebo",45,46 -"18","018","male",13,16,"sulindac",10,NA -"19","019","male",34,30,"placebo",30,50 -"20","020","female",23,10,"sulindac",6,3 -"21","021","female",22,20,"sulindac",5,1 -"22","022","male",42,12,"sulindac",8,4 From 5e436b180471d8104446126c3f16a93daa95b208 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Sun, 23 Jun 2024 20:42:29 -0400 Subject: [PATCH 8/8] Updated changes based off Elizabeth's comments --- intro_clustering_ml/intro_clustering_ml.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intro_clustering_ml/intro_clustering_ml.md b/intro_clustering_ml/intro_clustering_ml.md index 3e6e02b16..26334c022 100644 --- a/intro_clustering_ml/intro_clustering_ml.md +++ b/intro_clustering_ml/intro_clustering_ml.md @@ -2,7 +2,7 @@ author: Daniel Schwartz email: des338@drexel.edu -version: 0.0.0 +version: 1.0.0 current_version_description: Initial version module_type: standard docs_version: 3.0.0