check in notebook code (#23)

* check in notebook code * update notebook * update notebook code, the previous one was not the latest one * add some functions * refactor evaluation_pipeline code, pull out get_socre function from get_tuning_score.py to get_scores.py * add kmeans after running hierarchical clustering for re-building the model, only for testing, using user1 * adding kmeans only in the test step * check in codes for generating result for Gabriel's function, will move them to the server repo after refactoring * change the line of importing label_processing * address the problems from the previous commit, but has not yet done with geometric shape * Update tour_model_eval/build_save_model.py Co-authored-by: shankari <shankari@eecs.berkeley.edu> * refactored notebook code, not done with plot * update build_save_model according to notebook refactoring * check in the changes in the notebook, have put the original clustering code in evaluation_pipeline * delete output from the notebook * modify test notebook and add comments on it, remove extraneous files * add plot code, read filename directly from user id, add another way to read all data from the database in the notebook Co-authored-by: shankari <shankari@eecs.berkeley.edu>
e-mission · Jul 24, 2021 · 9cdb64e · 9cdb64e
1 parent d9c8f3f
commit 9cdb64e
Show file tree

Hide file tree

Showing 4 changed files with 407 additions and 145 deletions.
diff --git a/tour_model_eval/first_second_round_evaluation.ipynb b/tour_model_eval/first_second_round_evaluation.ipynb
@@ -0,0 +1,133 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "chubby-lightning",
+   "metadata": {},
+   "source": [
+    "## This notebook is to show the evaluation (scatter plot) of two rounds of clustering"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "mighty-ukraine",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import emission.core.get_database as edb\n",
+    "import emission.analysis.modelling.tour_model.get_users as gu\n",
+    "import emission.storage.timeseries.abstract_timeseries as esta\n",
+    "import matplotlib.pyplot as plt\n",
+    "import get_plot as plot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cathedral-pointer",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# read data of mini-pilot\n",
+    "participant_uuid_obj = list(edb.get_profile_db().find({\"install_group\": \"participant\"}, {\"user_id\": 1, \"_id\": 0}))\n",
+    "all_users = [u[\"user_id\"] for u in participant_uuid_obj]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "extraordinary-fetish",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# # read all data in the database\n",
+    "# all_users = esta.TimeSeries.get_uuid_list()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "exotic-livestock",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "radius = 100"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "czech-establishment",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get all/valid user list\n",
+    "user_ls, valid_users = gu.get_user_ls(all_users, radius)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "stainless-zoning",
+   "metadata": {},
+   "source": [
+    "### Get scatter plot from the 1st round of clustering"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "steady-bradley",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "# e.g. the full path is \"/Users/chuang/Desktop/e-mission-server/user_576e37c7-ab7e-4c03-add7-02486bc3f42e.csv\"\n",
+    "# here the file_path should be pass in \"/Users/chuang/Desktop/e-mission-server/\"\n",
+    "file_path = \"<path before filename>\"\n",
+    "plot.get_scatter(valid_users,file_path, first_round=True, second_round=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "distant-vietnamese",
+   "metadata": {},
+   "source": [
+    "### Get scatter plot from the 2nd round of clustering"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "polyphonic-reply",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure()\n",
+    "file_path = \"<path before filename>\"\n",
+    "plot.get_scatter(valid_users,file_path, first_round=False, second_round=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tour_model_eval/first_second_round_evaluation_test.ipynb b/tour_model_eval/first_second_round_evaluation_test.ipynb
@@ -0,0 +1,232 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "convinced-cedar",
+   "metadata": {},
+   "source": [
+    "### This notebook is to compare the results of scipy hierarchical clustering and sklearn KMeans clustering."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "integrated-wholesale",
+   "metadata": {},
+   "source": [
+    "We have 2 rounds of clustering. The first test only uses hierarchical clustering for 2 rounds of clustering. The second test adds KMeans clustering in the 2nd round, after running hierarchical custering. Since we cannot directly get the model from scipy hierarchical clustering, also, in sklearn, the associated AgglomerativeClustering method doesn't support separated fit() and predict() functions, we need to use a clustering algorithm like KMeans to build and save the model and use the saved model to predict labels for the new trip.\n",
+    "The result of this notebook shows that adding KMeans doesn't change the result from scipy hierarchical clustering."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "historical-michael",
+   "metadata": {},
+   "source": [
+    "We use user 1 from the mini-pilot program here. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "nearby-assist",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import emission.core.get_database as edb\n",
+    "import emission.analysis.modelling.tour_model.similarity as similarity\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import emission.analysis.modelling.tour_model.get_request_percentage as grp\n",
+    "import emission.analysis.modelling.tour_model.get_scores as gs\n",
+    "import emission.analysis.modelling.tour_model.label_processing as lp\n",
+    "import emission.analysis.modelling.tour_model.get_users as gu\n",
+    "import emission.analysis.modelling.tour_model.data_preprocessing as preprocess\n",
+    "import emission.analysis.modelling.tour_model.evaluation_pipeline as ep\n",
+    "import matplotlib.pyplot as plt\n",
+    "import emission.analysis.modelling.tour_model.get_plot as plot\n",
+    "import emission.core.common as ecc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "rational-enough",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.set_printoptions(suppress=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "matched-cylinder",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pd.set_option('max_colwidth',200)\n",
+    "pd.set_option('display.max_rows', None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "central-basic",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "participant_uuid_obj = list(edb.get_profile_db().find({\"install_group\": \"participant\"}, {\"user_id\": 1, \"_id\": 0}))\n",
+    "all_users = [u[\"user_id\"] for u in participant_uuid_obj]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "breathing-beginning",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "radius = 100"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "continuous-novel",
+   "metadata": {},
+   "source": [
+    "### using scipy hierarchical clustering"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "positive-tradition",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for a in range(1):\n",
+    "    user = all_users[a]\n",
+    "    df = pd.DataFrame(columns=['user','user_id','percentage of 1st round','homogeneity socre of 1st round',\n",
+    "                               'percentage of 2nd round','homogeneity socre of 2nd round','scores','lower boundary','distance percentage'])\n",
+    "    trips = preprocess.read_data(user)\n",
+    "    filter_trips = preprocess.filter_data(trips, radius)\n",
+    "\n",
+    "    # filter out users that don't have enough valid labeled trips\n",
+    "    if not gu.valid_user(filter_trips, trips):\n",
+    "        continue\n",
+    "    tune_idx, test_idx = preprocess.split_data(filter_trips)\n",
+    "\n",
+    "    # choose tuning/test set to run the model\n",
+    "    # this step will use KFold (5 splits) to split the data into different subsets\n",
+    "    # - tune: tuning set\n",
+    "    # - test: test set\n",
+    "    # Here we user a bigger part of the data for testing and a smaller part for tuning\n",
+    "    tune_data = preprocess.get_subdata(filter_trips, test_idx)\n",
+    "    test_data = preprocess.get_subdata(filter_trips, tune_idx)\n",
+    "    \n",
+    "    # tune data\n",
+    "    for j in range(len(tune_data)):\n",
+    "        low, dist_pct = ep.tune(tune_data[j], radius, kmeans=False)\n",
+    "        df.loc[j,'lower boundary']=low\n",
+    "        df.loc[j,'distance percentage']=dist_pct\n",
+    "\n",
+    "    # testing\n",
+    "    for k in range(len(test_data)):\n",
+    "        low = df.loc[k,'lower boundary']\n",
+    "        dist_pct = df.loc[k,'distance percentage']\n",
+    "\n",
+    "        homo_first, percentage_first, homo_second, percentage_second, scores = ep.test(test_data[k],radius,low,\n",
+    "                                                                                    dist_pct,kmeans=False)\n",
+    "        df.loc[k, 'percentage of 1st round'] = percentage_first\n",
+    "        df.loc[k, 'homogeneity socre of 1st round'] = homo_first\n",
+    "        df.loc[k, 'percentage of 2nd round'] = percentage_second\n",
+    "        df.loc[k, 'homogeneity socre of 2nd round'] = homo_second\n",
+    "        df.loc[k, 'scores'] = scores\n",
+    "        df['user_id'] = user\n",
+    "        df['user']='user'+str(a+1)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "muslim-keyboard",
+   "metadata": {},
+   "source": [
+    "### using kmeans"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "finite-visitor",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for a in range(1):\n",
+    "    user = all_users[a]\n",
+    "    df1 = pd.DataFrame(columns=['user','user_id','percentage of 1st round','homogeneity socre of 1st round',\n",
+    "                               'percentage of 2nd round','homogeneity socre of 2nd round','scores','lower boundary','distance percentage'])\n",
+    "    trips = preprocess.read_data(user)\n",
+    "    filter_trips = preprocess.filter_data(trips, radius)\n",
+    "\n",
+    "    # filter out users that don't have enough valid labeled trips\n",
+    "    if not gu.valid_user(filter_trips, trips):\n",
+    "        continue\n",
+    "    tune_idx, test_idx = preprocess.split_data(filter_trips)\n",
+    "\n",
+    "    # choose tuning/test set to run the model\n",
+    "    # this step will use KFold (5 splits) to split the data into different subsets\n",
+    "    # - tune: tuning set\n",
+    "    # - test: test set\n",
+    "    # Here we user a bigger part of the data for testing and a smaller part for tuning\n",
+    "    tune_data = preprocess.get_subdata(filter_trips, test_idx)\n",
+    "    test_data = preprocess.get_subdata(filter_trips, tune_idx)\n",
+    "    \n",
+    "    # tune data\n",
+    "    for j in range(len(tune_data)):\n",
+    "        low, dist_pct = ep.tune(tune_data[j], radius, kmeans=False)\n",
+    "        df1.loc[j,'lower boundary']=low\n",
+    "        df1.loc[j,'distance percentage']=dist_pct\n",
+    "\n",
+    "    # testing\n",
+    "    # for testing, we add kmeans to re-build the model. Kmeans is run after hierarchical clustering, \n",
+    "    # passed in n_clusters as a parameter that comes from the result of hierarchical clustering.\n",
+    "    for k in range(len(test_data)):\n",
+    "        low = df1.loc[k,'lower boundary']\n",
+    "        dist_pct = df1.loc[k,'distance percentage']\n",
+    "\n",
+    "        homo_first, percentage_first, homo_second, percentage_second, scores = ep.test(test_data[k],radius,low,\n",
+    "                                                                                    dist_pct,kmeans=True)\n",
+    "        df1.loc[k, 'percentage of 1st round'] = percentage_first\n",
+    "        df1.loc[k, 'homogeneity socre of 1st round'] = homo_first\n",
+    "        df1.loc[k, 'percentage of 2nd round'] = percentage_second\n",
+    "        df1.loc[k, 'homogeneity socre of 2nd round'] = homo_second\n",
+    "        df1.loc[k, 'scores'] = scores\n",
+    "        df1['user_id'] = user\n",
+    "        df1['user']='user'+str(a+1)\n",
+    "df1"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}