\n",
+ "______________________________________________________\n",
+ "\n",
+ "Despite major technological breakthroughs in cybersecurity and privacy in recent years, secure off-premises data science collaboration has remained out of reach. This is a major problem for the health sector which has so much to gain from the power of data but also so much at risk when it comes to patients' highly sensitive medical records.\n",
+ "\n",
+ "We are on a mission to make remote data science collaboration safe for the health sector. Using BastionLab, data owners can set strict access policies on datasets for collaborators, allowing them to run privacy-friendly queries and train and deploy ML models on datasets whilst blocking access to raw data.\n",
+ "\n",
+ "In this how-to guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.\n",
+ "\n",
+ "In part I of this two-part data exploration. We will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**.\n",
+ "\n",
+ "But before we can do that, we first need to get everything set up!\n",
+ "\n",
+ "## Pre-requisites\n",
+ "___________________________________________\n",
+ "\n",
+ "### Installation and dataset\n",
+ "\n",
+ "In order to run this notebook, we need to:\n",
+ "- Ensure we have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed\n",
+ "- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) and the [BastionLab server](https://pypi.org/project/bastionlab-server/0.3.7/) pip packages\n",
+ "- [Download the dataset](https://drive.google.com/file/d/1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI/view?usp=share_link) we will be using in this notebook.\n",
+ "\n",
+ "You can download the BastionLab pip packages and the dataset by running the following code block.\n",
+ "\n",
+ ">To find out about other ways you can install and run BastionLab, see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "hK-HDaMI_G1j"
+ },
+ "outputs": [],
+ "source": [
+ "# installing BastionLab client & server packages\n",
+ "!pip install bastionlab\n",
+ "!pip install bastionlab_server\n",
+ "\n",
+ "# dowloading the dataset using Google Drive tool dgown\n",
+ "!pip install gdown\n",
+ "!pip install --upgrade --no-cache-dir gdown\n",
+ "!gdown --id \"1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NJ67chDB_G1l"
+ },
+ "source": [
+ "The dataset we are using for this how-to guide is based on the Diabetes 130-US hospitals for years 1999-2008 dataset. It contains 10 years of data on diabetes admissions from 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.\n",
+ "\n",
+ ">For more detailed information on the dataset, you can check out the description and full dataset by following this [link](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).\n",
+ "\n",
+ "However, this dataset had already been pre-processed before publication which stopped us from showing you some key data cleaning steps. We therefore made a few modifications to replace some pre-grouped data columns with randomly populated data. You can check out exactly how we did this using Polars [here](https://colab.research.google.com/drive/174EJvK8u8mGGWb6ypLH9SKaeRnX-pEou?usp=share_link). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OjL01I5c_G1m"
+ },
+ "source": [
+ "## Data owner's POV\n",
+ "___________________________________________\n",
+ "\n",
+ "### Launching the server\n",
+ "\n",
+ "Let's start by putting ourselves in the shoes of the data owner.\n",
+ "\n",
+ "But before we can do anything more, the BastionLab server must be running.\n",
+ "\n",
+ "In production we recommend this is done using our Docker image, but for testing purposes you can use our `bastionlab_server` package, which removes the need for user authentication."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "A85GsYOi_G1o",
+ "outputId": "29d2505e-8106-4311-cba2-05d1ae6101ac"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "BastionLab server (version 0.3.7) already installed\n",
+ "Libtorch (version 1.13.1) already installed\n",
+ "TLS certificates already generated\n",
+ "Bastionlab server is now running on port 50056\n"
+ ]
+ }
+ ],
+ "source": [
+ "# launch bastionlab_server test package\n",
+ "import bastionlab_server\n",
+ "\n",
+ "srv = bastionlab_server.start()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IBWNyTnz_G1p"
+ },
+ "source": [
+ ">*For more details on how you can set up the server using our Docker image, check out our [Installation Tutorial](../getting-started/installation.md).*\n",
+ "\n",
+ "### Connecting to the server\n",
+ "Next, we will connect to the server in order to be able to upload the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "id": "6zzV7xrs_G1q"
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "[2023-02-17T16:18:15Z INFO bastionlab] Authentication is disabled.\n",
+ "[2023-02-17T16:18:15Z INFO bastionlab] Telemetry is enabled.\n",
+ "[2023-02-17T16:18:15Z INFO bastionlab] BastionLab server listening on 0.0.0.0:50056.\n",
+ "[2023-02-17T16:18:15Z INFO bastionlab] Server ready to take requests\n"
+ ]
+ }
+ ],
+ "source": [
+ "# connecting to the server\n",
+ "from bastionlab import Connection\n",
+ "\n",
+ "connection = Connection(\"localhost\")\n",
+ "client = connection.client"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "K9DO7gVt_G1r"
+ },
+ "source": [
+ "### Creating a custom privacy policy\n",
+ "\n",
+ "We can now create a [custom access policy](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/) for the dataset which determines how much access collaborators will get to the dataset. \n",
+ "\n",
+ "In this example, we create a policy with the following configuration:\n",
+ "\n",
+ "-> `Aggregation(min_agg_size=10):` Any data extracted from the dataset should be the result of an aggregation of at least ten rows.\n",
+ "\n",
+ "-> `unsafe_handling=Reject()`: Any attempted query which breaches this policy will be rejected by the server.\n",
+ "\n",
+ "-> `savable=True`: The data scientist can save changes made to the dataset in BastionLab (this will create a new dataset- it will not overwrite the original dataset).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "id": "mRJjgd1C_G1t"
+ },
+ "outputs": [],
+ "source": [
+ "from bastionlab.polars.policy import Policy, Aggregation, Reject\n",
+ "\n",
+ "# defining the dataset's privacy policy\n",
+ "policy = Policy(Aggregation(min_agg_size=10), unsafe_handling=Reject(), savable=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Q7HHSM3e_G1v"
+ },
+ "source": [
+ "### Uploading the dataset\n",
+ "\n",
+ "Now that the policy has been created, we can upload the dataset to the BastionLab server instance.\n",
+ "\n",
+ "Firstly, we need to convert our CSV file into a Polars DataFrame by using the Polars `read_csv` function, supplying the path to the CSV file as a string argument.\n",
+ "\n",
+ "Next, we use BastionLab's `client.polars.send_df` to upload the dataframe with our custom policy.\n",
+ "\n",
+ "Finally, we save the FetchableLazyFrame using the `save` method with no arguments. We can make a note of the FetchableLazyFrame's identifier to be shared with data scientists to help them to remotely access the FetchableLazyFrame!\n",
+ "\n",
+ ">Note we need to save FetchableLazyFrames to avoid them being lost when the server is stopped and restarted or crashes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "lkMIl0ar_G1w",
+ "outputId": "836022c7-273a-4f98-cc84-baf1721c3412"
+ },
+ "outputs": [],
+ "source": [
+ "import polars as pl\n",
+ "\n",
+ "# converting the dataset into a Polars dataframe\n",
+ "df = pl.read_csv(\"updated_diabetes_data.csv\")\n",
+ "\n",
+ "# uploading the dataframe, the custom privacy policy\n",
+ "# and the column we want to forbid to BastionLab's server\n",
+ "rdf = client.polars.send_df(df, policy=policy)\n",
+ "\n",
+ "# saving the RemoteLazyFrame\n",
+ "rdf.save()\n",
+ "# get and print out a copy of the RDF identifier string\n",
+ "ID = rdf.identifier\n",
+ "print(ID)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ywAyp-2y_G1y"
+ },
+ "source": [
+ "`send_df()` will return a FetchableLazyFrame instance, which we will work with directly from now on. \n",
+ "\n",
+ ">Note that we talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. \n",
+ "\n",
+ "A `RemoteLazyFrame` just means we have called some functions and not yet `collected` the results, which means the operations have not yet been run on the server-side. When we call `collect()` these operations are run server-side and the result of this is our `FetchableLazyFrame`!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YRC1y4uX_G10"
+ },
+ "source": [
+ "Let's finish off by testing what happens if we breach our security policy by trying to display an entire column from our dataset with the `collect().fetch()` methods. \n",
+ "\n",
+ ">*You can learn more about how to use both of those methods in [our quick tour](https://bastionlab.readthedocs.io/en/latest/docs/quick-tour/quick-tour/#running-queries).*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "C7j4vdDd_G10",
+ "outputId": "cfa18f9b-5606-4e38-ba96-b6bcbf4b44a7"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\u001b[31mThe query has been rejected by the data owner.\u001b[37m\n"
+ ]
+ }
+ ],
+ "source": [
+ "rdf.select(\"age\").collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x1Zu2YQi_G11"
+ },
+ "source": [
+ "Instead of getting back the results of our query, we see an error message: `The query has been rejected by the data owner.`\n",
+ "\n",
+ "We cannot view the output of the query because it does not aggregate at least 10 rows of data as specified in our privacy policy. It tries to print out individual rows instead!\n",
+ "\n",
+ "Now that the dataset has been uploaded, it's time for our data scientists to get working... \n",
+ "\n",
+ "The data owner can now connection their connection to the server."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "mcM4pR6D_G11"
+ },
+ "outputs": [],
+ "source": [
+ "connection.close()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HJzNveFG_G13"
+ },
+ "source": [
+ "## Data scientist #1's POV\n",
+ "__________________________________________\n",
+ "\n",
+ "### Connecting to the dataset\n",
+ "\n",
+ "We'll now jump into the role of the data scientist responsible for cleaning the dataset for this data analysis project.\n",
+ "\n",
+ "We first need to connect to the `bastion_lab` server and get a FetchableLazyFrame instance of the dataset. We'll use' the `get_df()` method and supply it with the id shared with us by the data owner to do this.\n",
+ "\n",
+ "We store our FetchableLazyFrame in the `rdf` variable which we'll be working with from here on."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "TT3mSjII_G13",
+ "outputId": "3e048fa0-5f0f-4244-f369-a9d87580b225"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "FetchableLazyFrame(identifier=0c7f2bcc-5afc-4a0a-b10f-24d796195045)"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "connection = Connection(\"localhost\")\n",
+ "client = connection.client\n",
+ "\n",
+ "# selecting the FetchableLazyFrame(s) we'll be working with\n",
+ "rdf = client.polars.get_df(ID)\n",
+ "rdf"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AEFbeESX_G14"
+ },
+ "source": [
+ "Let's display the dataset's columns to confirm we are connected to the correct one."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "G-g8rOnj_G15",
+ "outputId": "8538d93f-bf36-456e-c028-90cd724dd829"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(rdf.columns)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H55DTcKn_G15"
+ },
+ "source": [
+ "Everything is as expected! We can now start our data exploration. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CQC7tfaF_G15"
+ },
+ "source": [
+ "## Data cleaning\n",
+ "__________________________________________\n",
+ "\n",
+ "\n",
+ "### Dropping columns\n",
+ "You may have noticed, this dataset contains a lot of columns! This is great as it it gives us a wide choice of correlations to explore. However, we will not have time to explore all of them in this analysis! We can therefore drop the columns that we won't be using- either because they are irrelavant, or because they didn't lead us to the most interesting correlations for this analysis!\n",
+ "\n",
+ "We can do this by using the`drop` method, providing it with a list of the names of columns to be dropped. This is a RemoteLazyFrame method which corresponds directly to the [Polars drop() function](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.drop.html#polars.LazyFrame.drop)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "s0NI6rTqOKWN"
+ },
+ "outputs": [],
+ "source": [
+ "# list of column names we wish to remove from our RemoteLazyFrame\n",
+ "to_drop = [\n",
+ " \"encounter_id\",\n",
+ " \"patient_nbr\",\n",
+ " \"weight\",\n",
+ " \"discharge_disposition_id\",\n",
+ " \"admission_source_id\",\n",
+ " \"time_in_hospital\",\n",
+ " \"payer_code\",\n",
+ " \"medical_specialty\",\n",
+ " \"num_lab_procedures\",\n",
+ " \"num_procedures\",\n",
+ " \"num_medications\",\n",
+ " \"number_outpatient\",\n",
+ " \"number_inpatient\",\n",
+ " \"number_diagnoses\",\n",
+ " \"diabetesMed\",\n",
+ "]\n",
+ "\n",
+ "# replace rdf with our updated RemoteLazyFrame with to_drop columns deleted\n",
+ "rdf = rdf.drop(to_drop)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vabmc_jjOQCo"
+ },
+ "source": [
+ "There are now 36 columns to work with intead of 51- this will make the RemoteLazyFrame a little easier to work with!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7ausY-PC_G16"
+ },
+ "source": [
+ "\n",
+ "### Checking for null values\n",
+ "\n",
+ "We now want to assess how many null values we have in each column. This will help us to know if we have enough data to draw meaningful conclusions from each column and gives us the chance to fill or delete null values if relevant.\n",
+ "\n",
+ "However, based on the description of the dataset shared with us by the data owner, we know that some column cells have been filled with '?' instead of being left blank.\n",
+ "\n",
+ "Before we can get an accurate picture of null values, we first need to replace all these '?' values with null values. We will do this by using [Polars .when().then().otherwise()` functions](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html). \n",
+ "\n",
+ "One final hurdle is that we can only search and replace '?' strings in columns containing strings which will have the 'Utf8' datatype- otherwise an error will be produced. We must therefore only apply our search and replace operation to string columns!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "F2KwhZB_fTC3"
+ },
+ "outputs": [],
+ "source": [
+ "# step one: getting a list of all Utf8/string columns\n",
+ "selects = []\n",
+ "for x in rdf.columns:\n",
+ " if rdf.select(x).dtypes == [pl.datatypes.Utf8]:\n",
+ " selects.append(x)\n",
+ "\n",
+ "# step two: we replace all '? cells in these columns with null values\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(x) == \"?\").then(None).otherwise(pl.col(x)).keep_name()\n",
+ " for x in selects\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c1Frpi9GUtdW"
+ },
+ "source": [
+ "In step two, we use the Polars `with_columns` function to add our new columns with null values instead of question marks to our RemoteLazyFrame. By using the `keep_name` function, these columns keep their original column name and therefore replace the original columns in the dataset. We save the result as `rdf`, storing the updated version of the dataset in our `rdf` variable."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vMMX8JZnKitA"
+ },
+ "source": [
+ "Now that this is done, we can go ahead and calculate how many null values each column contains.\n",
+ "\n",
+ "We do this by iterating over all the columns and getting a percentage of the `sum` of all the value that return `True` to the `is_null` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SAqqUz6I_G16"
+ },
+ "outputs": [],
+ "source": [
+ "# getting every columns percentage of null values in the RemoteLazyFrame\n",
+ "percent_missing = rdf.select(\n",
+ " [\n",
+ " pl.all().is_null().sum() * 100 / pl.all().count(),\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3uMcNqVZWhdN"
+ },
+ "source": [
+ "We can then view the percentage of null values for each column as a two-column list by using Polars `melt` function to flip the query results from a 2 row by 5 column grid, to a 2 column by 5 row grid. We use the `sort` function to show the columns in order from the column with the highest percentage of null values to the lowest.\n",
+ "\n",
+ "Finally, we remove any columns with no null values from our output since they are not of interest to us here."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 285
+ },
+ "id": "Pzz5qvSJWd2V",
+ "outputId": "26229368-72d0-4630-f8e2-5a12f480f297"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (7, 2)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "column name\n",
+ "
\n",
+ "
\n",
+ "null values (%)\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "str\n",
+ "
\n",
+ "
\n",
+ "f64\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ ""max_glu_serum"\n",
+ "
\n",
+ "
\n",
+ "94.746772\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""A1Cresult"\n",
+ "
\n",
+ "
\n",
+ "83.277322\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""readmitted"\n",
+ "
\n",
+ "
\n",
+ "53.911916\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""race"\n",
+ "
\n",
+ "
\n",
+ "2.233555\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""diag_3"\n",
+ "
\n",
+ "
\n",
+ "1.398306\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""diag_2"\n",
+ "
\n",
+ "
\n",
+ "0.351787\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""diag_1"\n",
+ "
\n",
+ "
\n",
+ "0.020636\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "shape: (7, 2)\n",
+ "┌───────────────┬─────────────────┐\n",
+ "│ column name ┆ null values (%) │\n",
+ "│ --- ┆ --- │\n",
+ "│ str ┆ f64 │\n",
+ "╞═══════════════╪═════════════════╡\n",
+ "│ max_glu_serum ┆ 94.746772 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ A1Cresult ┆ 83.277322 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ readmitted ┆ 53.911916 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ race ┆ 2.233555 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ diag_3 ┆ 1.398306 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ diag_2 ┆ 0.351787 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ diag_1 ┆ 0.020636 │\n",
+ "└───────────────┴─────────────────┘"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# melt table to a two-column table with the column name 'column' and corresponding percetage of null values 'null values', sort in descending order and display\n",
+ "percent_missing = percent_missing.melt(\n",
+ " variable_name=\"column name\",\n",
+ " value_name=\"null values (%)\",\n",
+ ").sort(pl.col(\"null values (%)\"), reverse=True)\n",
+ "\n",
+ "# filter out columns with no null values and display\n",
+ "percent_missing.filter(pl.col(\"null values (%)\") > 0).collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4n0jnBPyYLjf"
+ },
+ "source": [
+ "There are several strategies for dealing with null values such as deleting these rows from the dataset with the `drop_nulls` method or filling null values with the `fill_null` method. But in our case, we are just happy to have visibility over which columns including null values and to what extent so that we can handle and analyse these columns with this in mind."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-WUugovwve2c"
+ },
+ "source": [
+ "### Grouping data: ICD-9 medical codes\n",
+ "Grouping data is going to be the largest and most crucial task in this data cleaning job. This is a dataset with a low of wide-ranging numerical values which need to be grouped so that our data analysts can gain meaningul insights.\n",
+ "\n",
+ "Let's start with our diagnoses columns: `diag_1`, `diag_2` and `diag_3`.\n",
+ "\n",
+ "These columns contain the primary, secondary and terciary diagnoses given to patients. These diagnoses are given using [ICD-9 medical codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes) which are three digit codes ranging from 1 to 1000, as well as E800–E999 codes and V01–V82 codes.\n",
+ "\n",
+ "By grabbing all the unique values in the `diag_1` column and counting them, we can see that we have over 700 different values in this column!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 121
+ },
+ "id": "7pVHpmLWj6_w",
+ "outputId": "c7d50a9f-f919-4893-a1f4-50b1ba7d20c5"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (1, 1)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "diag_1\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "u32\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "717\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "shape: (1, 1)\n",
+ "┌────────┐\n",
+ "│ diag_1 │\n",
+ "│ --- │\n",
+ "│ u32 │\n",
+ "╞════════╡\n",
+ "│ 717 │\n",
+ "└────────┘"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tmp = rdf.select(\"diag_1\").unique()\n",
+ "tmp.select(pl.col(\"diag_1\").count()).collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cPsmfkBpkPCv"
+ },
+ "source": [
+ "Standard groupings of these codes have already been designed. What we want to do is replace the hundreds of unique codes we have in our our diagnoses columns with these groupings!\n",
+ "\n",
+ "To do this, we will again use Polars `when().then().otherwise()` functions to perform a find and replace operation. We will use `when()` to check if the codes in each cell are either E or V codes or fall within a certain numerical range.\n",
+ "\n",
+ "However, these diagnoses columns are currently string columns, since the E and V codes are not entirely numerical. This is problematic since we cannot perform numerical comparisons on these cells and we cannot convert the column type to a numerical one because of these 'E' and 'V' values!\n",
+ "\n",
+ "We will solve this problem in three steps:\n",
+ "\n",
+ "1) We will find and replace all E codes with a \"-1\" value and V codes with a \"-2\" value.\n",
+ "\n",
+ "2) We will `select()` our columns and `cast()` all values in these columns to float values.\n",
+ "\n",
+ "3) We will perform the find and replace operation to group all ICD-9 codes into their associated group- of which there are 17, plus E codes and V codes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "xPNFpZ7lW8qR"
+ },
+ "outputs": [],
+ "source": [
+ "# iterate over the three diagnoses columns\n",
+ "for col in [\"diag_1\", \"diag_2\", \"diag_3\"]:\n",
+ " # step one: replace troublesome E and V codes with temporary -1 and -2 codes\n",
+ " rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(\n",
+ " pl.col(col).str.starts_with(\"E\")\n",
+ " ) # use Polars str.starts_with method to identify E codes\n",
+ " .then(\"-1\")\n",
+ " .when(pl.col(col).str.starts_with(\"V\"))\n",
+ " .then(\"-2\")\n",
+ " .otherwise(pl.col(col))\n",
+ " .keep_name()\n",
+ " ]\n",
+ " )\n",
+ "\n",
+ " # step two: cast all values in column to float values\n",
+ " rdf = rdf.with_columns([pl.col(col).cast(pl.Float64)])\n",
+ "\n",
+ " # step three: replace all codes with their corresponding group\n",
+ " rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(col) >= 800)\n",
+ " .then(\"injury and poisoning\")\n",
+ " .when(pl.col(col) >= 780)\n",
+ " .then(\"symptoms, signs & ill-defined\")\n",
+ " .when(pl.col(col) >= 760)\n",
+ " .then(\"perinatal\")\n",
+ " .when(pl.col(col) >= 740)\n",
+ " .then(\"congenital anomalies\")\n",
+ " .when(pl.col(col) >= 710)\n",
+ " .then(\"musculoskeletal & connective tissue\")\n",
+ " .when(pl.col(col) >= 680)\n",
+ " .then(\"skin\")\n",
+ " .when(pl.col(col) >= 630)\n",
+ " .then(\"pregnancy, childbirth and peurperium\")\n",
+ " .when(pl.col(col) >= 580)\n",
+ " .then(\"genitourinary\")\n",
+ " .when(pl.col(col) >= 520)\n",
+ " .then(\"digestive\")\n",
+ " .when(pl.col(col) >= 460)\n",
+ " .then(\"respiratory\")\n",
+ " .when(pl.col(col) >= 390)\n",
+ " .then(\"circulatory\")\n",
+ " .when(pl.col(col) >= 320)\n",
+ " .then(\"nervous system and sense organs\")\n",
+ " .when(pl.col(col) >= 290)\n",
+ " .then(\"mental disorders\")\n",
+ " .when(pl.col(col) >= 280)\n",
+ " .then(\"blood and blood-forming organs\")\n",
+ " .when(pl.col(col) >= 240)\n",
+ " .then(\"neoplasms\")\n",
+ " .when(pl.col(col) >= 140)\n",
+ " .then(\"endocrine, nutritional, metabolic and immunity\")\n",
+ " .when(pl.col(col) >= 1)\n",
+ " .then(\"infectious and parasitic\")\n",
+ " .when(pl.col(col) == -1)\n",
+ " .then(\"E code (injury\")\n",
+ " .when(pl.col(col) == -2)\n",
+ " .then(\"V code (other)\")\n",
+ " .otherwise(\n",
+ " pl.col(col)\n",
+ " ) # otherwise (null values) keep original value from the column\n",
+ " .alias(\n",
+ " col\n",
+ " ) # give resulting column same name as previously- therefore replacing old columns\n",
+ " ]\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "P1MquUrNlXDO"
+ },
+ "source": [
+ "By performing the same query as previously to count `diag_1`'s unique values, we see there is now a much more manageable 19 labels in our data column! This will be similar for the `diag_2` and `diag_3` columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 121
+ },
+ "id": "YfC9CmWWdu0n",
+ "outputId": "c81284d2-8e09-49b6-f411-512da2421902"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (1, 1)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "diag_1\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "u32\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "19\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "shape: (1, 1)\n",
+ "┌────────┐\n",
+ "│ diag_1 │\n",
+ "│ --- │\n",
+ "│ u32 │\n",
+ "╞════════╡\n",
+ "│ 19 │\n",
+ "└────────┘"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tmp = rdf.select(\"diag_1\").unique()\n",
+ "tmp.select(pl.col(\"diag_1\").count()).collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BvdGu7GmsZVu"
+ },
+ "source": [
+ "### Grouping data: A1C, max glucose levels and readmittance\n",
+ "\n",
+ "We want to group together data in another three other columns using the same `.then().when().otherwise()` methods.\n",
+ "\n",
+ "The first two are `A1Cresult`, which contains patients' HbA1c level, and `max_glu_serum`, which contains their blood glucose level. We want to group these into `very high`, `high`, `normal` groups based on levels defined in our project brief.\n",
+ "\n",
+ "These columns are both currently string columns, so we will also need to convert them to float values in order to perform numerical comparisons on them."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "FgyrnPAlsZ0u"
+ },
+ "outputs": [],
+ "source": [
+ "# cast `max_glu_serum` and `A1Cresult` columns to float values\n",
+ "rdf = rdf.with_columns(\n",
+ " [pl.col(\"max_glu_serum\").cast(pl.Float64), pl.col(\"A1Cresult\").cast(pl.Float64)]\n",
+ ")\n",
+ "\n",
+ "# group values in A1Cresult column\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"A1Cresult\") >= 8)\n",
+ " .then(\"very high\")\n",
+ " .when(pl.col(\"A1Cresult\") >= 7)\n",
+ " .then(\"high\")\n",
+ " .when(pl.col(\"A1Cresult\") >= 0)\n",
+ " .then(\"normal\")\n",
+ " .otherwise(pl.col(\"A1Cresult\"))\n",
+ " .keep_name()\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "# group values in max_glu_serum column\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"max_glu_serum\") >= 300)\n",
+ " .then(\"very high\")\n",
+ " .when(pl.col(\"max_glu_serum\") >= 200)\n",
+ " .then(\"high\")\n",
+ " .when(pl.col(\"max_glu_serum\") >= 0)\n",
+ " .then(\"normal\")\n",
+ " .otherwise(pl.col(\"max_glu_serum\"))\n",
+ " .keep_name()\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Buu2nja5w6Db"
+ },
+ "source": [
+ "The final column we want to group is the `readmitted` column which records the number of days before any further re-hospitalization linked to the patients' diabetic condition.\n",
+ "\n",
+ "We will group this column into `short-term` and `long-term` and `n/a` (not applicable) groups.\n",
+ "\n",
+ "Simiar to in previous examples, we must first convert values in this column from strings to integer values."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "9cca9DhSw6We"
+ },
+ "outputs": [],
+ "source": [
+ "# cast readmitted column to integer values\n",
+ "rdf = rdf.with_columns([pl.col(\"readmitted\").cast(pl.Int64)])\n",
+ "\n",
+ "# group values\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"readmitted\") < 31)\n",
+ " .then(\"short-term\")\n",
+ " .when(pl.col(\"readmitted\") >= 31)\n",
+ " .then(\"long-term\")\n",
+ " .otherwise(\"n/a\")\n",
+ " .keep_name()\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kuwxAGYBoOQJ"
+ },
+ "source": [
+ "### Grouping data: binning ages\n",
+ "The next grouping task we will perform is to group ages into intervals of 10 years. We do this both to increase data privacy and to more easily draw correlations linked to broader age groups.\n",
+ "\n",
+ "We won't need to perform an `when().then().otherwise()` query here since BastionLab has its own `ApplyBins` tool.\n",
+ "\n",
+ "`ApplyBins` is a PyTorch module and the grouping of numbers takes place in its `forward` function. We can pass PyTorch modules to BastionLab's `apply_udf` function which will apply the `forward` function to any specified columns.\n",
+ "\n",
+ "All in all, we just three steps to bin our age column data:\n",
+ "\n",
+ "1) We import `ApplyBins` from `bastionlab.polars.utils`.\n",
+ "1) We instantiate our `ApplyBins` PyTorch module class with our bins interval given as the only argument.\n",
+ "2) We use `apply_udf`, providing a list of the column we want to modify and the PyTorch module, `ApplyBins`, that we wish to apply to these columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "2EC3smnWc06Q"
+ },
+ "outputs": [],
+ "source": [
+ "from bastionlab.polars.utils import ApplyBins\n",
+ "\n",
+ "# get an instance of ApplyBins module which will bin data into groups of 10\n",
+ "model = ApplyBins(10)\n",
+ "\n",
+ "# apply bins to \"age\" column\n",
+ "rdf = rdf.apply_udf([\"age\"], model)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1pOQYPYSsVns"
+ },
+ "source": [
+ "> Note, you can create your own custom PyTorch modules and apply them to columns using `apply_udf`. This is BastionLab's way of allowing you to apply custom functions on datasets, whilst restricting what you can do for security reasons. Functionality like `lambda`, `map` and `apply` are blocked by BastionLab as they are too permissive and could be misused."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gYRVmqTitckT"
+ },
+ "source": [
+ "### Adding columns\n",
+ "\n",
+ "Up until this point we have been using the `.when().then().otherwise()` and `with_columns` methods to make changes to existing columns, but by providing a new column name to the `alias` method, we can create a new column.\n",
+ "\n",
+ "In the following example, we will create a `is_readmitted` column which will store `False` for all the \"n/a\" values in our original `readmitted` column and `True` for any other values. This will allow us to quickly query whether certain groups of data have been readmitted or not!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "B2JGdBhmteAz"
+ },
+ "outputs": [],
+ "source": [
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"readmitted\") == \"n/a\")\n",
+ " .then(False)\n",
+ " .otherwise(True)\n",
+ " .alias(\n",
+ " \"is_readmitted\"\n",
+ " ) # ending the .when().then().otherwise() pattern with .alias() allows us to provide a new column name\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "edoL2_uy_G19"
+ },
+ "source": [
+ "### Converting column types\n",
+ "\n",
+ "We have already seen examples where we have `explicity` converted the datatype of our columns using the `cast` method. Here we will `implicity` convert the datatype by replacing the \"yes\" and \"no\" values in our `change` column, which represent whether a patient's medication has been changed, to a boolean True or False value. \n",
+ "\n",
+ "The datatype of this column will be changed automatically by this operation as we can see below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "fMhSrD8__G19",
+ "outputId": "5230be79-58b9-4318-c5bb-052cd03e35d1"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[polars.datatypes.Utf8]"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# print out initial datatype of \"change\" column\n",
+ "\n",
+ "rdf.select(\"change\").dtypes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "hYWJ9FB70mcM",
+ "outputId": "cc2736c7-e4be-48dd-805d-352ba0d6196e"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[polars.datatypes.Boolean]"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# replaces Yes/No values with True/False\n",
+ "rdf = rdf.with_columns(\n",
+ " [pl.when(pl.col(\"change\") == \"No\").then(False).otherwise(True).keep_name()]\n",
+ ")\n",
+ "\n",
+ "# print out datatype of column post find and replace operation\n",
+ "rdf.select(\"change\").dtypes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CYS-Mkl1tD8t"
+ },
+ "source": [
+ "### Saving our RemoteLazyFrame and disconnecting\n",
+ "\n",
+ "Our dataframe is all clean and ready for the next step: data analysis/ visualization. Data scientist #1 is going to be reassigned to another task. They will save their cleaned RemoteLazyFrame and make a note of the identifier to share with data scientist #2.\n",
+ "\n",
+ "We need to perform `collect()` before saving or getting an identifier for our RemoteLazyFrame since the `save` method and `identifier` attribute are only available for FetchableLazyFrames.\n",
+ "\n",
+ ">Note, the data owner must have set the `savable` option to `True` when uploading the dataframe for this operation to be possible!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "id": "DWu6ToX53bm9",
+ "outputId": "3063c7ae-df03-4b74-d7a3-e2ceffc56083"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'49b66d7a-6c80-45fb-8278-9992c91f8666'"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rdf.collect().save()\n",
+ "saved_identifier = rdf.collect().identifier\n",
+ "saved_identifier"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NgkiBinG6DJ2"
+ },
+ "source": [
+ "They can now close their connection to the BastionLab server."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "qoiADM1W6OC_"
+ },
+ "outputs": [],
+ "source": [
+ "connection.close()"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "base",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.13 (main, Aug 25 2022, 23:26:10) \n[GCC 11.2.0]"
+ },
+ "orig_nbformat": 4,
+ "vscode": {
+ "interpreter": {
+ "hash": "d130ca42b532f14c740c9405384e6a25814bad609bad1a40b3b3f26954036080"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/mkdocs.yml b/mkdocs.yml
index 7bd70590..f2d8b068 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -107,6 +107,7 @@ nav:
- Data exploration:
- Covid-19 cleaning and exploration: "docs/how-to-guides/covid_cleaning_exploration.ipynb"
- Fraud detection cleaning and exploration: "docs/how-to-guides/fraud_detection.ipynb"
+ - Diabetes cleaning and exploration- part one: "docs/how-to-guides/diabetes_p1.ipynb"
- Deep learning:
- Fine Tuning Distilbert on BastionLab: "docs/how-to-guides/distilbert_example_notebook.ipynb"
- 🛠️ API reference: "docs/resources/bastionlab/index.html"
From 9bf86e736c5a6c1d53c2e47ba8500ba2226677f7 Mon Sep 17 00:00:00 2001
From: lyie28
Date: Wed, 22 Feb 2023 09:25:27 +0100
Subject: [PATCH 02/22] Updated diabetes
---
.../how-to-guides/diabetes_exploration.ipynb | 2753 +++++++++++++++++
mkdocs.yml | 2 +-
2 files changed, 2754 insertions(+), 1 deletion(-)
create mode 100644 docs/docs/how-to-guides/diabetes_exploration.ipynb
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
new file mode 100644
index 00000000..29dfaab5
--- /dev/null
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -0,0 +1,2753 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jdvo0Bjb_G1c"
+ },
+ "source": [
+ "
\n",
+ "
Data exploration of diabetes hospital admissions: Part I
\n",
+ "______________________________________________________\n",
+ "\n",
+ "Despite major technological breakthroughs in cybersecurity and privacy in recent years, secure off-premises data science collaboration has remained out of reach. This is a major problem for the health sector which has so much to gain from the power of data but also so much at risk when it comes to patients' highly sensitive medical records.\n",
+ "\n",
+ "We are on a mission to make remote data science collaboration safe for the health sector. Using BastionLab, data owners can set strict access policies on datasets for collaborators, allowing them to run privacy-friendly queries and train and deploy ML models on datasets whilst blocking access to raw data.\n",
+ "\n",
+ "In this how-to guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.\n",
+ "\n",
+ "In part I of this two-part data exploration. We will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**.\n",
+ "\n",
+ "But before we can do that, we first need to get everything set up!\n",
+ "\n",
+ "## Pre-requisites\n",
+ "___________________________________________\n",
+ "\n",
+ "### Installation and dataset\n",
+ "\n",
+ "In order to run this notebook, we need to:\n",
+ "- Ensure we have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed\n",
+ "- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) and the [BastionLab server](https://pypi.org/project/bastionlab-server/0.3.7/) pip packages\n",
+ "- [Download the dataset](https://drive.google.com/file/d/1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI/view?usp=share_link) we will be using in this notebook.\n",
+ "\n",
+ "You can download the BastionLab pip packages and the dataset by running the following code block.\n",
+ "\n",
+ ">To find out about other ways you can install and run BastionLab, see our [Installation page](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "hK-HDaMI_G1j"
+ },
+ "outputs": [],
+ "source": [
+ "# installing BastionLab client & server packages\n",
+ "!pip install bastionlab\n",
+ "!pip install bastionlab_server\n",
+ "\n",
+ "# dowloading the dataset using Google Drive tool dgown\n",
+ "!pip install gdown\n",
+ "!pip install --upgrade --no-cache-dir gdown\n",
+ "!gdown \"1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NJ67chDB_G1l"
+ },
+ "source": [
+ "The dataset we are using for this how-to guide is based on the Diabetes 130-US hospitals for years 1999-2008 dataset. It contains 10 years of data on diabetes admissions from 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.\n",
+ "\n",
+ ">For more detailed information on the dataset, you can check out the description and full dataset by following this [link](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).\n",
+ "\n",
+ "However, this dataset had already been pre-processed before publication which stopped us from showing you some key data cleaning steps. We therefore made a few modifications to replace some pre-grouped data columns with randomly populated data. You can check out exactly how we did this using Polars [here](https://colab.research.google.com/drive/174EJvK8u8mGGWb6ypLH9SKaeRnX-pEou?usp=share_link). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OjL01I5c_G1m"
+ },
+ "source": [
+ "## Data owner's POV\n",
+ "___________________________________________\n",
+ "\n",
+ "### Launching the server\n",
+ "\n",
+ "Let's start by putting ourselves in the shoes of the data owner.\n",
+ "\n",
+ "But before we can do anything more, the BastionLab server must be running.\n",
+ "\n",
+ "In production we recommend this is done using our Docker image, but for testing purposes you can use our `bastionlab_server` package, which removes the need for user authentication."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 193,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "A85GsYOi_G1o",
+ "outputId": "97b964bd-61b6-4cc6-e5e7-b9f2a2587bd7"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "BastionLab server (version 0.3.7) already installed\n",
+ "Libtorch (version 1.13.1) already installed\n",
+ "TLS certificates already generated\n",
+ "Bastionlab server is now running on port 50056\n"
+ ]
+ }
+ ],
+ "source": [
+ "# launch bastionlab_server test package\n",
+ "import bastionlab_server\n",
+ "\n",
+ "srv = bastionlab_server.start()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IBWNyTnz_G1p"
+ },
+ "source": [
+ ">*For more details on how you can set up the server using our Docker image, check out our [Installation Tutorial](../getting-started/installation.md).*\n",
+ "\n",
+ "### Connecting to the server\n",
+ "Next, we will connect to the server in order to be able to upload the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 194,
+ "metadata": {
+ "id": "6zzV7xrs_G1q"
+ },
+ "outputs": [],
+ "source": [
+ "# connecting to the server\n",
+ "from bastionlab import Connection\n",
+ "\n",
+ "connection = Connection(\"localhost\")\n",
+ "client = connection.client"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "K9DO7gVt_G1r"
+ },
+ "source": [
+ "### Creating a custom privacy policy\n",
+ "\n",
+ "We can now create a [custom access policy](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/) for the dataset which determines how much access collaborators will get to the dataset. \n",
+ "\n",
+ "In this example, we create a policy with the following configuration:\n",
+ "\n",
+ "-> `Aggregation(min_agg_size=10):` Any data extracted from the dataset should be the result of an aggregation of at least ten rows.\n",
+ "\n",
+ "-> `unsafe_handling=Reject()`: Any attempted query which breaches this policy will be rejected by the server.\n",
+ "\n",
+ "-> `savable=True`: The data scientist can save changes made to the dataset in BastionLab (this will create a new dataset- it will not overwrite the original dataset).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 195,
+ "metadata": {
+ "id": "mRJjgd1C_G1t"
+ },
+ "outputs": [],
+ "source": [
+ "from bastionlab.polars.policy import Policy, Aggregation, Reject\n",
+ "\n",
+ "# defining the dataset's privacy policy\n",
+ "policy = Policy(Aggregation(min_agg_size=10), unsafe_handling=Reject(), savable=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Q7HHSM3e_G1v"
+ },
+ "source": [
+ "### Uploading the dataset\n",
+ "\n",
+ "Now that the policy has been created, we can upload the dataset to the BastionLab server instance.\n",
+ "\n",
+ "Firstly, we need to convert our CSV file into a Polars DataFrame by using the Polars `read_csv` function, supplying the path to the CSV file as a string argument.\n",
+ "\n",
+ "Next, we use BastionLab's `client.polars.send_df` to upload the dataframe with our custom policy.\n",
+ "\n",
+ "Finally, we save the FetchableLazyFrame using the `save` method with no arguments. We can make a note of the FetchableLazyFrame's identifier to be shared with data scientists to help them to remotely access the FetchableLazyFrame!\n",
+ "\n",
+ ">Note we need to save FetchableLazyFrames to avoid them being lost when the server is stopped and restarted or crashes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 196,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "lkMIl0ar_G1w",
+ "outputId": "d8505f8e-5853-4adf-97f5-d6105db30761"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "63c8152d-f5af-41ec-b22c-aea51a8465b5\n"
+ ]
+ }
+ ],
+ "source": [
+ "import polars as pl\n",
+ "\n",
+ "# converting the dataset into a Polars dataframe\n",
+ "df = pl.read_csv(\"updated_diabetes_data.csv\")\n",
+ "\n",
+ "# uploading the dataframe, the custom privacy policy\n",
+ "# and the column we want to forbid to BastionLab's server\n",
+ "rdf = client.polars.send_df(df, policy=policy)\n",
+ "\n",
+ "# saving the FetchableLazyFrame\n",
+ "rdf.save()\n",
+ "# get and print out a copy of the RDF identifier string\n",
+ "ID = rdf.identifier\n",
+ "print(ID)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ywAyp-2y_G1y"
+ },
+ "source": [
+ "`send_df()` will return a FetchableLazyFrame instance, which we will work with directly from now on. \n",
+ "\n",
+ ">Note that we talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. \n",
+ "\n",
+ "> In BastionLab, when we run a query, it is not immediately executed. Like with Polar's LazyFrames, pending queries are only executed when we call `collect`. `FetchableLazyFrames` are BastionLab's remote lazy frames when there are no pending queries to run, either because we have just uploaded or got the dataframe using `get_df` or because we have already ran `collect` after our latest query. To display these lazy frames we call the `fetch` method, which will verify that the data frame is safe to display, i.e. is it the result of a safe aggregated query as specified in the privacy policy.\n",
+ "\n",
+ "> A `RemoteLazyFrame` is just a `FetchableLazyFrame` with pending queries still to run (as they have not yet been `collected`). When we call `collect()` these operations are run server-side and the result of this is our `FetchableLazyFrame`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YRC1y4uX_G10"
+ },
+ "source": [
+ "Let's finish off by testing what happens if we breach our security policy by trying to display an entire column from our dataset with the `collect().fetch()` methods. \n",
+ "\n",
+ ">*You can learn more about how to use both of those methods in [our quick tour](https://bastionlab.readthedocs.io/en/latest/docs/quick-tour/quick-tour/#running-queries).*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 197,
+ "metadata": {
+ "id": "C7j4vdDd_G10",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "7941b960-a0e4-4e9d-f0a4-13ef5c9ba296"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\u001b[31mThe query has been rejected by the data owner.\u001b[37m\n"
+ ]
+ }
+ ],
+ "source": [
+ "rdf.select(\"age\").collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x1Zu2YQi_G11"
+ },
+ "source": [
+ "Instead of getting back the results of our query, we see an error message: `The query has been rejected by the data owner.`\n",
+ "\n",
+ "We cannot view the output of the query because it does not aggregate at least 10 rows of data as specified in our privacy policy. It tries to print out individual rows instead!\n",
+ "\n",
+ "Now that the dataset has been uploaded, it's time for our data scientists to get working... \n",
+ "\n",
+ "The data owner can now connection their connection to the server."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 198,
+ "metadata": {
+ "id": "mcM4pR6D_G11"
+ },
+ "outputs": [],
+ "source": [
+ "connection.close()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HJzNveFG_G13"
+ },
+ "source": [
+ "## Data scientist #1's POV\n",
+ "__________________________________________\n",
+ "\n",
+ "### Connecting to the dataset\n",
+ "\n",
+ "We'll now jump into the role of the data scientist responsible for cleaning the dataset for this data analysis project.\n",
+ "\n",
+ "We first need to connect to the `bastion_lab` server and get a FetchableLazyFrame instance of the dataset. We'll use' the `get_df()` method and supply it with the id shared with us by the data owner to do this.\n",
+ "\n",
+ "We store our FetchableLazyFrame in the `rdf` variable which we'll be working with from here on."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 199,
+ "metadata": {
+ "id": "TT3mSjII_G13",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "4a463355-2753-40d6-ce62-a2c8fa30c63a"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "FetchableLazyFrame(identifier=63c8152d-f5af-41ec-b22c-aea51a8465b5)"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 199
+ }
+ ],
+ "source": [
+ "connection = Connection(\"localhost\")\n",
+ "client = connection.client\n",
+ "\n",
+ "# selecting the FetchableLazyFrame(s) we'll be working with\n",
+ "rdf = client.polars.get_df(ID)\n",
+ "rdf"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "AEFbeESX_G14"
+ },
+ "source": [
+ "Let's display the dataset's columns to confirm we are connected to the correct one."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 200,
+ "metadata": {
+ "id": "G-g8rOnj_G15",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "797dedc6-f4c5-4bb7-8830-3c2b8295fbbc"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(rdf.columns)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "H55DTcKn_G15"
+ },
+ "source": [
+ "Everything is as expected! We can now start our data exploration. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CQC7tfaF_G15"
+ },
+ "source": [
+ "## Data cleaning\n",
+ "__________________________________________\n",
+ "\n",
+ "\n",
+ "### Dropping columns\n",
+ "You may have noticed, this dataset contains a lot of columns! This is great as it it gives us a wide choice of correlations to explore. However, we will not have time to explore all of them in this analysis! We can therefore drop the columns that we won't be using- either because they are irrelavant, or because they didn't lead us to the most interesting correlations for this analysis!\n",
+ "\n",
+ "We can do this by using the`drop` method, providing it with a list of the names of columns to be dropped. This is a RemoteLazyFrame method which corresponds directly to the [Polars drop() function](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.drop.html#polars.LazyFrame.drop)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 201,
+ "metadata": {
+ "id": "s0NI6rTqOKWN"
+ },
+ "outputs": [],
+ "source": [
+ "# list of column names we wish to remove from our RemoteLazyFrame\n",
+ "to_drop = [\n",
+ " \"encounter_id\",\n",
+ " \"patient_nbr\",\n",
+ " \"weight\",\n",
+ " \"discharge_disposition_id\",\n",
+ " \"admission_source_id\",\n",
+ " \"time_in_hospital\",\n",
+ " \"payer_code\",\n",
+ " \"medical_specialty\",\n",
+ " \"num_lab_procedures\",\n",
+ " \"num_procedures\",\n",
+ " \"num_medications\",\n",
+ " \"number_outpatient\",\n",
+ " \"number_inpatient\",\n",
+ " \"number_diagnoses\",\n",
+ " \"diabetesMed\",\n",
+ "]\n",
+ "\n",
+ "# replace rdf with our updated RemoteLazyFrame with to_drop columns deleted\n",
+ "rdf = rdf.drop(to_drop)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vabmc_jjOQCo"
+ },
+ "source": [
+ "There are now 36 columns to work with intead of 51- this will make the RemoteLazyFrame a little easier to work with!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7ausY-PC_G16"
+ },
+ "source": [
+ "\n",
+ "### Checking for null values\n",
+ "\n",
+ "We now want to assess how many null values we have in each column. This will help us to know if we have enough data to draw meaningful conclusions from each column and gives us the chance to fill or delete null values if relevant.\n",
+ "\n",
+ "However, based on the description of the dataset shared with us by the data owner, we know that some column cells have been filled with '?' instead of being left blank.\n",
+ "\n",
+ "Before we can get an accurate picture of null values, we first need to replace all these '?' values with null values. We will do this by using [Polars .when().then().otherwise()` functions](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html). \n",
+ "\n",
+ "One final hurdle is that we can only search and replace '?' strings in columns with the 'Utf8' (string) datatype- otherwise an error will be produced. We must therefore firstly grab pl.Utf8 columns only and apply our search and replace operation to these strings!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 202,
+ "metadata": {
+ "id": "F2KwhZB_fTC3"
+ },
+ "outputs": [],
+ "source": [
+ "# step one: getting a list of all Utf8/string columns\n",
+ "selects = rdf.select(pl.col(pl.Utf8)).columns\n",
+ "\n",
+ "# step two: we replace all '? cells in these columns with null values\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(x) == \"?\").then(None).otherwise(pl.col(x)).keep_name()\n",
+ " for x in selects\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c1Frpi9GUtdW"
+ },
+ "source": [
+ "In step two, we use the Polars `with_columns` function to add our new columns with null values instead of question marks to our RemoteLazyFrame. By using the `keep_name` function, these columns keep their original column name and therefore replace the original columns in the dataset. We save the result as `rdf`, storing the updated version of the dataset in our `rdf` variable."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vMMX8JZnKitA"
+ },
+ "source": [
+ "Now that this is done, we can go ahead and calculate how many null values each column contains.\n",
+ "\n",
+ "We do this by iterating over all the columns and getting a percentage of the `sum` of all the value that return `True` to the `is_null` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 203,
+ "metadata": {
+ "id": "SAqqUz6I_G16"
+ },
+ "outputs": [],
+ "source": [
+ "# getting every columns percentage of null values in the RemoteLazyFrame\n",
+ "percent_missing = rdf.select(\n",
+ " [\n",
+ " pl.all().is_null().sum() * 100 / pl.all().count(),\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3uMcNqVZWhdN"
+ },
+ "source": [
+ "We can then view the percentage of null values for each column as a two-column list by using Polars `melt` function to flip the query results from a 2 row by 5 column grid, to a 2 column by 5 row grid. We use the `sort` function to show the columns in order from the column with the highest percentage of null values to the lowest.\n",
+ "\n",
+ "Finally, we remove any columns with no null values from our output since they are not of interest to us here."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 204,
+ "metadata": {
+ "id": "Pzz5qvSJWd2V",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 285
+ },
+ "outputId": "d0316533-6304-4dce-8357-e0caa0d897da"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "shape: (7, 2)\n",
+ "┌───────────────┬─────────────────┐\n",
+ "│ column name ┆ null values (%) │\n",
+ "│ --- ┆ --- │\n",
+ "│ str ┆ f64 │\n",
+ "╞═══════════════╪═════════════════╡\n",
+ "│ max_glu_serum ┆ 94.746772 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ A1Cresult ┆ 83.277322 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ readmitted ┆ 53.911916 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ race ┆ 2.233555 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ diag_3 ┆ 1.398306 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ diag_2 ┆ 0.351787 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ diag_1 ┆ 0.020636 │\n",
+ "└───────────────┴─────────────────┘"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (7, 2)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "column name\n",
+ "
\n",
+ "
\n",
+ "null values (%)\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "str\n",
+ "
\n",
+ "
\n",
+ "f64\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ ""max_glu_serum"\n",
+ "
\n",
+ "
\n",
+ "94.746772\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""A1Cresult"\n",
+ "
\n",
+ "
\n",
+ "83.277322\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""readmitted"\n",
+ "
\n",
+ "
\n",
+ "53.911916\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""race"\n",
+ "
\n",
+ "
\n",
+ "2.233555\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""diag_3"\n",
+ "
\n",
+ "
\n",
+ "1.398306\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""diag_2"\n",
+ "
\n",
+ "
\n",
+ "0.351787\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""diag_1"\n",
+ "
\n",
+ "
\n",
+ "0.020636\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 204
+ }
+ ],
+ "source": [
+ "# melt table to a two-column table with the column name 'column' and corresponding percetage of null values 'null values', sort in descending order and display\n",
+ "percent_missing = percent_missing.melt(\n",
+ " variable_name=\"column name\",\n",
+ " value_name=\"null values (%)\",\n",
+ ").sort(pl.col(\"null values (%)\"), reverse=True)\n",
+ "\n",
+ "# filter out columns with no null values and display\n",
+ "percent_missing.filter(pl.col(\"null values (%)\") > 0).collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4n0jnBPyYLjf"
+ },
+ "source": [
+ "There are several strategies for dealing with null values such as deleting these rows from the dataset with the `drop_nulls` method or filling null values with the `fill_null` method. But in our case, we are just happy to have visibility over which columns including null values and to what extent so that we can handle and analyse these columns with this in mind."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-WUugovwve2c"
+ },
+ "source": [
+ "### Grouping data: ICD-9 medical codes\n",
+ "Grouping data is going to be the largest and most crucial task in this data cleaning job. This is a dataset with a low of wide-ranging numerical values which need to be grouped so that our data analysts can gain meaningul insights.\n",
+ "\n",
+ "Let's start with our diagnoses columns: `diag_1`, `diag_2` and `diag_3`.\n",
+ "\n",
+ "These columns contain the primary, secondary and terciary diagnoses given to patients. These diagnoses are given using [ICD-9 medical codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes) which are three digit codes ranging from 1 to 1000, as well as E800–E999 codes and V01–V82 code.\n",
+ "\n",
+ "By grabbing all the unique values in the `diag_1` column and counting them, we can see that we have over 700 different values in this column!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 205,
+ "metadata": {
+ "id": "7pVHpmLWj6_w",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 121
+ },
+ "outputId": "bdc449f2-0e05-4595-c1fd-8935d6de4722"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "shape: (1, 1)\n",
+ "┌────────┐\n",
+ "│ diag_1 │\n",
+ "│ --- │\n",
+ "│ u32 │\n",
+ "╞════════╡\n",
+ "│ 717 │\n",
+ "└────────┘"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (1, 1)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "diag_1\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "u32\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "717\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 205
+ }
+ ],
+ "source": [
+ "tmp = rdf.select(\"diag_1\").unique()\n",
+ "tmp.select(pl.col(\"diag_1\").count()).collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cPsmfkBpkPCv"
+ },
+ "source": [
+ "Standard groupings of these codes have already been designed. What we want to do is replace the hundreds of unique codes we have in our our diagnoses columns with these groupings!\n",
+ "\n",
+ "To do this, we will again use Polars `when().then().otherwise()` functions to perform a find and replace operation. We will use `when()` to check if the codes in each cell are either E or V codes or fall within a certain numerical range.\n",
+ "\n",
+ "However, these diagnoses columns are currently string columns, since the E and V codes are not entirely numerical. This is problematic since we cannot perform numerical comparisons on these cells and we cannot convert the column type to a numerical one because of these 'E' and 'V' values!\n",
+ "\n",
+ "We will solve this problem in three steps:\n",
+ "\n",
+ "1) We will find and replace all E codes with a \"-1\" value and V codes with a \"-2\" value.\n",
+ "\n",
+ "2) We will `select()` our columns and `cast()` all values in these columns to float values.\n",
+ "\n",
+ "3) We will perform the find and replace operation to group all ICD-9 codes into their associated group- of which there are 17, plus E codes and V codes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 206,
+ "metadata": {
+ "id": "xPNFpZ7lW8qR"
+ },
+ "outputs": [],
+ "source": [
+ "# iterate over the three diagnoses columns\n",
+ "for col in [\"diag_1\", \"diag_2\", \"diag_3\"]:\n",
+ " # step one: replace troublesome E and V codes with temporary -1 and -2 codes\n",
+ " rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(\n",
+ " pl.col(col).str.starts_with(\"E\")\n",
+ " ) # use Polars str.starts_with method to identify E codes\n",
+ " .then(\"-1\")\n",
+ " .when(pl.col(col).str.starts_with(\"V\"))\n",
+ " .then(\"-2\")\n",
+ " .otherwise(pl.col(col))\n",
+ " .keep_name()\n",
+ " ]\n",
+ " )\n",
+ "\n",
+ " # step two: cast all values in column to float values\n",
+ " rdf = rdf.with_columns([pl.col(col).cast(pl.Float64)])\n",
+ "\n",
+ " # step three: replace all codes with their corresponding group\n",
+ " rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(col) >= 800)\n",
+ " .then(\"injury and poisoning\")\n",
+ " .when(pl.col(col) >= 780)\n",
+ " .then(\"symptoms, signs & ill-defined\")\n",
+ " .when(pl.col(col) >= 760)\n",
+ " .then(\"perinatal\")\n",
+ " .when(pl.col(col) >= 740)\n",
+ " .then(\"congenital anomalies\")\n",
+ " .when(pl.col(col) >= 710)\n",
+ " .then(\"musculoskeletal & connective tissue\")\n",
+ " .when(pl.col(col) >= 680)\n",
+ " .then(\"skin\")\n",
+ " .when(pl.col(col) >= 630)\n",
+ " .then(\"pregnancy, childbirth and peurperium\")\n",
+ " .when(pl.col(col) >= 580)\n",
+ " .then(\"genitourinary\")\n",
+ " .when(pl.col(col) >= 520)\n",
+ " .then(\"digestive\")\n",
+ " .when(pl.col(col) >= 460)\n",
+ " .then(\"respiratory\")\n",
+ " .when(pl.col(col) >= 390)\n",
+ " .then(\"circulatory\")\n",
+ " .when(pl.col(col) >= 320)\n",
+ " .then(\"nervous system and sense organs\")\n",
+ " .when(pl.col(col) >= 290)\n",
+ " .then(\"mental disorders\")\n",
+ " .when(pl.col(col) >= 280)\n",
+ " .then(\"blood and blood-forming organs\")\n",
+ " .when(pl.col(col) >= 240)\n",
+ " .then(\"neoplasms\")\n",
+ " .when(pl.col(col) >= 140)\n",
+ " .then(\"endocrine, nutritional, metabolic and immunity\")\n",
+ " .when(pl.col(col) >= 1)\n",
+ " .then(\"infectious and parasitic\")\n",
+ " .when(pl.col(col) == -1)\n",
+ " .then(\"E code (injury)\")\n",
+ " .when(pl.col(col) == -2)\n",
+ " .then(\"V code (other)\")\n",
+ " .otherwise(\n",
+ " None\n",
+ " ) # otherwise (null values) keep original value from the column\n",
+ " .alias(\n",
+ " col\n",
+ " ) # give resulting column same name as previously- therefore replacing old columns\n",
+ " ]\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "P1MquUrNlXDO"
+ },
+ "source": [
+ "By performing the same query as previously to count `diag_1`'s unique values, we see there is now a much more manageable 19 labels in our data column! This will be similar for the `diag_2` and `diag_3` columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 207,
+ "metadata": {
+ "id": "YfC9CmWWdu0n",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 121
+ },
+ "outputId": "04223b57-76a5-4108-99ef-20eb5862b907"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "shape: (1, 1)\n",
+ "┌────────┐\n",
+ "│ diag_1 │\n",
+ "│ --- │\n",
+ "│ u32 │\n",
+ "╞════════╡\n",
+ "│ 19 │\n",
+ "└────────┘"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (1, 1)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "diag_1\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "u32\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "19\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 207
+ }
+ ],
+ "source": [
+ "tmp = rdf.select(\"diag_1\").unique()\n",
+ "tmp.select(pl.col(\"diag_1\").count()).collect().fetch()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We notice in our project brief that there is only 1 E code value in the `diag_1` column, so we will remove this value from our dataset before continuing by using the `filter` function."
+ ],
+ "metadata": {
+ "id": "pKAp3OvKcwuX"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "rdf = rdf.filter(pl.col(\"diag_1\") != \"E code (injury)\")"
+ ],
+ "metadata": {
+ "id": "yIRH-_QNdEwL"
+ },
+ "execution_count": 208,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BvdGu7GmsZVu"
+ },
+ "source": [
+ "### Grouping data: A1C, max glucose levels and readmittance\n",
+ "\n",
+ "We want to group together data in another three other columns using the same `.then().when().otherwise()` methods.\n",
+ "\n",
+ "The first two are `A1Cresult`, which contains patients' HbA1c level, and `max_glu_serum`, which contains their blood glucose level. We want to group these into `very high`, `high`, `normal` groups based on levels defined in our project brief.\n",
+ "\n",
+ "These columns are both currently string columns, so we will also need to convert them to float values in order to perform numerical comparisons on them."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 209,
+ "metadata": {
+ "id": "FgyrnPAlsZ0u"
+ },
+ "outputs": [],
+ "source": [
+ "# cast `max_glu_serum` and `A1Cresult` columns to float values\n",
+ "rdf = rdf.with_columns(\n",
+ " [pl.col(\"max_glu_serum\").cast(pl.Float64), pl.col(\"A1Cresult\").cast(pl.Float64)]\n",
+ ")\n",
+ "\n",
+ "# group values in A1Cresult column\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"A1Cresult\") >= 8)\n",
+ " .then(\"very high\")\n",
+ " .when(pl.col(\"A1Cresult\") >= 7)\n",
+ " .then(\"high\")\n",
+ " .when(pl.col(\"A1Cresult\") >= 0)\n",
+ " .then(\"normal\")\n",
+ " .otherwise(pl.col(\"A1Cresult\"))\n",
+ " .keep_name()\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "# group values in max_glu_serum column\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"max_glu_serum\") >= 300)\n",
+ " .then(\"very high\")\n",
+ " .when(pl.col(\"max_glu_serum\") >= 200)\n",
+ " .then(\"high\")\n",
+ " .when(pl.col(\"max_glu_serum\") >= 0)\n",
+ " .then(\"normal\")\n",
+ " .otherwise(pl.col(\"max_glu_serum\"))\n",
+ " .keep_name()\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Buu2nja5w6Db"
+ },
+ "source": [
+ "The final column we want to group is the `readmitted` column which records the number of days before any further re-hospitalization linked to the patients' diabetic condition.\n",
+ "\n",
+ "We will group this column into `short-term` and `long-term` and `n/a` (not applicable) groups.\n",
+ "\n",
+ "Simiar to in previous examples, we must first convert values in this column from strings to integer values."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 210,
+ "metadata": {
+ "id": "9cca9DhSw6We"
+ },
+ "outputs": [],
+ "source": [
+ "# cast readmitted column to integer values\n",
+ "rdf = rdf.with_columns([pl.col(\"readmitted\").cast(pl.Int64)])\n",
+ "\n",
+ "# group values\n",
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"readmitted\") < 31)\n",
+ " .then(\"short-term\")\n",
+ " .when(pl.col(\"readmitted\") >= 31)\n",
+ " .then(\"long-term\")\n",
+ " .otherwise(\"n/a\")\n",
+ " .keep_name()\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kuwxAGYBoOQJ"
+ },
+ "source": [
+ "### Grouping data: binning ages\n",
+ "The next grouping task we will perform is to group ages into intervals of 10 years. We do this both to increase data privacy and to more easily draw correlations linked to broader age groups.\n",
+ "\n",
+ "We won't need to perform an `when().then().otherwise()` query here since BastionLab has its own `ApplyBins` tool.\n",
+ "\n",
+ "`ApplyBins` is a PyTorch module and the grouping of numbers takes place in its `forward` function. We can pass PyTorch modules to BastionLab's `apply_udf` function which will apply the `forward` function to any specified columns.\n",
+ "\n",
+ "All in all, we just three steps to bin our age column data:\n",
+ "\n",
+ "1) We import `ApplyBins` from `bastionlab.polars.utils`.\n",
+ "1) We instantiate our `ApplyBins` PyTorch module class with our bins interval given as the only argument.\n",
+ "2) We use `apply_udf`, providing a list of the column we want to modify and the PyTorch module, `ApplyBins`, that we wish to apply to these columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 211,
+ "metadata": {
+ "id": "2EC3smnWc06Q"
+ },
+ "outputs": [],
+ "source": [
+ "from bastionlab.polars.utils import ApplyBins\n",
+ "\n",
+ "# get an instance of ApplyBins module which will bin data into groups of 10\n",
+ "model = ApplyBins(10)\n",
+ "\n",
+ "# apply bins to \"age\" column\n",
+ "rdf = rdf.apply_udf([\"age\"], model)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1pOQYPYSsVns"
+ },
+ "source": [
+ "> Note, you can create your own custom PyTorch modules and apply them to columns using `apply_udf`. This is BastionLab's way of allowing you to apply custom functions on datasets, whilst restricting what you can do for security reasons. Functionality like `lambda`, `map` and `apply` are blocked by BastionLab as they are too permissive and could be misused."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gYRVmqTitckT"
+ },
+ "source": [
+ "### Adding columns\n",
+ "\n",
+ "Up until this point we have been using the `.when().then().otherwise()` and `with_columns` methods to make changes to existing columns, but by providing a new column name to the `alias` method, we can create a new column.\n",
+ "\n",
+ "In the following example, we will create a `is_readmitted` column which will store `False` for all the \"n/a\" values in our original `readmitted` column and `True` for any other values. This will allow us to quickly query whether certain groups of data have been readmitted or not!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 212,
+ "metadata": {
+ "id": "B2JGdBhmteAz"
+ },
+ "outputs": [],
+ "source": [
+ "rdf = rdf.with_columns(\n",
+ " [\n",
+ " pl.when(pl.col(\"readmitted\") == \"n/a\")\n",
+ " .then(False)\n",
+ " .otherwise(True)\n",
+ " .alias(\n",
+ " \"is_readmitted\"\n",
+ " ) # ending the .when().then().otherwise() pattern with .alias() allows us to provide a new column name\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "edoL2_uy_G19"
+ },
+ "source": [
+ "### Converting column types\n",
+ "\n",
+ "We have already seen examples where we have `explicity` converted the datatype of our columns using the `cast` method. Here we will `implicity` convert the datatype by replacing the \"yes\" and \"no\" values in our `change` column, which represent whether a patient's medication has been changed, to a boolean True or False value. \n",
+ "\n",
+ "The datatype of this column will be changed automatically by this operation as we can see below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 213,
+ "metadata": {
+ "id": "fMhSrD8__G19",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "fb2c31c6-c60c-4d2e-e712-30f502d436b0"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "[polars.datatypes.Utf8]"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 213
+ }
+ ],
+ "source": [
+ "# print out initial datatype of \"change\" column\n",
+ "\n",
+ "rdf.select(\"change\").dtypes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 214,
+ "metadata": {
+ "id": "hYWJ9FB70mcM",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "f9ebdad2-98f9-4216-d931-0c868d11a9ab"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "[polars.datatypes.Boolean]"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 214
+ }
+ ],
+ "source": [
+ "# replaces Yes/No values with True/False\n",
+ "rdf = rdf.with_columns(\n",
+ " [pl.when(pl.col(\"change\") == \"No\").then(False).otherwise(True).keep_name()]\n",
+ ")\n",
+ "\n",
+ "# print out datatype of column post find and replace operation\n",
+ "rdf.select(\"change\").dtypes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CYS-Mkl1tD8t"
+ },
+ "source": [
+ "### Saving our RemoteLazyFrame and disconnecting\n",
+ "\n",
+ "Our dataframe is all clean and ready for the next step: data analysis/ visualization. Data scientist #1 is going to be reassigned to another task. They will save their cleaned RemoteLazyFrame and make a note of the identifier to share with data scientist #2.\n",
+ "\n",
+ "We need to perform `collect()` before saving or getting an identifier for our RemoteLazyFrame since the `save` method and `identifier` attribute are only available for FetchableLazyFrames.\n",
+ "\n",
+ ">Note, the data owner must have set the `savable` option to `True` when uploading the dataframe for this operation to be possible!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 215,
+ "metadata": {
+ "id": "DWu6ToX53bm9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "outputId": "8980b3c6-c49e-4180-a607-50db2cc9f0b1"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'57a34384-f881-4059-ae5a-6a7d1483a1ae'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 215
+ }
+ ],
+ "source": [
+ "rdf.collect().save()\n",
+ "saved_identifier = rdf.collect().identifier\n",
+ "saved_identifier"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NgkiBinG6DJ2"
+ },
+ "source": [
+ "They can now close their connection to the BastionLab server."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 216,
+ "metadata": {
+ "id": "qoiADM1W6OC_"
+ },
+ "outputs": [],
+ "source": [
+ "connection.close()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Part II: data analysis and visualization\n",
+ "\n",
+ "So data scientist #2 is now ready to begin their analysis of the cleaned dataset. Just like data scientist #1, they will first need to connect to the server and get the FetchableLazyFrame saved by data scientist #1."
+ ],
+ "metadata": {
+ "id": "3Gvx_sK5ypgD"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# connecting to the server\n",
+ "from bastionlab import Connection\n",
+ "\n",
+ "connection = Connection(\"localhost\")\n",
+ "client = connection.client\n",
+ "\n",
+ "# get the previously saved dataframe\n",
+ "rdf = client.polars.get_df(saved_identifier)\n",
+ "rdf"
+ ],
+ "metadata": {
+ "id": "YXDgnPiayf-b",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "86ba51d8-ae36-4a55-ac36-9734ed1e0e9b"
+ },
+ "execution_count": 217,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "FetchableLazyFrame(identifier=57a34384-f881-4059-ae5a-6a7d1483a1ae)"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 217
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can again confirm that the original privacy policy is still in place by running a non-aggreagted query that would violate the policy."
+ ],
+ "metadata": {
+ "id": "z7-wG7DfzSyI"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "rdf.select(pl.col(\"age\")).collect().fetch()"
+ ],
+ "metadata": {
+ "id": "X_cUUQwqzw4B",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "423254aa-caca-45d5-ec8b-81dee515e3bd"
+ },
+ "execution_count": 218,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\u001b[31mThe query has been rejected by the data owner.\u001b[37m\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that we are all set-up, we can dive into the analysis.\n",
+ "\n",
+ "### Age as a factor in readmission and emergency trips\n",
+ "\n",
+ "Let's start by visualizing the number of patients who were readmitted to hospital for diabetes-related issues during the study.\n",
+ "\n",
+ "To do this we group data by `age` and aggregate the `sum` of those who were readmitted. We then generate a barplot for this query."
+ ],
+ "metadata": {
+ "id": "NfRexmoN0X9h"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "total_readmitted = rdf.groupby(\"age\").agg(\n",
+ " pl.col(\"is_readmitted\").sum().alias(\"total readmitted\")\n",
+ ")\n",
+ "total_readmitted.barplot(x=\"age\", y=\"total readmitted\")"
+ ],
+ "metadata": {
+ "id": "-5f35-7l_bUG",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 279
+ },
+ "outputId": "da47e386-065f-4b25-9029-43cad1e3e4fc"
+ },
+ "execution_count": 219,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "
"
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In terms of the number of readmissions, we see a clear trend for readmission cases to increase with age, before dropping down in the 80-90 and 90-100 age groups. This may be due to increased mortality in these age ranges.\n",
+ "\n",
+ "However, if we take a look at the mean number of cases per age group using `histplot`, we see that it follows the same trend, showing that this trend may not represent a higher risk of readmission for older patients, but rather a much increased number of diabetes patients in older age groups."
+ ],
+ "metadata": {
+ "id": "BdRDwT74BOrr"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "rdf.histplot(x=\"age\")"
+ ],
+ "metadata": {
+ "id": "DBML4fIUAID2",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 279
+ },
+ "outputId": "e068d27f-95d9-4230-f8d3-ab4a42920fe3"
+ },
+ "execution_count": 220,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "
"
+ ],
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAEGCAYAAACkQqisAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAV/klEQVR4nO3df/BddZ3f8efLAK7iDxLJUiQhya6pNsssATMQq92ysIXAUIOWdWC6klpqtrNQoWOnC9tpcXXt6MyqK11lBksEdhREEUkzrJiydLc65UfQCAkRSQnkxwDJEgS7zIrQd/+4n697G76Bb05yf3zJ8zFz5nvu+55zz/vee8KL8+Oek6pCkqQuXjPqBiRJ05chIknqzBCRJHVmiEiSOjNEJEmdHTLqBobtyCOPrPnz54+6DUmaVu67776/rqrZe9YPuhCZP38+69atG3UbkjStJHlssrq7syRJnRkikqTODBFJUmeGiCSpM0NEktSZISJJ6swQkSR1ZohIkjozRCRJnRkiksbOsfPmkWTkw7Hz5o36oxh7B91lTySNv21bt7Jmw5ZRt8HZxy0YdQtjzy0RSVJnhogkqTNDRJLU2cBCJMncJHcmeTDJxiSXtPrHkuxIsr4NZ/XNc3mSzUkeSnJGX31Zq21OcllffUGSu1v9a0kOG9T7kSS91CC3RF4APlpVi4ClwEVJFrXnPldVi9twG0B77jzg14BlwBeTzEgyA/gCcCawCDi/73U+3V7rbcDTwIUDfD+SpD0MLESq6vGq+n4b/ymwCTjmZWZZDtxYVT+rqi3AZuCkNmyuqkeq6nngRmB5kgCnAt9o818HnDOYdyNJmsxQjokkmQ+cANzdShcnuT/JqiQzW+0YYFvfbNtbbW/1twA/qaoX9qhPtvyVSdYlWbdr164D8I4kSTCEEEnyBuBm4NKqeha4CvhVYDHwOPCZQfdQVVdX1ZKqWjJ79ktuESxJ6migPzZMcii9APlKVX0ToKqe7Hv+S8Ca9nAHMLdv9jmtxl7qTwFHJDmkbY30Ty9JGoJBnp0V4BpgU1V9tq9+dN9k7wM2tPHVwHlJXptkAbAQuAe4F1jYzsQ6jN7B99VVVcCdwLlt/hXArYN6P5Kklxrklsi7gQ8CDyRZ32p/QO/sqsVAAY8CvwtQVRuT3AQ8SO/Mrouq6kWAJBcDtwMzgFVVtbG93u8DNyb5I+AH9EJLkjQkAwuRqvoukEmeuu1l5vkk8MlJ6rdNNl9VPULv7C1J0gj4i3VJUmeGiCSpM0NEktSZISJJ6swQkSR1ZohIkjozRCRJnRkikqTODBFJUmeGiCSpM0NEktSZISJJ6swQkSR1ZohIkjob6J0NJU0/x86bx7atW0fdhqYJQ0TS/2fb1q2s2bBlpD2cfdyCkS5fU+fuLElSZ4aIJKkzQ0SS1JkhIknqzBCRJHVmiEiSOjNEJEmdGSKSpM4MEUlSZ4aIJKkzQ0SS1JkhIknqzBCRJHVmiEiSOjNEJEmdDSxEksxNcmeSB5NsTHJJq89KsjbJw+3vzFZPkiuTbE5yf5IT+15rRZv+4SQr+urvTPJAm+fKJBnU+5EkvdQgt0ReAD5aVYuApcBFSRYBlwF3VNVC4I72GOBMYGEbVgJXQS90gCuAk4GTgCsmgqdN8+G++ZYN8P1IkvYwsBCpqser6vtt/KfAJuAYYDlwXZvsOuCcNr4cuL567gKOSHI0cAawtqp2V9XTwFpgWXvuTVV1V1UVcH3fa0mShmAox0SSzAdOAO4Gjqqqx9tTTwBHtfFjgG19s21vtZerb5+kPtnyVyZZl2Tdrl279uu9SJL+zsBDJMkbgJuBS6vq2f7n2hZEDbqHqrq6qpZU1ZLZs2cPenGSdNAYaIgkOZRegHylqr7Zyk+2XVG0vztbfQcwt2/2Oa32cvU5k9QlSUMyyLOzAlwDbKqqz/Y9tRqYOMNqBXBrX/2CdpbWUuCZttvrduD0JDPbAfXTgdvbc88mWdqWdUHfa0mShuCQAb72u4EPAg8kWd9qfwB8CrgpyYXAY8AH2nO3AWcBm4HngA8BVNXuJJ8A7m3Tfbyqdrfx3wOuBV4H/HkbJElDMrAQqarvAnv73cZpk0xfwEV7ea1VwKpJ6uuA4/ajTUnSfvAX65KkzgwRSVJnhogkqTNDRJLUmSEiSerMEJEkdWaISJI6M0QkSZ0ZIpKkzgwRSVJnhogkqTNDRJLUmSEiSerMEJEkdWaISJI6M0QkSZ0N8s6GkvbBsfPmsW3r1lG3oT6HHnYYvbtvj87cY49l62OPjbSHl2OISGNi29atrNmwZdRtcPZxC0bdwtj4+fPPj/w7Gffvw91ZkqTODBFJUmeGiCSpM0NEktSZISJJ6swQkSR1ZohIkjozRCRJnRkikqTODBFJUmeGiCSpM0NEktTZwEIkyaokO5Ns6Kt9LMmOJOvbcFbfc5cn2ZzkoSRn9NWXtdrmJJf11RckubvVv5bksEG9F0nS5Aa5JXItsGyS+ueqanEbbgNIsgg4D/i1Ns8Xk8xIMgP4AnAmsAg4v00L8On2Wm8DngYuHOB7kSRNYmAhUlV/Beye4uTLgRur6mdVtQXYDJzUhs1V9UhVPQ/cCCxP7wL/pwLfaPNfB5xzQN+AJOkVTSlEktwxldoUXZzk/ra7a2arHQNs65tme6vtrf4W4CdV9cIedUnSEL1siCT5pSSzgCOTzEwyqw3z6fYf7auAXwUWA48Dn+nwGvssycok65Ks27Vr1zAWKUkHhVe6s+HvApcCbwXuAybuE/ks8Kf7urCqenJiPMmXgDXt4Q5gbt+kc1qNvdSfAo5IckjbGumffrLlXg1cDbBkyZLa174lSZN72S2Rqvp8VS0A/l1V/UpVLWjD8VW1zyGS5Oi+h+8DJs7cWg2cl+S1SRYAC4F7gHuBhe1MrMPoHXxfXVUF3Amc2+ZfAdy6r/1IkvbPlO6xXlX/Jck/BOb3z1NV1+9tniQ3AKfQ2xW2HbgCOCXJYqCAR+lt6VBVG5PcBDwIvABcVFUvtte5GLgdmAGsqqqNbRG/D9yY5I+AHwDXTO0tS5IOlCmFSJI/o3csYz3wYisXsNcQqarzJynv9T/0VfVJ4JOT1G8Dbpuk/gi9s7ckSSMypRABlgCL2m4kSZKAqf9OZAPw9wbZiCRp+pnqlsiRwINJ7gF+NlGsqvcOpCtJ0rQw1RD52CCbkCRNT1M9O+svB92IJGn6merZWT+ldzYWwGHAocDfVNWbBtWYJGn8TXVL5I0T4+3ih8uBpYNqSpI0PezzVXyr51vAGa84sSTpVW2qu7Pe3/fwNfR+N/K3A+lIkjRtTPXsrH/aN/4CvUuWLD/g3UiSppWpHhP50KAbkSRNP1O9KdWcJLe0e6bvTHJzkjmDbk6SNN6memD9y/Qu1/7WNvy3VpMkHcSmGiKzq+rLVfVCG64FZg+wL0nSNDDVEHkqye8kmdGG36F3d0FJ0kFsqiHyL4EPAE/Quzf6ucC/GFBPkqRpYqqn+H4cWFFVTwMkmQX8Mb1wkSQdpKa6JfLrEwECUFW7gRMG05IkabqYaoi8JsnMiQdtS2SqWzGSpFepqQbBZ4D/leTr7fFvM8n90CVJB5ep/mL9+iTrgFNb6f1V9eDg2pIkTQdT3iXVQsPgkCT9wj5fCl6SpAmGiCSpM0NEktSZISJJ6swQkSR1ZohIkjozRCRJnRkikqTODBFJUmcDC5Ekq9r92Df01WYlWZvk4fZ3ZqsnyZVJNie5P8mJffOsaNM/nGRFX/2dSR5o81yZJIN6L5KkyQ1yS+RaYNketcuAO6pqIXBHewxwJrCwDSuBq+AXVwu+AjgZOAm4ou9qwlcBH+6bb89lSZIGbGAhUlV/Bezeo7wcuK6NXwec01e/vnruAo5IcjRwBrC2qna3+5msBZa1595UVXdVVQHX972WJGlIhn1M5KiqeryNPwEc1caPAbb1Tbe91V6uvn2S+qSSrEyyLsm6Xbt27d87kCT9wsgOrLctiBrSsq6uqiVVtWT27NnDWKQkHRSGHSJPtl1RtL87W30HMLdvujmt9nL1OZPUJUlDNOwQWQ1MnGG1Ari1r35BO0trKfBM2+11O3B6kpntgPrpwO3tuWeTLG1nZV3Q91rSPjt23jySjHSQpqOB3Sc9yQ3AKcCRSbbTO8vqU8BNSS4EHgM+0Ca/DTgL2Aw8B3wIoKp2J/kEcG+b7uNVNXGw/vfonQH2OuDP2yB1sm3rVtZs2DLSHs4+bsFIly91MbAQqarz9/LUaZNMW8BFe3mdVcCqSerrgOP2p0dJ0v7xF+uSpM4MEUlSZ4aIJKkzQ0SS1JkhIknqzBCRJHVmiEiSOjNEJEmdGSKSpM4MEUlSZ4aIJKkzQ0SS1JkhIknqzBCRJHVmiEiSOjNEJEmdGSKSpM4MEUlSZ4aIJKkzQ0SS1JkhIknqzBCRJHVmiEiSOjNEJEmdGSKSpM4MEUlSZ4aIJKkzQ0SS1JkhIknqzBCRJHVmiEiSOhtJiCR5NMkDSdYnWddqs5KsTfJw+zuz1ZPkyiSbk9yf5MS+11nRpn84yYpRvBdJOpiNckvkN6tqcVUtaY8vA+6oqoXAHe0xwJnAwjasBK6CXugAVwAnAycBV0wEjyRpOMZpd9Zy4Lo2fh1wTl/9+uq5CzgiydHAGcDaqtpdVU8Da4Flw25akg5mowqRAr6T5L4kK1vtqKp6vI0/ARzVxo8BtvXNu73V9lZ/iSQrk6xLsm7Xrl0H6j1I0kHvkBEt9z1VtSPJLwNrk/yo/8mqqiR1oBZWVVcDVwMsWbLkgL2uJB3sRrIlUlU72t+dwC30jmk82XZT0f7ubJPvAOb2zT6n1fZWlyQNydBDJMnhSd44MQ6cDmwAVgMTZ1itAG5t46uBC9pZWkuBZ9pur9uB05PMbAfUT281SdKQjGJ31lHALUkmlv/Vqvp2knuBm5JcCDwGfKBNfxtwFrAZeA74EEBV7U7yCeDeNt3Hq2r38N6GJGnoIVJVjwDHT1J/CjhtknoBF+3ltVYBqw50j5KkqRmnU3wlSdOMISJJ6swQkSR1ZohIkjob1Y8NJQCOnTePbVu3jroNSR0ZIhqpbVu3smbDllG3wdnHLRh1C9K05O4sSVJnhogkqTNDRJLUmSEiSerMEJEkdWaISJI68xRfSRpjhx52GO2q52PJEJGkMfbz558f699SuTtLktSZISJJ6swQkSR1ZohIkjozRCRJnRkikqTODBFJUmf+TuQg5g2hJO0vQ+QgNg43hPJmUNL05u4sSVJnhogkqTNDRJLUmSEiSerMEJEkdebZWSPgqbWSXi0MkREYh1NrwdNrJe0/d2dJkjqb9iGSZFmSh5JsTnLZqPuRpIPJtA6RJDOALwBnAouA85MsGm1XknTwmO7HRE4CNlfVIwBJbgSWAw/ubYb7H3hgrG96L0nTSapq1D10luRcYFlV/av2+IPAyVV18R7TrQRWtodvBx7aj8UeCfz1fsx/oIxDH+PQA4xHH+PQA4xHH+PQA4xHH+PQAxyYPuZV1ew9i9N9S2RKqupq4OoD8VpJ1lXVkgPxWtO9j3HoYVz6GIcexqWPcehhXPoYhx4G3ce0PiYC7ADm9j2e02qSpCGY7iFyL7AwyYIkhwHnAatH3JMkHTSm9e6sqnohycXA7cAMYFVVbRzwYg/IbrEDYBz6GIceYDz6GIceYDz6GIceYDz6GIceYIB9TOsD65Kk0Zruu7MkSSNkiEiSOjNE9sGoLrGSZFWSnUk29NVmJVmb5OH2d+aAe5ib5M4kDybZmOSSYfeR5JeS3JPkh62HP2z1BUnubt/L19pJFgOXZEaSHyRZM4o+kjya5IEk65Osa7WhrhdtmUck+UaSHyXZlORdQ14v3t4+g4nh2SSXjuiz+Ldt3dyQ5Ia2zg57vbikLX9jkktbbWCfhSEyRSO+xMq1wLI9apcBd1TVQuCO9niQXgA+WlWLgKXARe39D7OPnwGnVtXxwGJgWZKlwKeBz1XV24CngQsH2EO/S4BNfY9H0cdvVtXivt8ADHu9APg88O2qegdwPL3PZGh9VNVD7TNYDLwTeA64ZZg9ACQ5BvgIsKSqjqN3ss95DHG9SHIc8GF6V/M4Hjg7ydsY5GdRVQ5TGIB3Abf3Pb4cuHyIy58PbOh7/BBwdBs/GnhoyJ/HrcA/GVUfwOuB7wMn0/sl7iGTfU8DXP6c9o/xVGANkGH3ATwKHLlHbajfB/BmYAvtJJ1R9dG33NOB743oszgG2AbMonfm6xrgjGGuF8BvA9f0Pf6PwL8f5GfhlsjUTawgE7a32qgcVVWPt/EngKOGteAk84ETgLuH3UfbhbQe2AmsBf438JOqeqFNMqzv5U/o/eP8v+3xW0bQRwHfSXJfu7QPDH+9WADsAr7cdu391ySHj6CPCecBN7TxofZQVTuAPwa2Ao8DzwD3Mdz1YgPwj5K8JcnrgbPo/SB7YJ+FIfIqUL3/vRjKudpJ3gDcDFxaVc8Ou4+qerF6uy3m0Ntkf8cglzeZJGcDO6vqvmEvew/vqaoT6e1ivSjJb/Q/OaT14hDgROCqqjoB+Bv22FUyrPWzHWt4L/D1PZ8bRg/tOMNyesH6VuBwXrobeqCqahO93WffAb4NrAde3GOaA/pZGCJTN26XWHkyydEA7e/OQS8wyaH0AuQrVfXNUfUBUFU/Ae6kt3vgiCQTP5wdxvfybuC9SR4FbqS3S+vzw+6j/Z8vVbWT3jGAkxj+97Ed2F5Vd7fH36AXKqNYL84Evl9VT7bHw+7ht4AtVbWrqn4OfJPeujLs9eKaqnpnVf0GvWMwP2aAn4UhMnXjdomV1cCKNr6C3jGKgUkS4BpgU1V9dhR9JJmd5Ig2/jp6x2Q20QuTc4fRA0BVXV5Vc6pqPr314C+q6p8Ps48khyd548Q4vWMBGxjyelFVTwDbkry9lU6jdyuGofbRnM/f7cpiBD1sBZYmeX379zLxWQx1/Uzyy+3vscD7ga8yyM9ikAeaXm0Dvf2LP6a3H/4/DHG5N9Dbx/pzev/ndyG9ffB3AA8D/x2YNeAe3kNvE/h+epvI69vnMbQ+gF8HftB62AD8p1b/FeAeYDO9XRmvHeJ3cwqwZth9tGX9sA0bJ9bHYa8XbZmLgXXte/kWMHME6+fhwFPAm/tqo/gs/hD4UVs//wx47bDXT+B/0guvHwKnDfqz8LInkqTO3J0lSerMEJEkdWaISJI6M0QkSZ0ZIpKkzgwRSVJnhogkqTNDRBqSJN9qF0vcOHHBxCQXJvlxevdJ+VKSP2312UluTnJvG9492u6lyfljQ2lIksyqqt3tki330rtM+PfoXWvqp8BfAD+sqouTfBX4YlV9t12+4vaq+gcja17ai0NeeRJJB8hHkryvjc8FPgj8ZVXtBkjydeDvt+d/C1jUuwQTAG9K8oaq+j/DbFh6JYaINARJTqEXDO+qqueS/A9611ja29bFa4ClVfW3w+lQ6sZjItJwvBl4ugXIO+jdYvhw4B8nmdkuFf7P+qb/DvBvJh4kWTzUbqUpMkSk4fg2cEiSTcCngLvo3VfiP9O7wuv36N3u9pk2/UeAJUnuT/Ig8K+H3rE0BR5Yl0Zo4jhH2xK5BVhVVbeMui9pqtwSkUbrY+2e8RuALfTuxyFNG26JSJI6c0tEktSZISJJ6swQkSR1ZohIkjozRCRJnf0/B6JGrTglnSQAAAAASUVORK5CYII=\n"
+ },
+ "metadata": {
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "If we zoom in on `short-term` and `long-term` readmittance individually and get the percentage of patients in these groups who are readmitted, rather than the count, we get a rather different picture.\n",
+ "\n",
+ "To get these percentage values, we divide the total number of short-term or long-term values in the readmitted column by the total values in this column.\n",
+ "\n",
+ "To get the total short-term or long-term values, we use the str.count_match function to fill the readmitted column with True (1) values where the contents of the cell are short-term or long-term respectively and False (0) for any other values. We can use the sum function to count up all of these True values.\n",
+ "\n",
+ "To get the total values in the readmitted column, we select the column and use count() function.\n",
+ "\n",
+ "We can then set the column name to whatever we like using the alias function."
+ ],
+ "metadata": {
+ "id": "xPFiho5eEKNT"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "short_term = rdf.groupby(\"age\").agg(\n",
+ " (\n",
+ " pl.col(\"readmitted\").str.count_match(\"short-term\").sum()\n",
+ " / pl.col(\"readmitted\").count()\n",
+ " * 100\n",
+ " ).alias(\"short-term readmitted\")\n",
+ ")\n",
+ "long_term = rdf.groupby(\"age\").agg(\n",
+ " (\n",
+ " pl.col(\"readmitted\").str.count_match(\"long-term\").sum()\n",
+ " / pl.col(\"readmitted\").count()\n",
+ " * 100\n",
+ " ).alias(\"long-term readmitted\")\n",
+ ")\n",
+ "\n",
+ "short_term.barplot(x=\"age\", y=\"short-term readmitted\")\n",
+ "plt.show()\n",
+ "long_term.barplot(x=\"age\", y=\"long-term readmitted\")"
+ ],
+ "metadata": {
+ "id": "KRBIqBXs8pda",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 541
+ },
+ "outputId": "6f484d34-88f4-460c-9870-737315a2a263"
+ },
+ "execution_count": 221,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "
"
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We see a slight trend of increased long-term readmissions as age increases, but interestingly, a much higher risk of short-term readmission in 20-30 year olds. This could be explained by younger patients perhaps not having yet found the correct treatment or lifestyle to manage their diabetes."
+ ],
+ "metadata": {
+ "id": "nlHmjMozE38p"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Similarly, if we take a look at the average number of emergency visits in the past year for patients in each age category, the 20-30 group is the most at-risk.\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "F4c69wn0CjV4"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "rdf.barplot(x=\"age\", y=\"number_emergency\")"
+ ],
+ "metadata": {
+ "id": "9sg9nYzE_vXu",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 281
+ },
+ "outputId": "ed0a1a83-9a7d-431d-e257-9a94062a731b"
+ },
+ "execution_count": 222,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "
"
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {
+ "needs_background": "light"
+ }
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "These trends provide actionable insights, where advice or follow-ups with 20-30 year olds could be tailored based on their increased risk of short-term hospital readmission and emergency visits."
+ ],
+ "metadata": {
+ "id": "bINBHE7zCuPd"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### A1C examinations as a factor in changes to medication and readmittance"
+ ],
+ "metadata": {
+ "id": "UUBMSWbuHxf-"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Next, let's take a look at the impact of A1C levels being checked during the hospital admission on the likelihood of a patient's medication being changed. The higher the level of A1C, the greater the risk of developing diabetes complications is."
+ ],
+ "metadata": {
+ "id": "S11nqyg0oufI"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# group data by A1C result group\n",
+ "ret = rdf.groupby(pl.col(\"A1Cresult\")).agg(\n",
+ " [\n",
+ " # get percentage of patients in each group who changed medication\n",
+ " (pl.col(\"change\").sum() / pl.col(\"change\").count() * 100).alias(\"change\"),\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "# display as a sorted list\n",
+ "ret.sort(pl.col(\"change\"), reverse=True).collect().fetch()"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 203
+ },
+ "id": "RkJK-DmHhq37",
+ "outputId": "d9ed2ec7-4e14-4a64-fdf4-e4b6d2b6fab8"
+ },
+ "execution_count": 223,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "shape: (4, 2)\n",
+ "┌───────────┬───────────┐\n",
+ "│ A1Cresult ┆ change │\n",
+ "│ --- ┆ --- │\n",
+ "│ str ┆ f64 │\n",
+ "╞═══════════╪═══════════╡\n",
+ "│ very high ┆ 65.067994 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ high ┆ 50.738007 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ null ┆ 44.272954 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ normal ┆ 44.226143 │\n",
+ "└───────────┴───────────┘"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (4, 2)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "A1Cresult\n",
+ "
\n",
+ "
\n",
+ "change\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "str\n",
+ "
\n",
+ "
\n",
+ "f64\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ ""very high"\n",
+ "
\n",
+ "
\n",
+ "65.067994\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""high"\n",
+ "
\n",
+ "
\n",
+ "50.738007\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "null\n",
+ "
\n",
+ "
\n",
+ "44.272954\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""normal"\n",
+ "
\n",
+ "
\n",
+ "44.226143\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 223
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Perhaps as expected, those with a very high or high A1Cresult were more likely to have a medication change. Interestingly, those who do not have their A1C level examined are only as likely to change medication as those with normal A1C levels. This shows doctors are less likely to change medication unless they know that A1C levels are higher than expected via exams.\n",
+ "\n",
+ "What we now want to know is whether this has an impact on the likelihood of patient readmission in the short and long term."
+ ],
+ "metadata": {
+ "id": "xX672CJwpAYv"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# percentages of those readmitted within a month of initial hospital visit by A1C result group\n",
+ "ret = rdf.groupby(pl.col(\"A1Cresult\")).agg(\n",
+ " [\n",
+ " (\n",
+ " pl.col(\"readmitted\").str.count_match(\"short-term\").sum()\n",
+ " / pl.col(\"readmitted\").count()\n",
+ " * 100\n",
+ " ).alias(\"short-term readmitted\")\n",
+ " ]\n",
+ ")\n",
+ "ret.sort(pl.col(\"short-term readmitted\"), reverse=True).collect().fetch()"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 203
+ },
+ "id": "LFDwp7G2lfqh",
+ "outputId": "5290f2dd-74a6-4f82-bcea-b5b3a0a14bec"
+ },
+ "execution_count": 224,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "shape: (4, 2)\n",
+ "┌───────────┬───────────────────────┐\n",
+ "│ A1Cresult ┆ short-term readmitted │\n",
+ "│ --- ┆ --- │\n",
+ "│ str ┆ f64 │\n",
+ "╞═══════════╪═══════════════════════╡\n",
+ "│ null ┆ 11.436393 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ high ┆ 10.042172 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ very high ┆ 9.907722 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ normal ┆ 9.703288 │\n",
+ "└───────────┴───────────────────────┘"
+ ],
+ "text/html": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 225
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We see that patients who did not have their A1C level taken are the most likely to be readmitted in the short-term and almost as likely as their \"very high\" counterparts to be readmitted in the long-term. This suggests that taking patients' A1C levels can help encourage doctors to make changes in medication, which may be a factor in a reduction in hospital readmissions."
+ ],
+ "metadata": {
+ "id": "HmsKOU2bqz6F"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Dosage increases and decreases as factors on overall readmission"
+ ],
+ "metadata": {
+ "id": "TjaMlnbQIWcJ"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We will now investigate the likelihood of increases or decreases of specific medications leading to short-term patient readmission.\n",
+ "\n",
+ "Let's start by getting a list of the medications we want to look at. We will these lists down to drugs with more than 20 results to remove any medication with only a handful of results."
+ ],
+ "metadata": {
+ "id": "4dzRa_RN2_Oq"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# list of all medications in study\n",
+ "all_meds = [\n",
+ " \"metformin\",\n",
+ " \"repaglinide\",\n",
+ " \"nateglinide\",\n",
+ " \"chlorpropamide\",\n",
+ " \"glimepiride\",\n",
+ " \"acetohexamide\",\n",
+ " \"glipizide\",\n",
+ " \"glyburide\",\n",
+ " \"tolbutamide\",\n",
+ " \"pioglitazone\",\n",
+ " \"rosiglitazone\",\n",
+ " \"acarbose\",\n",
+ " \"miglitol\",\n",
+ " \"troglitazone\",\n",
+ " \"tolazamide\",\n",
+ " \"examide\",\n",
+ " \"citoglipton\",\n",
+ " \"insulin\",\n",
+ " \"glyburide-metformin\",\n",
+ " \"glipizide-metformin\",\n",
+ " \"glimepiride-pioglitazone\",\n",
+ " \"metformin-rosiglitazone\",\n",
+ " \"metformin-pioglitazone\",\n",
+ "]\n",
+ "\n",
+ "# get the number of increased doses per medication and flip the output vertically\n",
+ "increased_meds = rdf.select(\n",
+ " pl.col(x).str.count_match(\"Up\").sum() for x in all_meds\n",
+ ").melt(variable_name=\"medication\", value_name=\"count\")\n",
+ "\n",
+ "# remove any medications that don't have at least 100 rows of data and get this result as a Polars dataframe\n",
+ "increased_meds = increased_meds.filter(pl.col(\"count\") > 20).collect().fetch()\n",
+ "\n",
+ "# convert output to a list via Pandas API\n",
+ "increased_meds = increased_meds.to_pandas()[\"medication\"].tolist()\n",
+ "increased_meds"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "eBulsGoG3Z5I",
+ "outputId": "9ccbd70d-809d-43f1-9aae-4b41592b2232"
+ },
+ "execution_count": 226,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "['metformin',\n",
+ " 'repaglinide',\n",
+ " 'nateglinide',\n",
+ " 'glimepiride',\n",
+ " 'glipizide',\n",
+ " 'glyburide',\n",
+ " 'pioglitazone',\n",
+ " 'rosiglitazone',\n",
+ " 'insulin']"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 226
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We now do exactly the same for decreased medications."
+ ],
+ "metadata": {
+ "id": "GFA1-rz29W0X"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# get the number of increased doses per medication and flip the output vertically\n",
+ "decreased_meds = rdf.select(\n",
+ " pl.col(x).str.count_match(\"Down\").sum() for x in all_meds\n",
+ ").melt(variable_name=\"medication\", value_name=\"count\")\n",
+ "\n",
+ "# remove any medications that don't have at least 100 rows of data and get this result as a Polars dataframe\n",
+ "decreased_meds = decreased_meds.filter(pl.col(\"count\") > 20).collect().fetch()\n",
+ "\n",
+ "# convert output to a list via Pandas API\n",
+ "decreased_meds = decreased_meds.to_pandas()[\"medication\"].tolist()\n",
+ "decreased_meds"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "Ctt-Bg4f9dVH",
+ "outputId": "5b122acc-26ea-483b-9008-79b121fab3ed"
+ },
+ "execution_count": 227,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "['metformin',\n",
+ " 'repaglinide',\n",
+ " 'glimepiride',\n",
+ " 'glipizide',\n",
+ " 'glyburide',\n",
+ " 'pioglitazone',\n",
+ " 'rosiglitazone',\n",
+ " 'insulin']"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 227
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The next step is to loop over our list of `increased_meds` and get the percentage of patients who were readmitted to hospital within the following month after their dose of the drug was increased. We are able to use the `vstack` function to append each result for each drug into one table.\n",
+ "\n",
+ "We then simply add a column with the list of medicines in the same order and sort the list from highest to lowest."
+ ],
+ "metadata": {
+ "id": "Y_9as5FlqoZV"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# create a null table value for later use\n",
+ "table = None\n",
+ "\n",
+ "# iterate over medications list\n",
+ "for drugs in increased_meds:\n",
+ " # filter data down to cases where dosage increased\n",
+ " tmp = rdf.filter(pl.col(drugs) == \"Up\")\n",
+ " # get a RemoteLazyFrame of percentages of patients where each drug is increased were readmitted to hospital during study\n",
+ " percentages = tmp.select(\n",
+ " [\n",
+ " (\n",
+ " pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100\n",
+ " ).alias(\"overall readmitted %\"),\n",
+ " ]\n",
+ " )\n",
+ " # if first iteration, table and data_avilable are assigned percentages and row_count tables\n",
+ " if table == None:\n",
+ " table = percentages\n",
+ " # else we use vstack to add new row of percentages\n",
+ " else:\n",
+ " table = table.vstack(percentages)\n",
+ "\n",
+ "# convert table to Polars dataframe\n",
+ "table = table.collect().fetch()\n",
+ "\n",
+ "# create and add new column with medication names\n",
+ "new_col = pl.Series(\"medication\", increased_meds)\n",
+ "table = table.with_columns([new_col])\n",
+ "table.select([\"medication\", \"overall readmitted %\"]).sort(\n",
+ " pl.col(\"overall readmitted %\")\n",
+ ")"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 340
+ },
+ "id": "aEU6wEvYAhz6",
+ "outputId": "e51f858a-8992-4944-fb4c-74834b950207"
+ },
+ "execution_count": 228,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "shape: (9, 2)\n",
+ "┌───────────────┬──────────────────────┐\n",
+ "│ medication ┆ overall readmitted % │\n",
+ "│ --- ┆ --- │\n",
+ "│ str ┆ f64 │\n",
+ "╞═══════════════╪══════════════════════╡\n",
+ "│ rosiglitazone ┆ 39.88764 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ metformin ┆ 40.76851 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ glimepiride ┆ 42.507645 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ glyburide ┆ 44.211823 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ ... ┆ ... │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ repaglinide ┆ 48.181818 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ glipizide ┆ 50.0 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ insulin ┆ 51.537646 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ pioglitazone ┆ 51.709402 │\n",
+ "└───────────────┴──────────────────────┘"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (9, 2)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "medication\n",
+ "
\n",
+ "
\n",
+ "overall readmitted %\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "str\n",
+ "
\n",
+ "
\n",
+ "f64\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ ""rosiglitazone"\n",
+ "
\n",
+ "
\n",
+ "39.88764\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""metformin"\n",
+ "
\n",
+ "
\n",
+ "40.76851\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""glimepiride"\n",
+ "
\n",
+ "
\n",
+ "42.507645\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""glyburide"\n",
+ "
\n",
+ "
\n",
+ "44.211823\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""nateglinide"\n",
+ "
\n",
+ "
\n",
+ "45.833333\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""repaglinide"\n",
+ "
\n",
+ "
\n",
+ "48.181818\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""glipizide"\n",
+ "
\n",
+ "
\n",
+ "50.0\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""insulin"\n",
+ "
\n",
+ "
\n",
+ "51.537646\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""pioglitazone"\n",
+ "
\n",
+ "
\n",
+ "51.709402\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 228
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This gives us significiant results that merit further exploration with a difference of around 14% between the likelihood of an increased dose of our lowest and highest placed drugs on the list leading to readmission within the course of the study.\n",
+ "\n",
+ "We can run the same query for a reduction in medication, again leading to significant results."
+ ],
+ "metadata": {
+ "id": "PtGj-w4OrU_5"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# create a null table value for later use\n",
+ "table = None\n",
+ "\n",
+ "# iterate over medications list\n",
+ "for drugs in decreased_meds:\n",
+ " # filter data down to cases where dosage increased\n",
+ " tmp = rdf.filter(pl.col(drugs) == \"Down\")\n",
+ " # get a RemoteLazyFrame of percentages of patients where each drug is increased were readmitted to hospital during study\n",
+ " percentages = tmp.select(\n",
+ " [\n",
+ " (\n",
+ " pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100\n",
+ " ).alias(\"overall readmitted %\"),\n",
+ " ]\n",
+ " )\n",
+ " # if first iteration, table and data_avilable are assigned percentages and row_count tables\n",
+ " if table == None:\n",
+ " table = percentages\n",
+ " # else we use vstack to add new row of percentages\n",
+ " else:\n",
+ " table = table.vstack(percentages)\n",
+ "\n",
+ "table = table.collect().fetch()\n",
+ "new_col = pl.Series(\"medication\", decreased_meds)\n",
+ "table = table.with_columns([new_col])\n",
+ "table.select([\"medication\", \"overall readmitted %\"]).sort(\n",
+ " pl.col(\"overall readmitted %\")\n",
+ ")"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 312
+ },
+ "id": "nnGQB49dhwqE",
+ "outputId": "f0f36f25-88fa-452a-a34e-6d65e2d981d2"
+ },
+ "execution_count": 229,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "shape: (8, 2)\n",
+ "┌───────────────┬──────────────────────┐\n",
+ "│ medication ┆ overall readmitted % │\n",
+ "│ --- ┆ --- │\n",
+ "│ str ┆ f64 │\n",
+ "╞═══════════════╪══════════════════════╡\n",
+ "│ rosiglitazone ┆ 31.034483 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ metformin ┆ 45.043478 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ glimepiride ┆ 47.938144 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ glyburide ┆ 48.758865 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ repaglinide ┆ 48.888889 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ insulin ┆ 52.790964 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ glipizide ┆ 52.857143 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ pioglitazone ┆ 53.389831 │\n",
+ "└───────────────┴──────────────────────┘"
+ ],
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (8, 2)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "medication\n",
+ "
\n",
+ "
\n",
+ "overall readmitted %\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "str\n",
+ "
\n",
+ "
\n",
+ "f64\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ ""rosiglitazone"\n",
+ "
\n",
+ "
\n",
+ "31.034483\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""metformin"\n",
+ "
\n",
+ "
\n",
+ "45.043478\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""glimepiride"\n",
+ "
\n",
+ "
\n",
+ "47.938144\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""glyburide"\n",
+ "
\n",
+ "
\n",
+ "48.758865\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""repaglinide"\n",
+ "
\n",
+ "
\n",
+ "48.888889\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""insulin"\n",
+ "
\n",
+ "
\n",
+ "52.790964\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""glipizide"\n",
+ "
\n",
+ "
\n",
+ "52.857143\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ ""pioglitazone"\n",
+ "
\n",
+ "
\n",
+ "53.389831\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 229
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We see that the first four drugs (rosiglitazone, metformin, glimepiride and glyburide) and last four drugs (repaglinide, insulin, glipizide and pioglitazone) in each list are the same, which could suggest that they are more effective overall treatments, regardless of whether they are increased or decreased. \n",
+ "\n",
+ "However, this could also be explained by these medications being prescribed for milder diabetes or these being the go-to drugs, aka. the bottom four drugs are only prescribed when patients are not responding well to medication. This would need to be considered by the client."
+ ],
+ "metadata": {
+ "id": "DuMXzt7zIHp3"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Conclusions\n",
+ "\n",
+ "This brings us to the end of our data exploration. We gained meaningful insights:\n",
+ "\n",
+ " - 20-30 year olds are the most at-risk age group of short-term hospital readmission and emergency visits.\n",
+ " \n",
+ " - The number of hospital admissions per age group increases with age.\n",
+ "\n",
+ " - Taking patients' A1C levels may encourage doctors to make changes in medication.\n",
+ " \n",
+ " - Not taking patients' A1C levels may increase the likelihood of hospital readmissions.\n",
+ "\n",
+ " - Regardless of dosage increases or decreases, the following medications appear most effective at reducing hospital readmissions: rosiglitazone, metformin, glimepiride and glyburide.\n",
+ "\n",
+ " - Regardless of dosage increases or decreases, the following medications appear less effective at reducing hospital readmissions: repaglinide, insulin, glipizide and pioglitazone.\n",
+ "\n",
+ "\n",
+ "This is a rich dataset with many avenues to explore, so feel free to continue exploring!\n",
+ "\n",
+ "However in our case, that's all we've got time for! Let's close our connection and stop the server. \n",
+ "\n",
+ "(Leave this next block commented if you want to continue to run queries on the dataset instead!)\n"
+ ],
+ "metadata": {
+ "id": "cSU9X1fDvvZQ"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# connection.close()\n",
+ "# bastionlab_server.stop(srv)"
+ ],
+ "metadata": {
+ "id": "xROO5Oxzvev-"
+ },
+ "execution_count": 230,
+ "outputs": []
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "base",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.13 (main, Aug 25 2022, 23:26:10) \n[GCC 11.2.0]"
+ },
+ "orig_nbformat": 4,
+ "vscode": {
+ "interpreter": {
+ "hash": "d130ca42b532f14c740c9405384e6a25814bad609bad1a40b3b3f26954036080"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/mkdocs.yml b/mkdocs.yml
index f2d8b068..22c22bbb 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -107,7 +107,7 @@ nav:
- Data exploration:
- Covid-19 cleaning and exploration: "docs/how-to-guides/covid_cleaning_exploration.ipynb"
- Fraud detection cleaning and exploration: "docs/how-to-guides/fraud_detection.ipynb"
- - Diabetes cleaning and exploration- part one: "docs/how-to-guides/diabetes_p1.ipynb"
+ - Diabetes cleaning and exploration- part one: "docs/how-to-guides/diabetes_exploration.ipynb"
- Deep learning:
- Fine Tuning Distilbert on BastionLab: "docs/how-to-guides/distilbert_example_notebook.ipynb"
- 🛠️ API reference: "docs/resources/bastionlab/index.html"
From b07674dcce6cffe4fcc7b7790a13668f6b1e60a8 Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Wed, 22 Feb 2023 10:33:46 +0100
Subject: [PATCH 03/22] intro done
---
docs/docs/how-to-guides/diabetes_p1.ipynb | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_p1.ipynb b/docs/docs/how-to-guides/diabetes_p1.ipynb
index 072b6b31..bc64892f 100644
--- a/docs/docs/how-to-guides/diabetes_p1.ipynb
+++ b/docs/docs/how-to-guides/diabetes_p1.ipynb
@@ -8,21 +8,21 @@
},
"source": [
"
\n",
- "
Data exploration of diabetes hospital admissions: Part I
\n",
"______________________________________________________\n",
"\n",
- "Despite major technological breakthroughs in cybersecurity and privacy in recent years, secure off-premises data science collaboration has remained out of reach. This is a major problem for the health sector which has so much to gain from the power of data but also so much at risk when it comes to patients' highly sensitive medical records.\n",
+ "Despite major recent technological breakthroughs in cybersecurity and privacy, secure off-premises data science collaboration has remained out of reach. This is a major problem for the health sector which has so much to gain from the power of data but so much at risk when it comes to patients' highly sensitive records.\n",
"\n",
- "We are on a mission to make remote data science collaboration safe for the health sector. Using BastionLab, data owners can set strict access policies on datasets for collaborators, allowing them to run privacy-friendly queries and train and deploy ML models on datasets whilst blocking access to raw data.\n",
+ "BastionLab's goal is to make this issue disappear so that remote data science collaborations can happen safely in the medical industry. Its framework lets data owners set a strict access policies on datasets for collaborators and enforces that data scientists can explore or train ML models while never accessing the raw data.\n",
"\n",
- "In this how-to guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.\n",
+ "In this guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.\n",
"\n",
- "In part I of this two-part data exploration. We will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**.\n",
+ "First, we will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**. Then we'll go on analysing it - showing it is possible to do normal data science work without accessing the data in clear.\n",
"\n",
- "But before we can do that, we first need to get everything set up!\n",
+ "But before we can do that, let's get everything set up!\n",
"\n",
"## Pre-requisites\n",
"___________________________________________\n",
From cda26b1ba23432d38cb97dc03c60da825cbd89ef Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Wed, 22 Feb 2023 10:38:33 +0100
Subject: [PATCH 04/22] saving progress
---
docs/docs/how-to-guides/diabetes_p1.ipynb | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_p1.ipynb b/docs/docs/how-to-guides/diabetes_p1.ipynb
index bc64892f..1429a544 100644
--- a/docs/docs/how-to-guides/diabetes_p1.ipynb
+++ b/docs/docs/how-to-guides/diabetes_p1.ipynb
@@ -51,32 +51,34 @@
"!pip install bastionlab\n",
"!pip install bastionlab_server\n",
"\n",
- "# dowloading the dataset using Google Drive tool dgown\n",
+ "# dowloading the dataset using Google Drive tool gdown\n",
"!pip install gdown\n",
"!pip install --upgrade --no-cache-dir gdown\n",
"!gdown --id \"1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI\""
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "NJ67chDB_G1l"
},
"source": [
- "The dataset we are using for this how-to guide is based on the Diabetes 130-US hospitals for years 1999-2008 dataset. It contains 10 years of data on diabetes admissions from 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.\n",
+ "The dataset we are using for this how-to guide is based on the Diabetes 130-US hospitals, for years 1999-2008. It contains 10 years of information on diabetes admissions from 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.\n",
"\n",
">For more detailed information on the dataset, you can check out the description and full dataset by following this [link](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).\n",
"\n",
- "However, this dataset had already been pre-processed before publication which stopped us from showing you some key data cleaning steps. We therefore made a few modifications to replace some pre-grouped data columns with randomly populated data. You can check out exactly how we did this using Polars [here](https://colab.research.google.com/drive/174EJvK8u8mGGWb6ypLH9SKaeRnX-pEou?usp=share_link). "
+ "This dataset had already been pre-processed before publication which stopped us from showing you some key data cleaning steps. We made a few modifications to replace some pre-grouped data columns with randomly populated data. You can check out exactly how we did this using the Polars data science library [here](https://colab.research.google.com/drive/174EJvK8u8mGGWb6ypLH9SKaeRnX-pEou?usp=share_link). "
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "OjL01I5c_G1m"
},
"source": [
- "## Data owner's POV\n",
+ "## Data owner's side\n",
"___________________________________________\n",
"\n",
"### Launching the server\n",
From 69f1ee9e58cbda43b92a0d5645ef2d26fa87bde3 Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Wed, 22 Feb 2023 10:41:04 +0100
Subject: [PATCH 05/22] changed modifications to right document
---
.../how-to-guides/diabetes_exploration.ipynb | 1012 +++++++++--------
docs/docs/how-to-guides/diabetes_p1.ipynb | 26 +-
2 files changed, 520 insertions(+), 518 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index 29dfaab5..ff161fed 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -1,27 +1,28 @@
{
"cells": [
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "jdvo0Bjb_G1c"
},
"source": [
"
\n",
- "
Data exploration of diabetes hospital admissions: Part I
\n",
"______________________________________________________\n",
"\n",
- "Despite major technological breakthroughs in cybersecurity and privacy in recent years, secure off-premises data science collaboration has remained out of reach. This is a major problem for the health sector which has so much to gain from the power of data but also so much at risk when it comes to patients' highly sensitive medical records.\n",
+ "Despite major recent technological breakthroughs in cybersecurity and privacy, secure off-premises data science collaboration has remained out of reach. This is a major problem for the health sector which has so much to gain from the power of data but so much at risk when it comes to patients' highly sensitive records.\n",
"\n",
- "We are on a mission to make remote data science collaboration safe for the health sector. Using BastionLab, data owners can set strict access policies on datasets for collaborators, allowing them to run privacy-friendly queries and train and deploy ML models on datasets whilst blocking access to raw data.\n",
+ "BastionLab's goal is to make this issue disappear so that remote data science collaborations can happen safely in the medical industry. Its framework lets data owners set a strict access policies on datasets for collaborators and enforces that data scientists can explore or train ML models while never accessing the raw data.\n",
"\n",
- "In this how-to guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.\n",
+ "In this guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.\n",
"\n",
- "In part I of this two-part data exploration. We will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**.\n",
+ "First, we will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**. Then we'll go on analysing it - showing it is possible to do normal data science work without accessing the data in clear.\n",
"\n",
- "But before we can do that, we first need to get everything set up!\n",
+ "But before we can do that, let's get everything set up!\n",
"\n",
"## Pre-requisites\n",
"___________________________________________\n",
@@ -50,23 +51,24 @@
"!pip install bastionlab\n",
"!pip install bastionlab_server\n",
"\n",
- "# dowloading the dataset using Google Drive tool dgown\n",
+ "# dowloading the dataset using Google Drive tool gdown\n",
"!pip install gdown\n",
"!pip install --upgrade --no-cache-dir gdown\n",
"!gdown \"1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI\""
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "NJ67chDB_G1l"
},
"source": [
- "The dataset we are using for this how-to guide is based on the Diabetes 130-US hospitals for years 1999-2008 dataset. It contains 10 years of data on diabetes admissions from 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.\n",
+ "The dataset we are using for this how-to guide is based on the Diabetes 130-US hospitals, for years 1999-2008. It contains 10 years of information on diabetes admissions from 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.\n",
"\n",
">For more detailed information on the dataset, you can check out the description and full dataset by following this [link](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008).\n",
"\n",
- "However, this dataset had already been pre-processed before publication which stopped us from showing you some key data cleaning steps. We therefore made a few modifications to replace some pre-grouped data columns with randomly populated data. You can check out exactly how we did this using Polars [here](https://colab.research.google.com/drive/174EJvK8u8mGGWb6ypLH9SKaeRnX-pEou?usp=share_link). "
+ "This dataset had already been pre-processed before publication which stopped us from showing you some key data cleaning steps. We made a few modifications to replace some pre-grouped data columns with randomly populated data. You can check out exactly how we did this using the Polars data science library [here](https://colab.research.google.com/drive/174EJvK8u8mGGWb6ypLH9SKaeRnX-pEou?usp=share_link). "
]
},
{
@@ -99,8 +101,8 @@
},
"outputs": [
{
- "output_type": "stream",
"name": "stdout",
+ "output_type": "stream",
"text": [
"BastionLab server (version 0.3.7) already installed\n",
"Libtorch (version 1.13.1) already installed\n",
@@ -207,8 +209,8 @@
},
"outputs": [
{
- "output_type": "stream",
"name": "stdout",
+ "output_type": "stream",
"text": [
"63c8152d-f5af-41ec-b22c-aea51a8465b5\n"
]
@@ -261,16 +263,16 @@
"cell_type": "code",
"execution_count": 197,
"metadata": {
- "id": "C7j4vdDd_G10",
"colab": {
"base_uri": "https://localhost:8080/"
},
+ "id": "C7j4vdDd_G10",
"outputId": "7941b960-a0e4-4e9d-f0a4-13ef5c9ba296"
},
"outputs": [
{
- "output_type": "stream",
"name": "stdout",
+ "output_type": "stream",
"text": [
"\u001b[31mThe query has been rejected by the data owner.\u001b[37m\n"
]
@@ -328,22 +330,22 @@
"cell_type": "code",
"execution_count": 199,
"metadata": {
- "id": "TT3mSjII_G13",
"colab": {
"base_uri": "https://localhost:8080/"
},
+ "id": "TT3mSjII_G13",
"outputId": "4a463355-2753-40d6-ce62-a2c8fa30c63a"
},
"outputs": [
{
- "output_type": "execute_result",
"data": {
"text/plain": [
"FetchableLazyFrame(identifier=63c8152d-f5af-41ec-b22c-aea51a8465b5)"
]
},
+ "execution_count": 199,
"metadata": {},
- "execution_count": 199
+ "output_type": "execute_result"
}
],
"source": [
@@ -368,16 +370,16 @@
"cell_type": "code",
"execution_count": 200,
"metadata": {
- "id": "G-g8rOnj_G15",
"colab": {
"base_uri": "https://localhost:8080/"
},
+ "id": "G-g8rOnj_G15",
"outputId": "797dedc6-f4c5-4bb7-8830-3c2b8295fbbc"
},
"outputs": [
{
- "output_type": "stream",
"name": "stdout",
+ "output_type": "stream",
"text": [
"['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']\n"
]
@@ -541,39 +543,16 @@
"cell_type": "code",
"execution_count": 204,
"metadata": {
- "id": "Pzz5qvSJWd2V",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 285
},
+ "id": "Pzz5qvSJWd2V",
"outputId": "d0316533-6304-4dce-8357-e0caa0d897da"
},
"outputs": [
{
- "output_type": "execute_result",
"data": {
- "text/plain": [
- "shape: (7, 2)\n",
- "┌───────────────┬─────────────────┐\n",
- "│ column name ┆ null values (%) │\n",
- "│ --- ┆ --- │\n",
- "│ str ┆ f64 │\n",
- "╞═══════════════╪═════════════════╡\n",
- "│ max_glu_serum ┆ 94.746772 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ A1Cresult ┆ 83.277322 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ readmitted ┆ 53.911916 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ race ┆ 2.233555 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ diag_3 ┆ 1.398306 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ diag_2 ┆ 0.351787 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ diag_1 ┆ 0.020636 │\n",
- "└───────────────┴─────────────────┘"
- ],
"text/html": [
"
\n",
"\n",
- "
\n",
- "shape: (7, 2)\n",
- "\n",
- "
\n",
- "
\n",
- "column name\n",
- "
\n",
- "
\n",
- "null values (%)\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "str\n",
- "
\n",
- "
\n",
- "f64\n",
- "
\n",
- "
\n",
- "\n",
- "\n",
- "
\n",
- "
\n",
- ""max_glu_serum"\n",
- "
\n",
- "
\n",
- "94.746772\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""A1Cresult"\n",
- "
\n",
- "
\n",
- "83.277322\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""readmitted"\n",
- "
\n",
- "
\n",
- "53.911916\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""race"\n",
- "
\n",
- "
\n",
- "2.233555\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""diag_3"\n",
- "
\n",
- "
\n",
- "1.398306\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""diag_2"\n",
- "
\n",
- "
\n",
- "0.351787\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""diag_1"\n",
- "
\n",
- "
\n",
- "0.020636\n",
- "
\n",
- "
\n",
- "\n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "shape: (7, 2)\n",
- "┌───────────────┬─────────────────┐\n",
- "│ column name ┆ null values (%) │\n",
- "│ --- ┆ --- │\n",
- "│ str ┆ f64 │\n",
- "╞═══════════════╪═════════════════╡\n",
- "│ max_glu_serum ┆ 94.746772 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ A1Cresult ┆ 83.277322 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ readmitted ┆ 53.911916 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ race ┆ 2.233555 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ diag_3 ┆ 1.398306 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ diag_2 ┆ 0.351787 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ diag_1 ┆ 0.020636 │\n",
- "└───────────────┴─────────────────┘"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# melt table to a two-column table with the column name 'column' and corresponding percetage of null values 'null values', sort in descending order and display\n",
- "percent_missing = percent_missing.melt(\n",
- " variable_name=\"column name\",\n",
- " value_name=\"null values (%)\",\n",
- ").sort(pl.col(\"null values (%)\"), reverse=True)\n",
- "\n",
- "# filter out columns with no null values and display\n",
- "percent_missing.filter(pl.col(\"null values (%)\") > 0).collect().fetch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "4n0jnBPyYLjf"
- },
- "source": [
- "There are several strategies for dealing with null values such as deleting these rows from the dataset with the `drop_nulls` method or filling null values with the `fill_null` method. But in our case, we are just happy to have visibility over which columns including null values and to what extent so that we can handle and analyse these columns with this in mind."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-WUugovwve2c"
- },
- "source": [
- "### Grouping data: ICD-9 medical codes\n",
- "Grouping data is going to be the largest and most crucial task in this data cleaning job. This is a dataset with a low of wide-ranging numerical values which need to be grouped so that our data analysts can gain meaningul insights.\n",
- "\n",
- "Let's start with our diagnoses columns: `diag_1`, `diag_2` and `diag_3`.\n",
- "\n",
- "These columns contain the primary, secondary and terciary diagnoses given to patients. These diagnoses are given using [ICD-9 medical codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes) which are three digit codes ranging from 1 to 1000, as well as E800–E999 codes and V01–V82 codes.\n",
- "\n",
- "By grabbing all the unique values in the `diag_1` column and counting them, we can see that we have over 700 different values in this column!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 121
- },
- "id": "7pVHpmLWj6_w",
- "outputId": "c7d50a9f-f919-4893-a1f4-50b1ba7d20c5"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- "shape: (1, 1)\n",
- "\n",
- "
\n",
- "
\n",
- "diag_1\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "u32\n",
- "
\n",
- "
\n",
- "\n",
- "\n",
- "
\n",
- "
\n",
- "717\n",
- "
\n",
- "
\n",
- "\n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "shape: (1, 1)\n",
- "┌────────┐\n",
- "│ diag_1 │\n",
- "│ --- │\n",
- "│ u32 │\n",
- "╞════════╡\n",
- "│ 717 │\n",
- "└────────┘"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tmp = rdf.select(\"diag_1\").unique()\n",
- "tmp.select(pl.col(\"diag_1\").count()).collect().fetch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "cPsmfkBpkPCv"
- },
- "source": [
- "Standard groupings of these codes have already been designed. What we want to do is replace the hundreds of unique codes we have in our our diagnoses columns with these groupings!\n",
- "\n",
- "To do this, we will again use Polars `when().then().otherwise()` functions to perform a find and replace operation. We will use `when()` to check if the codes in each cell are either E or V codes or fall within a certain numerical range.\n",
- "\n",
- "However, these diagnoses columns are currently string columns, since the E and V codes are not entirely numerical. This is problematic since we cannot perform numerical comparisons on these cells and we cannot convert the column type to a numerical one because of these 'E' and 'V' values!\n",
- "\n",
- "We will solve this problem in three steps:\n",
- "\n",
- "1) We will find and replace all E codes with a \"-1\" value and V codes with a \"-2\" value.\n",
- "\n",
- "2) We will `select()` our columns and `cast()` all values in these columns to float values.\n",
- "\n",
- "3) We will perform the find and replace operation to group all ICD-9 codes into their associated group- of which there are 17, plus E codes and V codes."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "xPNFpZ7lW8qR"
- },
- "outputs": [],
- "source": [
- "# iterate over the three diagnoses columns\n",
- "for col in [\"diag_1\", \"diag_2\", \"diag_3\"]:\n",
- " # step one: replace troublesome E and V codes with temporary -1 and -2 codes\n",
- " rdf = rdf.with_columns(\n",
- " [\n",
- " pl.when(\n",
- " pl.col(col).str.starts_with(\"E\")\n",
- " ) # use Polars str.starts_with method to identify E codes\n",
- " .then(\"-1\")\n",
- " .when(pl.col(col).str.starts_with(\"V\"))\n",
- " .then(\"-2\")\n",
- " .otherwise(pl.col(col))\n",
- " .keep_name()\n",
- " ]\n",
- " )\n",
- "\n",
- " # step two: cast all values in column to float values\n",
- " rdf = rdf.with_columns([pl.col(col).cast(pl.Float64)])\n",
- "\n",
- " # step three: replace all codes with their corresponding group\n",
- " rdf = rdf.with_columns(\n",
- " [\n",
- " pl.when(pl.col(col) >= 800)\n",
- " .then(\"injury and poisoning\")\n",
- " .when(pl.col(col) >= 780)\n",
- " .then(\"symptoms, signs & ill-defined\")\n",
- " .when(pl.col(col) >= 760)\n",
- " .then(\"perinatal\")\n",
- " .when(pl.col(col) >= 740)\n",
- " .then(\"congenital anomalies\")\n",
- " .when(pl.col(col) >= 710)\n",
- " .then(\"musculoskeletal & connective tissue\")\n",
- " .when(pl.col(col) >= 680)\n",
- " .then(\"skin\")\n",
- " .when(pl.col(col) >= 630)\n",
- " .then(\"pregnancy, childbirth and peurperium\")\n",
- " .when(pl.col(col) >= 580)\n",
- " .then(\"genitourinary\")\n",
- " .when(pl.col(col) >= 520)\n",
- " .then(\"digestive\")\n",
- " .when(pl.col(col) >= 460)\n",
- " .then(\"respiratory\")\n",
- " .when(pl.col(col) >= 390)\n",
- " .then(\"circulatory\")\n",
- " .when(pl.col(col) >= 320)\n",
- " .then(\"nervous system and sense organs\")\n",
- " .when(pl.col(col) >= 290)\n",
- " .then(\"mental disorders\")\n",
- " .when(pl.col(col) >= 280)\n",
- " .then(\"blood and blood-forming organs\")\n",
- " .when(pl.col(col) >= 240)\n",
- " .then(\"neoplasms\")\n",
- " .when(pl.col(col) >= 140)\n",
- " .then(\"endocrine, nutritional, metabolic and immunity\")\n",
- " .when(pl.col(col) >= 1)\n",
- " .then(\"infectious and parasitic\")\n",
- " .when(pl.col(col) == -1)\n",
- " .then(\"E code (injury\")\n",
- " .when(pl.col(col) == -2)\n",
- " .then(\"V code (other)\")\n",
- " .otherwise(\n",
- " pl.col(col)\n",
- " ) # otherwise (null values) keep original value from the column\n",
- " .alias(\n",
- " col\n",
- " ) # give resulting column same name as previously- therefore replacing old columns\n",
- " ]\n",
- " )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "P1MquUrNlXDO"
- },
- "source": [
- "By performing the same query as previously to count `diag_1`'s unique values, we see there is now a much more manageable 19 labels in our data column! This will be similar for the `diag_2` and `diag_3` columns."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 121
- },
- "id": "YfC9CmWWdu0n",
- "outputId": "c81284d2-8e09-49b6-f411-512da2421902"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- "shape: (1, 1)\n",
- "\n",
- "
\n",
- "
\n",
- "diag_1\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "u32\n",
- "
\n",
- "
\n",
- "\n",
- "\n",
- "
\n",
- "
\n",
- "19\n",
- "
\n",
- "
\n",
- "\n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "shape: (1, 1)\n",
- "┌────────┐\n",
- "│ diag_1 │\n",
- "│ --- │\n",
- "│ u32 │\n",
- "╞════════╡\n",
- "│ 19 │\n",
- "└────────┘"
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tmp = rdf.select(\"diag_1\").unique()\n",
- "tmp.select(pl.col(\"diag_1\").count()).collect().fetch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "BvdGu7GmsZVu"
- },
- "source": [
- "### Grouping data: A1C, max glucose levels and readmittance\n",
- "\n",
- "We want to group together data in another three other columns using the same `.then().when().otherwise()` methods.\n",
- "\n",
- "The first two are `A1Cresult`, which contains patients' HbA1c level, and `max_glu_serum`, which contains their blood glucose level. We want to group these into `very high`, `high`, `normal` groups based on levels defined in our project brief.\n",
- "\n",
- "These columns are both currently string columns, so we will also need to convert them to float values in order to perform numerical comparisons on them."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "FgyrnPAlsZ0u"
- },
- "outputs": [],
- "source": [
- "# cast `max_glu_serum` and `A1Cresult` columns to float values\n",
- "rdf = rdf.with_columns(\n",
- " [pl.col(\"max_glu_serum\").cast(pl.Float64), pl.col(\"A1Cresult\").cast(pl.Float64)]\n",
- ")\n",
- "\n",
- "# group values in A1Cresult column\n",
- "rdf = rdf.with_columns(\n",
- " [\n",
- " pl.when(pl.col(\"A1Cresult\") >= 8)\n",
- " .then(\"very high\")\n",
- " .when(pl.col(\"A1Cresult\") >= 7)\n",
- " .then(\"high\")\n",
- " .when(pl.col(\"A1Cresult\") >= 0)\n",
- " .then(\"normal\")\n",
- " .otherwise(pl.col(\"A1Cresult\"))\n",
- " .keep_name()\n",
- " ]\n",
- ")\n",
- "\n",
- "# group values in max_glu_serum column\n",
- "rdf = rdf.with_columns(\n",
- " [\n",
- " pl.when(pl.col(\"max_glu_serum\") >= 300)\n",
- " .then(\"very high\")\n",
- " .when(pl.col(\"max_glu_serum\") >= 200)\n",
- " .then(\"high\")\n",
- " .when(pl.col(\"max_glu_serum\") >= 0)\n",
- " .then(\"normal\")\n",
- " .otherwise(pl.col(\"max_glu_serum\"))\n",
- " .keep_name()\n",
- " ]\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Buu2nja5w6Db"
- },
- "source": [
- "The final column we want to group is the `readmitted` column which records the number of days before any further re-hospitalization linked to the patients' diabetic condition.\n",
- "\n",
- "We will group this column into `short-term` and `long-term` and `n/a` (not applicable) groups.\n",
- "\n",
- "Simiar to in previous examples, we must first convert values in this column from strings to integer values."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "9cca9DhSw6We"
- },
- "outputs": [],
- "source": [
- "# cast readmitted column to integer values\n",
- "rdf = rdf.with_columns([pl.col(\"readmitted\").cast(pl.Int64)])\n",
- "\n",
- "# group values\n",
- "rdf = rdf.with_columns(\n",
- " [\n",
- " pl.when(pl.col(\"readmitted\") < 31)\n",
- " .then(\"short-term\")\n",
- " .when(pl.col(\"readmitted\") >= 31)\n",
- " .then(\"long-term\")\n",
- " .otherwise(\"n/a\")\n",
- " .keep_name()\n",
- " ]\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "kuwxAGYBoOQJ"
- },
- "source": [
- "### Grouping data: binning ages\n",
- "The next grouping task we will perform is to group ages into intervals of 10 years. We do this both to increase data privacy and to more easily draw correlations linked to broader age groups.\n",
- "\n",
- "We won't need to perform an `when().then().otherwise()` query here since BastionLab has its own `ApplyBins` tool.\n",
- "\n",
- "`ApplyBins` is a PyTorch module and the grouping of numbers takes place in its `forward` function. We can pass PyTorch modules to BastionLab's `apply_udf` function which will apply the `forward` function to any specified columns.\n",
- "\n",
- "All in all, we just three steps to bin our age column data:\n",
- "\n",
- "1) We import `ApplyBins` from `bastionlab.polars.utils`.\n",
- "1) We instantiate our `ApplyBins` PyTorch module class with our bins interval given as the only argument.\n",
- "2) We use `apply_udf`, providing a list of the column we want to modify and the PyTorch module, `ApplyBins`, that we wish to apply to these columns."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "2EC3smnWc06Q"
- },
- "outputs": [],
- "source": [
- "from bastionlab.polars.utils import ApplyBins\n",
- "\n",
- "# get an instance of ApplyBins module which will bin data into groups of 10\n",
- "model = ApplyBins(10)\n",
- "\n",
- "# apply bins to \"age\" column\n",
- "rdf = rdf.apply_udf([\"age\"], model)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "1pOQYPYSsVns"
- },
- "source": [
- "> Note, you can create your own custom PyTorch modules and apply them to columns using `apply_udf`. This is BastionLab's way of allowing you to apply custom functions on datasets, whilst restricting what you can do for security reasons. Functionality like `lambda`, `map` and `apply` are blocked by BastionLab as they are too permissive and could be misused."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "gYRVmqTitckT"
- },
- "source": [
- "### Adding columns\n",
- "\n",
- "Up until this point we have been using the `.when().then().otherwise()` and `with_columns` methods to make changes to existing columns, but by providing a new column name to the `alias` method, we can create a new column.\n",
- "\n",
- "In the following example, we will create a `is_readmitted` column which will store `False` for all the \"n/a\" values in our original `readmitted` column and `True` for any other values. This will allow us to quickly query whether certain groups of data have been readmitted or not!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "B2JGdBhmteAz"
- },
- "outputs": [],
- "source": [
- "rdf = rdf.with_columns(\n",
- " [\n",
- " pl.when(pl.col(\"readmitted\") == \"n/a\")\n",
- " .then(False)\n",
- " .otherwise(True)\n",
- " .alias(\n",
- " \"is_readmitted\"\n",
- " ) # ending the .when().then().otherwise() pattern with .alias() allows us to provide a new column name\n",
- " ]\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "edoL2_uy_G19"
- },
- "source": [
- "### Converting column types\n",
- "\n",
- "We have already seen examples where we have `explicity` converted the datatype of our columns using the `cast` method. Here we will `implicity` convert the datatype by replacing the \"yes\" and \"no\" values in our `change` column, which represent whether a patient's medication has been changed, to a boolean True or False value. \n",
- "\n",
- "The datatype of this column will be changed automatically by this operation as we can see below."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "fMhSrD8__G19",
- "outputId": "5230be79-58b9-4318-c5bb-052cd03e35d1"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[polars.datatypes.Utf8]"
- ]
- },
- "execution_count": 21,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# print out initial datatype of \"change\" column\n",
- "\n",
- "rdf.select(\"change\").dtypes"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "hYWJ9FB70mcM",
- "outputId": "cc2736c7-e4be-48dd-805d-352ba0d6196e"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[polars.datatypes.Boolean]"
- ]
- },
- "execution_count": 22,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# replaces Yes/No values with True/False\n",
- "rdf = rdf.with_columns(\n",
- " [pl.when(pl.col(\"change\") == \"No\").then(False).otherwise(True).keep_name()]\n",
- ")\n",
- "\n",
- "# print out datatype of column post find and replace operation\n",
- "rdf.select(\"change\").dtypes"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "CYS-Mkl1tD8t"
- },
- "source": [
- "### Saving our RemoteLazyFrame and disconnecting\n",
- "\n",
- "Our dataframe is all clean and ready for the next step: data analysis/ visualization. Data scientist #1 is going to be reassigned to another task. They will save their cleaned RemoteLazyFrame and make a note of the identifier to share with data scientist #2.\n",
- "\n",
- "We need to perform `collect()` before saving or getting an identifier for our RemoteLazyFrame since the `save` method and `identifier` attribute are only available for FetchableLazyFrames.\n",
- "\n",
- ">Note, the data owner must have set the `savable` option to `True` when uploading the dataframe for this operation to be possible!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 35
- },
- "id": "DWu6ToX53bm9",
- "outputId": "3063c7ae-df03-4b74-d7a3-e2ceffc56083"
- },
- "outputs": [
- {
- "data": {
- "application/vnd.google.colaboratory.intrinsic+json": {
- "type": "string"
- },
- "text/plain": [
- "'49b66d7a-6c80-45fb-8278-9992c91f8666'"
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "rdf.collect().save()\n",
- "saved_identifier = rdf.collect().identifier\n",
- "saved_identifier"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "NgkiBinG6DJ2"
- },
- "source": [
- "They can now close their connection to the BastionLab server."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "qoiADM1W6OC_"
- },
- "outputs": [],
- "source": [
- "connection.close()"
- ]
- }
- ],
- "metadata": {
- "colab": {
- "provenance": []
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.10"
- },
- "orig_nbformat": 4,
- "vscode": {
- "interpreter": {
- "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}
From 2f061a97d281a170b4bf0d40bbd40a5d7bd687ab Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Wed, 22 Feb 2023 11:00:17 +0100
Subject: [PATCH 09/22] changed mkdocs order and name
---
docs/docs/how-to-guides/diabetes_exploration.ipynb | 10 +++++++---
mkdocs.yml | 2 +-
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index fe25f454..ce04ad57 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -236,6 +236,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ywAyp-2y_G1y"
@@ -243,11 +244,14 @@
"source": [
"`send_df()` will return a FetchableLazyFrame instance, which we will work with directly from now on. \n",
"\n",
- ">Note that we talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. \n",
+ "
\n",
+ "
Note
\n",
+ "
We talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. \n",
"\n",
- "> In BastionLab, when we run a query, it is not immediately executed. Like with Polar's LazyFrames, pending queries are only executed when we call `collect`. `FetchableLazyFrames` are BastionLab's remote lazy frames when there are no pending queries to run, either because we have just uploaded or got the dataframe using `get_df` or because we have already ran `collect` after our latest query. To display these lazy frames we call the `fetch` method, which will verify that the data frame is safe to display, i.e. is it the result of a safe aggregated query as specified in the privacy policy.\n",
+ "In BastionLab, when we run a query, it is not immediately executed. Like with Polar's LazyFrames, pending queries are only executed when we call `collect`. `FetchableLazyFrames` are BastionLab's remote lazy frames when there are no pending queries to run, either because we have just uploaded or got the dataframe using `get_df` or because we have already ran `collect` after our latest query. To display these lazy frames we call the `fetch` method, which will verify that the data frame is safe to display, i.e. is it the result of a safe aggregated query as specified in the privacy policy.\n",
"\n",
- "> A `RemoteLazyFrame` is just a `FetchableLazyFrame` with pending queries still to run (as they have not yet been `collected`). When we call `collect()` these operations are run server-side and the result of this is our `FetchableLazyFrame`."
+ "A `RemoteLazyFrame` is just a `FetchableLazyFrame` with pending queries still to run (as they have not yet been `collected`). When we call `collect()` these operations are run server-side and the result of this is our `FetchableLazyFrame`.
\n",
+ "
"
]
},
{
diff --git a/mkdocs.yml b/mkdocs.yml
index 1809e971..e387fb3d 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -112,8 +112,8 @@ nav:
- 🌍 How-to-guides:
- Data exploration:
- Covid-19 cleaning and exploration: "docs/how-to-guides/covid_cleaning_exploration.ipynb"
+ - Diabetes cleaning and exploration: "docs/how-to-guides/diabetes_exploration.ipynb"
- Fraud detection cleaning and exploration: "docs/how-to-guides/fraud_detection.ipynb"
- - Diabetes cleaning and exploration- part one: "docs/how-to-guides/diabetes_exploration.ipynb"
- Deep learning:
- Fine Tuning Distilbert on BastionLab: "docs/how-to-guides/distilbert_example_notebook.ipynb"
- 💡 Concepts:
From 34dbd306caaae0467d7f3a022cec34afefc8109d Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Wed, 22 Feb 2023 11:06:38 +0100
Subject: [PATCH 10/22] saving changes
---
docs/docs/how-to-guides/diabetes_exploration.ipynb | 2 +-
mkdocs.yml | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index ce04ad57..ac6a2026 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -8,7 +8,7 @@
},
"source": [
"
\n",
+ "\n",
"- [Download the dataset](https://drive.google.com/file/d/1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI/view?usp=share_link) we will be using in this notebook.\n",
"\n",
"You can download the BastionLab pip packages and the dataset by running the following code block.\n",
@@ -83,9 +89,9 @@
"\n",
"### Launching the server\n",
"\n",
- "Let's start by putting ourselves in the shoes of the data owner.\n",
+ "Let's start by putting ourselves in the shoes of the data owner. \n",
"\n",
- "But first, let's get the BastionLab server running.\n",
+ "First, we need to get the BastionLab server running.\n",
"\n",
"In production we recommend this is done using our Docker image, but for testing purposes you can use our `bastionlab_server` package, which removes the need for user authentication."
]
@@ -120,6 +126,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "IBWNyTnz_G1p"
@@ -128,7 +135,8 @@
">*For more details on how you can set up the server using our Docker image, check out our [Installation Tutorial](../getting-started/installation.md).*\n",
"\n",
"### Connecting to the server\n",
- "Next, we will connect to the server in order to be able to upload the dataset."
+ "\n",
+ "Then we connect to the server to upload the dataset."
]
},
{
@@ -155,15 +163,17 @@
"source": [
"### Creating a custom privacy policy\n",
"\n",
- "We can now create a [custom access policy](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/) for the dataset which determines how much access collaborators will get to the dataset. \n",
+ "On to the fun parts!\n",
"\n",
- "In this example, we create a policy with the following configuration:\n",
+ "BastionLab's main feature is that it lets you create a [custom access policy](https://bastionlab.readthedocs.io/en/latest/docs/tutorials/defining_policy_privacy/) for the dataset. It will determine how much access collaborators will get, so it needs to be carefuly set up. \n",
+ "\n",
+ "For this guide, we create a policy with the following configuration:\n",
"\n",
"-> `Aggregation(min_agg_size=10):` Any data extracted from the dataset should be the result of an aggregation of at least ten rows.\n",
"\n",
"-> `unsafe_handling=Reject()`: Any attempted query which breaches this policy will be rejected by the server.\n",
"\n",
- "-> `savable=True`: The data scientist can save changes made to the dataset in BastionLab (this will create a new dataset - it will not overwrite the original dataset).\n"
+ "-> `savable=True`: The data scientist can save changes made to the dataset in BastionLab. (This will create a new dataset. It will *not* overwrite the original dataset.)\n"
]
},
{
@@ -181,6 +191,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Q7HHSM3e_G1v"
@@ -190,13 +201,13 @@
"\n",
"Now that the policy has been created, we can upload the dataset to the BastionLab server instance.\n",
"\n",
- "Firstly, we need to convert our CSV file into a Polars DataFrame by using the Polars `read_csv` function, supplying the path to the CSV file as a string argument.\n",
+ "We need to convert our CSV file into a Polars DataFrame by using the Polars `read_csv` function, supplying the path to the CSV file as a string argument.\n",
"\n",
"Next, we use BastionLab's `client.polars.send_df` to upload the dataframe with our custom policy.\n",
"\n",
- "Finally, we save the FetchableLazyFrame using the `save` method with no arguments. We can make a note of the FetchableLazyFrame's identifier to be shared with data scientists to help them to remotely access the FetchableLazyFrame!\n",
+ "Finally, we save the FetchableLazyFrame using the `save` method with no arguments. We'll need to keep the FetchableLazyFrame's identifier, so we can share it with data scientists to help them remotely access the frame.\n",
"\n",
- ">Note we need to save FetchableLazyFrames to avoid them being lost when the server is stopped and restarted or crashes."
+ ">We need to save FetchableLazyFrames to avoid them being lost when the server is stopped and restarted or crashes."
]
},
{
@@ -245,22 +256,21 @@
"`send_df()` will return a FetchableLazyFrame instance, which we will work with directly from now on. \n",
"\n",
"
\n",
- "
Note
\n",
+ "
Note: Frames in BastionLab
\n",
"
We talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. \n",
"\n",
- "In BastionLab, when we run a query, it is not immediately executed. Like with Polar's LazyFrames, pending queries are only executed when we call `collect`. `FetchableLazyFrames` are BastionLab's remote lazy frames when there are no pending queries to run, either because we have just uploaded or got the dataframe using `get_df` or because we have already ran `collect` after our latest query. To display these lazy frames we call the `fetch` method, which will verify that the data frame is safe to display, i.e. is it the result of a safe aggregated query as specified in the privacy policy.\n",
+ "In BastionLab, when we run a query, it is not immediately executed. Like with Polar's LazyFrames, pending queries are only executed when we call collect. FetchableLazyFrames are BastionLab's remote lazy frames when there are no pending queries to run, either because we have just uploaded or got the dataframe using get_df or because we have already ran collect after our latest query. To display these lazy frames we call the fetch method, which will verify that the data frame is safe to display, i.e. is it the result of a safe aggregated query as specified in the privacy policy.\n",
"\n",
- "A `RemoteLazyFrame` is just a `FetchableLazyFrame` with pending queries still to run (as they have not yet been `collected`). When we call `collect()` these operations are run server-side and the result of this is our `FetchableLazyFrame`.
\n",
- "
"
+ "A RemoteLazyFrame is just a FetchableLazyFrame with pending queries still to run (as they have not yet been collected). When we call collect() these operations are run server-side and the result of this is our FetchableLazyFrame.
\n",
+ "\n"
]
},
{
+ "attachments": {},
"cell_type": "markdown",
- "metadata": {
- "id": "YRC1y4uX_G10"
- },
+ "metadata": {},
"source": [
- "Let's finish off by testing what happens if we breach our security policy by trying to display an entire column from our dataset with the `collect().fetch()` methods. \n",
+ "Let's complete the set up by testing what happens if we breach our security policy! We'll try to display an entire column from our dataset with the `collect().fetch()` methods. \n",
"\n",
">*You can learn more about how to use both of those methods in [our quick tour](https://bastionlab.readthedocs.io/en/latest/docs/quick-tour/quick-tour/#running-queries).*"
]
@@ -289,6 +299,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "x1Zu2YQi_G11"
@@ -298,7 +309,7 @@
"\n",
"We cannot view the output of the query because it does not aggregate at least 10 rows of data as specified in our privacy policy. It tries to print out individual rows instead!\n",
"\n",
- "Now that the dataset has been uploaded, it's time for our data scientists to get working... \n",
+ "All is working, so now that the dataset has been uploaded, it's time for our data scientists to start their exploration... \n",
"\n",
"The data owner can now connection their connection to the server."
]
@@ -321,14 +332,14 @@
"id": "HJzNveFG_G13"
},
"source": [
- "## Data scientist #1 setup\n",
+ "## Data scientist setup\n",
"__________________________________________\n",
"\n",
"### Connecting to the dataset\n",
"\n",
"We'll now jump into the role of the data scientist responsible for cleaning the dataset for this data analysis project.\n",
"\n",
- "We first need to connect to the `bastion_lab` server and get a FetchableLazyFrame instance of the dataset. We'll use' the `get_df()` method and supply it with the id shared with us by the data owner to do this.\n",
+ "We (the data scientist) will first need to connect to the `bastion_lab` server and get a `FetchableLazyFrame` instance of the dataset. We'll use' the `get_df()` method and supply it with the id shared with us by the data owner.\n",
"\n",
"We store our FetchableLazyFrame in the `rdf` variable which we'll be working with from here on."
]
@@ -406,6 +417,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "CQC7tfaF_G15"
@@ -416,9 +428,14 @@
"\n",
"\n",
"### Dropping columns\n",
- "You may have noticed, this dataset contains a lot of columns! This is great as it it gives us a wide choice of correlations to explore. However, we will not have time to explore all of them in this analysis! We can therefore drop the columns that we won't be using- either because they are irrelavant, or because they didn't lead us to the most interesting correlations for this analysis!\n",
"\n",
- "We can do this by using the`drop` method, providing it with a list of the names of columns to be dropped. This is a RemoteLazyFrame method which corresponds directly to the [Polars drop() function](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.drop.html#polars.LazyFrame.drop)."
+ "You may have noticed that this dataset contains *a lot* of columns! This is great as it it gives us a wide choice of correlations to explore. But we don't want to bore you to death with a 50 pages long tutorial, so we will not explore all of them in this analysis.\n",
+ "\n",
+ "So we'll drop the columns that we won't be using - either because they are irrelevant, or because they didn't lead us to the most interesting correlations.\n",
+ "\n",
+ "We can do this by using the `drop()` method and providing it with a list of the names of columns to be dropped. \n",
+ "\n",
+ ">This is a RemoteLazyFrame method which works the same as the [Polars drop()](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.drop.html#polars.LazyFrame.drop) function."
]
},
{
@@ -453,15 +470,17 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "vabmc_jjOQCo"
},
"source": [
- "There are now 36 columns to work with intead of 51- this will make the RemoteLazyFrame a little easier to work with!"
+ "There are now 36 columns to work with intead of 51! This will make the RemoteLazyFrame a little easier to work with."
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "7ausY-PC_G16"
@@ -470,13 +489,13 @@
"\n",
"### Checking for null values\n",
"\n",
- "We now want to assess how many null values we have in each column. This will help us to know if we have enough data to draw meaningful conclusions from each column and gives us the chance to fill or delete null values if relevant.\n",
+ "Next step: assessing how many null values we have in each column. This will help us know if we have enough data to draw meaningful conclusions from each column. We can also fill or delete null values if relevant.\n",
"\n",
- "However, based on the description of the dataset shared with us by the data owner, we know that some column cells have been filled with '?' instead of being left blank.\n",
+ "In this particular case, the data owner shared with us a description of the dataset and we know that some column cells have been filled with `?` instead of being left blank.\n",
"\n",
- "Before we can get an accurate picture of null values, we first need to replace all these '?' values with null values. We will do this by using [Polars .when().then().otherwise()` functions](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html). \n",
+ "Before we can get an accurate picture of null values, we first need to replace all these `?` values with `null` values. We will do this by using [Polars .when().then().otherwise()`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html) functions. \n",
"\n",
- "One final hurdle is that we can only search and replace '?' strings in columns with the 'Utf8' (string) datatype- otherwise an error will be produced. We must therefore firstly grab pl.Utf8 columns only and apply our search and replace operation to these strings!"
+ "One final hurdle is that we can only search and replace `?` strings in columns with the `Utf8` (string) datatype - otherwise an error will be produced. This is why we'll grab `pl.Utf8` columns only before we apply our search and replace operation to these strings!"
]
},
{
@@ -490,7 +509,7 @@
"# step one: getting a list of all Utf8/string columns\n",
"selects = rdf.select(pl.col(pl.Utf8)).columns\n",
"\n",
- "# step two: we replace all '? cells in these columns with null values\n",
+ "# step two: we replace all '?' cells in these columns with null values\n",
"rdf = rdf.with_columns(\n",
" [\n",
" pl.when(pl.col(x) == \"?\").then(None).otherwise(pl.col(x)).keep_name()\n",
@@ -500,23 +519,25 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "c1Frpi9GUtdW"
},
"source": [
- "In step two, we use the Polars `with_columns` function to add our new columns with null values instead of question marks to our RemoteLazyFrame. By using the `keep_name` function, these columns keep their original column name and therefore replace the original columns in the dataset. We save the result as `rdf`, storing the updated version of the dataset in our `rdf` variable."
+ "In step two, we used the `with_columns` function to add our new columns with `null` values, instead of `?` to `rdf`. By using the `keep_name` function, the columns keep their original name and replace the old ones in the dataset. "
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "vMMX8JZnKitA"
},
"source": [
- "Now that this is done, we can go ahead and calculate how many null values each column contains.\n",
+ "Finally, we can go ahead and calculate how many null values each column contains.\n",
"\n",
- "We do this by iterating over all the columns and getting a percentage of the `sum` of all the value that return `True` to the `is_null` function."
+ "We do this by iterating over all the columns and getting a percentage of the `sum` of all the values that return `True` to the `is_null` function."
]
},
{
From ed985b62fec34e03e76045b0361abf3330348ece Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Wed, 22 Feb 2023 13:25:03 +0100
Subject: [PATCH 13/22] reviewed part 1
---
.../how-to-guides/diabetes_exploration.ipynb | 90 +++++++++++--------
1 file changed, 55 insertions(+), 35 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index 39196d90..4c903efc 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -495,7 +495,7 @@
"\n",
"Before we can get an accurate picture of null values, we first need to replace all these `?` values with `null` values. We will do this by using [Polars .when().then().otherwise()`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html) functions. \n",
"\n",
- "One final hurdle is that we can only search and replace `?` strings in columns with the `Utf8` (string) datatype - otherwise an error will be produced. This is why we'll grab `pl.Utf8` columns only before we apply our search and replace operation to these strings!"
+ "One final hurdle is that we can only search and replace `?` strings in columns with the `Utf8` (string) datatype - otherwise an error will be produced. This is why we'll grab `pl.Utf8` columns only before we apply our search and replace operation to these strings!\n"
]
},
{
@@ -557,14 +557,15 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "3uMcNqVZWhdN"
},
"source": [
- "We can then view the percentage of null values for each column as a two-column list by using Polars `melt` function to flip the query results from a 2 row by 5 column grid, to a 2 column by 5 row grid. We use the `sort` function to show the columns in order from the column with the highest percentage of null values to the lowest.\n",
+ "We can then view the percentage of null values in each column of our dataset as a two-column table. We'll use [Polars `melt()`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.melt.html) function to flip the query results from a '2 row / 5 column' table, to a '2 column / 5 row' table. We use the `sort()` function to show the columns in order from the column with the highest percentage of null values to the lowest.\n",
"\n",
- "Finally, we remove any columns with no null values from our output since they are not of interest to us here."
+ "Then we print our the table excluding any results where the percentage of null values is `0%`, since these results are not of interest to us."
]
},
{
@@ -722,7 +723,9 @@
}
],
"source": [
- "# melt table to a two-column table with the column name 'column' and corresponding percetage of null values 'null values', sort in descending order and display\n",
+ "# melt table to a two-column table with the column name 'column' \n",
+ "# and corresponding percetage of null values 'null values', sort\n",
+ "# in descending order and display\n",
"percent_missing = percent_missing.melt(\n",
" variable_name=\"column name\",\n",
" value_name=\"null values (%)\",\n",
@@ -733,26 +736,33 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "4n0jnBPyYLjf"
},
"source": [
- "There are several strategies for dealing with null values such as deleting these rows from the dataset with the `drop_nulls` method or filling null values with the `fill_null` method. But in our case, we are just happy to have visibility over which columns including null values and to what extent so that we can handle and analyse these columns with this in mind."
+ "BastionLab supports various strategies for removing or filling null values from our dataset, but in this case, we don't want to remove the null values from the dataset. We just want to be aware of the amount of null values in each column of our dataset so we can query accordingly."
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "-WUugovwve2c"
},
"source": [
- "### Grouping data: ICD-9 medical codes\n",
- "Grouping data is going to be the largest and most crucial task in this data cleaning job. This is a dataset with a low of wide-ranging numerical values which need to be grouped so that our data analysts can gain meaningul insights.\n",
+ "### Grouping data\n",
+ "\n",
+ "Grouping data is going to be the largest and most crucial task in this data cleaning job. \n",
"\n",
- "Let's start with our diagnoses columns: `diag_1`, `diag_2` and `diag_3`.\n",
+ "This diabetes dataset has a lot of of wide-ranging numerical values. They need to be grouped so that our data analysts can gain meaningul insights from them.\n",
"\n",
- "These columns contain the primary, secondary and terciary diagnoses given to patients. These diagnoses are given using [ICD-9 medical codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes) which are three digit codes ranging from 1 to 1000, as well as E800–E999 codes and V01–V82 code.\n",
+ "#### ICD-9 medical codes\n",
+ "\n",
+ "Let's start with the diagnoses columns: `diag_1`, `diag_2` and `diag_3`.\n",
+ "\n",
+ "They contain the primary, secondary and terciary diagnoses given to patients. These diagnoses are given using [**ICD-9** medical codes](https://en.wikipedia.org/wiki/List_of_ICD-9_codes) which are three digit codes ranging from **1** to **1000**, as well as **E800–E999** codes and **V01–V82** code.\n",
"\n",
"By grabbing all the unique values in the `diag_1` column and counting them, we can see that we have over 700 different values in this column!"
]
@@ -848,24 +858,25 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "cPsmfkBpkPCv"
},
"source": [
- "Standard groupings of these codes have already been designed. What we want to do is replace the hundreds of unique codes we have in our our diagnoses columns with these groupings!\n",
+ "Standard groupings of these codes have already been designed. So what we want to do is replace the hundreds of unique codes we have in our our diagnoses columns with these groupings.\n",
"\n",
- "To do this, we will again use Polars `when().then().otherwise()` functions to perform a find and replace operation. We will use `when()` to check if the codes in each cell are either E or V codes or fall within a certain numerical range.\n",
+ "To do this, we will again use Polars `when().then().otherwise()` functions to perform a find and replace operation. We will use `when()` to check if the codes in each cell are either **E** or **V** codes or fall within a certain numerical range.\n",
"\n",
- "However, these diagnoses columns are currently string columns, since the E and V codes are not entirely numerical. This is problematic since we cannot perform numerical comparisons on these cells and we cannot convert the column type to a numerical one because of these 'E' and 'V' values!\n",
+ "The problem is that these diagnoses columns are currently string columns because **E** and **V** codes are not entirely numerical. Until it's solved, we cannot perform numerical comparisons on these cells and we cannot convert the column type to a numerical one.\n",
"\n",
- "We will solve this problem in three steps:\n",
+ "Here's how we'll handle this:\n",
"\n",
- "1) We will find and replace all E codes with a \"-1\" value and V codes with a \"-2\" value.\n",
+ "1) We will find and replace all **E** codes with a `-1` value and **V** codes with a `-2` value.\n",
"\n",
"2) We will `select()` our columns and `cast()` all values in these columns to float values.\n",
"\n",
- "3) We will perform the find and replace operation to group all ICD-9 codes into their associated group- of which there are 17, plus E codes and V codes."
+ "3) We will perform the find and replace operation to group all **ICD-9** codes into their associated group - of which there are 17, plus **E** codes and **V** codes."
]
},
{
@@ -947,12 +958,13 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "P1MquUrNlXDO"
},
"source": [
- "By performing the same query as previously to count `diag_1`'s unique values, we see there is now a much more manageable 19 labels in our data column! This will be similar for the `diag_2` and `diag_3` columns."
+ "By performing the same query as previously to count `diag_1`'s unique values, we see there is now a much more manageable 19 labels in our data column. This will be similar for the `diag_2` and `diag_3` columns."
]
},
{
@@ -1046,12 +1058,13 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "pKAp3OvKcwuX"
},
"source": [
- "We notice in our project brief that there is only 1 E code value in the `diag_1` column, so we will remove this value from our dataset before continuing by using the `filter` function."
+ "We notice in our project brief that there is only one **E** code value in the `diag_1` column, so we will remove this value from our dataset before continuing by using the `filter` function."
]
},
{
@@ -1066,16 +1079,17 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "BvdGu7GmsZVu"
},
"source": [
- "### Grouping data: A1C, max glucose levels and readmittance\n",
+ "#### A1C, max glucose levels and readmittance\n",
"\n",
"We want to group together data in another three other columns using the same `.then().when().otherwise()` methods.\n",
"\n",
- "The first two are `A1Cresult`, which contains patients' HbA1c level, and `max_glu_serum`, which contains their blood glucose level. We want to group these into `very high`, `high`, `normal` groups based on levels defined in our project brief.\n",
+ "The first two are **`A1Cresult`**, which contains patients' **HbA1c** level, and `max_glu_serum`, which contains their blood glucose level. We want to group these into `very high`, `high`and `normal` groups based on levels defined in our project brief.\n",
"\n",
"These columns are both currently string columns, so we will also need to convert them to float values in order to perform numerical comparisons on them."
]
@@ -1093,7 +1107,7 @@
" [pl.col(\"max_glu_serum\").cast(pl.Float64), pl.col(\"A1Cresult\").cast(pl.Float64)]\n",
")\n",
"\n",
- "# group values in A1Cresult column\n",
+ "# group values in `A1Cresult` column\n",
"rdf = rdf.with_columns(\n",
" [\n",
" pl.when(pl.col(\"A1Cresult\") >= 8)\n",
@@ -1107,7 +1121,7 @@
" ]\n",
")\n",
"\n",
- "# group values in max_glu_serum column\n",
+ "# group values in `max_glu_serum` column\n",
"rdf = rdf.with_columns(\n",
" [\n",
" pl.when(pl.col(\"max_glu_serum\") >= 300)\n",
@@ -1123,6 +1137,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Buu2nja5w6Db"
@@ -1132,7 +1147,7 @@
"\n",
"We will group this column into `short-term` and `long-term` and `n/a` (not applicable) groups.\n",
"\n",
- "Simiar to in previous examples, we must first convert values in this column from strings to integer values."
+ "In the same way as in the previous examples, we must first convert values in this column from strings to integer values."
]
},
{
@@ -1160,23 +1175,24 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "kuwxAGYBoOQJ"
},
"source": [
- "### Grouping data: binning ages\n",
- "The next grouping task we will perform is to group ages into intervals of 10 years. We do this both to increase data privacy and to more easily draw correlations linked to broader age groups.\n",
+ "#### Binning ages\n",
+ "The next grouping task we will perform is to group ages into intervals of 10 years. We do this both to increase data privacy and draw correlations linked to broader age groups more easily.\n",
"\n",
- "We won't need to perform an `when().then().otherwise()` query here since BastionLab has its own `ApplyBins` tool.\n",
+ "We won't need to perform a `when().then().otherwise()` query here because BastionLab has its own `ApplyBins` tool.\n",
"\n",
"`ApplyBins` is a PyTorch module and the grouping of numbers takes place in its `forward` function. We can pass PyTorch modules to BastionLab's `apply_udf` function which will apply the `forward` function to any specified columns.\n",
"\n",
- "All in all, we just three steps to bin our age column data:\n",
+ "All in all, we just need three steps to bin our age column data:\n",
"\n",
"1) We import `ApplyBins` from `bastionlab.polars.utils`.\n",
- "1) We instantiate our `ApplyBins` PyTorch module class with our bins interval given as the only argument.\n",
- "2) We use `apply_udf`, providing a list of the column we want to modify and the PyTorch module, `ApplyBins`, that we wish to apply to these columns."
+ "1) We instantiate the `ApplyBins` PyTorch module class with our bins interval.\n",
+ "2) We use `apply_udf`, providing a list of the column we want to modify, and the PyTorch module `ApplyBins` that we wish to apply to these columns."
]
},
{
@@ -1197,15 +1213,17 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "1pOQYPYSsVns"
},
"source": [
- "> Note, you can create your own custom PyTorch modules and apply them to columns using `apply_udf`. This is BastionLab's way of allowing you to apply custom functions on datasets, whilst restricting what you can do for security reasons. Functionality like `lambda`, `map` and `apply` are blocked by BastionLab as they are too permissive and could be misused."
+ "> Note, you can create your own custom PyTorch modules and apply them to columns using `apply_udf`. This is BastionLab's way of allowing you to apply custom functions on datasets, while restricting what you can do for security reasons. Functionality like `lambda`, `map` and `apply` are blocked by BastionLab as they are too permissive and could be misused."
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "gYRVmqTitckT"
@@ -1213,9 +1231,9 @@
"source": [
"### Adding columns\n",
"\n",
- "Up until this point we have been using the `.when().then().otherwise()` and `with_columns` methods to make changes to existing columns, but by providing a new column name to the `alias` method, we can create a new column.\n",
+ "Up until this point we have been using the `.when().then().otherwise()` and `with_columns` methods to make changes to existing columns. But we can also provide a new column name to the `alias` method to create a new column.\n",
"\n",
- "In the following example, we will create a `is_readmitted` column which will store `False` for all the \"n/a\" values in our original `readmitted` column and `True` for any other values. This will allow us to quickly query whether certain groups of data have been readmitted or not!"
+ "In the following example, we will create a `is_readmitted` column which will store `False` for all the `n/a` values in our original `readmitted` column, and `True` for any other values. This will allow us to quickly query whether certain groups of data have been readmitted or not."
]
},
{
@@ -1239,6 +1257,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "edoL2_uy_G19"
@@ -1246,7 +1265,7 @@
"source": [
"### Converting column types\n",
"\n",
- "We have already seen examples where we have `explicity` converted the datatype of our columns using the `cast` method. Here we will `implicity` convert the datatype by replacing the \"yes\" and \"no\" values in our `change` column, which represent whether a patient's medication has been changed, to a boolean True or False value. \n",
+ "We have already seen examples where we have explicity converted the datatype of our columns using the `cast` method. Here we will implicity convert the datatype by replacing the `yes` and `no` values in our `change` column (which represent whether a patient's medication has been changed) to a boolean `True` or `False` value. \n",
"\n",
"The datatype of this column will be changed automatically by this operation as we can see below."
]
@@ -1312,6 +1331,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "CYS-Mkl1tD8t"
@@ -1319,11 +1339,11 @@
"source": [
"### Saving our RemoteLazyFrame and disconnecting\n",
"\n",
- "Our dataframe is all clean and ready for the next step: data analysis/ visualization. Data scientist #1 is going to be reassigned to another task. They will save their cleaned RemoteLazyFrame and make a note of the identifier to share with data scientist #2.\n",
+ "Our dataframe is all clean and ready for the next step: data analysis and visualization. We, Data scientist #1, are going to be reassigned to another task. We will save our cleaned RemoteLazyFrame and make note of the identifier so we can share it with Data scientist #2.\n",
"\n",
- "We need to perform `collect()` before saving or getting an identifier for our RemoteLazyFrame since the `save` method and `identifier` attribute are only available for FetchableLazyFrames.\n",
+ "We need to perform `collect()` before saving or getting an identifier for our RemoteLazyFrame, because the `save` method and `identifier` attribute are only available for FetchableLazyFrames.\n",
"\n",
- ">Note, the data owner must have set the `savable` option to `True` when uploading the dataframe for this operation to be possible!"
+ ">Note, the data owner must have set the `savable` option to `True` when uploading the dataframe for this operation to be possible! But here, we did so we won't run into an issue."
]
},
{
From c12216e977340b205387872660c2367abee5940f Mon Sep 17 00:00:00 2001
From: lyie28
Date: Wed, 22 Feb 2023 17:46:36 +0100
Subject: [PATCH 14/22] added pies
---
client/src/bastionlab/polars/remote_polars.py | 81 ++---
.../how-to-guides/diabetes_exploration.ipynb | 284 ++++++++++--------
2 files changed, 175 insertions(+), 190 deletions(-)
diff --git a/client/src/bastionlab/polars/remote_polars.py b/client/src/bastionlab/polars/remote_polars.py
index 94f8fd1f..f6b2f31d 100644
--- a/client/src/bastionlab/polars/remote_polars.py
+++ b/client/src/bastionlab/polars/remote_polars.py
@@ -430,61 +430,6 @@ def with_row_count(self: LDF, name: str = "index") -> LDF:
# because if not this leads to panics etc. when we follow this with other operations that use the new column before next using collect()
return ret.collect()
- def describe(self: LDF) -> pl.DataFrame:
- """
- Provides the following summary statistics for our RemoteLazyFrame:
- - count
- - null count
- - mean
- - std
- - min
- - max
- - median
- Raises:
- Exception: Where necessary queries to get statistical information for the operation are rejected by the data owner
- Returns:
- A Polars DataFrame containing statistical information
- """
- ret = self.select(
- [
- pl.col("*").count().suffix("_count"),
- pl.col("*").null_count().suffix("_null_count"),
- pl.col("*").mean().suffix("_mean"),
- pl.col("*").std().suffix("_std"),
- pl.col("*").min().suffix("_min"),
- pl.col("*").max().suffix("_max"),
- pl.col("*").median().suffix("_median"),
- ]
- )
- stats = ret.collect().fetch()
- RequestRejected.check_valid_df(stats)
- description = pl.DataFrame(
- {
- "describe": [
- "count",
- "null_count",
- "mean",
- "std",
- "min",
- "max",
- "median",
- ],
- **{
- x: [
- stats.select(f"{x}_count")[0, 0],
- stats.select(f"{x}_null_count")[0, 0],
- stats.select(f"{x}_mean")[0, 0],
- stats.select(f"{x}_std")[0, 0],
- stats.select(f"{x}_min")[0, 0],
- stats.select(f"{x}_max")[0, 0],
- stats.select(f"{x}_median")[0, 0],
- ]
- for x in self.columns
- },
- }
- )
- return description
-
def join(
self: LDF,
other: LDF,
@@ -624,14 +569,24 @@ def pieplot(
various exceptions: Note that exceptions may be raised from matplotlib pyplot's pie or subplots functions, for example if fig_kwargs keywords are not valid.
"""
+ tmp = self
if parts not in self.columns:
raise ValueError("Parts column not found in dataframe")
if type(labels) == str and labels not in self.columns:
raise ValueError("Labels column not found in dataframe")
+ # run previous operations to ensure order of columns are as expected
+ if type(labels) == str and type(parts) == str:
+ tmp = tmp.collect()
# get list of values in parts column
- parts_tmp = self.select(pl.col(parts)).collect().fetch().to_numpy()
- parts_list = [x[0] for x in parts_tmp]
+ parts_list = (
+ tmp.select(pl.col(parts))
+ .collect()
+ .fetch()
+ .select(parts)
+ .to_series(0)
+ .to_list()
+ )
# get total for calculating percentages
total = sum(parts_list)
@@ -641,8 +596,14 @@ def pieplot(
# get labels list
if type(labels) == str:
- labels_tmp = self.select(pl.col(labels)).collect().fetch().to_numpy()
- labels_list = [x[0] for x in labels_tmp]
+ labels_list = (
+ tmp.select(labels)
+ .collect()
+ .fetch()
+ .select(labels)
+ .to_series(0)
+ .to_list()
+ )
else:
labels_list = labels
@@ -651,7 +612,7 @@ def pieplot(
if fig_kwargs == None:
fig, ax = plt.subplots(figsize=(7, 4), subplot_kw=dict(aspect="equal"))
else:
- if "figsize" not in self.kwargs:
+ if "figsize" not in fig_kwargs:
fig_kwargs["figsize"] = (7, 4)
fig, ax = plt.subplots(**fig_kwargs)
if pie_labels == True:
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index 4c903efc..2524bee6 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -98,7 +98,7 @@
},
{
"cell_type": "code",
- "execution_count": 193,
+ "execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -106,18 +106,7 @@
"id": "A85GsYOi_G1o",
"outputId": "97b964bd-61b6-4cc6-e5e7-b9f2a2587bd7"
},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "BastionLab server (version 0.3.7) already installed\n",
- "Libtorch (version 1.13.1) already installed\n",
- "TLS certificates already generated\n",
- "Bastionlab server is now running on port 50056\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"# launch bastionlab_server test package\n",
"import bastionlab_server\n",
@@ -141,7 +130,7 @@
},
{
"cell_type": "code",
- "execution_count": 194,
+ "execution_count": 3,
"metadata": {
"id": "6zzV7xrs_G1q"
},
@@ -178,7 +167,7 @@
},
{
"cell_type": "code",
- "execution_count": 195,
+ "execution_count": 4,
"metadata": {
"id": "mRJjgd1C_G1t"
},
@@ -212,7 +201,7 @@
},
{
"cell_type": "code",
- "execution_count": 196,
+ "execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -225,7 +214,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "63c8152d-f5af-41ec-b22c-aea51a8465b5\n"
+ "801444d3-0742-43e2-a199-a454cce00928\n"
]
}
],
@@ -277,7 +266,7 @@
},
{
"cell_type": "code",
- "execution_count": 197,
+ "execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -316,7 +305,7 @@
},
{
"cell_type": "code",
- "execution_count": 198,
+ "execution_count": 7,
"metadata": {
"id": "mcM4pR6D_G11"
},
@@ -346,7 +335,7 @@
},
{
"cell_type": "code",
- "execution_count": 199,
+ "execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -358,10 +347,10 @@
{
"data": {
"text/plain": [
- "FetchableLazyFrame(identifier=63c8152d-f5af-41ec-b22c-aea51a8465b5)"
+ "FetchableLazyFrame(identifier=801444d3-0742-43e2-a199-a454cce00928)"
]
},
- "execution_count": 199,
+ "execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
@@ -386,7 +375,7 @@
},
{
"cell_type": "code",
- "execution_count": 200,
+ "execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -440,7 +429,7 @@
},
{
"cell_type": "code",
- "execution_count": 201,
+ "execution_count": 10,
"metadata": {
"id": "s0NI6rTqOKWN"
},
@@ -500,7 +489,7 @@
},
{
"cell_type": "code",
- "execution_count": 202,
+ "execution_count": 11,
"metadata": {
"id": "F2KwhZB_fTC3"
},
@@ -542,7 +531,7 @@
},
{
"cell_type": "code",
- "execution_count": 203,
+ "execution_count": 12,
"metadata": {
"id": "SAqqUz6I_G16"
},
@@ -570,7 +559,7 @@
},
{
"cell_type": "code",
- "execution_count": 204,
+ "execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -608,10 +597,6 @@
" .dataframe td {\n",
" padding-bottom: 0;\n",
" }\n",
- "\n",
- " .dataframe td {\n",
- " line-height: 95%;\n",
- " }\n",
"\n",
"
\n",
"shape: (7, 2)\n",
@@ -717,13 +702,13 @@
"└───────────────┴─────────────────┘"
]
},
- "execution_count": 204,
+ "execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "# melt table to a two-column table with the column name 'column' \n",
+ "# melt table to a two-column table with the column name 'column'\n",
"# and corresponding percetage of null values 'null values', sort\n",
"# in descending order and display\n",
"percent_missing = percent_missing.melt(\n",
@@ -769,7 +754,7 @@
},
{
"cell_type": "code",
- "execution_count": 205,
+ "execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -807,10 +792,6 @@
" .dataframe td {\n",
" padding-bottom: 0;\n",
" }\n",
- "\n",
- " .dataframe td {\n",
- " line-height: 95%;\n",
- " }\n",
"\n",
"
\n",
"shape: (4, 2)\n",
@@ -1833,7 +1793,7 @@
"└───────────┴───────────┘"
]
},
- "execution_count": 223,
+ "execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
@@ -1851,6 +1811,38 @@
"ret.sort(pl.col(\"change\"), reverse=True).collect().fetch()"
]
},
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can visualize this as a pie chart using the `pieplot` method and pasing it the name of the columns that should be used as `labels` and the name of the column that should be used for the pie chart `parts` or slices."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "
"
]
@@ -2202,7 +1841,7 @@
},
{
"cell_type": "code",
- "execution_count": 38,
+ "execution_count": 34,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -2225,7 +1864,7 @@
" 'insulin']"
]
},
- "execution_count": 38,
+ "execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
@@ -2282,7 +1921,7 @@
},
{
"cell_type": "code",
- "execution_count": 39,
+ "execution_count": 35,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -2304,7 +1943,7 @@
" 'insulin']"
]
},
- "execution_count": 39,
+ "execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
@@ -2336,7 +1975,7 @@
},
{
"cell_type": "code",
- "execution_count": 40,
+ "execution_count": 36,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -2499,7 +2138,7 @@
"└───────────────┴──────────────────────┘"
]
},
- "execution_count": 40,
+ "execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
@@ -2551,7 +2190,7 @@
},
{
"cell_type": "code",
- "execution_count": 41,
+ "execution_count": 37,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -2704,7 +2343,7 @@
"└───────────────┴──────────────────────┘"
]
},
- "execution_count": 41,
+ "execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
@@ -2783,7 +2422,7 @@
},
{
"cell_type": "code",
- "execution_count": 42,
+ "execution_count": 38,
"metadata": {
"id": "xROO5Oxzvev-"
},
From ad515bd1307c8c2ecd1a34e59431805e1dfada27 Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Thu, 23 Feb 2023 18:03:51 +0100
Subject: [PATCH 16/22] saving progress
---
.../how-to-guides/diabetes_exploration.ipynb | 28 +++++++++++++------
1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index a0788359..243c71b7 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -486,7 +486,7 @@
"\n",
"In this particular case, the data owner shared with us a description of the dataset and we know that some column cells have been filled with `?` instead of being left blank.\n",
"\n",
- "Before we can get an accurate picture of null values, we first need to replace all these `?` values with `null` values. We will do this by using [Polars .when().then().otherwise()`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html) functions. \n",
+ "Before we can get an accurate picture of null values, we first need to replace all these `?` values with `null` values. We will do this by using [Polars `.when().then().otherwise()`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.when.html) functions. \n",
"\n",
"One final hurdle is that we can only search and replace `?` strings in columns with the `Utf8` (string) datatype - otherwise an error will be produced. This is why we'll grab `pl.Utf8` columns only before we apply our search and replace operation to these strings!\n"
]
@@ -1012,14 +1012,16 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "3Gvx_sK5ypgD"
},
"source": [
- "### Part II: data analysis and visualization\n",
+ "## Data analysis and visualization\n",
+ "_________________________________________\n",
"\n",
- "So data scientist #2 is now ready to begin their analysis of the cleaned dataset. Just like data scientist #1, they will first need to connect to the server and get the FetchableLazyFrame saved by data scientist #1."
+ "The dataset is clean and Data scientist #2 is now ready to begin their analysis. Just like Data scientist #1, they will first need to connect to the server and get the FetchableLazyFrame that was previously saved."
]
},
{
@@ -1057,12 +1059,13 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "z7-wG7DfzSyI"
},
"source": [
- "We can again confirm that the original privacy policy is still in place by running a non-aggreagted query that would violate the policy."
+ "We'll confirm that the original privacy policy is still in place by running a non-aggreagted query that would violate the policy:"
]
},
{
@@ -1089,18 +1092,19 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "NfRexmoN0X9h"
},
"source": [
- "Now that we are all set-up, we can dive into the analysis.\n",
+ "We are all set-up, so let's dive into the analysis.\n",
"\n",
"### Age as a factor in readmission and emergency trips\n",
"\n",
"Let's start by visualizing the number of patients who were readmitted to hospital for diabetes-related issues during the study.\n",
"\n",
- "To do this we group data by `age` and aggregate the `sum` of those who were readmitted. We then generate a barplot for this query."
+ "To do this we group data by `age` and aggregate the `sum` of those who were readmitted. We'll generate a barplot for this query."
]
},
{
@@ -1134,14 +1138,17 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "BdRDwT74BOrr"
},
"source": [
- "In terms of the number of readmissions, we see a clear trend for readmission cases to increase with age, before dropping down in the 80-90 and 90-100 age groups. This may be due to increased mortality in these age ranges.\n",
+ "In terms of the number of readmissions, we see a clear trend: readmission cases increase with age, before dropping down in the 80-90 and 90-100 age groups. This could be due to increased mortality in these age ranges.\n",
"\n",
- "However, if we take a look at the mean number of cases per age group using `histplot`, we see that it follows the same trend, showing that this trend may not represent a higher risk of readmission for older patients, but rather a much increased number of diabetes patients in older age groups."
+ "If we take a look at the mean number of cases per age group using `histplot`, we see that it follows the same trend. But it shows that it may not represent a higher risk of readmission for older patients, rather a much increased number of diabetes patients in older age groups.\n",
+ "\n",
+ "***# LAST SENTENCE IS UNCLEAR. WHAT DO YOU MEAN?***"
]
},
{
@@ -1172,12 +1179,15 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "xPFiho5eEKNT"
},
"source": [
- "If we zoom in on `short-term` and `long-term` readmittance individually and get the percentage of patients in these groups who are readmitted, rather than the count, we get a rather different picture.\n",
+ "If we zoom in on `short-term` and `long-term` readmittance individually and get the percentage of patients in these groups who are readmitted instead of the count, we get a rather different picture.\n",
+ "\n",
+ "***# SENTENCE IS TOO LONG. Maybe try to use more direct formulations and cut the sentences more so each one says one thing (2 tops)?***\n",
"\n",
"To get these percentage values, we divide the total number of short-term or long-term values in the readmitted column by the total values in this column.\n",
"\n",
From 5e16a051a77f1c9fdbbde9e186bdbbf2c7ada581 Mon Sep 17 00:00:00 2001
From: lyie28
Date: Fri, 24 Feb 2023 09:58:53 +0100
Subject: [PATCH 17/22] resolve
---
.../how-to-guides/diabetes_exploration.ipynb | 843 +++++++++++++-----
1 file changed, 608 insertions(+), 235 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index 243c71b7..ae5d29c1 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -51,7 +51,83 @@
"metadata": {
"id": "hK-HDaMI_G1j"
},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Requirement already satisfied: bastionlab in /home/laura/anaconda3/lib/python3.9/site-packages (0.3.7)\n",
+ "Requirement already satisfied: pyarrow~=10.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (10.0.1)\n",
+ "Requirement already satisfied: numpy~=1.21 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (1.24.2)\n",
+ "Requirement already satisfied: seaborn~=0.12.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (0.12.2)\n",
+ "Requirement already satisfied: six~=1.16.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (1.16.0)\n",
+ "Requirement already satisfied: cryptography~=38.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (38.0.4)\n",
+ "Requirement already satisfied: grpcio==1.47.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (1.47.0)\n",
+ "Requirement already satisfied: pyserde~=0.9 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (0.9.8)\n",
+ "Requirement already satisfied: colorama~=0.4.6 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (0.4.6)\n",
+ "Requirement already satisfied: tokenizers==0.13.2 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (0.13.2)\n",
+ "Requirement already satisfied: typing-extensions~=4.4 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (4.5.0)\n",
+ "Requirement already satisfied: protobuf==3.20.2 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (3.20.2)\n",
+ "Requirement already satisfied: grpcio-tools==1.47.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (1.47.0)\n",
+ "Requirement already satisfied: torch==1.13.1 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (1.13.1)\n",
+ "Requirement already satisfied: tqdm~=4.64 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (4.64.1)\n",
+ "Requirement already satisfied: matplotlib==3.6.3 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (3.6.3)\n",
+ "Requirement already satisfied: polars==0.14.24 in /home/laura/anaconda3/lib/python3.9/site-packages (from bastionlab) (0.14.24)\n",
+ "Requirement already satisfied: setuptools in /home/laura/anaconda3/lib/python3.9/site-packages (from grpcio-tools==1.47.0->bastionlab) (67.3.3)\n",
+ "Requirement already satisfied: contourpy>=1.0.1 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (1.0.7)\n",
+ "Requirement already satisfied: cycler>=0.10 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (0.11.0)\n",
+ "Requirement already satisfied: pillow>=6.2.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (9.4.0)\n",
+ "Requirement already satisfied: kiwisolver>=1.0.1 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (1.4.4)\n",
+ "Requirement already satisfied: pyparsing>=2.2.1 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (3.0.9)\n",
+ "Requirement already satisfied: fonttools>=4.22.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (4.38.0)\n",
+ "Requirement already satisfied: packaging>=20.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (23.0)\n",
+ "Requirement already satisfied: python-dateutil>=2.7 in /home/laura/anaconda3/lib/python3.9/site-packages (from matplotlib==3.6.3->bastionlab) (2.8.2)\n",
+ "Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/laura/anaconda3/lib/python3.9/site-packages (from torch==1.13.1->bastionlab) (11.7.99)\n",
+ "Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/laura/anaconda3/lib/python3.9/site-packages (from torch==1.13.1->bastionlab) (11.10.3.66)\n",
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/laura/anaconda3/lib/python3.9/site-packages (from torch==1.13.1->bastionlab) (11.7.99)\n",
+ "Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/laura/anaconda3/lib/python3.9/site-packages (from torch==1.13.1->bastionlab) (8.5.0.96)\n",
+ "Requirement already satisfied: wheel in /home/laura/anaconda3/lib/python3.9/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch==1.13.1->bastionlab) (0.38.4)\n",
+ "Requirement already satisfied: cffi>=1.12 in /home/laura/anaconda3/lib/python3.9/site-packages (from cryptography~=38.0->bastionlab) (1.15.1)\n",
+ "Requirement already satisfied: casefy in /home/laura/anaconda3/lib/python3.9/site-packages (from pyserde~=0.9->bastionlab) (0.1.7)\n",
+ "Requirement already satisfied: typing_inspect>=0.4.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from pyserde~=0.9->bastionlab) (0.8.0)\n",
+ "Requirement already satisfied: jinja2 in /home/laura/anaconda3/lib/python3.9/site-packages (from pyserde~=0.9->bastionlab) (3.0.3)\n",
+ "Requirement already satisfied: pandas>=0.25 in /home/laura/anaconda3/lib/python3.9/site-packages (from seaborn~=0.12.0->bastionlab) (1.5.3)\n",
+ "Requirement already satisfied: pycparser in /home/laura/anaconda3/lib/python3.9/site-packages (from cffi>=1.12->cryptography~=38.0->bastionlab) (2.21)\n",
+ "Requirement already satisfied: pytz>=2020.1 in /home/laura/anaconda3/lib/python3.9/site-packages (from pandas>=0.25->seaborn~=0.12.0->bastionlab) (2022.7.1)\n",
+ "Requirement already satisfied: mypy-extensions>=0.3.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from typing_inspect>=0.4.0->pyserde~=0.9->bastionlab) (0.4.3)\n",
+ "Requirement already satisfied: MarkupSafe>=2.0 in /home/laura/anaconda3/lib/python3.9/site-packages (from jinja2->pyserde~=0.9->bastionlab) (2.0.1)\n",
+ "Requirement already satisfied: bastionlab_server in /home/laura/anaconda3/lib/python3.9/site-packages (0.3.7)\n",
+ "Requirement already satisfied: gdown in /home/laura/anaconda3/lib/python3.9/site-packages (4.6.4)\n",
+ "Requirement already satisfied: filelock in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (3.6.0)\n",
+ "Requirement already satisfied: six in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (1.16.0)\n",
+ "Requirement already satisfied: requests[socks] in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (2.28.1)\n",
+ "Requirement already satisfied: beautifulsoup4 in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (4.11.1)\n",
+ "Requirement already satisfied: tqdm in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (4.64.1)\n",
+ "Requirement already satisfied: soupsieve>1.2 in /home/laura/anaconda3/lib/python3.9/site-packages (from beautifulsoup4->gdown) (2.3.1)\n",
+ "Requirement already satisfied: charset-normalizer<3,>=2 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (2.0.4)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (2022.9.14)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (3.3)\n",
+ "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (1.26.11)\n",
+ "Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (1.7.1)\n",
+ "Requirement already satisfied: gdown in /home/laura/anaconda3/lib/python3.9/site-packages (4.6.4)\n",
+ "Requirement already satisfied: six in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (1.16.0)\n",
+ "Requirement already satisfied: requests[socks] in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (2.28.1)\n",
+ "Requirement already satisfied: beautifulsoup4 in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (4.11.1)\n",
+ "Requirement already satisfied: filelock in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (3.6.0)\n",
+ "Requirement already satisfied: tqdm in /home/laura/anaconda3/lib/python3.9/site-packages (from gdown) (4.64.1)\n",
+ "Requirement already satisfied: soupsieve>1.2 in /home/laura/anaconda3/lib/python3.9/site-packages (from beautifulsoup4->gdown) (2.3.1)\n",
+ "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (1.26.11)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (2022.9.14)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (3.3)\n",
+ "Requirement already satisfied: charset-normalizer<3,>=2 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (2.0.4)\n",
+ "Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /home/laura/anaconda3/lib/python3.9/site-packages (from requests[socks]->gdown) (1.7.1)\n",
+ "Downloading...\n",
+ "From: https://drive.google.com/uc?id=1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI\n",
+ "To: /home/laura/bl4/docs/docs/how-to-guides/updated_diabetes_data.csv\n",
+ "100%|██████████████████████████████████████| 17.8M/17.8M [00:00<00:00, 35.1MB/s]\n"
+ ]
+ }
+ ],
"source": [
"# installing BastionLab client & server packages\n",
"!pip install bastionlab\n",
@@ -106,7 +182,18 @@
"id": "A85GsYOi_G1o",
"outputId": "97b964bd-61b6-4cc6-e5e7-b9f2a2587bd7"
},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "BastionLab server (version 0.3.7) already installed\n",
+ "Libtorch (version 1.13.1) already installed\n",
+ "TLS certificates already generated\n",
+ "Bastionlab server is now running on port 50056\n"
+ ]
+ }
+ ],
"source": [
"# launch bastionlab_server test package\n",
"import bastionlab_server\n",
@@ -171,12 +258,28 @@
"metadata": {
"id": "mRJjgd1C_G1t"
},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "[2023-02-23T12:14:14Z INFO bastionlab] Authentication is disabled.\n",
+ "[2023-02-23T12:14:14Z INFO bastionlab] Telemetry is enabled.\n",
+ "[2023-02-23T12:14:14Z INFO bastionlab] BastionLab server listening on 0.0.0.0:50056.\n",
+ "[2023-02-23T12:14:14Z INFO bastionlab] Server ready to take requests\n",
+ "Error: transport error\n",
+ "\n",
+ "Caused by:\n",
+ " 0: error creating server listener: Address already in use (os error 98)\n",
+ " 1: Address already in use (os error 98)\n"
+ ]
+ }
+ ],
"source": [
"from bastionlab.polars.policy import Policy, Aggregation, Reject\n",
"\n",
"# defining the dataset's privacy policy\n",
- "policy = Policy(Aggregation(min_agg_size=10), unsafe_handling=Reject(), savable=True)"
+ "policy = Policy(Aggregation(min_agg_size=1), unsafe_handling=Reject(), savable=True)"
]
},
{
@@ -214,7 +317,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "4ab1725e-7ecc-4750-b004-1df14a191cb9\n"
+ "fa9a68d1-891c-417e-aea6-44bb1dcbf777\n"
]
}
],
@@ -276,11 +379,208 @@
},
"outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[31mThe query has been rejected by the data owner.\u001b[37m\n"
- ]
+ "data": {
+ "text/html": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "rdf.groupby(\"A1Cresult\").agg(pl.count().alias(\"count\")).pieplot(parts=\"count\", labels=\"A1Cresult\", key=\"False\")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We see here that in the vast majority of cases, A1C levels were not checked."
+ ]
+ },
+ {
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "S11nqyg0oufI"
},
"source": [
- "Next, let's take a look at the impact of A1C levels being checked during the hospital admission on the likelihood of a patient's medication being changed. The higher the level of A1C, the greater the risk of developing diabetes complications is."
+ "Next, let's take a look at the impact of A1C levels being checked during the hospital admission on the likelihood of a patient's medication being changed."
]
},
{
@@ -1460,38 +1798,6 @@
"ret.sort(pl.col(\"change\"), reverse=True).collect().fetch()"
]
},
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can visualize this as a pie chart using the `pieplot` method and pasing it the name of the columns that should be used as `labels` and the name of the column that should be used for the pie chart `parts` or slices."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "ret.pieplot(\n",
- " parts=\"long-term readmitted\",\n",
- " labels=\"A1Cresult\",\n",
- " title=\"percentage of long-term readmissons per A1Cresult group\",\n",
- ")"
- ]
- },
{
"cell_type": "markdown",
"metadata": {
@@ -1830,28 +2088,110 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "TjaMlnbQIWcJ"
},
"source": [
- "### Dosage increases and decreases as factors on overall readmission"
+ "### Medication as factor on overall readmission\n",
+ "\n",
+ "For the next part of our analysis, we will look at how treatment with different medications led to above or below average patient readmission.\n",
+ "\n",
+ "Let's start by getting the percentage of all patients in the study who were readmitted to hospital in the short or long-term."
]
},
{
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ "shape: (1, 1)\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "readmitted %\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "f64\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
\n",
+ "46.088084\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "shape: (1, 1)\n",
+ "┌──────────────┐\n",
+ "│ readmitted % │\n",
+ "│ --- │\n",
+ "│ f64 │\n",
+ "╞══════════════╡\n",
+ "│ 46.088084 │\n",
+ "└──────────────┘"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rdf.select((pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100).alias(\"readmitted %\")).collect().fetch()"
+ ]
+ },
+ {
+ "attachments": {},
"cell_type": "markdown",
- "metadata": {
- "id": "4dzRa_RN2_Oq"
- },
+ "metadata": {},
"source": [
- "We will now investigate the likelihood of increases or decreases of specific medications leading to short-term patient readmission.\n",
+ "Next let's get the percentage of patients readmitted to hospital for each medication, regardless of whether dosage was increased, decreased or remained the same.\n",
"\n",
- "Let's start by getting a list of the medications we want to look at. We will these lists down to drugs with more than 20 results to remove any medication with only a handful of results."
+ "Let's start by getting a list of the medications we want to look at. We will narrow this list down to drugs with more than 30 rows of data (\"increased\", \"steady\" or \"decreased\" dosage) to remove any medication with only a handful of results."
]
},
{
"cell_type": "code",
- "execution_count": 34,
+ "execution_count": 69,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -1866,15 +2206,20 @@
"['metformin',\n",
" 'repaglinide',\n",
" 'nateglinide',\n",
+ " 'chlorpropamide',\n",
" 'glimepiride',\n",
" 'glipizide',\n",
" 'glyburide',\n",
" 'pioglitazone',\n",
" 'rosiglitazone',\n",
- " 'insulin']"
+ " 'acarbose',\n",
+ " 'miglitol',\n",
+ " 'tolazamide',\n",
+ " 'insulin',\n",
+ " 'glyburide-metformin']"
]
},
- "execution_count": 34,
+ "execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
@@ -1907,85 +2252,34 @@
" \"metformin-pioglitazone\",\n",
"]\n",
"\n",
- "# get the number of increased doses per medication and flip the output vertically\n",
- "increased_meds = rdf.select(\n",
- " pl.col(x).str.count_match(\"Up\").sum() for x in all_meds\n",
- ").melt(variable_name=\"medication\", value_name=\"count\")\n",
- "\n",
- "# remove any medications that don't have at least 100 rows of data and get this result as a Polars dataframe\n",
- "increased_meds = increased_meds.filter(pl.col(\"count\") > 20).collect().fetch()\n",
- "\n",
- "# convert output to a list via Pandas API\n",
- "increased_meds = increased_meds.to_pandas()[\"medication\"].tolist()\n",
- "increased_meds"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "GFA1-rz29W0X"
- },
- "source": [
- "We now do exactly the same for decreased medications."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "Ctt-Bg4f9dVH",
- "outputId": "5b122acc-26ea-483b-9008-79b121fab3ed"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['metformin',\n",
- " 'repaglinide',\n",
- " 'glimepiride',\n",
- " 'glipizide',\n",
- " 'glyburide',\n",
- " 'pioglitazone',\n",
- " 'rosiglitazone',\n",
- " 'insulin']"
- ]
- },
- "execution_count": 35,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# get the number of increased doses per medication and flip the output vertically\n",
- "decreased_meds = rdf.select(\n",
- " pl.col(x).str.count_match(\"Down\").sum() for x in all_meds\n",
+ "# get the number of increased, decreased and stable doses per medication and flip the output vertically\n",
+ "meds = rdf.select(\n",
+ " pl.col(x).count() - pl.col(x).str.count_match(\"No\").sum() for x in all_meds\n",
").melt(variable_name=\"medication\", value_name=\"count\")\n",
"\n",
"# remove any medications that don't have at least 100 rows of data and get this result as a Polars dataframe\n",
- "decreased_meds = decreased_meds.filter(pl.col(\"count\") > 20).collect().fetch()\n",
+ "meds = meds.filter(pl.col(\"count\") > 30).collect().fetch()\n",
"\n",
"# convert output to a list via Pandas API\n",
- "decreased_meds = decreased_meds.to_pandas()[\"medication\"].tolist()\n",
- "decreased_meds"
+ "meds = meds.to_pandas()[\"medication\"].tolist()\n",
+ "meds"
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Y_9as5FlqoZV"
},
"source": [
- "The next step is to loop over our list of `increased_meds` and get the percentage of patients who were readmitted to hospital within the following month after their dose of the drug was increased. We are able to use the `vstack` function to append each result for each drug into one table.\n",
+ "The next step is to loop over our list of `meds`. On each iteration, we filter out any \"no\" values, indicating that a patient did not follow this treatment, and get the percentage of patients who were readmitted to hospital. We are able to use the `vstack` function to append each result for each drug into one table.\n",
"\n",
- "We then simply add a column with the list of medicines in the same order and sort the list from highest to lowest."
+ "We then add a column with the list of medicines in the same order and sort the list from lowest to highest percentage of readmissions."
]
},
{
"cell_type": "code",
- "execution_count": 36,
+ "execution_count": 70,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -2025,7 +2319,7 @@
" }\n",
"\n",
"
"
]
@@ -1644,7 +1848,7 @@
}
],
"source": [
- "rdf.groupby(\"A1Cresult\").agg(pl.count().alias(\"count\")).pieplot(parts=\"count\", labels=\"A1Cresult\", key=\"False\")"
+ "rdf.groupby(\"A1Cresult\").agg(pl.count().alias(\"count\")).pieplot(parts=\"count\", labels=\"A1Cresult\", key=\"False\", title=\"A1Cresult group distribution\")"
]
},
{
@@ -1667,7 +1871,7 @@
},
{
"cell_type": "code",
- "execution_count": 28,
+ "execution_count": 68,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -1780,7 +1984,7 @@
"└───────────┴───────────┘"
]
},
- "execution_count": 28,
+ "execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
@@ -1799,6 +2003,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "xX672CJwpAYv"
@@ -1806,12 +2011,72 @@
"source": [
"Perhaps as expected, those with a very high or high A1Cresult were more likely to have a medication change. Interestingly, those who do not have their A1C level examined are only as likely to change medication as those with normal A1C levels. This shows doctors are less likely to change medication unless they know that A1C levels are higher than expected via exams.\n",
"\n",
+ "We can visualize this with the `pieplot` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 89,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "for group in [\"very high\", \"high\", None, \"normal\"]:\n",
+ " tmp = rdf.filter(pl.col(\"A1Cresult\") == group)\n",
+ " tmp.groupby(\"change\").agg(pl.count().alias(\"count\")).pieplot(parts=\"count\", labels=\"change\", title=(\"medication change for \" + (group if group else \"null\") + \" A1C group\"))"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
"What we now want to know is whether this has an impact on the likelihood of patient readmission in the short and long term."
]
},
{
"cell_type": "code",
- "execution_count": 30,
+ "execution_count": 69,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -1924,7 +2189,7 @@
"└───────────┴───────────────────────┘"
]
},
- "execution_count": 30,
+ "execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
@@ -1945,7 +2210,7 @@
},
{
"cell_type": "code",
- "execution_count": 32,
+ "execution_count": 70,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -2058,7 +2323,7 @@
"└───────────┴──────────────────────┘"
]
},
- "execution_count": 32,
+ "execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
@@ -2103,7 +2368,7 @@
},
{
"cell_type": "code",
- "execution_count": 34,
+ "execution_count": 71,
"metadata": {},
"outputs": [
{
@@ -2170,7 +2435,7 @@
"└──────────────┘"
]
},
- "execution_count": 34,
+ "execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
@@ -2191,7 +2456,7 @@
},
{
"cell_type": "code",
- "execution_count": 69,
+ "execution_count": 72,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -2219,7 +2484,7 @@
" 'glyburide-metformin']"
]
},
- "execution_count": 69,
+ "execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
@@ -2272,14 +2537,23 @@
"id": "Y_9as5FlqoZV"
},
"source": [
- "The next step is to loop over our list of `meds`. On each iteration, we filter out any \"no\" values, indicating that a patient did not follow this treatment, and get the percentage of patients who were readmitted to hospital. We are able to use the `vstack` function to append each result for each drug into one table.\n",
+ "Now we are ready to perform our query to get a table containing the percentage of patients following a treatment with each drug who were readmitted to hospital.\n",
+ "\n",
+ "The query will work by iterating over all the drugs we want to include in our final table.\n",
+ "\n",
+ "For each iteration we will get a row to add to our final table, containing the percentage of readmitted patients for that drug.\n",
"\n",
- "We then add a column with the list of medicines in the same order and sort the list from lowest to highest percentage of readmissions."
+ "To do this, we filter down that medication's column to rows that do not contain \"No\". This gives us rows where the patient was following some sort of treatment with the drug.\n",
+ "We then caclulate the percentage of those patients who were readmitted and give the result a column name `overall readmitted`.\n",
+ "\n",
+ "Then we add this row to the table using `vstack`. If the table doesn't yet exist, our query result becomes the table, which we will then add to!\n",
+ "\n",
+ "We finally use collect().fetch() to get out output as a Polars dataframe that we can display."
]
},
{
"cell_type": "code",
- "execution_count": 70,
+ "execution_count": 73,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -2288,6 +2562,47 @@
"id": "aEU6wEvYAhz6",
"outputId": "e51f858a-8992-4944-fb4c-74834b950207"
},
+ "outputs": [],
+ "source": [
+ "# create a null table value for later use\n",
+ "table = None\n",
+ "\n",
+ "# iterate over medications list\n",
+ "for drugs in meds:\n",
+ " # filter data down to cases where dosage was steady, increased or decreased\n",
+ " tmp = rdf.filter(pl.col(drugs) != \"No\").select(\n",
+ " [\n",
+ " (\n",
+ " pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100\n",
+ " ).alias(\"overall readmitted %\"),\n",
+ " ]\n",
+ " )\n",
+ " # if first iteration, table give value of our tmp query, otherwise tmp query appended to end of table\n",
+ " if table == None:\n",
+ " table = tmp\n",
+ " else:\n",
+ " table = table.vstack(tmp)\n",
+ "\n",
+ "# convert table to Polars dataframe\n",
+ "table = table.collect().fetch()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This table current contains the percentage of readmissions for each medication we iterated over, but it does not include a column with the medication names. \n",
+ "\n",
+ "We will add this by converting out list of medications into a Polars Series with the column name `medication` and adding it to our table using the `with_columns` method.\n",
+ "\n",
+ "We now have an `overall readmitted %` and `medication` column in our table. To swap the order so that the `medication` column goes first, we can use `select` and select the columns in the order we want. We then sort the table by lowest to highest `overall readmitted %` value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 74,
+ "metadata": {},
"outputs": [
{
"data": {
@@ -2482,37 +2797,17 @@
"└────────────────┴──────────────────────┘"
]
},
- "execution_count": 70,
+ "execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "# create a null table value for later use\n",
- "table = None\n",
- "\n",
- "# iterate over medications list\n",
- "for drugs in meds:\n",
- " # filter data down to cases where dosage was steady, increased or decreased\n",
- " tmp = rdf.filter(pl.col(drugs) != \"No\").select(\n",
- " [\n",
- " (\n",
- " pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100\n",
- " ).alias(\"overall readmitted %\"),\n",
- " ]\n",
- " )\n",
- " # if first iteration, table give value of our tmp query, otherwise tmp query appended to end of table\n",
- " if table == None:\n",
- " table = tmp\n",
- " else:\n",
- " table = table.vstack(tmp)\n",
- "\n",
- "# convert table to Polars dataframe\n",
- "table = table.collect().fetch()\n",
- "\n",
"# create and add new column with medication names in same order as iteration and disaply results\n",
"new_col = pl.Series(\"medication\", meds)\n",
- "table.with_columns([new_col]).select([\"medication\", \"overall readmitted %\"]).sort(\n",
+ "table = table.with_columns([new_col])\n",
+ "\n",
+ "table.select([\"medication\", \"overall readmitted %\"]).sort(\n",
" pl.col(\"overall readmitted %\")\n",
")"
]
@@ -2528,12 +2823,13 @@
"\n",
"There may be medical explanations for this such as certain drugs being linked to more complex cases.\n",
"\n",
+ "### NOTE TO OPHELIE- WE COULD DROP THE NEXT EXAMPLE AND JUST STICK WITH THE FIRST ONE FOR THIS SECTION?\n",
"We can equally repeat the same process but zoom in just on dosages that were increased (`Up`), `decreased` (`Down`) or `steady` (`Steady`). In this case, let's take a look at dosages that were `decreased`. Feel free to replace the DOSAGE variable with `decreased` or `steady` or the condition to check for short or long-term readmissions only and re-run the cell if you want to take a look at how they compare."
]
},
{
"cell_type": "code",
- "execution_count": 66,
+ "execution_count": 75,
"metadata": {},
"outputs": [
{
@@ -2549,7 +2845,7 @@
" 'insulin']"
]
},
- "execution_count": 66,
+ "execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
@@ -2571,7 +2867,7 @@
},
{
"cell_type": "code",
- "execution_count": 67,
+ "execution_count": 76,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -2724,7 +3020,7 @@
"└───────────────┴──────────────────────┘"
]
},
- "execution_count": 67,
+ "execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
@@ -2732,8 +3028,6 @@
"source": [
"# create a null table value for later use\n",
"table = None\n",
- "\n",
- "# iterate over medications list\n",
"for drugs in meds:\n",
" # filter data down to cases where dosage increased\n",
" tmp = rdf.filter(pl.col(drugs) == DOSAGE)\n",
@@ -2805,7 +3099,7 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 77,
"metadata": {
"id": "xROO5Oxzvev-"
},
From 5b8cd86bb4ce408550139aa208b2acf57aa85c17 Mon Sep 17 00:00:00 2001
From: lyie28
Date: Fri, 24 Feb 2023 15:42:05 +0100
Subject: [PATCH 19/22] fmt
---
client/src/bastionlab/polars/remote_polars.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/client/src/bastionlab/polars/remote_polars.py b/client/src/bastionlab/polars/remote_polars.py
index de70c548..e7f438a7 100644
--- a/client/src/bastionlab/polars/remote_polars.py
+++ b/client/src/bastionlab/polars/remote_polars.py
@@ -548,7 +548,7 @@ def pieplot(
key_loc: str = "center left",
key_title: str = None,
key_bbox=(1, 0, 0.5, 1),
- **kwargs
+ **kwargs,
) -> None:
"""Draws a pie chart based on values within single column.
pieplot collects necessary data only and calculates percentage values before calling matplotlib pyplot's pie function to create a pie chart.
From bcf28bd1f191cbbe0d898f8d64f2d9f37641f8f9 Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Mon, 27 Feb 2023 16:24:30 +0100
Subject: [PATCH 20/22] reviewed part 2
---
.../how-to-guides/diabetes_exploration.ipynb | 319 +++---------------
1 file changed, 41 insertions(+), 278 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index 58270ccc..8afea5f4 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -1096,13 +1096,15 @@
"id": "NfRexmoN0X9h"
},
"source": [
- "We are all set-up, so let's dive into the analysis.\n",
+ "We are all set-up! Let's dive into the analysis.\n",
"\n",
"### Age as a factor in readmission and emergency trips\n",
"\n",
- "Let's start by visualizing the number of patients who were readmitted to hospital for diabetes-related issues during the study.\n",
+ "We'll start by visualizing the number of patients who were readmitted to hospital for diabetes-related issues during the study.\n",
"\n",
- "To do this we group data by `age` and aggregate the `sum` of those who were readmitted. We'll generate a barplot for this query."
+ "To do this we group data by `age` and aggregate the `sum` of those who were readmitted. We'll generate a barplot for this query.\n",
+ "\n",
+ "***# missing comment(s) in code =)***"
]
},
{
@@ -1136,6 +1138,7 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "BdRDwT74BOrr"
@@ -1145,7 +1148,7 @@
"\n",
"If we take a look at the mean number of cases per age group using `histplot`, we see that it follows the same trend. But it shows that it may not represent a higher risk of readmission for older patients, rather a much increased number of diabetes patients in older age groups.\n",
"\n",
- "***# LAST SENTENCE IS UNCLEAR. WHAT DO YOU MEAN?***"
+ "***# Last sentence is unclear. What do you mean?***"
]
},
{
@@ -1184,15 +1187,17 @@
"source": [
"If we zoom in on `short-term` and `long-term` readmittance individually and get the percentage of patients in these groups who are readmitted instead of the count, we get a rather different picture.\n",
"\n",
- "***# SENTENCE IS TOO LONG. Maybe try to use more direct formulations and cut the sentences more so each one says one thing (2 tops)?***\n",
+ "***# Sentence is too long. Maybe try to use more direct formulations and cut the sentences more so each one says one thing (2 tops)?***\n",
"\n",
"To get these percentage values, we divide the total number of short-term or long-term values in the readmitted column by the total values in this column.\n",
"\n",
- "To get the total short-term or long-term values, we use the str.count_match function to fill the readmitted column with True (1) values where the contents of the cell are short-term or long-term respectively and False (0) for any other values. We can use the sum function to count up all of these True values.\n",
+ "To get the total short-term or long-term values, we use the `str.count_match` function to fill the readmitted column with True (`1`) values where the contents of the cell are short-term or long-term respectively and False (`0`) for any other values. We can use the `sum()` function to count up all of these True values.\n",
"\n",
- "To get the total values in the readmitted column, we select the column and use count() function.\n",
+ "To get the total values in the readmitted column, we select the column and use `count()` function.\n",
"\n",
- "We can then set the column name to whatever we like using the alias function."
+ "We can then set the column name to whatever we like using the alias function.\n",
+ "\n",
+ "***# Here you could maybe put some of the previous info as comments in code and get the paragraph a bit easier to read? Or just put the info again so it's easier to go through the code?***\n"
]
},
{
@@ -1252,12 +1257,13 @@
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "nlHmjMozE38p"
},
"source": [
- "We see a slight trend of increased long-term readmissions as age increases, but interestingly, a much higher risk of short-term readmission in 20-30 year olds. This could be explained by younger patients perhaps not having yet found the correct treatment or lifestyle to manage their diabetes."
+ "We see a slight trend of increased long-term readmissions as age increases. But interestingly, there is a much higher risk of short-term readmission in 20-30 year olds. This could be explained by younger patients perhaps not having yet found the correct treatment or lifestyle to manage their diabetes."
]
},
{
@@ -1320,11 +1326,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "High or very high A1C levels can indicate an increased risk of diabetes complications. In this dataset, A1C levels are grouped into \"very high\", \"high\", \"normal\" and \"null\" (not taken).\n",
+ "High or very high A1C levels can indicate an increased risk of diabetes complications. In this dataset, A1C levels are grouped into `very high`, `high`, `normal` and `null` (A1C levels not taken).\n",
"\n",
- "Let's take a look at the impact of A1C levels being checked during the hospital admission on the likelihood of a patient's medication being changed.\n",
+ "Let's take a look at the impact A1C levels being checked during the hospital admission can have on the likelihood of a patient's medication being changed.\n",
"\n",
- "We group the dataset by A1Cresult group and get the percentage of patients in each of these groups who had a change to their medication during their hospital stay."
+ "We group the dataset by `A1Cresult` group and get the percentage of patients in each of these groups who had a change to their medication during their hospital stay."
]
},
{
@@ -1467,14 +1473,14 @@
"id": "xX672CJwpAYv"
},
"source": [
- "Perhaps as expected, those with a very high or high A1Cresult were more likely to have a medication change. Interestingly, those who do not have their A1C level examined are only as likely to change medication as those with normal A1C levels. This shows doctors are less likely to change medication unless they know that A1C levels are higher than expected via exams.\n",
+ "As expected, those with a very high or high A1Cresult were more likely to have a medication change. But interestingly, those who do not have their A1C level examined are as likely to change medication as those with normal A1C levels. This shows doctors are less likely to change medication unless they know that A1C levels are higher than expected via exams.\n",
"\n",
"We can visualize this trend by comparing a piechart of medication change for those who did and did not have their A1C level recorded.\n",
"\n",
"To do this:\n",
"\n",
- "- We first create a subplot grid with space for two plots, ax1 and ax2.\n",
- "- We then filter the dataset down into two datasets, a `taken` group which filters out any null data from the `A1Cresult` column and a `non_taken` group which filters out any non-null data from teh `A1Cresult` column.\n",
+ "- We first create a subplot grid with space for two plots, `ax1` and `ax2`.\n",
+ "- We filter the dataset down to two datasets: a `taken` group which filters out any null data from the `A1Cresult` column, and a `non_taken` group which filters out any non-null data from the `A1Cresult` column.\n",
"- We group our two datasets by the `change` column and create a `count` column for our two change groups, `True` and `False`. \n",
"- Finally we call `pieplot`.\n",
"\n",
@@ -1533,7 +1539,7 @@
"\n",
"What we now want to know is whether this has an impact on the likelihood of patient readmission in the short and long term.\n",
"\n",
- "Let's start by getting the percentage of patients in each group who were readmitted to hispital within the following month after their hospital stay."
+ "Let's start by getting the percentage of patients in each group who were readmitted to hospital within the following month after their first stay."
]
},
{
@@ -1657,7 +1663,8 @@
}
],
"source": [
- "# percentages of those readmitted within a month of initial hospital visit by A1C result group\n",
+ "# percentages of those readmitted within a month of initial hospital\n",
+ "# visit by A1C result group\n",
"ret = rdf.groupby(pl.col(\"A1Cresult\")).agg(\n",
" [\n",
" (\n",
@@ -1820,7 +1827,7 @@
"id": "HmsKOU2bqz6F"
},
"source": [
- "We see that patients who did not have their A1C level taken are the most likely to be readmitted within a month of their hospital admission. They were also almost as likely as their \"very high\" counterparts to be readmitted in the long-term. \n",
+ "We see that patients who did not have their A1C level taken are the most likely to be readmitted within a month of their hospital admission. They were also almost as likely as their `very high` counterparts to be readmitted in the long-term. \n",
"\n",
"Our findings suggests that:\n",
"- Taking patients' A1C levels may help encourage doctors to make changes in medication.\n",
@@ -1838,7 +1845,9 @@
"\n",
"For the next part of our analysis, we will look at how treatment with different medications led to above or below average patient readmission.\n",
"\n",
- "Let's start by getting the percentage of all patients in the study who were readmitted to hospital in the short or long-term."
+ "Let's start by getting the percentage of all patients in the study who were readmitted to hospital in the short or long-term.\n",
+ "\n",
+ "***# some comment in the code maybe here?***"
]
},
{
@@ -1930,7 +1939,7 @@
"source": [
"Next let's get the percentage of patients readmitted to hospital for each medication, regardless of whether dosage was increased, decreased or remained the same.\n",
"\n",
- "Let's start by getting a list of the medications we want to look at. We will narrow this list down to drugs with more than 30 rows of data (\"increased\", \"steady\" or \"decreased\" dosage) to remove any medication with only a handful of results."
+ "Let's start by getting a list of the medications we want to look at. We will narrow this list down to drugs with more than 30 rows of data (`increased`, `steady` or `decreased` dosage) to remove any medication with only a handful of results."
]
},
{
@@ -2022,12 +2031,13 @@
"\n",
"For each iteration we will get a row to add to our final table, containing the percentage of readmitted patients for that drug.\n",
"\n",
- "To do this, we filter down that medication's column to rows that do not contain \"No\". This gives us rows where the patient was following some sort of treatment with the drug.\n",
- "We then caclulate the percentage of those patients who were readmitted and give the result a column name `overall readmitted`.\n",
+ "- To do this, we filter down that medication's column to rows that do not contain `No`. This gives us rows where the patient was following some sort of treatment with the drug.\n",
+ "\n",
+ "- We calculate the percentage of those patients who were readmitted and give the result a column name `overall readmitted`.\n",
"\n",
- "Then we add this row to the table using `vstack`. If the table doesn't yet exist, our query result becomes the table, which we will then add to!\n",
+ "- We add this row to the table using `vstack`. If the table doesn't yet exist, our query result becomes the table, and we'll add to it!\n",
"\n",
- "We finally use collect().fetch() to get out output as a Polars dataframe that we can display."
+ "- We use `collect().fetch()` to get the output as a Polars dataframe that we can display."
]
},
{
@@ -2075,7 +2085,7 @@
"\n",
"We will add this by converting out list of medications into a Polars Series with the column name `medication` and adding it to our table using the `with_columns` method.\n",
"\n",
- "We now have an `overall readmitted %` and `medication` column in our table. To swap the order so that the `medication` column goes first, we can use `select` and select the columns in the order we want. We then sort the table by lowest to highest `overall readmitted %` value."
+ "Our table will have the `overall readmitted %` column before the `medication` one. To swap the order so `medication` goes first, we'll use `select` and select the columns in the order we want. We'll then sort the table by lowest to highest `overall readmitted %` value."
]
},
{
@@ -2282,7 +2292,7 @@
}
],
"source": [
- "# create and add new column with medication names in same order as iteration and disaply results\n",
+ "# create and add new column with medication names in same order as iteration and display results\n",
"new_col = pl.Series(\"medication\", meds)\n",
"table = table.with_columns([new_col])\n",
"\n",
@@ -2298,251 +2308,9 @@
"id": "PtGj-w4OrU_5"
},
"source": [
- "This gives us significiant results with 12-13% less patients taking `tolazamide` or `tolbutamide` readmitted to hospital than the overall average, while 17% more patients taking `miglitol` were readmitted!\n",
- "\n",
- "There may be medical explanations for this such as certain drugs being linked to more complex cases.\n",
- "\n",
- "### NOTE TO OPHELIE- WE COULD DROP THE NEXT EXAMPLE AND JUST STICK WITH THE FIRST ONE FOR THIS SECTION IF WE FEEL LIKE THIS IS ALREADY LONG ENOUGH?\n",
- "We can equally repeat the same process but zoom in just on dosages that were increased (`Up`), `decreased` (`Down`) or `steady` (`Steady`). In this case, let's take a look at dosages that were `decreased`. Feel free to replace the DOSAGE variable with `decreased` or `steady` or the condition to check for short or long-term readmissions only and re-run the cell if you want to take a look at how they compare."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['metformin',\n",
- " 'repaglinide',\n",
- " 'glimepiride',\n",
- " 'glipizide',\n",
- " 'glyburide',\n",
- " 'pioglitazone',\n",
- " 'rosiglitazone',\n",
- " 'insulin']"
- ]
- },
- "execution_count": 36,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# get the number of decreased doses per medication and flip the output vertically\n",
- "DOSAGE = \"Down\"\n",
- "meds = rdf.select(pl.col(x).str.count_match(\"Down\").sum() for x in all_meds).melt(\n",
- " variable_name=\"medication\", value_name=\"count\"\n",
- ")\n",
- "\n",
- "# remove any medications that don't have at least 100 rows of data and get this result as a Polars dataframe\n",
- "meds = meds.filter(pl.col(\"count\") > 30).collect().fetch()\n",
- "\n",
- "# convert output to a list via Pandas API\n",
- "meds = meds.to_pandas()[\"medication\"].tolist()\n",
- "meds"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 312
- },
- "id": "nnGQB49dhwqE",
- "outputId": "f0f36f25-88fa-452a-a34e-6d65e2d981d2"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- "shape: (8, 2)\n",
- "\n",
- "
\n",
- "
\n",
- "medication\n",
- "
\n",
- "
\n",
- "overall readmitted %\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "str\n",
- "
\n",
- "
\n",
- "f64\n",
- "
\n",
- "
\n",
- "\n",
- "\n",
- "
\n",
- "
\n",
- ""rosiglitazone"\n",
- "
\n",
- "
\n",
- "31.034483\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""metformin"\n",
- "
\n",
- "
\n",
- "45.043478\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""glimepiride"\n",
- "
\n",
- "
\n",
- "47.938144\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""glyburide"\n",
- "
\n",
- "
\n",
- "48.758865\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""repaglinide"\n",
- "
\n",
- "
\n",
- "48.888889\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""insulin"\n",
- "
\n",
- "
\n",
- "52.790964\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""glipizide"\n",
- "
\n",
- "
\n",
- "52.857143\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- ""pioglitazone"\n",
- "
\n",
- "
\n",
- "53.389831\n",
- "
\n",
- "
\n",
- "\n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "shape: (8, 2)\n",
- "┌───────────────┬──────────────────────┐\n",
- "│ medication ┆ overall readmitted % │\n",
- "│ --- ┆ --- │\n",
- "│ str ┆ f64 │\n",
- "╞═══════════════╪══════════════════════╡\n",
- "│ rosiglitazone ┆ 31.034483 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ metformin ┆ 45.043478 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ glimepiride ┆ 47.938144 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ glyburide ┆ 48.758865 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ repaglinide ┆ 48.888889 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ insulin ┆ 52.790964 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ glipizide ┆ 52.857143 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ pioglitazone ┆ 53.389831 │\n",
- "└───────────────┴──────────────────────┘"
- ]
- },
- "execution_count": 37,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# create a null table value for later use\n",
- "table = None\n",
- "for drugs in meds:\n",
- " # filter data down to cases where dosage increased\n",
- " tmp = rdf.filter(pl.col(drugs) == DOSAGE)\n",
- " # get a RemoteLazyFrame of percentages of patients where each drug is increased were readmitted to hospital during study\n",
- " percentages = tmp.select(\n",
- " [\n",
- " (\n",
- " pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100\n",
- " ).alias(\"overall readmitted %\"),\n",
- " ]\n",
- " )\n",
- " # if first iteration, table and data_avilable are assigned percentages and row_count tables\n",
- " if table == None:\n",
- " table = percentages\n",
- " # else we use vstack to add new row of percentages\n",
- " else:\n",
- " table = table.vstack(percentages)\n",
- "\n",
- "table = table.collect().fetch()\n",
- "new_col = pl.Series(\"medication\", meds)\n",
- "table = table.with_columns([new_col])\n",
- "table.select([\"medication\", \"overall readmitted %\"]).sort(\n",
- " pl.col(\"overall readmitted %\")\n",
- ")"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {
- "id": "DuMXzt7zIHp3"
- },
- "source": [
- "Here we see that patients with a decreased dosage of `rosiglitazone` were readmitted at a rate well below avergae. Meanwhile, patients who decreased their dosage of `insulin`, `glipizide` or `pioglitazone` were all at least 6% more likely to be readmitted to hospital.\n",
+ "This gives us significiant results: 12-13% less patients taking `tolazamide` or `tolbutamide` were readmitted to hospital compared to the overall average, while 17% more patients taking `miglitol` were readmitted!\n",
"\n",
- "These results could helpe us to flag medications which are riskier to decrease or encourage us to decrease the dosages of others."
+ "There may be medical explanations for this such as certain drugs being linked to more complex cases. In any case, those results could help flag medications which are riskier to decrease or encourage to decrease the dosages of others."
]
},
{
@@ -2554,7 +2322,7 @@
"source": [
"### Conclusions\n",
"\n",
- "This brings us to the end of our data exploration. We gained meaningful insights:\n",
+ "We gained meaningful insights from this explorations:\n",
"\n",
"- 20-30 year olds are the most at-risk age group of short-term hospital readmission and emergency visits.\n",
" \n",
@@ -2566,14 +2334,9 @@
"\n",
"- Patients following a treatment of `tolazamide` or `tolbutamide` were readmitted at a well below avergae rate, while those taking `miglitol` were readmitted at a rate well above average.\n",
"\n",
- "- Patients with a decreased dose of `rosiglitazone` were readmitted well below the average rate of readmission, while those with with a decreased dose of `insulin`, `glipizide` or `pioglitazone` readmitted at a significantly above average rate.\n",
- "\n",
- "\n",
- "This is a rich dataset with many avenues to explore, so feel free to continue exploring!\n",
- "\n",
- "However in our case, that's all we've got time for! Let's close our connection and stop the server. \n",
+ "This is a rich dataset with many avenues to explore, so feel free to continue exploring and running more queries on this notebook!\n",
"\n",
- "(Leave this next block commented if you want to continue to run queries on the dataset instead!)\n"
+ "But once you're done, you can close the connection and stop the server. To do so, uncomment the following code block:"
]
},
{
From dbefea13a470d3473a1edf31f5e4945dbb660540 Mon Sep 17 00:00:00 2001
From: lyie28
Date: Mon, 27 Feb 2023 16:56:40 +0100
Subject: [PATCH 21/22] Updated
---
.../how-to-guides/diabetes_exploration.ipynb | 160 +++++++++++-------
1 file changed, 95 insertions(+), 65 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index 8afea5f4..d1b65ffd 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -214,7 +214,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "d38b27a2-9ad4-451e-bf8c-f0d280eee561\n"
+ "4d6d50ed-7d46-48a4-ad24-a87bec236b80\n"
]
}
],
@@ -347,7 +347,7 @@
{
"data": {
"text/plain": [
- "FetchableLazyFrame(identifier=d38b27a2-9ad4-451e-bf8c-f0d280eee561)"
+ "FetchableLazyFrame(identifier=4d6d50ed-7d46-48a4-ad24-a87bec236b80)"
]
},
"execution_count": 8,
@@ -975,7 +975,7 @@
{
"data": {
"text/plain": [
- "'4e458c87-76c9-4900-9b4f-e8a93e2f7bdf'"
+ "'2d3b0bbe-0d8b-4e0a-88e6-97e73e64ca7e'"
]
},
"execution_count": 20,
@@ -1036,7 +1036,7 @@
{
"data": {
"text/plain": [
- "FetchableLazyFrame(identifier=4e458c87-76c9-4900-9b4f-e8a93e2f7bdf)"
+ "FetchableLazyFrame(identifier=2d3b0bbe-0d8b-4e0a-88e6-97e73e64ca7e)"
]
},
"execution_count": 22,
@@ -1102,9 +1102,7 @@
"\n",
"We'll start by visualizing the number of patients who were readmitted to hospital for diabetes-related issues during the study.\n",
"\n",
- "To do this we group data by `age` and aggregate the `sum` of those who were readmitted. We'll generate a barplot for this query.\n",
- "\n",
- "***# missing comment(s) in code =)***"
+ "To do this we group data by `age` and aggregate the `sum` of those who were readmitted. We'll generate a barplot for this query."
]
},
{
@@ -1131,9 +1129,12 @@
}
],
"source": [
+ "# get total number of patients readmitted per each age group\n",
"total_readmitted = rdf.groupby(\"age\").agg(\n",
" pl.col(\"is_readmitted\").sum().alias(\"total readmitted\")\n",
")\n",
+ "\n",
+ "# visualize this query with barplot\n",
"total_readmitted.barplot(x=\"age\", y=\"total readmitted\")"
]
},
@@ -1144,11 +1145,9 @@
"id": "BdRDwT74BOrr"
},
"source": [
- "In terms of the number of readmissions, we see a clear trend: readmission cases increase with age, before dropping down in the 80-90 and 90-100 age groups. This could be due to increased mortality in these age ranges.\n",
- "\n",
- "If we take a look at the mean number of cases per age group using `histplot`, we see that it follows the same trend. But it shows that it may not represent a higher risk of readmission for older patients, rather a much increased number of diabetes patients in older age groups.\n",
+ "In terms of the number of readmissions, we see a clear trend: the number of patients readmitted to the hospital increased with age, before dropping down in the 80-90 and 90-100 age groups. This could be due to increased mortality in these age ranges.\n",
"\n",
- "***# Last sentence is unclear. What do you mean?***"
+ "However, this trend is not representative of an increased risk of readmission, but rather it is relative to the number of patients in each group who took part in this study. If we look at the overall age distribution of the patients in the study, we see that it matches the trends seen in our previous bar plot."
]
},
{
@@ -1175,6 +1174,7 @@
}
],
"source": [
+ "# get age distribution of patients\n",
"rdf.histplot(x=\"age\")"
]
},
@@ -1185,19 +1185,11 @@
"id": "xPFiho5eEKNT"
},
"source": [
- "If we zoom in on `short-term` and `long-term` readmittance individually and get the percentage of patients in these groups who are readmitted instead of the count, we get a rather different picture.\n",
- "\n",
- "***# Sentence is too long. Maybe try to use more direct formulations and cut the sentences more so each one says one thing (2 tops)?***\n",
- "\n",
- "To get these percentage values, we divide the total number of short-term or long-term values in the readmitted column by the total values in this column.\n",
+ "Let's now look at the percentage of patients in each age catgeory who were readmitted within a month of their hospital admission.\n",
"\n",
- "To get the total short-term or long-term values, we use the `str.count_match` function to fill the readmitted column with True (`1`) values where the contents of the cell are short-term or long-term respectively and False (`0`) for any other values. We can use the `sum()` function to count up all of these True values.\n",
+ "To get these percentage values, we divide the total number of short-termvalues in the readmitted column by the total values in this column.\n",
"\n",
- "To get the total values in the readmitted column, we select the column and use `count()` function.\n",
- "\n",
- "We can then set the column name to whatever we like using the alias function.\n",
- "\n",
- "***# Here you could maybe put some of the previous info as comments in code and get the paragraph a bit easier to read? Or just put the info again so it's easier to go through the code?***\n"
+ "We can then set the column name to whatever we like using the alias function."
]
},
{
@@ -1221,7 +1213,43 @@
},
"metadata": {},
"output_type": "display_data"
- },
+ }
+ ],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "short_term = rdf.groupby(\"age\").agg(\n",
+ " (\n",
+ " pl.col(\"readmitted\")\n",
+ " .str.count_match(\"short-term\")\n",
+ " .sum() # get number of patients in short-term readmitted catgeory\n",
+ " / pl.col(\n",
+ " \"readmitted\"\n",
+ " ).count() # get number of all patients in this age catgeory regardless of readmitted status\n",
+ " * 100\n",
+ " ).alias(\n",
+ " \"short-term readmitted\"\n",
+ " ) # set name for our new percentage column\n",
+ ")\n",
+ "\n",
+ "# display as bar plot\n",
+ "short_term.barplot(x=\"age\", y=\"short-term readmitted\")\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We will now do the same but for the percentage of patients in each age group who were readmitted at least a month after their first hospital visit."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {},
+ "outputs": [
{
"data": {
"image/png": "",
@@ -1234,26 +1262,23 @@
}
],
"source": [
- "import matplotlib.pyplot as plt\n",
- "\n",
- "short_term = rdf.groupby(\"age\").agg(\n",
- " (\n",
- " pl.col(\"readmitted\").str.count_match(\"short-term\").sum()\n",
- " / pl.col(\"readmitted\").count()\n",
- " * 100\n",
- " ).alias(\"short-term readmitted\")\n",
- ")\n",
"long_term = rdf.groupby(\"age\").agg(\n",
" (\n",
- " pl.col(\"readmitted\").str.count_match(\"long-term\").sum()\n",
- " / pl.col(\"readmitted\").count()\n",
+ " pl.col(\"readmitted\")\n",
+ " .str.count_match(\"long-term\")\n",
+ " .sum() # get number of patients in long-term readmitted catgeory\n",
+ " / pl.col(\n",
+ " \"readmitted\"\n",
+ " ).count() # get number of all patients in this age catgeory regardless of readmitted status\n",
" * 100\n",
- " ).alias(\"long-term readmitted\")\n",
+ " ).alias(\n",
+ " \"long-term readmitted\"\n",
+ " ) # set name for our new percentage column\n",
")\n",
"\n",
- "short_term.barplot(x=\"age\", y=\"short-term readmitted\")\n",
- "plt.show()\n",
- "long_term.barplot(x=\"age\", y=\"long-term readmitted\")"
+ "# display as bar plot\n",
+ "long_term.barplot(x=\"age\", y=\"long-term readmitted\")\n",
+ "plt.show()"
]
},
{
@@ -1263,7 +1288,7 @@
"id": "nlHmjMozE38p"
},
"source": [
- "We see a slight trend of increased long-term readmissions as age increases. But interestingly, there is a much higher risk of short-term readmission in 20-30 year olds. This could be explained by younger patients perhaps not having yet found the correct treatment or lifestyle to manage their diabetes."
+ "We see a slight trend of increased long-term readmissions as age increases. But most interestingly, there is a much higher risk of short-term readmission in 20-30 year olds. This could be explained by younger patients perhaps not having yet found the correct treatment or lifestyle to manage their diabetes."
]
},
{
@@ -1278,7 +1303,7 @@
},
{
"cell_type": "code",
- "execution_count": 27,
+ "execution_count": 28,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -1335,7 +1360,7 @@
},
{
"cell_type": "code",
- "execution_count": 28,
+ "execution_count": 29,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -1448,7 +1473,7 @@
"└───────────┴───────────┘"
]
},
- "execution_count": 28,
+ "execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
@@ -1491,7 +1516,7 @@
},
{
"cell_type": "code",
- "execution_count": 29,
+ "execution_count": 30,
"metadata": {},
"outputs": [
{
@@ -1544,7 +1569,7 @@
},
{
"cell_type": "code",
- "execution_count": 30,
+ "execution_count": 31,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -1657,7 +1682,7 @@
"└───────────┴───────────────────────┘"
]
},
- "execution_count": 30,
+ "execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
@@ -1687,7 +1712,7 @@
},
{
"cell_type": "code",
- "execution_count": 31,
+ "execution_count": 32,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -1800,7 +1825,7 @@
"└───────────┴──────────────────────┘"
]
},
- "execution_count": 31,
+ "execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
@@ -1845,14 +1870,12 @@
"\n",
"For the next part of our analysis, we will look at how treatment with different medications led to above or below average patient readmission.\n",
"\n",
- "Let's start by getting the percentage of all patients in the study who were readmitted to hospital in the short or long-term.\n",
- "\n",
- "***# some comment in the code maybe here?***"
+ "Let's start by getting the percentage of all patients in the study who were readmitted to hospital in the short or long-term."
]
},
{
"cell_type": "code",
- "execution_count": 32,
+ "execution_count": 33,
"metadata": {},
"outputs": [
{
@@ -1919,17 +1942,24 @@
"└──────────────┘"
]
},
- "execution_count": 32,
+ "execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "rdf.select(\n",
- " (pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100).alias(\n",
- " \"readmitted %\"\n",
+ "# get average percentage of patients who are readmitted\n",
+ "(\n",
+ " rdf.select(\n",
+ " (\n",
+ " pl.col(\"is_readmitted\").sum() / pl.col(\"is_readmitted\").count() * 100\n",
+ " ).alias( # calculate percentage\n",
+ " \"readmitted %\"\n",
+ " ) # set column name\n",
" )\n",
- ").collect().fetch()"
+ " .collect()\n",
+ " .fetch()\n",
+ ")"
]
},
{
@@ -1944,7 +1974,7 @@
},
{
"cell_type": "code",
- "execution_count": 33,
+ "execution_count": 34,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@@ -1972,7 +2002,7 @@
" 'glyburide-metformin']"
]
},
- "execution_count": 33,
+ "execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
@@ -2042,7 +2072,7 @@
},
{
"cell_type": "code",
- "execution_count": 34,
+ "execution_count": 35,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@@ -2090,7 +2120,7 @@
},
{
"cell_type": "code",
- "execution_count": 35,
+ "execution_count": 36,
"metadata": {},
"outputs": [
{
@@ -2286,7 +2316,7 @@
"└────────────────┴──────────────────────┘"
]
},
- "execution_count": 35,
+ "execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
@@ -2341,7 +2371,7 @@
},
{
"cell_type": "code",
- "execution_count": 38,
+ "execution_count": 37,
"metadata": {
"id": "xROO5Oxzvev-"
},
@@ -2357,7 +2387,7 @@
"provenance": []
},
"kernelspec": {
- "display_name": "Python 3",
+ "display_name": "base",
"language": "python",
"name": "python3"
},
@@ -2371,12 +2401,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.8.10"
+ "version": "3.9.13"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
- "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
+ "hash": "d130ca42b532f14c740c9405384e6a25814bad609bad1a40b3b3f26954036080"
}
}
},
From 201611d9f6e7449f0c455cd6b802283612c3ab20 Mon Sep 17 00:00:00 2001
From: Knulpinette
Date: Wed, 1 Mar 2023 11:54:33 +0100
Subject: [PATCH 22/22] reviewed! good to go :)
---
.../how-to-guides/diabetes_exploration.ipynb | 34 +++++++++----------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/docs/docs/how-to-guides/diabetes_exploration.ipynb b/docs/docs/how-to-guides/diabetes_exploration.ipynb
index d1b65ffd..6a3b1b38 100644
--- a/docs/docs/how-to-guides/diabetes_exploration.ipynb
+++ b/docs/docs/how-to-guides/diabetes_exploration.ipynb
@@ -20,7 +20,7 @@
"\n",
"In this guide, we will explore a dataset of diabetic patients admitted to hospital in the US over a ten year period. Diabetes is a disease that affects over 10% of the US population and can lead to serious health complications. The dataset contains 51 columns of data, including readmission to hospital, changes to medication and primary, secondary and terciary patient diagnoses.\n",
"\n",
- "First, we will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**. Then we'll go on analysing it - showing it is possible to do normal data science work without accessing the data in clear.\n",
+ "First, we will see how the data owner can upload a dataset to BastionLab and how a data scientist can then connect to BastionLab and **clean the dataset**. Then we'll go on **analysing it** - showing it is possible to do classic data science work without accessing the data in clear.\n",
"\n",
"But before we can do that, let's get everything set up!\n",
"\n",
@@ -33,9 +33,9 @@
"- Ensure we have [Python3.7](https://www.python.org/downloads/) (or greater) and [Python Pip](https://pypi.org/project/pip/) installed.\n",
"- Install [BastionLab](https://bastionlab.readthedocs.io/en/latest/docs/getting-started/installation/) and the [BastionLab server](https://pypi.org/project/bastionlab-server/0.3.7/) pip packages. \n",
"\n",
- "
\n",
"\n",
"- [Download the dataset](https://drive.google.com/file/d/1NPQoKKG3CdvXTNkHVNYhRQZ8GGiPNlvI/view?usp=share_link) we will be using in this notebook.\n",
@@ -244,13 +244,13 @@
"source": [
"`send_df()` will return a FetchableLazyFrame instance, which we will work with directly from now on. \n",
"\n",
- "
\n",
- "
Note: Frames in BastionLab
\n",
- "
We talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. \n",
- "\n",
- "In BastionLab, when we run a query, it is not immediately executed. Like with Polar's LazyFrames, pending queries are only executed when we call collect. FetchableLazyFrames are BastionLab's remote lazy frames when there are no pending queries to run, either because we have just uploaded or got the dataframe using get_df or because we have already ran collect after our latest query. To display these lazy frames we call the fetch method, which will verify that the data frame is safe to display, i.e. is it the result of a safe aggregated query as specified in the privacy policy.\n",
- "\n",
- "A RemoteLazyFrame is just a FetchableLazyFrame with pending queries still to run (as they have not yet been collected). When we call collect() these operations are run server-side and the result of this is our FetchableLazyFrame.
\n",
+ ">
\n",
+ ">
Note: Frames in BastionLab
\n",
+ ">
We talk about two types of LazyFrames in BastionLab: `RemoteLazyFrames` and `FetchableLazyFrames`. \n",
+ ">\n",
+ ">In BastionLab, when we run a query, it is not immediately executed. Like with Polar's LazyFrames, pending queries are only executed when we call collect. FetchableLazyFrames are BastionLab's remote lazy frames when there are no pending queries to run, either because we have just uploaded or got the dataframe using get_df or because we have already ran collect after our latest query. To display these lazy frames we call the fetch method, which will verify that the data frame is safe to display, i.e. is it the result of a safe aggregated query as specified in the privacy policy.\n",
+ ">\n",
+ ">A RemoteLazyFrame is just a FetchableLazyFrame with pending queries still to run (as they have not yet been collected). When we call collect() these operations are run server-side and the result of this is our FetchableLazyFrame.
\n",
"
\n"
]
},
@@ -298,7 +298,7 @@
"\n",
"We cannot view the output of the query because it does not aggregate at least 10 rows of data as specified in our privacy policy. It tries to print out individual rows instead!\n",
"\n",
- "All is working, so now that the dataset has been uploaded, it's time for our data scientists to start their exploration... \n",
+ "All is working, so now that the dataset has been uploaded, it's time for our data scientists to start their exploration!\n",
"\n",
"The data owner can now connection their connection to the server."
]
@@ -719,7 +719,7 @@
"\n",
"We want to group together data in another three other columns using Polars `.then().when().otherwise()` methods to replace values meeting certain criteria with a new value.\n",
"\n",
- "The first two are **`A1Cresult`**, which contains patients' **HbA1c** level. We want to group these into `very high`, `high`and `normal` groups based on levels defined in our project brief.\n",
+ "The first two are **`A1Cresult`**, which contains patients' **HbA1c** level. We want to group these into `very high`, `high` and `normal` groups based on levels defined in our project brief.\n",
"\n",
"These columns are both currently string columns, so we will also need to convert them to float values in order to perform numerical comparisons on them."
]
@@ -1147,7 +1147,7 @@
"source": [
"In terms of the number of readmissions, we see a clear trend: the number of patients readmitted to the hospital increased with age, before dropping down in the 80-90 and 90-100 age groups. This could be due to increased mortality in these age ranges.\n",
"\n",
- "However, this trend is not representative of an increased risk of readmission, but rather it is relative to the number of patients in each group who took part in this study. If we look at the overall age distribution of the patients in the study, we see that it matches the trends seen in our previous bar plot."
+ "However, this trend is not representative of an increased risk of readmission, it is relative to the number of patients in each group who took part in this study. If we look at the overall age distribution of the patients in the study, we see that it matches the trends seen in our previous bar plot."
]
},
{
@@ -2065,7 +2065,7 @@
"\n",
"- We calculate the percentage of those patients who were readmitted and give the result a column name `overall readmitted`.\n",
"\n",
- "- We add this row to the table using `vstack`. If the table doesn't yet exist, our query result becomes the table, and we'll add to it!\n",
+ "- We add this row to the table using `vstack`. If the table doesn't yet exist, our query result becomes the table, and we'll add to it.\n",
"\n",
"- We use `collect().fetch()` to get the output as a Polars dataframe that we can display."
]
@@ -2111,9 +2111,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "This table current contains the percentage of readmissions for each medication we iterated over, but it does not include a column with the medication names. \n",
+ "The current table contains the percentage of readmissions for each medication we iterated over, but it does not include a column with the medication names. \n",
"\n",
- "We will add this by converting out list of medications into a Polars Series with the column name `medication` and adding it to our table using the `with_columns` method.\n",
+ "We will add this by converting out list of medications into a `Polars Series` with the column name `medication` and adding it to our table using the `with_columns` method.\n",
"\n",
"Our table will have the `overall readmitted %` column before the `medication` one. To swap the order so `medication` goes first, we'll use `select` and select the columns in the order we want. We'll then sort the table by lowest to highest `overall readmitted %` value."
]