Matrix Testing Different Python Versions for Polars Descriptive Statistics Script and Jupyter Notebook
This repo contains the python script and notebook in the codes/project_files folder which perform the following general operations for the selected column from the dataset:
- calculate the descriptive statistics (mean, median and standard deviation)
- generate a histogram of the selcted column
Matrix Testing
is done to ensure that the code and repository function as expected in different versions of Python.
The codes reads the data from the csv and stores it as a polars DataFrame for the analysis.
Note : The script returns the descriptive stats as list and also saves the graph and the stats to a file while the notebook displays them within the notebook
This repo has been created using the IDS-706_rg361_ind-proj-1 template which was created as Individual Project-1.
Date Created: 2023-09-24
This repo uses Github Matrix testing to automatically test the code against the following versions of python:
3.7
3.8
3.9
If the codes work with all the specied versions, the following will be visible in the Matrix Testing workflow in Github actions:
Create a Codespace on main which will initialize the enviroment with the required packages and settings to execute the code.
The main project files are present in the codes/project_codes
folder
The descriptive_stats
function in main_script.py
returns a list which contains the the [mean, median, standard deviation] of the selected column in the data.
The code also writes these results to a summary.md
file
The code stores the histogram as an image with name output.png
Note: The code saves the summary.md and output.png in the outputs
folder by default, please change this file path within main_script.py in case required
The function takes in the following 2 parameters:
- fname (required) - path or link to the csv file with the desired data
- col (optional) - column number for which the statistics needs to be analyzed. if no input is given, the last column in the data is considered for analysis
Notes
- Count the column numbers starting at 1
- The code assumes that the data has a header row, which is the default behaviour of the
read_csv
function from polars which is used to read the data and create a Dataframe
The Jupyter notebook performs the same operations as the main_script.py, however it does not save the graph or the summary file in a folder, it displays them in-line in the jupyter notebook.
The notebook can be executed in the virutal environment and the values for the dataset and columns can be modified as required.
all the python scripts and notebooks are present in the codes folder and organised into the following 2 directories:
This folder contains the main scripts and notebooks which perform the functions in the repository:
main_script.py
: contains thedescriptive_stats
function which returns the descriptive statistics and wirtes the summary.md and output.png files as explained earliermain_notebook.ipynb
: Jupyter notebook which performs the descriptive statics and plots the graph and displays them within the notebooklib.py
: contains the functions which are used in the main script and notebook files:- select_col : returns the column name for the selected column number (or last column) after verifying that they are numeric columns, else returns error code
- summary_stats : returns the [mean, median, standard deviation] of the dataframe as a list
This folder contains the following test files which are used by the github actions to verify the scripts in project_codes folder:
test_lib
: tests the correct funtioning of thelib.py
file in the project_codes foldertest_script
: tests the correct funtioning of themain_script.py
file in the project_codes folder
contains the information about the repository and instructions for using it
contains the list of packages and libraries which are required for running the project. These are intalled and used in the virtual environment and Github actions.
different github actions are used to automate the following 4 actions whenever a change is made to the files in the repository:
-
install.yml
: installs the packages and libraries mentioned in the requirements.txt -
test.yml
: usespytest
to test the python script and jupyter notebook (also usesnbval
) using the test_* files in thecodes/test_codes
folder. -
Note: this action also has the trigger for automatically generating the output.png and summary.md file whenever any changes are made in the repository
-
format.yml
: usesblack
to format the python script and jupyter notebooks (also usesnbqa
) -
lint.yml
: usesruff
to lint the python script and jupyter notebooks (also usesnbqa
)
Note -if all the processes run successfully the following output will be visible in github actions:
contains the instructions and sequences for the processes used in github actions and .devcontainer for creating the virtual environment
contains the dockerfile
and devcontainer.json
files which are used to build and define the setting of the virtual environment (codespaces - python) for running the codes.
contains the summary.md
and the output.png
files generated by the main_script.py file, these are automatically updated by github actions whenever any changes are made in the repository
contains the test dataset and otherfiles which are used in the README
a sample Dataset of blood-pressure from Github has been loaded into the resources folder and is used for testing the code.
Following test cases are run to check the proper functioning of the code and lib files:
- We specify the column number (in this test, column 4 is passed as argument to the function)
- We do not specify a column number (in this test, no argument is passed to the funtion)
- Invalid Column number is given, lib and main files return error codes
- Non-numeric column selected, lib and main files return error codes
The codes run as expected and pass the test cases:
The following output files are automatically genrated by Github actions and stored in the outputs folder whenever there is a change in the directory:
Visualization:
Sample of summary.md file, actual file can be accessed at this Link:
Note : Only the last graph and summary are stored since the test file calls the function multiple times and the function clears the previous output before saving a new one