IBM PROJECT-HEALTHCARE DISEASE PREDICTION MODEL
AIM: THIS PROJECT IS TO DEVELOP A HEALTHCARE DISEASE AND SPECIALIST PREDICTION MODEL
INFORMATION ABOUT DATASET This dataset consist of various major diseases and its symptoms.
1-PANDAS- It is a library written for the Python programming language for data manipulation and analysis.
2-MATPLOTLIB-Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
3-TKINTER-For User Interface
4-SCIKIT-LEARN-It is a free software machine learning library for the Python programming language.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting,k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
5-NUMPY-For Cleaning,analyzing data.
Decision Trees are some of the most used machine learning algorithms. They are used for both classification and Regression. Decision Trees as the name suggests works on a set of decisions derived from the data and its behavior.
The decision trees use the CART algorithm (Classification and Regression Trees). In both cases, decisions are based on conditions on any of the features. The internal nodes represent the conditions and the leaf nodes represent the decision based on the conditions.
A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions. On each step or node of a decision tree, used for classification, we try to form a condition on the features to separate all the labels or classes contained in the dataset to the fullest purity.
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction (see figure below).
The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data science speak, the reason that the random forest model works so well is:
A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction.
A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.
It is mainly used in text classification that includes a high-dimensional training dataset. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
Working of Naïve Bayes' Classifier:
1.Convert the given dataset into frequency tables.
2.Generate Likelihood table by finding the probabilities of given features.
3.Now, use Bayes theorem to calculate the posterior probability.
Python Implementation of the Naïve Bayes algorithm:
Steps to implement:
1.Data Pre-processing step
2.Fitting Naive Bayes to the Training set
3.Predicting the test result
4.Test accuracy of the result(Creation of Confusion matrix)
5.Visualizing the test set result.
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
The K-NN working can be explained on the basis of the below algorithm:
1: Select the number K of the neighbors
2: Calculate the Euclidean distance of K number of neighbors
3: Take the K nearest neighbors as per the calculated Euclidean distance.
4: Among these k neighbors, count the number of the data points in each category.
5: Assign the new data points to that category for which the number of the neighbor is maximum.
6: Model is ready.
-
To integrate the predicted disease and specialist to a nearby hospital having the specialist
-
Deploying the model on a cloud platform
-
Provision for providing more symptoms