Blog

Explore

Blog

A Project from a medical student who completed the " No code/Low Code AI in Healthcare" Course

It gives me a great pleasure to share this blog with you. Rishi Gadepally is a medical student who completed our AI/Ml course and was able to build his first project right away. It is amazing to me how fast that this happened :). Congratulations to Rishi. I am looking forward to see more of these projects from all of you.

Getting Started

Using existing machine learning models can be very intimidating for a first-timer, especially if you are not familiar with the concepts of using Python libraries such as pandas and numpy. As much as the models themselves are talked about, the most important learning point for a beginner is being familiar with the data and understanding how to process it so that a model can use it to make accurate and reliable predictions. The guide below is intended for learners with no coding background. It is by no means comprehensive and I encourage anyone following along using their own dataset to search up anything that doesn’t make sense to them. Having some familiarity with Python syntax can be helpful but I was able to do the following just by using Google to find the code I needed to achieve each purpose. For the purposes of this post, I will not be delving into how the models themselves work. Rather, this is intended to be an introduction on how to take existing data and train a model to create a classifier.

Step 1: Find a dataset you want to use.

https://www.kaggle.com/datasets⁠

is a great repository of open and free-to-use datasets that can be a great resource for beginners who may not have access to large datasets of their own and want to get some hands-on experience. For the sake of this post, I will be discussing how to use data that is in a csv format. If you have your own data in an Excel file, you can export it as a csv and follow along.

Looking at the Data

Step 1: Upload the CSV file

from google.colab import files

uploaded = files.upload()

Step 2: Import pandas and numpy libraries. These libraries will allow you to view, sort, filter, and manipulate the data frame (chart) that contains the variables of interest

import numpy as np

import pandas as pd

Step 3: Create a dataframe from the CSV file that can then be viewed and manipulated using pandas

df = pd.read_csv("stroke_data.csv")

df.head()

Step 4: Modify the data frame to make it more amenable to processing by the model. This involves assigning each categorical variable a number rather than a string, which is essentially a bunch of characters stringed together. For example ‘Yes’ is a string as it contains the characters ‘Y’ ‘e’ and ‘s’. As seen below ‘Yes’ can also be represented as ‘1’ and ‘No’ can be represented as ‘0’.

#Gender

df['gender'].replace('Male', 1,inplace=True)

df['gender'].replace('Female', 0,inplace=True)

#Ever Married

df['ever_married'].replace('Yes', 1,inplace=True)

df['ever_married'].replace('No', 0,inplace=True)

#Work Type

df['work_type'].replace('children', 3,inplace=True)

df['work_type'].replace('Private', 2,inplace=True)

df['work_type'].replace('Self-employed', 1,inplace=True)

df['work_type'].replace('Govt_job', 0,inplace=True)

#Residence Type

df['Residence_type'].replace('Urban', 1,inplace=True)

df['Residence_type'].replace('Rural', 0,inplace=True)

#Smoking Status

df['smoking_status'].replace('never smoked', 0,inplace=True)

df['smoking_status'].replace('formerly smoked', 1,inplace=True)

df['smoking_status'].replace('smokes', 2,inplace=True)

Step 5: Take a look at your data. ‘df.describe()’ essentially allows you to quickly view statistics of each variable. Variables include:

‘Count’ which gives the total number of entries for each variable

Other statistical values you can see are ‘mean’, ‘std’ (standard deviation), ‘min’ (minimum value of all entries for that variable), ‘25%, 50%, 75%’ (values for each quartile amongst all the data), and ‘max’ (which gives the highest value) in the column.

As seen below, it can be very helpful to quickly get a broad overview of the data, especially for variables like age. On the other hand, it may not be helpful for certain variables such as ‘id’ as statistics on patient ID numbers are unlikely to provide any insight into the data.

df.describe()

⁠

Screen Shot 2023-02-02 at 5.59.00 PM.png

⁠

Step 6: Check for null values. Null values are essentially ‘missing’ values in your dataset. When a model sees a missing values, it may interpret that as its own variable because it does not understand context.

df.isna().any()

Step 7: Explore your data to see if there are any null values and drop rows accordingly. Sometimes you will see that a certain column will be the only one with null values. In that case you can just drop the values in that column that are missing. This will lead to a data frame that contains all values for all columns.

df = df.dropna(subset=['bmi'])

df.describe()

Step 8: Take time to explore the variables in the data frame. These are the variables that will be fed into the model to train it. If you don’t want a model to ‘learn’ to use that variable to make a prediction, it is important to drop it. The example above contains patient ID numbers. In the case of creating a binary classifier, this would not be helpful and may hinder training/ lead to inaccurate results. Below is an example of how to drop a column from the dataframe (in this case patient ID (‘id’).

df = df.drop('id', axis=1)

df.head()

One Hot

While models such as LightGBM (being used here) can accept categorical variables, others like XGBoost accept only numerical features. While it may not be necessary to use one hot encoding in this example, it is beneficial to understand what it is, why it is important, and how to do it. In the last section, we replaced each categorical variable with numbers. In the ‘smoking_status’ column, ‘never smoked’ was replaced by 0, ‘formerly smoked’ was replaced by 1, and ‘smokes’ was replaced by 2. The problem with leaving things that way is that 0<1<2, but the original categories (never smoked, formerly smoked, and smokes) do not have an order associated with them. This is where one hot encoding becomes a valuable tool. One hot encoding allows each option within the column (in this case never smoked, formerly smoked, and smokes) to become their own columns. If someone never smoked, they would be assigned a ‘1’ and everyone else would be assigned ‘0’. For entries that contained ‘formerly smoked’, these entries would be assigned a ‘1’ and everything else would be assigned a ‘0’. One hot coding allows each possible entry to have its own column and be considered its own variable.

Step 1: Start by looking at every possible type of entry for each current column.

print(data['gender'].unique())

print(data['work_type'].unique())

print(data['smoking_status'].unique())

print(data['ever_married'].unique())

print(data['Residence_type'].unique())

print(data['smoking_status'].unique())

The code above should yield something that looks like the following:

['Male' 'Female' 'Other']

['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']

['formerly smoked' 'never smoked' 'smokes' 'Unknown']

['Yes' 'No']

['Urban' 'Rural']

['formerly smoked' 'never smoked' 'smokes' 'Unknown']

Step 2: Check the label counts (number of different values that have been entered for each category) on categorical parameters. For example, if we look specifically at the ‘Residence_type’ column, we would be able to see how many ‘Urban’ and how many ‘Rural’ entries were in the dataset. Likewise it is good to get an idea for every variable that will be used.

print(data['gender'].value_counts())

print(data['work_type'].value_counts())

print(data['smoking_status'].value_counts())

print(data['ever_married'].value_counts())

print(data['Residence_type'].value_counts())

print(data['smoking_status'].value_counts())

Step 3: Take the original data frame (named ‘data’ in this case) and create a one hot encoded version of it

one_hot_encoded_data = pd.get_dummies(data, columns = ['gender', 'work_type', 'smoking_status', 'ever_married', 'Residence_type', 'smoking_status'])

Step 4: View the one hot encoded data frame to ensure that it looks the way it should

one_hot_encoded_data.head()

Step 1: Create a csv file with the one hot encoded data that can be used to train the model

from google.colab import files

One_Hot_Stroke_Data.to_csv('One_Hot_Stroke_Data.csv')

files.download('One_Hot_Stroke_Data.csv')

Using the Model

Step 1: Import the file that you will be using

from google.colab import files

uploaded = files.upload()

Step 2: Import the Libraries

import numpy as np

import pandas as pd

Step 3: Create a dataframe (table) from the csv file that can be viewed in the notebook. The df.head() line will pull up the first 5 rows of the dataframe by default

df = pd.read_csv("One_Hot_Stroke_Data.csv")

df.head()

Step 4: Install LightGBM

!pip install lightgbm

Step 5: Separate the data into 2 dataframes: ‘x’ (input variables) and ‘y’ (output). The model will be trained to predict y based on x.

x = df.drop(['stroke'], axis=1)

y = df['stroke']

Step 6: Display x to make sure that it no longer contains y (column which contains the information stroke vs no stroke). This is vital as you want the model to have the answer before making the prediction. ‘x.head(10)’ will allow you to see the first 10 rows of the new dataframe ‘x’.

x.head(10)

Step 7: Split the data into training and test sets. Here, we are importing the tool ‘train_test_split’ from the library ‘sklearn’. This tool will allow us to split up the dataset so that the model has adequate data to train on while still leaving some to test. Here the training set comprises 80% of the data while the test set will contain the remaining 20%. The ‘random_state’ indicates that the data has been shuffled randomly. This parameter can take on any positive integer value. By assigning it a specific value such as ‘10’ in this instance, we can ensure that the training and test sets are the same across different executions.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 10)

Step 8: Import the LightGBM Model. This creates a classifier that uses LightGBM. The classifier is ‘fit’ to the training data. The test data is then fed to the model and the results are stored in another dataframe is created called y_pred. This contains the predictions based on the data from the test set.

import lightgbm as lgb

classifier = lgb.LGBMClassifier()

classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test)

Assessing the Results

Step 9: Assess the model’s performance. ‘accuracy_score’ compares y_pred (the results of the model’s predictions) with y_test (the actual answers). Remember, this accuracy score is for the test set.

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_pred, y_test)

print('LightGBM accuracy score: {0:0.4f}'.format(accuracy_score(y_test, y_pred)))

Step 10: Compare model performance between training and test set. This is important as a model that does significantly better on a training set compared to the test set has likely ‘overfit’. This is a commonly used term in machine learning to describe a model that has learned the features of a training set so well that it cannot generalize to a test set. Think of this as the equivalent of memorizing practice questions and answers without understanding anything. If you see any new questions, you will have trouble answering them.

#Training Set Score

y_pred_train = classifier.predict(x_train)

print('LightGBM Model training-set accuracy score: {0:0.4f}'.format(accuracy_score(y_train, y_pred_train)))

#Testing Set Score

accuracy = accuracy_score(y_pred, y_test)

print('LightGBM Model testing-set accuracy score: {0:0.4f}'.format(accuracy_score(y_test, y_pred)))

Understanding How the Model Made Predictions

Step 11: Install shap, a tool that is used to better understand what factors the model focused on when making predictions.

!pip install shap

Step 12: Import shap.

import shap

explainer = shap.Explainer(classifier)

shap_values = explainer.shap_values(x)

Step 13: Create a summary plot to easily visualize which variables in the training set played a role in the predictions.

shap.summary_plot(shap_values, x)

Congratulations, you just learned how to download a dataset and use pandas to view, analyze, clean, and format your data to feed into a machine learning model! This guide was by no means comprehensive or expert-level but I hope it gives you the confidence and motivation to dive in and learn more!

Assessing the Results

Understanding How the Model Made Predictions

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.