| Sections ¶

1. What does it mean?
2. Analysing what we have
3. Visualizing the Data
4. Building the Model

4.1 Splitting the data into functional data
4.2 A formula to predict the level of cancer?
4.3 Random Forest Model Presentation

5. Bottom

by Yousef Qasim


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from scipy import stats
from scipy.stats import norm, skew

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

%config InlineBackend.figure_format = 'retina'

# Model Accuracies
ml_accuracies = dict()


# Colors
colors = ['lightcoral',
          'brown',
          'lightseagreen',
          'maroon',
          'deeppink',
          'darkorange',
          'royalblue',
          'darkviolet',
          'gold',
          'crimson',
          'lightsteelblue',
          'salmon',
          'mediumseagreen',
          'olivedrab',
          'blue',
          'limegreen',
          'slateblue',
          'red',
          'steelblue',
          'teal',
          'peru',
          'dimgray',
          'violet',
          'cyan']


df = pd.read_csv("/content/cancer patient data sets.csv", index_col='index')

# Index Column now refers to patient. This will reduce any prediction bias having a chronological index number would induce.
df.drop("Patient Id", axis=1, inplace=True)

# Cleaning the column names to allow us to deal with them professionaly.
df.rename(columns=str.lower, inplace=True)
df.rename(columns={col: col.replace(" ", "_") for col in df.columns}, inplace=True)

# View the first 5 rows of the database.
df.head(5)


# To understand the data further, we need to identify the data type and view if there's any corrupt/missing data fields.
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   age                       1000 non-null   int64 
 1   gender                    1000 non-null   int64 
 2   air_pollution             1000 non-null   int64 
 3   alcohol_use               1000 non-null   int64 
 4   dust_allergy              1000 non-null   int64 
 5   occupational_hazards      1000 non-null   int64 
 6   genetic_risk              1000 non-null   int64 
 7   chronic_lung_disease      1000 non-null   int64 
 8   balanced_diet             1000 non-null   int64 
 9   obesity                   1000 non-null   int64 
 10  smoking                   1000 non-null   int64 
 11  passive_smoker            1000 non-null   int64 
 12  chest_pain                1000 non-null   int64 
 13  coughing_of_blood         1000 non-null   int64 
 14  fatigue                   1000 non-null   int64 
 15  weight_loss               1000 non-null   int64 
 16  shortness_of_breath       1000 non-null   int64 
 17  wheezing                  1000 non-null   int64 
 18  swallowing_difficulty     1000 non-null   int64 
 19  clubbing_of_finger_nails  1000 non-null   int64 
 20  frequent_cold             1000 non-null   int64 
 21  dry_cough                 1000 non-null   int64 
 22  snoring                   1000 non-null   int64 
 23  level                     1000 non-null   object
dtypes: int64(23), object(1)
memory usage: 195.3+ KB
None


# What levels did the researchers categorize in this dataset?
print('Cancer Levels: ', df['level'].unique())

'''

Applying a horizontal background gradient to the transposed descriptive statistics of the DataFrame,
rounded to three decimal places. This enhances readability and visual interpretation,
allowing for an immediate, intuitive understanding of the data's distribution and variability across different measures.
'''

mapping = {'High': 2, 'Medium': 1, 'Low': 0}
df["level"].replace(mapping, inplace=True)
print('Cancer Levels: ', df['level'].unique())

Cancer Levels:  ['Low' 'Medium' 'High']
Cancer Levels:  [0 1 2]


round(df.describe().iloc[1:, ].T, 3).style.background_gradient(axis=1)

2.1 | Splitting the data into input and output. ¶


# As we are using a supervised machine learning method, the data must be identified correctly.
X = df.drop(columns='level')
y = df.level

display(X.head(), y[:5])

index
0    0
1    1
2    2
3    2
4    2
Name: level, dtype: int64


plt.figure(figsize=(10, 8))
plt.title('Training Data Output Labels', fontsize=20)
plt.pie(df.level.value_counts(),
    labels=mapping.keys(),
    colors=['#97C1A9','#FFB8B1', '#55CBCD','#FA0000'],
    autopct=lambda p: '{:.2f}%\n{:,.0f}'.format(p, p * sum(df.level.value_counts()) / 100),
    explode=tuple(0.01 for i in range(3)),
    textprops={'fontsize': 20}
)
plt.show()


plt.figure(figsize=(20,15))
sns.heatmap(df.corr(), annot=True, cmap=plt.cm.PuBu)
plt.show()

Violin plots (Another form of visualization)¶

We can observe here violin plots, which show us the range, distribution and the quartiles as well as the medians of data in one plot, very illustratively.


fig, ax = plt.subplots(ncols=4, nrows=6, figsize=(20, 20))
ax = ax.flatten()

for i, col in enumerate(df.columns):
    sns.violinplot(x=df['level'].replace(dict(zip(mapping.values(), mapping.keys()))),
                   y=col, data=df, hue_order='level', palette='tab20c', ax=ax[i])
    ax[i].set_title(col.title())

plt.tight_layout(pad=0.2, w_pad=0.2, h_pad=2.5)


fig, ax = plt.subplots(ncols=8, nrows=3, figsize=(20, 10))
ax = ax.flatten()
i = 0

for k, v in df.items():
    mu, sigma = norm.fit(v)
    sns.histplot(v,
                 color=colors[i],
                 kde=True,
                 bins=25,
                 ax=ax[i],
                 label=f'$\mu={mu:.1f}$\n$\sigma={sigma:.1f}$')
    ax[i].set_title(f'{k}')
    ax[i].legend()
    i += 1

plt.tight_layout(pad=0.2, w_pad=0.2, h_pad=2.5)
plt.show()


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
print(f'Shapes - X Training: {X_train.shape} and X Testing {X_test.shape}')
print(f'Shapes - Y Training: {y_train.shape} and Y Testing {y_test.shape}')

print(f'\nTraining output counts\n{y_train.value_counts()}')

Shapes - X Training: (700, 23) and X Testing (300, 23)
Shapes - Y Training: (700,) and Y Testing (300,)

Training output counts
2    248
1    233
0    219
Name: level, dtype: int64


def CM(y_test, y_pred_test, col_names):

    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as pl
    import numpy as np

    # Forming confusion matrix
    CM = confusion_matrix(y_test, y_pred_test)
    pl.figure(figsize=(10,8))
    sns.heatmap(CM, annot=True, annot_kws={'size':15}, fmt=".0f",
                cmap=pl.cm.Blues, linewidths=5)

    # labels in plot
    tick_marks = np.arange(len(col_names))
    pl.xticks(tick_marks + 0.5, col_names)
    pl.yticks(tick_marks + 0.5, col_names, rotation=0)
    pl.xlabel('Predicted label')
    pl.ylabel('True label')
    pl.title('Confusion Matrix for Random Forest Model')
    pl.show()


def random_forest_n_best(X_train, y_train, X_test, y_test, n_list):

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score

    scores = []

    for n in n_list:
        RF = RandomForestClassifier(n_estimators=n, random_state=40)
        RF.fit(X_train, y_train)
        RF_pred = RF.predict(X_test)

        scores.append(accuracy_score(y_test, RF_pred))

    plt.plot(n_list, scores)
    plt.xlabel('Value of n_estimators for Random Forest Classifier')
    plt.ylabel('Testing Accuracy')
    plt.grid(alpha=0.1)
    plt.show()


random_forest_n_best(X_train, y_train, X_test, y_test, n_list=np.arange(1,20,1))


# Define model and set random_state
RF = RandomForestClassifier(n_estimators=3, random_state=40)

# fitting model
RF.fit(X_train, y_train)

# predicting with model
RF_pred = RF.predict(X_test)
pd.Series(RF_pred).value_counts()

2    117
1     99
0     84
dtype: int64


CM(y_test, RF_pred, col_names=['Low', 'Medium', 'High'])


from sklearn import tree

trees = len(RF.estimators_)
cn = ['Low', 'Medium', 'High']

fig, ax = plt.subplots(trees, 1, figsize=(30,10*trees))

for i, forest in enumerate(RF.estimators_):
    if trees > 1:
        tree.plot_tree(forest,
                       feature_names=X.columns,
                       class_names=cn,
                       filled=True,
                       fontsize=11,
                       ax=ax[i])
    else:
        tree.plot_tree(forest,
                       feature_names=X.columns,
                       class_names=cn,
                       filled=True,
                       fontsize=11)

plt.tight_layout(h_pad=-10)
plt.show()


# View the classification report for test data and predictions
ml_accuracies['Random Forest'] = accuracy_score(y_test, RF_pred)
print(classification_report(y_test, RF_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        84
           1       1.00      1.00      1.00        99
           2       1.00      1.00      1.00       117

    accuracy                           1.00       300
   macro avg       1.00      1.00      1.00       300
weighted avg       1.00      1.00      1.00       300

Columns	Meaning
Age	The age of the patient. (Numeric)
Gender	The gender of the patient. (Categorical)
Air Pollution	The level of air pollution exposure of the patient. (Categorical)
Alcohol use	The level of alcohol use of the patient. (Categorical)
Dust Allergy	The level of dust allergy of the patient. (Categorical)
OccuPational Hazards	The level of occupational hazards of the patient. (Categorical)
Genetic Risk	The level of genetic risk of the patient. (Categorical)
chronic Lung Disease	The level of chronic lung disease of the patient. (Categorical)
Balanced Diet	The level of balanced diet of the patient. (Categorical)
Obesity	The level of obesity of the patient. (Categorical)
Smoking	The level of smoking of the patient. (Categorical)
Passive Smoker	The level of passive smoker of the patient. (Categorical)
Chest Pain	The level of chest pain of the patient. (Categorical)
Coughing of Blood	The level of coughing of blood of the patient. (Categorical)
Fatigue	The level of fatigue of the patient. (Categorical)
Weight Loss	The level of weight loss of the patient. (Categorical)
Shortness of Breath	The level of shortness of breath of the patient. (Categorical)
Wheezing	The level of wheezing of the patient. (Categorical)
Swallowing Difficulty	The level of swallowing difficulty of the patient. (Categorical)
Clubbing of Finger Nails	The level of clubbing of finger nails of the patient. (Categorical)

	mean	std	min	25%	50%	75%	max
age	37.174000	12.005000	14.000000	27.750000	36.000000	45.000000	73.000000
gender	1.402000	0.491000	1.000000	1.000000	1.000000	2.000000	2.000000
air_pollution	3.840000	2.030000	1.000000	2.000000	3.000000	6.000000	8.000000
alcohol_use	4.563000	2.620000	1.000000	2.000000	5.000000	7.000000	8.000000
dust_allergy	5.165000	1.981000	1.000000	4.000000	6.000000	7.000000	8.000000
occupational_hazards	4.840000	2.108000	1.000000	3.000000	5.000000	7.000000	8.000000
genetic_risk	4.580000	2.127000	1.000000	2.000000	5.000000	7.000000	7.000000
chronic_lung_disease	4.380000	1.849000	1.000000	3.000000	4.000000	6.000000	7.000000
balanced_diet	4.491000	2.136000	1.000000	2.000000	4.000000	7.000000	7.000000
obesity	4.465000	2.125000	1.000000	3.000000	4.000000	7.000000	7.000000
smoking	3.948000	2.496000	1.000000	2.000000	3.000000	7.000000	8.000000
passive_smoker	4.195000	2.312000	1.000000	2.000000	4.000000	7.000000	8.000000
chest_pain	4.438000	2.280000	1.000000	2.000000	4.000000	7.000000	9.000000
coughing_of_blood	4.859000	2.428000	1.000000	3.000000	4.000000	7.000000	9.000000
fatigue	3.856000	2.245000	1.000000	2.000000	3.000000	5.000000	9.000000
weight_loss	3.855000	2.207000	1.000000	2.000000	3.000000	6.000000	8.000000
shortness_of_breath	4.240000	2.285000	1.000000	2.000000	4.000000	6.000000	9.000000
wheezing	3.777000	2.042000	1.000000	2.000000	4.000000	5.000000	8.000000
swallowing_difficulty	3.746000	2.270000	1.000000	2.000000	4.000000	5.000000	8.000000
clubbing_of_finger_nails	3.923000	2.388000	1.000000	2.000000	4.000000	5.000000	9.000000
frequent_cold	3.536000	1.833000	1.000000	2.000000	3.000000	5.000000	7.000000
dry_cough	3.853000	2.039000	1.000000	2.000000	4.000000	6.000000	7.000000
snoring	2.926000	1.475000	1.000000	2.000000	3.000000	4.000000	7.000000
level	1.062000	0.815000	0.000000	0.000000	1.000000	2.000000	2.000000

| Sections ¶

2.1 | Splitting the data into input and output. ¶

Violin plots (Another form of visualization)¶

1 - Data Meaning
¶

2 - Importing and Initial Analysis
¶

3 - Visualizing The Data
¶

4 - Model Building
¶

4.1 | Training Testing Splitting ¶

4.2 | Defining Confusion Matrix function ¶

4.3 | Random Forest Classifier Regression ¶

4.3.1 | Confusion Matrix of Random Forest Classifier Model Outcome ¶

4.3.2 | Random Forest Classifier Plot ¶

4.3.3 | Classification Report of Random Forest Classifier Model ¶

Sections:¶

	age	gender	air_pollution	alcohol_use	dust_allergy	occupational_hazards	genetic_risk	chronic_lung_disease	balanced_diet	obesity	...	fatigue	weight_loss	shortness_of_breath	wheezing	swallowing_difficulty	clubbing_of_finger_nails	frequent_cold	dry_cough	snoring	level
index
0	33	1	2	4	5	4	3	2	2	4	...	3	4	2	2	3	1	2	3	4	Low
1	17	1	3	1	5	3	4	2	2	2	...	1	3	7	8	6	2	1	7	2	Medium
2	35	1	4	5	6	5	5	4	6	7	...	8	7	9	2	1	4	6	7	2	High
3	37	1	7	7	7	7	6	7	7	7	...	4	2	3	1	4	5	6	7	5	High
4	46	1	6	8	7	7	7	6	7	7	...	3	2	4	1	4	2	4	2	3	High

1 - Data Meaning ¶

3 - Visualizing The Data ¶

4 - Model Building ¶

4.3.1 | Confusion Matrix of Random Forest Classifier Model Outcome ¶

4.3.2 | Random Forest Classifier Plot ¶

4.3.3 | Classification Report of Random Forest Classifier Model ¶

1 - Data Meaning
¶

3 - Visualizing The Data
¶

4 - Model Building
¶