Research question: Can you predict harmful breast cancer with data?

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score 
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
import matplotlib.pyplot as plt 
In [2]:
cancer_data = pd.read_csv('breast_cancer.csv', header=None)
cancer_data.head()
#print(cancer_data.shape)
(569, 32)

In this dataset M = malignant which means harmfull and B = benign which means unharmfull tumor.

Ten real-valued features are computed for each cell nucleus:

  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area - 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension ("coastline approximation" - 1)
In [3]:
cancer_data = cancer_data.dropna()
del cancer_data[0]
In [4]:
x = cancer_data.copy()
del x[1]
In [5]:
y = cancer_data[1].copy()
In [24]:
scores = []
values = []
value = 5
while value <= 100:
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=value, random_state=42)
    cancer_tree = DecisionTreeClassifier(max_leaf_nodes = 20, random_state=42)
    cancer_tree.fit(x_train, y_train)
    predictions = cancer_tree.predict(x_test)
    scores.append(accuracy_score(y_true = y_test, y_pred = predictions))
    values.append(value)
    value += 5
In [20]:
scores
Out[20]:
[1.0,
 0.9,
 0.9333333333333333,
 0.95,
 0.92,
 0.9333333333333333,
 0.9428571428571428,
 0.925,
 0.9111111111111111,
 0.92,
 0.9090909090909091,
 0.8833333333333333,
 0.8923076923076924,
 0.9,
 0.96,
 0.9125,
 0.9058823529411765,
 0.9111111111111111,
 0.9157894736842105,
 0.92]
In [21]:
predictions = cancer_tree.predict(x_test)
predictions[:10]
Out[21]:
array(['B', 'M', 'M', 'B', 'B', 'M', 'M', 'M', 'M', 'M'], dtype=object)
In [22]:
y_test[:10]
Out[22]:
204    B
70     M
131    M
431    B
540    B
567    M
369    M
29     M
81     B
477    B
Name: 1, dtype: object
In [23]:
accuracy_score(y_true = y_test, y_pred = predictions)
Out[23]:
0.92
In [15]:
dot_data = StringIO()
export_graphviz(cancer_tree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[15]:
In [37]:
plt.plot(values, scores)
plt.ylabel('percentage accuracy score')
plt.xlabel('percentage of data used for testing')
plt.title('percentage of data used for testing vs accuracy score')
plt.axis([0,100, 0, 1])
Out[37]:
[0, 100, 0, 1]
000webhost logo