
Hier ein sehr einfaches Beispiel zur Kategorisierung von Tieren im Zoo.
Import necessary module¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import graphviz
from pandas.tools.plotting import scatter_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV
import warnings
warnings.simplefilter("ignore")
Load the datasets from local folder¶
read dataset, dataset get from https://www.kaggle.com/uciml/zoo-animal-classification
zoo_df = pd.read_csv('zoo.csv', encoding='utf-8', delimiter = ',')
Data Features¶
The dataset have following features:
- Animal name
- Hair
- Feathers
- Eggs
- Milk
- Airborne
- Aquatic
- Predator
- Toothed
- Backbone
- Breathes
- Venomous
- Fins
- Legs
- Tail
- Domestic
- Catsize
- Class type
check the dataset head¶
zoo_df.head()
zoo_df.info()
# check unique animal name
zoo_df.animal_name.value_counts().count()
Drop animal name¶
The animal name type is the object. There are 101 samples and this feature is almost all values unique (except only one value). I think it should be removed.
zoo_df.drop(columns='animal_name',inplace=True)
zoo_df.head()
Histograms¶
plt.rcParams['figure.figsize'] = (30.0, 30.0)
zoo_df.hist()
plt.show()
zoo_df.class_type.value_counts()
Correlation and plot these values¶
corr = zoo_df.corr()
# plot the heatmap
plt.rcParams['figure.figsize'] = (10.0, 10.0)
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
corr.class_type
Between class type and, airborne, predator and fin seem to be uncorrelated.
Seperate X and Y values from zoo_df¶
X = zoo_df.drop('class_type', axis=1)
Y = zoo_df['class_type']
validation_size = 0.20
seed = 7
Running RFECV for finding important features¶
For feature selection, recursive feature elimination and cross-validated algorithm can be used.
estimator = DecisionTreeClassifier()
rfecv = RFECV(estimator, step=1, cv=5)
rfecv = rfecv.fit(X, Y)
feature_names = X.columns[np.where(rfecv.support_ == True)[0]]
feature_names
Drop unimportant features from X¶
X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1, inplace=True)
X.head()
Split train and test data¶
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size,random_state=seed)
Build some models¶
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
scoring = 'accuracy'
results = []
names = []
for name, model in models:
cv_results=cross_val_score(model, X_train, Y_train, cv=10, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
dtc = DecisionTreeClassifier();
dtc.fit(X_train,Y_train)
y_pred = dtc.predict(X_validation)
confusion_matrix(Y_validation, y_pred)
accuracy_score(Y_validation, y_pred)
Decision Tree Rule¶
The following figure shows the decision tree rule
For example, if milk value is upper than 0.5, this sample should be in the first class.
If milk value is lower than 0.5 and feathers value is upper than 0.5, this sample should be in second class.
...
dot_data = export_graphviz(dtc, out_file=None,
feature_names = feature_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
Gaussian Naive Bayes¶
nb = GaussianNB()
nb.fit(X_train,Y_train)
y_pred_nb = nb.predict(X_validation)
confusion_matrix(Y_validation, y_pred_nb)
accuracy_score(Y_validation, y_pred_nb)