例子是用一些特征来判断蘑菇是否有毒,使用决策树模型
import numpy as np
import matplotlib.pyplot as plt
from public_tests import *%matplotlib inline
You will start by loading the dataset for this task. The dataset you have collected is as follows:
| Cap Color | Stalk Shape | Solitary | Edible |
|---|---|---|---|
| Brown | Tapering | Yes | 1 |
| Brown | Enlarging | Yes | 1 |
| Brown | Enlarging | No | 0 |
| Brown | Enlarging | No | 0 |
| Brown | Tapering | Yes | 1 |
| Red | Tapering | Yes | 0 |
| Red | Enlarging | No | 0 |
| Brown | Enlarging | Yes | 1 |
| Red | Tapering | No | 1 |
| Brown | Enlarging | No | 0 |
Brown or Red),Tapering or Enlarging), andYes or No)1 indicating yes or 0 indicating poisonous)For ease of implementation, we have one-hot encoded the features (turned them into 0 or 1 valued features)
| Brown Cap | Tapering Stalk Shape | Solitary | Edible |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 1 | 0 | 1 | 1 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 0 |
| 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 |
Therefore,
X_train contains three features for each example
1 indicates “Brown” cap color and 0 indicates “Red” cap color)1 indicates “Tapering Stalk Shape” and 0 indicates “Enlarging” stalk shape)1 indicates “Yes” and 0 indicates “No”)y_train is whether the mushroom is edible
y = 1 indicates edibley = 0 indicates poisonous可能有一些特征不仅有2种性状,而是n种,此时要用独热编码表示就必须要n列
刚开始都最好打印一下数据和数据类型
print("First few elements of X_train:\n", X_train[:5])
print("Type of X_train:",type(X_train))First few elements of X_train:[[1 1 1][1 0 1][1 0 0][1 0 0][1 1 1]]
Type of X_train: print("First few elements of y_train:", y_train[:5])
print("Type of y_train:",type(y_train))First few elements of y_train: [1 1 0 0 1]
Type of y_train:
维数也要打印
print ('The shape of X_train is:', X_train.shape)
print ('The shape of y_train is: ', y_train.shape)
print ('Number of training examples (m):', len(X_train))The shape of X_train is: (10, 3)
The shape of y_train is: (10,)
Number of training examples (m): 10
在本实践中,将根据提供的数据集构建决策树。
构建决策树的步骤如下:
在本实验中,您将实现以下功能,这些功能将允许您使用信息增益最高的特性将节点拆分为左分支和右分支
首先,您将编写一个名为“compute_entropy”的助手函数,用于计算节点处的熵 (measure of impurity杂质的度量) .
compute_entropy()函数1 in y) 计算p1p_1p1,这是可食用的示例的分数(即在“y”中的值=“1”)H(p1)=−p1log2(p1)−(1−p1)log2(1−p1)H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1)H(p1)=−p1log2(p1)−(1−p1)log2(1−p1)
len(y) != 0). Return 0 if it is 检查节点上的数据是否为空(特判是否为空串)(即len(y)!=0). 如果是,则返回“0”代码如下:
# UNQ_C1
# GRADED FUNCTION: compute_entropydef compute_entropy(y):"""Computes the entropy for Args:y (ndarray): Numpy array indicating whether each example at a node isedible (`1`) or poisonous (`0`)Returns:entropy (float): Entropy at that node"""# You need to return the following variables correctlyentropy = 0.### START CODE HERE ###if len(y)!=0:p1 = len(y[y == 1]) / len(y) #节点y为1的概率,y==1可以选出其中只为1的子数组if p1!=1 and p1!=0:entropy = -p1*np.log2(p1) - (1-p1)*np.log2(1 - p1)else:entropy = 0.### END CODE HERE ### return entropy
接下来,您将编写一个名为“split_dataset”的助手函数,它接收节点处的数据和要拆分的特性,并将其拆分为左右分支。稍后在实验室中,您将实现代码来计算分割的效果。
node_index=[0,1,2,3,4,5,6,7,8,9]),我们选择在特征0上拆分,这就是示例是否有棕色帽。 left_indices=[0,1,2,3,4,7,9]和right_indices=[5,6,8]| Index | Brown Cap | Tapering Stalk Shape | Solitary | Edible |
|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 | 0 |
| 4 | 1 | 1 | 1 | 1 |
| 5 | 0 | 1 | 1 | 0 |
| 6 | 0 | 0 | 0 | 0 |
| 7 | 1 | 0 | 1 | 1 |
| 8 | 0 | 1 | 0 | 1 |
| 9 | 1 | 0 | 0 | 0 |
Please complete the split_dataset() function shown below
node_indices X at that index for that feature is 1, add the index to left_indicesX at that index for that feature is 0, add the index to right_indices代码实现:
# UNQ_C2
# GRADED FUNCTION: split_datasetdef split_dataset(X, node_indices, feature):"""Splits the data at the given node intoleft and right branchesArgs:X (ndarray): Data matrix of shape(n_samples, n_features)node_indices (ndarray): List containing the active indices. I.e, the samples being considered at this step.feature (int): Index of feature to split onReturns:left_indices (ndarray): Indices with feature value == 1right_indices (ndarray): Indices with feature value == 0"""# You need to return the following variables correctlyleft_indices = []right_indices = []### START CODE HERE ###for i in node_indices:if X[i][feature] == 1:left_indices.append(i)else:right_indices.append(i) ### END CODE HERE ###return left_indices, right_indices
函数调用:
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]# Feel free to play around with these variables
# The dataset only has three features, so this value can be 0 (Brown Cap), 1 (Tapering Stalk Shape) or 2 (Solitary)
feature = 0left_indices, right_indices = split_dataset(X_train, root_indices, feature)print("Left indices: ", left_indices)
print("Right indices: ", right_indices)
接下来,您将编写一个名为“information_gain”的函数,它接收训练数据、节点处的索引和要拆分的特征,并返回拆分后的信息增益。
请完成下面显示的compute_information_gain()函数来计算
Information Gain=H(p1node)−(wleftH(p1left)+wrightH(p1right))\text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right}))Information Gain=H(p1node)−(wleftH(p1left)+wrightH(p1right))
where
代码实现:
注意len(X_node)=len(X_left) + len(X_right)
# UNQ_C3
# GRADED FUNCTION: compute_information_gaindef compute_information_gain(X, y, node_indices, feature):"""Compute the information of splitting the node on a given featureArgs:X (ndarray): Data matrix of shape(n_samples, n_features)y (array like): list or ndarray with n_samples containing the target variablenode_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.Returns:cost (float): Cost computed""" # Split datasetleft_indices, right_indices = split_dataset(X, node_indices, feature)# Some useful variablesX_node, y_node = X[node_indices], y[node_indices]X_left, y_left = X[left_indices], y[left_indices]X_right, y_right = X[right_indices], y[right_indices]# You need to return the following variables correctlyinformation_gain = 0### START CODE HERE #### Weights w_left = len(X_left) / len(X_node)w_right = len(X_right) / len(X_node)#Weighted entropy# 记得这里算的是y值H_p1_node = compute_entropy(y_node)H_p1_left = compute_entropy(y_left)H_p1_right = compute_entropy(y_right)#Information gain information_gain = H_p1_node - (w_left*H_p1_left + w_right*H_p1_right)### END CODE HERE ### return information_gain
通过如上所述计算每个特征的信息增益,并返回给出最大信息增益的特征,来获得要分割的最佳特征
请完成下面显示的get_best_split()函数。
compute_information_gain()函数遍历功能并计算每个功能的信息代码如下:
# UNQ_C4
# GRADED FUNCTION: get_best_splitdef get_best_split(X, y, node_indices): """Returns the optimal feature and threshold valueto split the node data Args:X (ndarray): Data matrix of shape(n_samples, n_features)y (array like): list or ndarray with n_samples containing the target variablenode_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.Returns:best_feature (int): The index of the best feature to split""" # Some useful variablesnum_features = X.shape[1]# You need to return the following variables correctlybest_feature = -1### START CODE HERE ###max_gain = 0for i in range(num_features):info_gain = compute_information_gain(X,y,node_indices,i)if info_gain > max_gain:max_gain = info_gainbest_feature = i### END CODE HERE ## return best_feature
递归地构建决策树
# Not graded
tree = []def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):"""Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.This function just prints the tree.Args:X (ndarray): Data matrix of shape(n_samples, n_features)y (array like): list or ndarray with n_samples containing the target variablenode_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.branch_name (string): Name of the branch. ['Root', 'Left', 'Right']max_depth (int): Max depth of the resulting tree. current_depth (int): Current depth. Parameter used during recursive call.""" # Maximum depth reached - stop splittingif current_depth == max_depth:formatting = " "*current_depth + "-"*current_depthprint(formatting, "%s leaf node with indices" % branch_name, node_indices)return# Otherwise, get best split and split the data# Get the best feature and threshold at this nodebest_feature = get_best_split(X, y, node_indices) tree.append((current_depth, branch_name, best_feature, node_indices))formatting = "-"*current_depthprint("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))# Split the dataset at the best featureleft_indices, right_indices = split_dataset(X, node_indices, best_feature)# continue splitting the left and the right child. Increment current depthbuild_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)Depth 0, Root: Split on feature: 2
- Depth 1, Left: Split on feature: 0-- Left leaf node with indices [0, 1, 4, 7]-- Right leaf node with indices [5]
- Depth 1, Right: Split on feature: 1-- Left leaf node with indices [8]-- Right leaf node with indices [2, 3, 6, 9]