35丨AdaBoost(下):如何使用AdaBoost对房价进行预测?
35丨AdaBoost(下):如何使用AdaBoost对房价进行预测?
讲述:陈旸
时长08:37大小7.90M
如何使用 AdaBoost 工具
如何用 AdaBoost 对房价进行预测
AdaBoost 与决策树模型的比较
总结
赞 14
提建议
精选留言(23)
- TKbook2019-03-05源代码中: # 从 12000 个数据中取前 2000 行作为测试集,其余作为训练集 test_x, test_y = X[2000:],y[2000:] train_x, train_y = X[:2000],y[:2000] 这个部分的代码写错了吧 应该是: test_x, test_y = x[: 2000], y[: 2000] train_x, train_y = x[2000:], y[2000:]展开
编辑回复: 您好,文章已进行更正,谢谢您的反馈。
17 - third2019-03-04结果仍然为AdaBoost算法最优。 个人发现,前两个分类器出结果很快 分析最优: 1.AdaBoost算法经过了更多运算,特别是在迭代弱分类器和组合上 2.良好组合起来的个体,能够创造更大的价值。 决策树弱分类器准确率为 0.7867 决策树分类器准确率为 0.7891 AdaBoost 分类器准确率为 0.8138 import numpy as np import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.feature_extraction import DictVectorizer # 1.数据加载 train_data=pd.read_csv('./Titanic_Data/train.csv') test_data=pd.read_csv('./Titanic_Data/test.csv') # 2.数据清洗 # 使用平均年龄来填充年龄中的 NaN 值 train_data['Age'].fillna(train_data['Age'].mean(),inplace=True) test_data['Age'].fillna(test_data['Age'].mean(),inplace=True) # 均价填充 train_data['Fare'].fillna(train_data['Fare'].mean(),inplace=True) test_data['Fare'].fillna(test_data['Fare'].mean(),inplace=True) # 使用登陆最多的港口来填充 train_data['Embarked'].fillna('S',inplace=True) test_data['Embarked'].fillna('S',inplace=True) # 特征选择 features=['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked'] train_features=train_data[features] train_labels=train_data['Survived'] test_features=test_data[features] # 将符号化的Embarked对象抽象处理成0/1进行表示 dvec=DictVectorizer(sparse=False) train_features=dvec.fit_transform(train_features.to_dict(orient='record')) test_features=dvec.transform(test_features.to_dict(orient='record')) # 决策树弱分类器 dt_stump = DecisionTreeClassifier(max_depth=1,min_samples_leaf=1) dt_stump.fit(train_features, train_labels) print(u'决策树弱分类器准确率为 %.4lf' % np.mean(cross_val_score(dt_stump, train_features, train_labels, cv=10))) # 决策树分类器 dt = DecisionTreeClassifier() dt.fit(train_features, train_labels) print(u'决策树分类器准确率为 %.4lf' % np.mean(cross_val_score(dt, train_features, train_labels, cv=10))) # AdaBoost 分类器 ada = AdaBoostClassifier(base_estimator=dt_stump,n_estimators=200) ada.fit(train_features, train_labels) print(u'AdaBoost 分类器准确率为 %.4lf' % np.mean(cross_val_score(ada, train_features, train_labels, cv=10)))展开
编辑回复: 结果正确,一般来说AdaBoost的结果会比决策树分类器略好一些。
7 - 王彬成2019-03-04由于乘客测试集缺失真实值,采用 K 折交叉验证准确率 -------------------- 运行结果: 决策树弱分类器准确率为 0.7867 决策树分类器准确率为 0.7813 AdaBoost 分类器准确率为 0.8138 ------------------------- 代码: import numpy as np from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import AdaBoostClassifier import pandas as pd from sklearn.feature_extraction import DictVectorizer from sklearn.model_selection import cross_val_score # 设置 AdaBoost 迭代次数 n_estimators=200 # 数据加载 train_data=pd.read_csv('./Titanic_Data/train.csv') test_data=pd.read_csv('./Titanic_Data/test.csv') # 模块 2:数据清洗 # 使用平均年龄来填充年龄中的 NaN 值 train_data['Age'].fillna(train_data['Age'].mean(),inplace=True) test_data['Age'].fillna(test_data['Age'].mean(),inplace=True) # 使用票价的均值填充票价中的 nan 值 train_data['Fare'].fillna(train_data['Fare'].mean(),inplace=True) test_data['Fare'].fillna(test_data['Fare'].mean(),inplace=True) # 使用登录最多的港口来填充登录港口Embarked的 nan 值 train_data['Embarked'].fillna('S',inplace=True) test_data['Embarked'].fillna('S',inplace=True) # 特征选择 features=['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked'] train_features=train_data[features] train_labels=train_data['Survived'] test_features=test_data[features] # 将符号化的Embarked对象处理成0/1进行表示 dvec=DictVectorizer(sparse=False) train_features=dvec.fit_transform(train_features.to_dict(orient='record')) test_features=dvec.transform(test_features.to_dict(orient='record')) # 决策树弱分类器 dt_stump = DecisionTreeClassifier(max_depth=1,min_samples_leaf=1) dt_stump.fit(train_features, train_labels) print(u'决策树弱分类器准确率为 %.4lf' % np.mean(cross_val_score(dt_stump, train_features, train_labels, cv=10))) # 决策树分类器 dt = DecisionTreeClassifier() dt.fit(train_features, train_labels) print(u'决策树分类器准确率为 %.4lf' % np.mean(cross_val_score(dt, train_features, train_labels, cv=10))) # AdaBoost 分类器 ada = AdaBoostClassifier(base_estimator=dt_stump,n_estimators=n_estimators) ada.fit(train_features, train_labels) print(u'AdaBoost 分类器准确率为 %.4lf' % np.mean(cross_val_score(ada, train_features, train_labels, cv=10)))展开
作者回复: Good Job
6 - 梁林松2019-03-04跑第二块代码是需要引入两个模块 from sklearn.tree import DecisionTreeRegressor from sklearn.neighbors import KNeighborsRegressor
编辑回复: 对的 需要引入相应的回归类库。
3 - 滢2019-04-21得到结果: CART决策树K折交叉验证准确率: 0.39480897860892333 AdaBoostK折交叉验证准确率: 0.4376641797318339 from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import AdaBoostRegressor from sklearn.feature_extraction import DictVectorizer from sklearn.model_selection import cross_val_predict import pandas as pd import numpy as np #读取数据 path = '/Users/apple/Desktop/GitHubProject/Read mark/数据分析/geekTime/data/' train_data = pd.read_csv(path + 'Titannic_Data_train.csv') test_data = pd.read_csv(path + 'Titannic_Data_test.csv') #数据清洗 train_data['Age'].fillna(train_data['Age'].mean(),inplace=True) test_data['Age'].fillna(test_data['Age'].mean(), inplace=True) train_data['Embarked'].fillna('S', inplace=True) test_data['Embarked'].fillna('S', inplace=True) #特征选择 features = ['Pclass','Sex','Age','SibSp','Parch','Embarked'] train_features = train_data[features] train_result = train_data['Survived'] test_features = test_data[features] devc = DictVectorizer(sparse=False) train_features = devc.fit_transform(train_features.to_dict(orient='record')) test_features = devc.fit_transform(test_features.to_dict(orient='record')) #构造决策树,进行预测 tree_regressor = DecisionTreeRegressor() tree_regressor.fit(train_features,train_result) predict_tree = tree_regressor.predict(test_features) #交叉验证准确率 print('CART决策树K折交叉验证准确率:', np.mean(cross_val_predict(tree_regressor,train_features,train_result,cv=10))) #构造AdaBoost ada_regressor = AdaBoostRegressor() ada_regressor.fit(train_features,train_result) predict_ada = ada_regressor.predict(test_features) #交叉验证准确率 print('AdaBoostK折交叉验证准确率:',np.mean(cross_val_predict(ada_regressor,train_features,train_result,cv=10)))展开
编辑回复: 准确率一般不会这么低,所以你可以查下代码中是否有错误。 这里需要注意的是,应该是用DecisionTreeClassifier和AdaBoostClassifier,因为泰坦尼克生存预测是个分类问题(离散值),不是回归问题(连续值)。 另外在我们在做K折交叉验证的时候,应该使用:cross_val_score cross_val_score 用来返回评测的准确率 cross_val_predict 用来返回预测的分类结果 这两处地方你调整下,再跑跑代码
1 - KokutoDa2021-05-22准确率: adaboost 交叉验证:0.81147315855181 决策树交叉验证:0.7812484394506866 adaboost(accuracy_score):0.8484848484848485 决策树(accuracy_score)0.9820426487093153 老师,为什么交叉验证的准确率和accuracy_score的准确率计算结果相反? import pandas as pd from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.feature_extraction import DictVectorizer from sklearn.model_selection import cross_val_score from sklearn.metrics import accuracy_score import numpy as np # adaboost 预测泰坦尼克号生存 # 数据加载 train_data = pd.read_csv('../Titanic_Data/train.csv') test_data = pd.read_csv('../Titanic_Data/test.csv') # 数据清洗 train_data['Age'].fillna(train_data['Age'].mean(), inplace=True) test_data['Age'].fillna(test_data['Age'].mean(),inplace=True) train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True) test_data['Fare'].fillna(test_data['Fare'].mean(),inplace=True) # 使用登录最多的港口来填充登录港口的 nan 值 train_data['Embarked'].fillna('S', inplace=True) test_data['Embarked'].fillna('S',inplace=True) # 特征选择 features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] X_train = train_data[features] y_train = train_data['Survived'] X_test = test_data[features] # 替换成计算机能理解的 dvec=DictVectorizer(sparse=False) X_train = dvec.fit_transform(X_train.to_dict(orient='record')) # Adaboost ada = AdaBoostClassifier(n_estimators=200) ada.fit(X_train, y_train) ada_pred = ada.predict(X_train) # 决策树 clf = DecisionTreeClassifier(criterion='entropy') clf.fit(X_train, y_train) clf_pred = clf.predict(X_train) print(np.mean(cross_val_score(ada, X_train, y_train, cv=10))) print(np.mean(cross_val_score(clf, X_train, y_train, cv=10))) print(accuracy_score(y_train, ada_pred)) print(accuracy_score(y_train, clf_pred))展开
- Liam2021-03-26ax = fig.add_subplot(111)ax.plot([1,n_estimators],[dt_stump_err]*2, 'k-', label=u'决策树弱分类器 错误率')ax.plot([1,n_estimators],[dt_err]*2,'k--', label=u'决策树模型 错误率')ada_err = np.zeros((n_estimators,)). 疑问:这里*2是什么意思,能解析下代码吗?
作者回复: print([0.8] * 2) 你会看到打印结果为:[0.8, 0.8] 列表 * n 代表列表被复制扩展n倍长。 乘号*常被用于快速初始化list,但有一个隐患:被乘号复制的对象都指向同一个空间,所以如果你的列表中的元素要用来存储不同值时,建议用for循环。 老师这里只是为了可视化,所以才使用这种方式。
1 - 小晨2021-03-10弱分类器准确率为 0.7868 决策树分类器准确率为 0.7823 AdaBoost分类器准确率为:0.8115 #!/usr/bin/env python # -*- coding:utf-8 -*- # Author:Peter import numpy as np import pandas as pd from sklearn.ensemble import AdaBoostClassifier from sklearn.feature_extraction import DictVectorizer from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier # 迭代次数 n_estimators = 200 train_data = pd.read_csv(r'data/Titanic_Data_train.csv') test_data = pd.read_csv(r'data/Titanic_Data_Test.csv') # 用平均年龄将缺失的年龄补齐 train_data['Age'].fillna(train_data['Age'].mean(), inplace=True) test_data['Age'].fillna(test_data['Age'].mean(), inplace=True) # 用平均票价将缺失的票价补齐 train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True) test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True) # 用登船港口最多的S补齐缺失 train_data['Embarked'].fillna('S', inplace=True) test_data['Embarked'].fillna('S', inplace=True) # 将可用来分类的数据放到训练集中 features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] train_features = train_data[features] train_labels = train_data['Survived'] test_features = test_data[features] # 字符串数据规范化,转为int型 dvec = DictVectorizer(sparse=False) train_features = dvec.fit_transform(train_features.to_dict(orient='record')) test_features = dvec.transform(test_features.to_dict(orient='record')) # 弱分类器 dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1) dt_stump.fit(train_features, train_labels) print(u'弱分类器准确率为 %.4lf' % dt_stump.score(train_features, train_labels)) # 决策树分类器 dt = DecisionTreeClassifier() dt.fit(train_features, train_labels) print(u'决策树分类器准确率为 %.4lf' % np.mean(cross_val_score(dt, train_features, train_labels, cv=10))) # AdaBoost分类器 ada = AdaBoostClassifier(base_estimator=dt_stump, n_estimators=n_estimators) ada.fit(train_features, train_labels) ada_score = np.mean(cross_val_score(ada, train_features, train_labels, cv=10)) print("AdaBoost分类器准确率为:%.4lf" % ada_score)展开
作者回复: 结果正确,一般来说AdaBoost的结果会比决策树分类器略好一些
- 非同凡想2020-11-26交作业: import numpy as np import pandas as pd from sklearn import tree from sklearn import feature_extraction from sklearn.model_selection import cross_val_score from sklearn.ensemble import AdaBoostClassifier # load dataset train_data = pd.DataFrame(pd.read_csv('~/Documents/titanic_data/train.csv')) test_data = pd.DataFrame(pd.read_csv('~/Documents/titanic_data/test.csv')) # data cleaning train_data['Age'].fillna(train_data['Age'].mean(), inplace=True) test_data['Age'].fillna(test_data['Age'].mean(), inplace=True) test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True) train_data['Embarked'].fillna('S', inplace=True) test_data['Embarked'].fillna('S', inplace=True) # select features features = ['Pclass', 'Sex', "Age", 'SibSp', 'Parch', 'Fare', 'Embarked'] train_x = train_data[features] train_y = train_data['Survived'] test_x = test_data[features] # one-hot dict_vec = feature_extraction.DictVectorizer(sparse=False) train_x = dict_vec.fit_transform(train_x.to_dict(orient='record')) test_x = dict_vec.transform(test_x.to_dict(orient='record')) print(dict_vec.feature_names_) # decision tree dtc = tree.DecisionTreeClassifier() dtc.fit(train_x, train_y) print("决策树准确率", dtc.score(train_x, train_y)) print("决策树:k折交叉验证准确率:", np.mean(cross_val_score(dtc, train_x, train_y, cv= 10))) # adaboost ada = AdaBoostClassifier(n_estimators=50) ada.fit(train_x, train_y) print("AdaBoost准确率", ada.score(train_x, train_y)) print("AdaBoost k折交叉验证准确率:", np.mean(cross_val_score(ada, train_x, train_y, cv= 10))) 决策树准确率 0.9820426487093153 决策树:k折交叉验证准确率: 0.7744943820224719 AdaBoost准确率 0.8338945005611672 AdaBoost k折交叉验证准确率: 0.8070037453183521展开
- 萌辰2020-07-05在AdaBoost、决策树回归、KNN房价预测对比中发现,随机种子对决策树的预测结果有影响。 分别测试了三种不同的随机种子: dec_regressor=DecisionTreeRegressor(random_state=1) dec_regressor=DecisionTreeRegressor(random_state=20) dec_regressor=DecisionTreeRegressor(random_state=30) 测试结果为: 决策树均方误差1 = 36.65 决策树均方误差20 = 25.54 决策树均方误差30 = 37.19 思考: 此处考虑这里没有限制种子的随机性,对比的结果可能过于随机了,无法真实反映算法效果,两种算法原理中随机种子的应用情况不同。思考是不是采用多次随机MSE结果求平均的方法作为【比较项】更为合适 KNN算法无随机种子影响。展开
作者回复: DecisionTreeRegressor的参数random_state随机数种子,用来控制估算器的随机性。 即使分割器设置为“best”,每个分割中的特征也始终是随机排列的。 当max_features <n_features时,算法将在每个分割处随机选择max_features,然后再在其中找到最佳分割。 但是,即使max_features = n_features,找到的最佳分割也可能因不同的运行而有所不同。 就是这种情况,如果对于几个分割而言标准的改进是相同的,并且必须随机选择一个分割。 为了在拟合过程中获得确定性的行为,random_state必须固定为整数。 使用相同random_state,则每次使用相同的分割策略。所以不同随机数种子参数,得到的结果不同。
- even2020-06-30不同的算法有不同的特点,老师是否可以做个总结和对比。比如在实际的工作或者项目中,根据经验和不同的算法特点,如何选择算法,为什么选择这种算法。希望老师能分享这一块实际应用场景的经验。
- §mc²ompleXWr2020-06-09使用自带的数据集就不用做数据规范化么?
作者回复: 是否需要进行数据规范化,取决于所使用的模型和特征的数据范围。比如:树模型和朴素贝叶斯模型不需要进行规范化;如果数据集特征的数据已经都在0-1之间,或者已经符合标准化,则无需规范化。
- 鲨鱼鲸鱼鳄鱼2020-05-25老师,请问AdaBoost模型在预测前需不需要对数据进行标准化或者归一化,做有什么好处,不做有什么好处呢
作者回复: AdaBoost模型默认使用的弱分类器是决策树模型,树模型只看点之间的相对位置,不计算二者之间的距离,因此不需要进行数据规范化(包括标准化或归一化等)
- 张贺2020-03-27老师讲的很清晰
作者回复: 谢谢张贺同学
- 热水泡面不会做2020-03-10可不可以解释一下这里的学习率体现在哪里呢?之前的原理讲解里好像没有用到学习率?
- Untitled2020-03-08结果: ada train precision = 0.8338945005611672 ada 10k precison = 0.8070037453183521 clf train precision = 0.9820426487093153 clf 10k precision = 0.7767041198501872 #代码 import pandas as pd import numpy as np from sklearn.feature_extraction import DictVectorizer from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score train_data = pd.read_csv('train.csv') test_data = pd.read_csv('test.csv') train_data['Age'].fillna(train_data['Age'].mean(),inplace=True) test_data['Age'].fillna(train_data['Age'].mean(),inplace=True) train_data['Fare'].fillna(train_data['Age'].mean(),inplace=True) test_data['Fare'].fillna(train_data['Age'].mean(),inplace=True) train_data['Embarked'].fillna('S',inplace=True) test_data['Embarked'].fillna('S',inplace=True) features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] train_features = train_data[features] train_labels = train_data['Survived'] test_features = test_data[features] dvec=DictVectorizer(sparse=False) train_features=dvec.fit_transform(train_features.to_dict(orient='record')) test_features=dvec.transform(test_features.to_dict(orient='record')) ada = AdaBoostClassifier() ada.fit(train_features, train_labels) print("ada train precision = ",ada.score(train_features, train_labels)) print("ada 10k precison = ", np.mean(cross_val_score(ada,train_features,train_labels,cv=10))) clf=DecisionTreeClassifier(criterion='entropy') clf.fit(train_features,train_labels) print("clf train precision = ", clf.score(train_features, train_labels)) print("clf 10k precision = ", np.mean(cross_val_score(clf,train_features,train_labels,cv=10)))展开
- 骑行的掌柜J2019-08-14打错了 陈老师是对的 是回归算法😂里没有分类算法的algorithm 参数。
- 滨滨2019-04-21分类和回归都是做预测,分类是离散值,回归是连续值
作者回复: 对的
- hlz-1232019-03-27老师,在AdaBoost 与决策树模型的比较的例子中,弱分类器 dt_stump = DecisionTreeClassfier(max_depth=1,min_samples_leaf=1) 为什么两个参数都设置为1,相当于只有1个根节点,2个叶节点? 而普通的决策树分类器,没有设置参数,这是什么原因?
- 叮当猫2019-03-19fit_transform数据统一处理,求问什么时候需要? 在我同时没有进行fit_transform的情况下,准确率: 决策树弱分类器的准确率是0.7867 决策树分类器的准确率是0.7734 AdaBoost分类器的准确率是0.8161 在我对数据同时进行fit_transform的情况下,准确率: 决策树弱分类器的准确率是0.7867 决策树分类器的准确率是0.7745 AdaBoost分类器的准确率是0.8138 以下是第一种情况: train_data['Embarked'] = train_data['Embarked'].map({'S':0, 'C':1, 'Q':2}) test_data['Embarked'] = test_data['Embarked'].map({'S':0, 'C':1, 'Q':2}) train_data['Sex'] = train_data['Sex'].map({'male':0, 'female':1}) test_data['Sex'] = test_data['Sex'].map({'male':0, 'female':1}) train_data['Age'].fillna(train_data['Age'].mean(), inplace=True) test_data['Age'].fillna(test_data['Age'].mean(), inplace=True) train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True) test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True) features = ['Pclass', 'Sex','Age','SibSp', 'Parch', 'Fare', 'Embarked'] train_features = train_data[features] train_labels = train_data['Survived'] test_features = test_data[features] #train_features = dvec.fit_transform(train_features.to_dict(orient='record')) #test_features = dvec.transform(test_features.to_dict(orient='record')) 以下是第二种情况: #train_data['Embarked'] = train_data['Embarked'].map({'S':0, 'C':1, 'Q':2}) #test_data['Embarked'] = test_data['Embarked'].map({'S':0, 'C':1, 'Q':2}) #train_data['Sex'] = train_data['Sex'].map({'male':0, 'female':1}) #test_data['Sex'] = test_data['Sex'].map({'male':0, 'female':1}) train_data['Age'].fillna(train_data['Age'].mean(), inplace=True) test_data['Age'].fillna(test_data['Age'].mean(), inplace=True) train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True) test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True) features = ['Pclass', 'Sex','Age','SibSp', 'Parch', 'Fare', 'Embarked'] train_features = train_data[features] train_labels = train_data['Survived'] test_features = test_data[features] train_features = dvec.fit_transform(train_features.to_dict(orient='record')) test_features = dvec.transform(test_features.to_dict(orient='record'))展开