Tags: Josh Starmer StatQuest Machine Learning Statistics Data Science
NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: https://statquest.gumroad.com/l/tzxoh
This webinar was recorded 20200528 at 11:00am (New York time).
NOTE: This StatQuest assumes are already familiar with:
Decision Trees: https://youtu.be/7VeUPuFGJHk
Cross Validation: https://youtu.be/fSytzGwwBVw
Confusion Matrices: https://youtu.be/Kdsp6soqA7o
Cost Complexity Pruning: https://youtu.be/D0efHEJsfHo
Bias and Variance and Overfitting: https://youtu.be/EuBBz3bI-aA
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying my book, The StatQuest Illustrated Guide to Machine Learning:
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
0:00 Awesome song and introduction
5:23 Import Modules
7:40 Import Data
11:18 Missing Data Part 1: Identifying
15:57 Missing Data Part 2: Dealing with it
21:16 Format Data Part 1: X and y
23:33 Format Data Part 2: One-Hot Encoding
37:29 Build Preliminary Tree
46:31 Pruning Part 1: Visualize Alpha
51:22 Pruning Part 2: Cross Validation
56:46 Build and Draw Final Tree
#StatQuest #ML #ClassificationTrees
or
Question
We have data scientist out there. We have "data artist" right in this video.
I loved your Brazil polo shirt! Triple bam!!! Thank you for your videos. Regards from Brazil!
How do you know the order to list the display labels in the in plot_confusion_matrix function?
Hi Josh, Thanks for the video again!!. I have some questions hope you don't mind to clarify in regards to pruning in general hyperparameter tuning. I see that in general the video has done the following to find the best alpha.
1) After train test split, find the best alpha after comparison between test and training (single split). @50:32
2) Rechecking the best alpha by doing CV @52:33. It is checked that that is huge variation in the accuracy, and this implies that alpha is sensitive to different training set.
3) Redo the CV for to find the best alpha by taking the mean of accuracy for each alpha.
a) At step two, do we still need to plot the training set accuracy to check for overfitting? (it is always mention that we should compare training & testing set accuracy to check for overfitting) but there is an debate on this as well. ( Where other party mentioned that for a model-A of training/test accuracy of 99/90% vs another model-B : 85/85%. We should pick model-A with 99/90% accuracy because 90% testing accuracy is higher than 85% even though the model-B has no gap (overfitting) between train & test. What's your thought on this?
b) What if I don't do step 1) and 2) and straight to step 3) is this a bad practice? do i still need to plot the training accuracy to compare with test accuracy if I skip step 1 and step 2? Thanks.
c) I always see that the final hyper parameter is decided on highest mean of accuracy of all K-folds. Do we need to consider the impact of variance in K-fold? surely we don't want our accuracy to jump all over the place if taken into production. if yes, what is general rule of thumb if the variance in accuracy is consider bad.
Sorry for the long posting. Thanks!
Wow, this is super helpful!
Hi Josh, I have a question, at 1:01:03 , if we interpret the tree, on the right split from the root node, we first went from a node with Gini Score of 0.346 (cp_4.0 <= 0.5) to a Gini Score of 0.4999 (oldpeak <= 0.55), we learnt that Gini Score are supposed to decrease as we descend the tree, why did the Gini score increase here?
Thank you so much, love ur videos
Hi Josh, such an awesome helpful video, again! May I ask you a basic question? When I'm doing an initial decision tree model building using train/test split and evaluate training and test accuracy scores and then start over doing k-fold cross validation on the same training set and evaluate it on the same test set as in the initial step -> is that a proper method? Because I used the same test set for evaluation twice, first on the initial train/test split method and second using the crossvalidation method? I read you should us your test (or hold out) set only once⦠Last question: Should you use the exactly same training/test set for comparing different algorithms (decision trees, random Forests, logistic Regression, kNN, etc...)? Thanks so much for a short feedback and quest on! Thanks and BAM!!!
Awesome StatQuest! Great channel! Make more videos like this one for the other topics. Thank you for your time!
greatings from Brazil!
Hi Josh, thank you so much for this awesome posting! Quick question, when doing the cross validation, should the cross_val_score() using [X_train, y_train] or the [X_encoded, y]? I'm wondering if the point of doing cross validation is to let each chunk of data set being testing data, should we then use the full data set X_encoded an y for the cross validation? Thank you!!