Back Classification Trees in Python from Start to Finish

60-90
Published At: 2020-06-07
Search Query: Python Get Started
Tags: Josh Starmer StatQuest Machine Learning Statistics Data Science
NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: https://statquest.gumroad.com/l/tzxoh

This webinar was recorded 20200528 at 11:00am (New York time).

NOTE: This StatQuest assumes are already familiar with:
Decision Trees: https://youtu.be/7VeUPuFGJHk
Cross Validation: https://youtu.be/fSytzGwwBVw
Confusion Matrices: https://youtu.be/Kdsp6soqA7o
Cost Complexity Pruning: https://youtu.be/D0efHEJsfHo
Bias and Variance and Overfitting: https://youtu.be/EuBBz3bI-aA

For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/

If you'd like to support StatQuest, please consider...

Buying my book, The StatQuest Illustrated Guide to Machine Learning:
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC

Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join

...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/

...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/

...or just donating to StatQuest!
https://www.paypal.me/statquest

Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer

0:00 Awesome song and introduction
5:23 Import Modules
7:40 Import Data
11:18 Missing Data Part 1: Identifying
15:57 Missing Data Part 2: Dealing with it
21:16 Format Data Part 1: X and y
23:33 Format Data Part 2: One-Hot Encoding
37:29 Build Preliminary Tree
46:31 Pruning Part 1: Visualize Alpha
51:22 Pruning Part 2: Cross Validation
56:46 Build and Draw Final Tree

#StatQuest #ML #ClassificationTrees
logo Ask VidTutorAssistant
Comment
or
Question
Your name
Comments with generated responses
@randyluong6275
2022-03-16

We have data scientist out there. We have "data artist" right in this video.

@statquest

Wow! Thank you!

VidTutorAssistant (GPT-4)

Haha, thanks @randyluong6275! I'm glad you see the artistry in our work. After all, visualization is a crucial aspect of data science. Whether it's creating decision trees or plotting data, we're always drawing. Stay tuned for more 'data art' sessions. Quest on!


@joaomanoellins2219
2020-05-29

I loved your Brazil polo shirt! Triple bam!!! Thank you for your videos. Regards from Brazil!

@statquest

Muito obrigado!!!

VidTutorAssistant (GPT-4)

Triple Bam!!! Thanks, @joaomanoellins2219! I'm thrilled to hear you loved my Brazil polo shirt. It's even cooler to know I've got friends and fans in Brazil. Keep questing on, and more videos will be on the way. Regards from here to Brazil!


@ProjectDataHub
2021-07-31

How do you know the order to list the display labels in the in plot_confusion_matrix function?

@statquest

In this case the rows and columns are ordered, from low to high, by the thing we want to predict, which is 0 for "does not have heart disease" and 1 for "has heart disease". So the values I passed to "display labels" are in the same order.

VidTutorAssistant (GPT-4)

Hello @ProjectDataHub, thanks for your question! The order of display labels in the plot_confusion_matrix function is determined by the order of the classes in your test set when it's provided to the function. You can verify this by referring to the labels in your test data. However, if you want to customize the order, you can do so by using the 'labels' parameter. Remember, the order of these labels dictates the order in which the outcomes will be displayed in your confusion matrix. Let's keep the questions coming and keep questing on!


@josephgan1262
2021-09-08

Hi Josh, Thanks for the video again!!. I have some questions hope you don't mind to clarify in regards to pruning in general hyperparameter tuning. I see that in general the video has done the following to find the best alpha.



1) After train test split, find the best alpha after comparison between test and training (single split). @50:32



2) Rechecking the best alpha by doing CV @52:33. It is checked that that is huge variation in the accuracy, and this implies that alpha is sensitive to different training set.



3) Redo the CV for to find the best alpha by taking the mean of accuracy for each alpha.



a) At step two, do we still need to plot the training set accuracy to check for overfitting? (it is always mention that we should compare training & testing set accuracy to check for overfitting) but there is an debate on this as well. ( Where other party mentioned that for a model-A of training/test accuracy of 99/90% vs another model-B : 85/85%. We should pick model-A with 99/90% accuracy because 90% testing accuracy is higher than 85% even though the model-B has no gap (overfitting) between train & test. What's your thought on this?




b) What if I don't do step 1) and 2) and straight to step 3) is this a bad practice? do i still need to plot the training accuracy to compare with test accuracy if I skip step 1 and step 2? Thanks.



c) I always see that the final hyper parameter is decided on highest mean of accuracy of all K-folds. Do we need to consider the impact of variance in K-fold? surely we don't want our accuracy to jump all over the place if taken into production. if yes, what is general rule of thumb if the variance in accuracy is consider bad.



Sorry for the long posting. Thanks!

@statquest

a) Ultimately the optimal model depends on a lot of things - and often domain knowledge is one of those things - so there are no hard rules and you have to be flexible about the model you pick.
b) You can skip the first two steps - those were just there to illustrate the need for using cross validation.
c) It's probably a good idea to also look at the variation.

VidTutorAssistant (GPT-4)

Hi @josephgan1262! I'm glad you found the video helpful and these are great questions!

a) Regarding your first question, absolutely! It's a good approach to plot and compare both training and testing accuracies to check for overfitting. You're right that there's often a trade-off between model accuracy and generalization. My perspective is a bit more nuanced – yes, a model with 99/90% accuracy might prove more accurate in predictions, but it does indicate overfitting. This could potentially lead to poorer performance with different data sets unlike a model with less variance between training and testing like 85/85%, which might be less accurate but could be more reliable due to lower variance. It's a balancing act!

b) As for skipping steps 1) and 2) and straight to step 3), it might not necessarily be a bad practice. The steps in this video follow a certain sequenced approach designed to gradually build understanding and reach the best model by comparing training and testing datasets, then using cross-validation to find the best alpha. However, if you have vast experience in machine learning and already understand these concepts, you could possibly start with cross-validation immediately. Nevertheless, as a general practice, plotting training and testing accuracies for a thorough comparison is typically helpful.

c) Finally, yes, variance in k-fold cross-validation results is indeed another important consideration. If you find that your model's performance fluctuates wildly between very high and very low across different folds, it could be an indication that your model is sensitive to the particular data samples and might not generalize well. I don't have a strict rule of thumb for what constitutes "bad variance" but a good model should have stability in the performance above all.

I hope my answers were useful. Let's continue to quest on!


@pfunknoondawg
2020-10-07

Wow, this is super helpful!

@statquest

Glad you think so!

VidTutorAssistant (GPT-4)

Hip hip hooray! I'm glad you found the video helpful, @pfunknoondawg! Stay tuned to the channel for more easy-to-understand stats and machine learning concepts. Don't forget to 'Quest On'!


@6223086
2021-07-24

Hi Josh, I have a question, at 1:01:03 , if we interpret the tree, on the right split from the root node, we first went from a node with Gini Score of 0.346 (cp_4.0 <= 0.5) to a Gini Score of 0.4999 (oldpeak <= 0.55), we learnt that Gini Score are supposed to decrease as we descend the tree, why did the Gini score increase here?

Thank you so much, love ur videos

@statquest

For each split we calculate the Weighted Average of the individual Gini scores for each leaf and we pick the one with the lowest weighted average. In this case, although the leaf on the left has a higher Gini score than the node above it, it has fewer samples, 31, than the leaf on the right, which has a much lower Gini score, 0.126, and more samples, 59. If we calculate the weighted average of the Gini scores for these two leaves it will be lower than the node above them.

VidTutorAssistant (GPT-4)

Hi @6223086, fantastic question! You're right in that we usually expect the Gini Score to decrease as we descend the tree because splits are typically made to purify the child nodes. However, this is not always the case.

Rather than looking for an instantaneous decrease of the Gini Score from one step to the next, consider the process in aggregate over the whole tree. In other words, the goal is for the leaf nodes at the end of the tree to have lower Gini Scores (higher purity) compared to the root node. So, while local increases in the Gini Score like this might seem perplexing, what's important is that our tree as a whole is doing a good job splitting the data into purer child nodes.

I hope that clears things up a bit. Thanks for your engagement and please keep those insightful questions coming! And as always, until next time, quest on.


@Nico.75
2020-08-25

Hi Josh, such an awesome helpful video, again! May I ask you a basic question? When I'm doing an initial decision tree model building using train/test split and evaluate training and test accuracy scores and then start over doing k-fold cross validation on the same training set and evaluate it on the same test set as in the initial step -> is that a proper method? Because I used the same test set for evaluation twice, first on the initial train/test split method and second using the crossvalidation method? I read you should us your test (or hold out) set only once… Last question: Should you use the exactly same training/test set for comparing different algorithms (decision trees, random Forests, logistic Regression, kNN, etc...)? Thanks so much for a short feedback and quest on! Thanks and BAM!!!

@statquest

Yes, I think it's OK to use the same testing set to compare the model before optimization and after optimization.
Ideally, if you are comparing different algorithms, you will use cross validation and pick the one that has the best, on average, score. Think of picking an algorithm like picking a hyperparameter.

VidTutorAssistant (GPT-4)

Hey @Nico.75, thanks for the awesome feedback and don't worry, there are no basic questions here! We're all learning together. You bring up a great point about using your hold-out (test) set only once. Using it more than once can lead to "data leakage" where your model gets too familiar with the test set which may eventually lead to an over-optimistic performance measure. Typically, when you're tuning your model parameters (like our Alpha in pruning), you'd want to use a separate validation set, or use techniques like k-fold cross-validation on your training set.

To answer your second question - yes, you should compare different algorithms (e.g., decision trees vs. random forests) using the same train/test split. This is to ensure that the comparisons between models are fair and done under same circumstance, providing you with a reliable comparison. You're on track, keep going. Quest on and BAM!!!


@robertmitru7234
2021-01-01

Awesome StatQuest! Great channel! Make more videos like this one for the other topics. Thank you for your time!

@statquest

Thanks! Will do!

VidTutorAssistant (GPT-4)

Thanks a ton, @robertmitru7234! I'm thrilled that you found the video helpful. It's comments like yours that really make it worthwhile. I'll certainly continue to make more tutorials on a wide range of topics, just like this one. Until then, keep questing! Cheers.


@fernandosicos
2022-04-12

greatings from Brazil!

@statquest

Muito obrigado! :)

VidTutorAssistant (GPT-4)

Hello @fernandosicos! Greetings from StatQuest! Glad to have you watching from Brazil. I hope you found the video useful and learning decision trees with Python was made easier. If you've any questions, don't hesitate to ask. Quest on!


@alexyuan1622
2020-09-27

Hi Josh, thank you so much for this awesome posting! Quick question, when doing the cross validation, should the cross_val_score() using [X_train, y_train] or the [X_encoded, y]? I'm wondering if the point of doing cross validation is to let each chunk of data set being testing data, should we then use the full data set X_encoded an y for the cross validation? Thank you!!

@statquest

There are different ideas about how to do this, and they depend on how much data you have. If you have a lot of data, it is common to hold out a portion of the data to only be used for the final evaluation of the model (after optimizing and cross validation) as demonstrated here. When you have less data, it might make sense to use all of the data for cross validation.

VidTutorAssistant (GPT-4)

Hey Alexyuan1622! That's an excellent question and I'm thrilled you're delving into the details. The function cross_val_score() in our current context is applied on the training set [X_train, y_train]. This is because it'll help us tune parameters, compare and select a model, then see how it performs. That is, in essence, what cross-validation is about β€” it assists us in determining how well our model does on unseen data while we're still in the model training phase. The training set, hence, is split into a number of 'folds', and each in turn is used as a validation set while the others are used for training.

However, you bring up an interesting point about using the entire data set [X_encoded, y]. Generally, it's best practice to reserve some data solely for the final test, to simulate an unbiased representation of future data. If we included our entire data set in the cross validation process, it could create information leakage, where our final test set had some influence on the model training. By keeping a separate test set, you ensure an 'unseen' data set for your final model test. Great thought, and keep them coming! Quest On!