Last January I decided to compete in my first Kaggle competition, ELO Merchant Category Recommendation. The competition was hosted by the brazilian banking company, ELO, and competitors were tasked to predict signal in customer loyalty given previous credit card transactions. Competing in this competition taught me a ton not just about effective competition techniques but also about how the kaggle community functions as a whole. The entire project is available on github, and if you’re interested I encourage you to check it out. The following are four things I wish I knew before I entered the competition:
1) The Kaggle community is amazing.
Seriously, I was blown away by how great the community of data scientist and data enthusiasts is on kaggle. The community is incredibly collaborative and is an amazing resource for almost any data science question you might have. Initially because of the competitive nature of the problem I was worried that most people wouldn’t being willing to share much information. The reality however, is that almost everyone on kaggle is simply there to learn. People might keep some of their findings if it is particularly useful and could majorly affect people’s leaderboard positions, however more often than not people are happy to help you improve your models.
2) Try to avoid public kernels in the early stages.
For each competition there are public kernels available where people show some of their EDA and present baseline models to the community. This is a great way to learn a lot about the data set in relatively short period of time however I recommend you hold off until you have done your own EDA and built your own baseline. Diving right into public kernels biased me towards their solutions and made it more difficult to come to my own conclusions. By holding off on looking through the public kernels in the initial stages you will not only improve your own skills in this area, but you will also likely come to new conclusions than those found in the public kernels. By the end of it you will have your own solid understanding of the data that you can compare and contrast to public kernels which will help you build a much more complete understanding altogether.
3) Nothing is going to work and that’s okay.
This is something you will hear a lot from people on kaggle. If one in ten ideas works you should consider yourself very lucky. The point is always to keep trying new things and when they do or don’t work try to understand why this is the case. I can’t tell you the number of times I thought I had the next best thing to push me up the leaderboard only to have it lower my overall score. The important thing is to not get discouraged and to take every failed idea as a learning experience.
4) Trust your cross validation.
Setting up a cross validation score you can trust is one of the most important steps for any kaggle competition. If you don’t have a way to reliably validate your model its going to be exceptionally difficult to make any sort of progress. Kaggle allows you to check your results using a portion of the test set however the more you check this the more you risk overfitting. To give you some insight going with the solution with the highest CV score placed me in the top 3% while using the solution that did best on the public test set didn’t even make the top 10%. Because kaggle lets you submit two predictions it’s generally recommended that you submit the one that yields the best CV scores and the one which scores best on the public test set. This way you should be reasonably protected from leaderboard shake ups.
Wrap up:
These are just a few things I wish I knew going into my first competition. The overall experience was still very positive and allowed me to learn a ton. For me, Kaggle will continue to be a way that I grow my skills and interact with the data science community. It’s one of the best online communities around and I am happy to be a part of it.