Winning Kaggle
Competitions
Hendrik Jacob van Veen - Nubank Brasil
About Kaggle
Biggest platform for competitive data science in the
world
Currently 500k + competitors
Great platform to learn about the latest techniques and
avoiding overfit
Great platform to share and meet up with other data
freaks
Approach
Get a good score as fast as possible
Using versatile libraries
Model ensembling
Get a good score as fast as
possible
Get the raw data into a universal format like SVMlight or
Numpy arrays.
Failing fast and failing often / Agile sprint / Iteration
Sub-linear debugging:
“output enough intermediate information as a
calculation is progressing to determine before it
finishes whether you've injected a major defect or
a significant improvement.” Paul Mineiro
Using versatile libraries
Scikit-learn
Vowpal Wabbit
XGBoost
Keras
Other tools get Scikit-learn API wrappers
Model Ensembling
Voting
Averaging
Bagging
Boosting
Binning
Blending
Stacking
General Strategy
Try to create “machine learning”-learning algorithms with optimized
pipelines that are:
Data agnostic (Sparse, dense, missing values, larger than memory)
Problem agnostic (Classification, regression, clustering)
Solution agnostic (Production-ready, PoC, latency)
Automated (Turn on and go to bed)
Memory-friendly (Don’t want to pay for AWS)
Robust (Good generalization, concept drift, consistent)
First Overview I
Classification? Regression?
Evaluation Metric
Description
Benchmark code
“Predict human activities based on their smartphone usage. Predict
if a user is sitting, walking etc.” - Smartphone User Activity Prediction
Given the HTML of ~337k websites served to users of
StumbleUpon, identify the paid content disguised as real content. -
Dato Truly Native?