logo资料库

Kaggle 技术秘籍.pdf

第1页 / 共58页
第2页 / 共58页
第3页 / 共58页
第4页 / 共58页
第5页 / 共58页
第6页 / 共58页
第7页 / 共58页
第8页 / 共58页
资料共58页,剩余部分请下载后查看
Winning Kaggle Competitions Hendrik Jacob van Veen - Nubank Brasil
About Kaggle Biggest platform for competitive data science in the world Currently 500k + competitors Great platform to learn about the latest techniques and avoiding overfit Great platform to share and meet up with other data freaks
Approach Get a good score as fast as possible Using versatile libraries Model ensembling
Get a good score as fast as possible Get the raw data into a universal format like SVMlight or Numpy arrays. Failing fast and failing often / Agile sprint / Iteration Sub-linear debugging: 
 “output enough intermediate information as a calculation is progressing to determine before it finishes whether you've injected a major defect or a significant improvement.” Paul Mineiro
Using versatile libraries Scikit-learn Vowpal Wabbit XGBoost Keras Other tools get Scikit-learn API wrappers
Model Ensembling Voting Averaging Bagging Boosting Binning Blending Stacking
General Strategy Try to create “machine learning”-learning algorithms with optimized pipelines that are: Data agnostic (Sparse, dense, missing values, larger than memory) Problem agnostic (Classification, regression, clustering) Solution agnostic (Production-ready, PoC, latency) Automated (Turn on and go to bed) Memory-friendly (Don’t want to pay for AWS) Robust (Good generalization, concept drift, consistent)
First Overview I Classification? Regression? Evaluation Metric Description Benchmark code “Predict human activities based on their smartphone usage. Predict if a user is sitting, walking etc.” - Smartphone User Activity Prediction Given the HTML of ~337k websites served to users of StumbleUpon, identify the paid content disguised as real content. - Dato Truly Native?
分享到:
收藏