Jeremy Howard on winning the Predict Grant Applications Competition

Kaggle Team
Kaggle Blog
Published in
2 min readFeb 21, 2011

--

Because I have recently started employment with Kaggle, I am not eligible to win any prizes. Which means the prize-winner for this comp is Quan Sun (team ‘student1’)! Congratulations!

My approach to this competition was to first analyze the data in Excel pivottables. I looked for groups which had high or low application success rates. In this way, I found a large number of strong predictors — including by date (new years day is a strong predictor, as are applications processed on a Sunday), and for many fields a null value was highly predictive.

I then used C# to normalize the data into Grants and Persons objects, and constructed a dataset for modeling including these features: CatCode, NumPerPerson, PersonId, NumOnDate, AnyHasPhd, Country, Dept, DayOfWeek, HasPhd, IsNY, Month, NoClass, NoSpons, RFCD, Role, SEO, Sponsor, ValueBand, HasID, AnyHasID, AnyHasSucc, HasSucc, People.Count, AStarPapers, APapers, BPapers, CPapers, Papers, MaxAStarPapers, MaxCPapers, MaxPapers, NumSucc, NumUnsucc, MinNumSucc, MinNumUnsucc, PctRFCD, PctSEO, MaxYearBirth, MinYearUni, YearBirth, YearUni .

Most of these are fairly obvious as to what they mean. Field names starting with ‘Any’ are true if any person attached to the grant has that feature (e.g. ‘AnyHasPhd’). For most fields I had one predictor that just looks at person 1 (e.g. ‘APapers’ is number of A papers from person 1), and one for the maximum of all people in the application (e.g. ‘MaxAPapers’).

Once I had created these features, I used a generalization of the random forest algorithm to build a model. I’ll try to write some detail about how this algorithm works when I have more time, but really, the difference between it and a regular random forest is not that great.

I pre-processed the data before running it through the model by grouping up small groups in categorical variables, and replacing continuous columns with null values with 2 columns (one containing a binary predictor that is true only where the continuous column is null, the other containing the original column, with nulls replaced by the median). Other than the Excel pivottables at the start, all the pre-processing and modelling was done in C#, using libraries I developed during this competition. I hope to document and release these libraries at some point — perhaps after tuning them in future comps.

Originally published at blog.kaggle.com on February 21, 2011.

--

--

Kaggle Team
Kaggle Blog

Official authors of Kaggle winner’s interviews + more! Kaggle is the world’s largest community of data scientists. Join us at kaggle.com.