In many domains such as medicine, training data is in short supply. In such
cases, external knowledge is often helpful in building predictive models. We
propose a novel method to incorporate publicly available domain expertise to
build accurate models. Specifically, we use word2vec models trained on a
domain-specific corpus to estimate the relevance of each feature's text
description to the prediction problem. We use these relevance estimates to
rescale the features, causing more important features to experience weaker
We apply our method to predict the onset of five chronic diseases in the next
five years in two genders and two age groups. Our rescaling approach improves
the accuracy of the model, particularly when there are few positive examples.
Furthermore, our method selects 60% fewer features, easing interpretation by
physicians. Our method is applicable to other domains where feature and outcome
descriptions are available.