The integrated classifier predicts which genes are likely relevant to T2D risk within T2D GWAS loci through functional and semantic data.
Starting with genetic credible sets, genetic variants were mapped to genes and annotated by various functional predictors like ChIP-seq, ATAC-seq, and transcription factor binding sites. These variant annotations were then combined with gene-based annotations, expression within T2D-related tissues and a neural network derived score measuring the strength of evidence present in the published literature related to T2D etiology. The random forest model was then trained on genes known to influence T2D risk through perturbation of the gene product from coding variation, dysregulation of expression from non-coding variation, and influence on related traits like insulin secretion and fasting insulin levels.
Predictions were then made for each gene within a T2D locus that did not harbor a known T2D effector gene, resulting in a probability that a given gene is relevant to T2D risk. For each prediction, the most influential model features were extracted.
A preprint with a full description of the input data, method, and prediction results will be posted to Medrxiv.