Mining insights from GWAS in the CDKP

Genetic association data from genome-wide association studies (GWAS) are foundational for our understanding of complex diseases and traits. But in order to apply these results to diagnosis, drug development, and treatment, we need to identify the effector genes that explain those genetic associations. This is rarely straightforward: most SNPs associated with disease are located outside of coding regions of the genome, so that their impact on genes is not obvious; and even a variant located in a protein-coding gene may actually affect a different gene. And to complicate things further, a variant that is strongly associated with disease may not have a direct impact on a gene, but may rather be "along for the ride" with a tightly linked causal variant.

To help bridge the gap between genetic association results and the effector genes that are directly involved in disease, we are aggregating additional data types—for example, transcriptional regulation, tissue specificity, curated biological annotations, and more—and integrating them, using cutting-edge computational methods, in order to mine insights from GWAS data. We present the results of these methods in interactive FOCUS (Find Orthogonal Computational Support) tables.

As a first step in implementing these methods, we needed to find a way to store many different connections between variants, genes, tissues, phenotypes, and biological annotations. We decided to use a Neo4J graph database, which holds data nodes and their relationships with each other and can support complex, scientifically meaningful queries.


Neo4J graph showing variants on chromosome 8 that are associated with glycemic phenotypes. Orange circles represent variants; pink, p-values; blue, phenotypes; red, phenotype group; green and brown, variant annotations.


We have also created pipelines to apply computational methods to the genetic association data in the Knowledge Portal Network. In brief, we are currently running:

  • MetaXcan, which integrates tissue-specific expression data from GTEx and genetic association data to predict the potential that a gene is causal for a phenotype in a given tissue;
  • DEPICT, which integrates multiple data sources including transcriptional co-regulation, Gene Ontology annotations,  model organism phenotypes, and more to make several predictions: membership of a gene in a pathway; the probability of its association with a given phenotype; and the tissues or cell types that are likely to be relevant for a given trait;
  • eCAVIAR and COLOC, two methods that quantify the probability that a variant is causal in both genetic association and eQTL studies;
  • GREGOR, which integrates chromatin states with genetic associations derived from meta-analysis of the Knowledge Portal Network to generate p-values representing the significance of association between a tissue and trait; 
  • LD score regression (LDSR), which uses cell type-specific annotations and genetic association summary statistics in the Knowledge Portal Network to generate p-values representing the significance of association between a tissue and trait.
The Gene FOCUS table is accessible via the "Genes in region" tab on the Gene page:
Gene FOCUS table for PITX2
The table shows results of the methods for each gene across the region. It has two alternative views, and supports versatile sorting. The methods, data types, and table navigation are described in more detail in our downloadable help documentation for the new interface. Note that not all results are currently available for all genes in the Cerebrovascular Disease Knowledge Portal, but the Gene FOCUS table will become increasingly populated in the future as more datasets are added to the CDKP and the methods are re-run.

The Tissue FOCUS table, accessible via a link on the home page, presents results that can suggest which tissues or cell types may be relevant for a disease or trait of interest. 
To use the table, choose a phenotype of interest to see p-values for different tissues, denoting the significance with which variants associated with that phenotype are enriched in each tissue. Find complete details about the table and methods in our downloadable documentation.


This system, from data storage through the computational pipelines through the user interface, has been designed to be flexible and modular so that in the future we will be able to add new methods and data types easily and rapidly. As we actively develop these tools, we are very interested in feedback from researchers about how to improve it. Please try it out and let us know what you think!
stroke, cd