Selecting datasets
Our criteria for including genetic association datasets in the Knowledge Portals are:
- relevance to a Knowledge Portal disease area
- requests by disease-specific researcher communities
We prioritize datasets that:
- are considered authoritative by each disease community
- are recently generated
- have large sample sizes
- include subjects of diverse ancestry
We incorporate older or smaller datasets if they 1) increase the diversity of ancestries in our database; 2) include associations for phenotypes that are currently under-represented in the portals; or 3) are requested by collaborators.
We occasionally remove datasets from the portals if we are advised by investigators that the samples in a previously generated dataset from their group are completely represented in a newer analysis for the same phenotype from that group.
Dataset sources
We identify genetic association datasets to incorporate from several sources:
- weekly PubMed searches to find new studies in disease areas of interest
- contributions from collaborators in disease-specific research areas
- quarterly searches of the GWAS Catalog to find newly available summary statistics for traits of interest
Harmonization and quality control
Datasets from the GWAS Catalog are checked and harmonized at the European Bioinformatics Institute, using the pipelines used by the GWAS Catalog, before transfer to the Knowledge Portal Data Coordinating Center (DCC) at the Broad Institute. Datasets obtained by all other routes are subjected to harmonization and quality control at the DCC. We ensure that each dataset has, at minimum, chromosome, position, genome build, effect allele, and p-value, and we also obtain effect sizes, standard errors, and effect allele frequencies if available. Datasets that refer to genome builds other than hg19 (GRCh37) are converted to hg19 using the liftOver software. We harmonize datasets to the reference genome by 1) determining whether the effect allele is the reference or alternate allele and 2) if necessary, changing the effect allele to the non-reference allele and inverting the effect size and allele frequency.
Quality control checks include:
- determining how many variants are new to our database (to detect misspecification of the genome build)
- if effect allele frequency is available, comparing it to allele frequency in the same ancestry in the 1000Genomes reference panel (to detect strand flips)
- visually inspecting Manhattan and Q-Q plots, and comparing them to those in the publication accompanying the study if available
- plotting distributions of effect sizes and standard errors (to determine whether they are consistent with p-values)
Dataset descriptions in the Knowledge Portals
Every dataset whose results are displayed in a Knowledge Portal is listed on the "Genetic association datasets" page, accessible via the "Data" menu in the upper menu bar. Clicking on a dataset name on that page leads you to a "Dataset Inspector" page that lists the publication accompanying it, the phenotypes for which associations were analyzed, and a brief summary of the analysis. If associations were analyzed for more than one phenotype, a pull-down menu at the top of the page allows you to select a phenotype and view Manhattan and Q-Q plots and a table of the top associations for that phenotype.