Promoter-like regions based on DNase and H3K4me3 signals

DNase hypersensitivity and the histone modification H3K4me3 are well-known indicators of promoters. We have developed an unsupervised method that combines DNase and H3K4me3 signals in the same cell type to predict promoter-like regions across all ENCODE cell and tissue types. To evaluate methods, we used ENCODE RNA-seq data on mouse limb, hindbrain, midbrain and neural tube at embryonic day 11.5. For each tissue, we ranked all TSS-proximal (less than 2 kb) DNase peaks using the combined expression of all proximal transcripts. Using linear regression models, we sought to find a ranking scheme that best predicted this ranked expression. We tested ranking by H3K4me3, DNase or H3K27ac signals as well as various combinations of these signals.

In Figure 4, we found that the best single feature was ranking by the H3K4me3 signal (average R2 of 0.64). DNase on its own performed less well (R2 = 0.37), followed by H3K27ac (R2 = 0.31). Although we found that combining DNase and H3K27ac is highly accurate in predicting enhancer-like regions and identifying some promoters, this approach is not as predictive of gene expression as the combination of H3K4me3 and DNase. We determined that the best model combined H3K4me3 and DNase rankings in the ratio of 1 to 0.28.

To predict promoter-like regions, we then applied this ranking scheme to all DNase peaks and selected the top 10,000 TSS-proximal regions. We also included distal regions that were ranked above the 10,000-th TSS-proximal prediction as they may correspond to unannotated TSSs or actively transcribed enhancer-like regions.

We have applied this method to 107 human cell types and 14 mouse cell types with both DNase and H3K4me3 data generated by the ENCODE and Roadmap Epigenomic consortia. For cell and tissues types with only H3K4me3 data, we centered predictions on H3K4me3 peaks and ranked them by H3K4me3 signals. Users can query these enhancer-like regions by genomic locations, nearby genes, or SNPs, and visualize them in the UCSC and WashU genome browsers. We have also made these regions available for download.

Figure 4.