Introduction
measure genome-wide biochemical signals (chromatin accessibility, histone modifications, DNA methylation, binding of transcription factors) → candidate cis-regulatory elements use evolutionary conservation to identify potential regulatory regions combine the above approaches, examining how different functional classes of regulatory elements respond to evolutionary pressures
Finding
Genes near constrained elements perform fundamental celluar processes Genes near primate-specific elements are involved in environmental interaction ( odor perception, immune response) ~20% of TFBSs are derived from Transposable Elements(TE), exhibting intricate patterns of gain-loss during evolution Sequence variants associated with complex traits are enriched in constrained TFBSs Backgrounds
Cis Regulatory Elements
regions of non-coding DNA — regulate transcription of neighboring genes
typically regulate gene transcription through binding with TFs a single TF might bind to many CREs Classification
a relatively short sequences
~ 35 bp upstream/downstream from the initiation site
Initiate transcription of downstream genes
Must bind with TFs
enhance transcription of genes
upstream, downstream, within the introns, far from the gene ...
multiple enhancers can coordinate to regulate
have activation marks( H3K4me1 , H3K27ac )
exert regulatory function to increase transcription of target genes
exist in a primed state prior to activation( do not yield RNA )
marked with activation histone modification( H3K4me1 )
similar to Primed Enhancer, but have repression mark( H3K27me3)
must be removed for transition to an active enhancer state
prevent transcription of a gene
Trans-Regulatory Elements
DNA sequence — encode upstream regulators( Trans-acting factors )
Metazoan genomes have roughly the same number of nulcleotides to encoding proteins Higher levels of organismal complexity achieved in mammals are attributed to regulation systems
ENCODE and Roadmap Epigenomics Consortia measure genome-wide biochemical signals
transcription factors binding sites Goal
identify the cCREs and TFBSs conserved in the mammalian linage characterize the evolutionary histories of cCREs and TFBSs and identify the driving forces behind their gains and loses assess the likelihood that conserved cCREs and TFBSs are functional in humans and other mammals Fig. 1. Subsets of cCREs show distinct patterns of evolutionary conservation
cCREs fall into three groups with distinct patterns of mammalian conservation
using Zoonomia’s reference-free alignment across 241 mammalian genomes
compute the number of other mammalian genomes to which each human cCRE could be aligned for ≥ 90% of its position ( N1 ) or ≤ 10% of its position ( N2 )
G1: highly conserved cCREs
G2: actively evolving cCREs. ( ≥ 90% → only to primate genomes ; ≤ 10% → fewer than half mammalian genomes)
G3: primate-specific cCREs
Functional categories of cCREs show varying distributions among the three groups
PLS: promoter-like signature
pELS: proximal enhancer-like signature
dELS: distal enhancer-like signature
cCREs with PLSs — highest G1, lowest G3
DNase-H3K4me3 and CTCF-only — lowest G1, highest G3
In summary, cCREs fall into three distinct groups based on their conservation levels across the 241 mammalian genomes, and a cCRE’s functional category influences the likelihood that it falls within a given conservation group.
an epigenetic modification → DNA packaging protein Histone H3
commonly associated with the activation of transcription of nearby genes for transcription
an epigenetic modification → DNA packaging protein Histone H3
associated with the higher activation of transcription → defined as an activate enhancer mark
transcriptional repressor CTCF; regulate 3D structure of DNA
block interaction between enhancer and promoter
both to promote and repress gene expression
Mammalian genome alignments place cCREs in a landscape of evolutionary profiles
performe Uniform Manifold Approximation and Projection (UMAP) for Dimension Reduction
the remaining G3 cCREs break off to form dozens of small clusters (fig.C) the G1 cCREs at the end of the large cluster have the highest phyloP (fig.F) TSS-proximal cCREs occupy “ridges” of the large cluster at the end of G1 cCREs, overlapping with a subset of high-phyloP locations (fig.G) al- most all G3 clusters exhibit strong overlap with TEs, whereas the other groups do not (fig.H) Although TEs have been a driving force for regulatory elements throughout evolution, they have been instrumental in the evolution of primate-specific elements.
The immune pathway adapts by evolving new exons and cCREs, whereas olfaction and transposon control pathways adapt mainly by evolving cCREs
perform GO enrichment analysis on the genes near each group of the cCREs
near G1: functionally important for all cells near G2: diverse biological processes near G3: interaction with the environment using the same approach, group the exons of protein-coding genes into 3 groups on the basis of mammalian conservation
Genes containing G3 exons are enriched in immune pathways
The immune pathway responds to viral infection by evolving both new exons and regulatory elements, whereas olfaction and transposon control pathways adapt mainly by evolving regulatory elements.
Fig.2 Identification of TFBSs constrained in the mammalian lineage.
The binding sites of 367 transcription factors show diverse evolutionary profiles
ChIP-seq peak sets → sequence motifs
(implemented a convolutional neural network architecture)
Information content at individual positions in these motifs is positively correlated with conservation scores, while both quantities are negatively correlated with both DNase I cleavage (DNase-seq) and Tn5 insertion (ATAC-seq), supporting the motifs’ accuracy
DNase I
a nuclease that cleaves DNA
a method to identicy the location of regulatory regions (chromosome open areas)
based on genome-wide sequencing of regions sensitive to cleavage by DNase I
merge and align instance of the same motif → final set of TFBSs
using the approach for grouping cCREs calssify TFBSs
TFBSs show much smaller percentages of G1 and G2 and a much larger percentage of G3 than cCREs in the corresponding groups
whether this distribution difference arises because different groups of cCREs contain distinct groups of TFBSs?
G1 cCREs contain G1 TFBSs and ungrouped (other) TFBSs G2 cCREs contain a mixture of G1, G2, G3, and other TFBSs G3 cCREs predominantly contain G3 TFBSs In other words, G1 cCREs have conserved most of their constituent TFBSs throughout mammalian evolution, whereas G2 cCREs have undergone greater turnover in their constituent TFBSs.
When accounting for mutation rate on a per TF basis, only a third of highly conserved TFBSs are constrained across mammals
Fit a two-component Gaussian mixture model to the phyloP scores of the TFBSs for each TF individually to classify its binding sites as constrained or unconstrained.
Constrained sites are preferentially located in conserved regions but are even more conserved than their flanking regions
A second model the difference in phyloP scores between the TFBS and the average score of its two flanks
Across the 367 TFs, the two models yielded two sets of highly overlapping sites. Use the union of the two (2 M sites, 0.8% of the human genome) as constrained TFBSs for subsequent analyses.
color TFBSs in HNF4A-bound sites according to constraint
construct sequence logos for G1 HNF4A-bound sites:
logos for constrained G1 sites maintain high information content, unconstrained G1 sites show much lower
Across the TFs, the difference in mean phyloP scores between the constrained and unconstrained sets (μ2 and μ1) correlates strongly with the fraction of the sites in the constrained subset
TFs vary greatly in the fraction of their sites which are constrained (0 to 60%), although the C2H2 zinc finger family shows the largest range
Of all C2H2 factors, KRAB-ZFPs exhibit the lowest percentages of constrained sites (pink dots at the bottom-left corner)
exam TFBS/cCRE intersection in five cell lines → evaluate TFBS overlap with cell type-specific regulatory elements 100bp?
most TFBSs are located near regulatory elements having regulatory functions in the same cell type
Fig.3 Almost all primate- specific TFBSs overlap TEs
Almost all primate-specific TFBSs overlap TEs
according to Fig.2.D, almost all primate-specific G3 TFBS clusters overlap TEs
illustrate sis clusters of HNF4A sites and their presence or absence in primate lineage
The HNF4A sites in these clusters are enriched in specific subfamilies of TEs
LTR (median age 95 Myr), LINE1 (97 Myr), and SINE/Alu (54 Myr) are the three youngest TE families, and they overlap the youngest clusters of HNF4A sites
cluster (i): HNF4A-bound sites restricted to great apes cluster (ii), (iii), and (iv): shared between apes and monkeys cluster (v) and (vi): contain even older G3 HNF4A sites All six clusters of G3 HNF4A sites are highly enriched in LTRs (28.4 to 51.7%), indicating that LTRs have contributed substantially to the spread of HNF4A sites during primate evolution
Only 7.1% of non- G3 HNF4A sites (167,311 in total) overlap LTRs, similar to the level in the genomic background (8.8%).
Fig. 4 TFs with binding sites most enriched in each TE family
Among the 367 TFs investigated, 24.6% of the 15.6 M binding sites are classified as G3. 86.1% of the G3 TFBSs overlap TEs
Finding that 9.1% of cCREs are primate-specific and driven by TEs; this is lower than the percentage for TFBSs (21.2%)—Each cCRE, particularly those in G2, may contain multiple TFBSs clas- sified in different groups
Constrained TFBSs are largely depleted of TEs, whereas unconstrained TFBSs have similar TE distributions as the genomic background
For C2H2 zinc finger TFs, constrained TFBS fraction (x axis) is plotted against the fraction of TFBS/TE overlap (y axis). Each dot represents a TF; all 114 C2H2 zinc-finger TFs in this study are included. KRAB-ZFPs are colored orange.
Fractions of TFBSs for each TF which overlap families of TEs. Dots indicate outlier TFs; KRAB-ZFPs are in orange. Horizontal gray lines denote each TE family’s overall genomic footprint. Green numbers denote the num- ber of TFs whose overlap fraction exceeds genomic background (outside parentheses) and the total number of outlier TFs (inside parentheses) for each TE family.
KRAB-ZFPs are the most enriched TFs in binding to each TE family
TEs bound by KRAB-ZFPs tend to be younger than unbound TEs, indicating that KRAB-ZFPs repress the activity of these young TEs.
KRAB?
Fig. 5. Epigenetic signals are enriched at constrained and unconstrained TFBSs in mammalian species
Constrained human TFBSs are bound by TFs in other mammals and exhibit epigenetic signals indicative of regulatory functions
Access whether epigenomic data in other mammals supports TFBSs More than 90% of constrained human HNF4A binding sites are also present in macaque, dig, mouse, and rat
By contrast, 53% of unconstrained human sites are present in the dog genome, and only 36% of unconstrained human sites are present in mouse and rat
examin ChIP-seq data to access binding signals
Constrained TFBSs show higher ChIP-seq signals than unconstrained TFBSs across all five species
Most unconstrained TFBSs still show some evidence of binding
Constrained and unconstrained TFBSs are highly enriched in the corresponding sequence motifs
The information content of the sequence logos is lower for unconstrained TFBSs in more distant species
The information content is measured in bits and, in the case of DNA sequences, ranges from 0 to 2 bits. A position in the motif at which all nucleotides occur with equal probability has an information content of 0 bits, while a position at which only a single nucleotide can occur has an information content of 2 bits.
sort of possibility
examine protection against cleavage by DNase I in DNase-seq data
Constrained TFBSs bound in both cell lines according to ChIP- seq show the highest baseline DNase signal and the deepest DNase protection profile in both cell lines
In summary, TFBSs show cell-type-specific protection against DNase cleavage, with conserved TFBSs showing greater protection than unconstrained TFBSs.
Further evaluate three histone modifications around TFBSs. These modifications, H3K4me3, H3K27ac, and H3K4me1, are enriched at active promoters, active enhancers, and all enhancers, respectively
Higher fractions are observed for constrained HNF4A binding sites than for unconstrained sites in all species
Examine DNA CpG methylation at TFBSs using whole-genome bisulfite sequencing data
Low DNA methylation typically corresponds to active regulatory elements, whereas high DNA methylation leads to repression
Analyze 94 normal tissue and primary cell samples separately from 18 cancer samples
constrained TFBSs are ubiquitously unmethylated
although most unconstrained TFBSs are methylated in most samples, they exhibit considerable variation
In normal samples, constrained TFBSs tend to be ubiquitously unmethylated and likely active, and unconstrained TFBSs tend to be variably methylated and likely active in specific cell and tissue types.
constrained TFBSs remain unmethylated, although a small fraction of them become methylated in some samples
most unconstrained TFBSs become methylated in most cancer samples
An increase in the methylation of a subset of TFBSs likely leads to their repression in cancer.
Fig. 6. GWAS SNPs are enriched in constrained cCREs and TFBSs
Disease- and trait-associated variants are most enriched in highly conserved cCREs and constrained TFBSs
Aiming to interpret trait-associated variants identified by genome-wide association studies (GWASs) using our highly conserved cCREs and constrained TFBSs.
a measure of how well differences in people’s genes account for differences in other traits
ranging from 0 (env.) to 1 (genetic)
explore the genetic architecture of complex traits in human genetics
defination: the proportion of an SNP subset explained heritability, divided by the proportion of SNPs
These results remain robust after removing coding nucleotides from all partitions
supporting the utility of this set of constrained TFBSs in prioritizing candidate functional variants
Heritability enrichment within TFBSs is most significant in cell-type–specific regulatory elements
GWAS variants are known to be enriched within regulatory elements specific to disease- and trait-relevant cell types
Whether the TFBSs driving the afore- mentioned enrichment are cell-type specific?
Heritability enrichment for seven immune-mediated traits and sixteen erythroid traits in partitions of cCREs-dELS that are chromatin accessible in six distinct cell lines.
constrained disease-associated TFBSs affect regulatory activity in a cell-type- specific manner.