ChIP-seq Analysis

Evaluation of ChIP-seq data

Establishing Transcriptional Regulatory Network

Explore

Evaluation of ChIP-seq data

刘运哲

Alignment quality

1. Checking the total percent of reads aligning to the genome. A mapping rate of 70% or higher is great, and lower than 50% is concerning especially if total read depth is low.

Low mapping rates can be a result of low quality reads, contaminating sequences, inappropriate alignment parameters chosen, or a poor quality reference genome.

2. Determining the percent uniquely mapping reads, ideally this would be > 60% of the total read depth of your sample. The higher the percentage the more usable data you have to work with. (image source:

Land et, al, 2012⁠

)

High duplication can be a result of over-amplification.

⁠

High number of multi-mapping reads can be due to mappability.

⁠

3. Identify percent of reads mapping in blacklist regions. This shouldn’t be higher than 10% of your high quality mapped (and filtered) read count. (haven’t been done yet)

Here’s the code for summarizing alignment quality information from log.txt

# make .sh file

########### get summary of mapping (bowtie2) and samtools rmdup

########### bash get_bowtie2_log_info.sh

echo "Sample" > bowtie2_mapping.tmp

echo "#total" >> bowtie2_mapping.tmp

echo "#align0" >> bowtie2_mapping.tmp

echo "#align1" >> bowtie2_mapping.tmp

echo "#align>1" >> bowtie2_mapping.tmp

echo "#mapping_ratio" >> bowtie2_mapping.tmp

echo "#duplicated" >> bowtie2_mapping.tmp

echo "#all_Q10_reads" >> bowtie2_mapping.tmp

echo "#duplicated" >> bowtie2_mapping.tmp

ls *_mapping.log | while read i; do

echo $i > $i.log.tmp

sed -n '1p' $i | awk -F" " '{print $1}' >> $i.log.tmp

sed -n '3p' $i | awk -F" " '{print $1$2}' >> $i.log.tmp

sed -n '4p' $i | awk -F" " '{print $1$2}' >> $i.log.tmp

sed -n '5p' $i | awk -F" " '{print $1$2}' >> $i.log.tmp

sed -n '15p' $i | awk -F" " '{print $1}' >> $i.log.tmp

tail -n 1 $i | awk -F" " '{print $2}' >> $i.log.tmp

tail -n 1 $i | awk -F" " '{print $4}' >> $i.log.tmp

tail -n 1 $i | awk -F" " '{print $6}' >> $i.log.tmp

done

paste *.log.tmp > all.log.tmp

paste bowtie2_mapping.tmp all.log.tmp > bowtie2_mapping_1.info

awk '{i=1;while(i <= NF){col[i]=col[i] $i " ";i=i+1}} END {i=1;while(i<=NF){print col[i];i=i+1}}' bowtie2_mapping_1.info | sed 's/[ \t]*$//g' > bowtie2_mapping.info

rm *.tmp bowtie2_mapping_1.info

########################

bash get_bowtie2_log_info.sh

Cross-Correlation Analysis

A high-quality ChIP-seq experiment will produce significant clustering of enriched DNA sequence tags/reads at locations bound by the protein of interest; the expectation is that we can observe a bimodal enrichment of reads (sequence tags) on both the forward and the reverse strands.

⁠

How are the Cross-Correlation scores calculated?

Using a small genomic window as an example, let’s walk through the details of the cross-correlation below. It is important to note that the cross-correlation metric is computed as the Pearson’s linear correlation between coverage for each complementary base (i.e. on the minus strand and the plus strands), by systematically shifting minus strand by k base pairs at a time. This shift is performed over and over again to obtain the correlations for a given area of the genome.

Plot 1: At strand shift of zero, the Pearson correlation between the two vectors is 0.

⁠

Plot 2: At strand shift of 100bp, the Pearson correlation between the two vectors is 0.389.

⁠

Plot 3: At strand shift of 175bp, the Pearson correlation between the two vectors is 0.831.

⁠

When this process is completed we will have a table of values mapping each base pair shift to a Pearson correlation value. These Pearson correlation values are computed for every peak for each chromosome and values are multiplied by a scaling factor and then summed across all chromosomes.

Once the final cross-correlation values have been calculated, they can be plotted (Y-axis) against the shift value (X-axis) to generate a cross-correlation plot. The cross-correlation plot typically produces two peaks: a peak of enrichment corresponding to the predominant fragment length (highest correlation value) and a peak corresponding to the read length (“phantom” peak).

Examples of Cross-Correlation Plots

Strong signal:

High-quality ChIP-seq data sets tend to have a larger fragment-length peak compared with the read-length peak. The red vertical line shows the dominant peak at the true peak shift, with a small bump at the blue vertical line representing the read length.

⁠

Weak signal:

An example of weaker signal is demonstrated below with a Pol2 data. Here, this particular antibody is not very efficient and these are broad scattered peaks. We observe two peaks in the cross-correlation profile: one at the true peak shift (~185-200 bp) and the other at read length. For weak signal datasets, the read-length peak will start to dominate.

⁠

No signal:

A failed experiment will resemble a cross-correlation plot using input only, in which we observe little or no peak for fragment length. Note in the example below the strongest peak is the blue line (read length) and there is basically no other significant peak in the profile. The absence of a peak is expected since there should be no significant clustering of fragments around specific target sites (except potentially weak biases in open chromatin regions depending on the protocol used).

⁠

Normalized strand cross-correlation coefficient (NSC):

The ratio of the maximal cross-correlation value divided by the background cross-correlation (minimum cross-correlation value over all possible strand shifts).

higher NSC values indicate more enrichment (better signal:noise)

low signal-to-noise: NSC values < 1.1

minimum possible NSC value: 1 (no enrichment)

Relative strand cross-correlation coefficient (RSC):

The ratio of the fragment-length cross-correlation value minus the background cross-correlation value, divided by the phantom-peak cross-correlation value minus the background cross-correlation value.

high enrichment: RSC values > 1

low signal-to-noise: RSC values < 0.8

minimum possible RSC value: 0 (no enrichment)

NOTE Low NSC and RSC values can be due to failed and poor quality ChIP, low read sequence quality and hence lots of mismappings, shallow sequencing depth or a combination of these. Also, datasets with few binding sites (< 200) which could be due to biological reasons (i.e. a factor that truly binds only a few sites in a particular tissue type) would output low NSC and RSC scores.

Here’s the code for cross-correlation analysis

# phantompeakqualtools是一个用于计算ChIP-Seq数据富集和质量度量值的R包

less ./List/Sample_list.txt |while read list;

less ./List/${list} |while read id;

Rscript --max-ppsize=500000 \

/home/octavia/miniconda3/envs/chip-qc/pkgs/phantompeakqualtools-1.2.2-hdfd78af_1/bin/run_spp.R \

-c=./bam/${id}_rmdup.bam -out=./bam_qc/${id}_bam.qc -p 3 > ./bam_qc_log/${id}_bam.rlog 2>&1

done &

done

cat *.qc >> bam_qc.txt

'''

* COL9: Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8

* COL10: Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)

* COL11: QualityTag: Quality tag based on thresholded RSC

(codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)

'''

Library Complexity

Library complexity refers to the number of unique DNA fragments present in a given library.

Reductions in complexity resulting from PCR amplification during library preparation will ultimately compromise downstream analyses via an elevation in the number of duplicate reads.

PCR-associated duplication artifacts can result from: inadequate amounts of starting material (genomic DNA, cDNA, etc.), losses during cleanups, and size selection issues. Duplicate reads can also arise from optical duplicates resulting from sequencing-machine optical sensor artifacts.

⁠

⁠

PCR Bottlenecking Coefficient 1 (PBC1)

PBC1=M1/MDISTINCT where

M1: number of genomic locations where exactly one read maps uniquely

MDISTINCT: number of distinct genomic locations to which some read maps uniquely

PCR Bottlenecking Coefficient 2 (PBC2)

PBC2=M1/M2 where

M1: number of genomic locations where only one read maps uniquely

M2: number of genomic locations where two reads map uniquely

Non-Redundant Fraction (NRF)

Number of distinct uniquely mapping reads (i.e. after removing duplicates) / Total number of reads.

⁠

⁠

# library complexity for single sample

bedtools bamtobed -i ./bam/SRR15036576_rmdup.bam | \

awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$6}' | grep -v 'chrM' | sort | uniq -c | \

awk 'BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{m1_m2=-1.0; if(m2>0) m1_m2=m1/m2; ...

... printf "%s\t%d\t%d\t%d\t%d\t%f\t%f\t%f\n","SRR15036576",mt,m0,m1,m2,m0/mt,m1/m0,m1_m2}' > SRR15036576_pbc.qc 2>&1 &

# library complexity for multiple samples

less SRR_Acc_List.txt |while read id;

bedtools bamtobed -i ./bam/SRR15036576_rmdup.bam | \

awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$6}' | grep -v 'chrM' | sort | uniq -c | \

awk -v label="%{id}" 'BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{m1_m2=-1.0; if(m2>0) m1_m2=m1/m2; ...

... printf "%s\t%d\t%d\t%d\t%d\t%f\t%f\t%f\n",label,mt,m0,m1,m2,m0/mt,m1/m0,m1_m2}' >> lib.qc 2>&1

done &

# m0/mt = NRF,m1/m0 = PBC1, m1/m2 = PBC2

scp lib.qc lab@137.189.43.55:/home/lab/tmp_oct

Measuring global ChIP enrichment

Fraction of Reads in Peaks(FRiP/RiP)

This metric represents the percentage of reads that overlap ‘called peaks’. It can be considered a “signal-to-noise” measure of what proportion of the library consists of fragments from binding sites vs. background reads.

RiP (also called FRiP) values will vary depending on the protein of interest:

A typical good quality TF (sharp/narrow peaks) with successful enrichment would exhibit a RiP around 5% or higher.

A good quality Pol2 (mix of sharp/narrow and dispersed/broad peaks) would exhibit a RiP of 30% or higher.

There are also known examples of good datasets with RiP < 1% (i.e. RNAPIII or a protein that binds few sites).

FRiP(RiP) the percentage of reads that overlap ‘call peaks’

FRiP score ＝ reads_in_peaks / total_reads

FRiP for singal sample

# FRiP(RiP) the percentage of reads that overlap ‘call peaks’

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.

Evaluation of ChIP-seq data

Alignment quality

Cross-Correlation Analysis

How are the Cross-Correlation scores calculated?

​Normalized strand cross-correlation coefficient (NSC):

Relative strand cross-correlation coefficient (RSC):

Library Complexity

Measuring global ChIP enrichment

Fraction of Reads in Peaks(FRiP/RiP)

Normalized strand cross-correlation coefficient (NSC):