icon picker
ChIP-seq Data Analysis

From the downloading of SRR in NCBI to IDR of repetitions

Workflow

屏幕截图(327).png

the toolkit

屏幕截图(328).png

plug and play approach

屏幕截图(329).png

the process

屏幕截图(330).png

Download

First we need to download ChIP-seq data from ncbi
屏幕截图(323).png
Click BioSample
屏幕截图(324).png
Click the bottom SRA
屏幕截图(325).png
Click All runs
屏幕截图(326).png
Using Accession List to download a "SRR_Acc_List.txt"
# download from ncbi using sratools

nohup prefetch --option-file SRR_Acc_List.txt &

#nohup ... & hang in background

#sra --> fq.gz

less SRR_Acc_List.txt |while read id;
do
fastq-dump --split-3 --gzip ${id} > ${id}.log 2>&1
done &

#remove original file

less SRR_Acc_List.txt |while read id;
do
mv ${id} ${id}.rm
done

rm -rf *rm

Trim Adapter

why need to trim adapter

Adapters are short DNA sequences that are added to the ends of DNA fragments during library preparation. These adapters contain necessary sequences for downstream processes, such as priming sites for amplification or binding sites for sequencing platforms.
image.png
In common short read sequencing, the DNA insert (original molecule to be sequenced) is downstream from the read primer, meaning that the 5' adapters will not appear in the sequenced read. But, if the fragment is shorter than the number of bases sequenced, one will sequence into the 3' adapter. To make it clear: In Illumina sequencing, adapter sequences will only occur at the 3' end of the read and only if the DNA insert is shorter than the number of sequencing cycles (see picture below)!
image.png

# first split the 227 num list into 7 (7*33) part

split -l 33 SRR_Acc_List.txt Sample_list_add

# using fastp to trim adapter and get clean data

less ./List/Sample_list.txt |while read list;
do
less ./List/${list} |while read id;
do
fastp -i ./rawdata/${id}_1.fastq.gz -o ./clean_data/${id}_1_clean.fastq \
-I ./rawdata/${id}_2.fastq.gz -O ./clean_data/${id}_2_clean.fastq -w 2 >> ./clean_data/fastp.log 2>&1
done &
done

Mapping

Mapping

In ChIP-seq (Chromatin Immunoprecipitation sequencing), mapping refers to the process of aligning or mapping the sequencing reads obtained from the ChIP-seq experiment to a reference genome. The purpose of mapping is to determine the genomic locations of the DNA fragments that were enriched through the immunoprecipitation step.
image.png

View & Sort

View is to filter those good quality mapping results
Sort is to reorder data by location

Remove Duplication

Duplicated reads refer to sequencing reads that are identical or nearly identical, indicating potential PCR amplification artifacts or technical biases during library preparation.
PCR amplification artifacts: During library preparation, PCR amplification is often employed to generate enough DNA fragments for sequencing. However, PCR amplification can introduce biases, leading to the over-amplification of certain fragments and the creation of duplicate reads. Removing duplicate reads helps mitigate the impact of PCR biases and provides a more accurate representation of the actual DNA fragments in the sample.
less ./List/Sample_list.txt |while read list;
do
less ./List/${list} |while read id;
do
bowtie2 -p 2 \
-x /home/octavia/ref/niben261/Niben261_genome \
-1 ./clean_data/${id}_1_clean.fastq \
-2 ./clean_data/${id}_2_clean.fastq 2> ./mapping_log/${id}_mapping.log | \
samtools view -h -Sb -F 12 -f 3 -q 10 -@ 1 - | \
samtools sort -@ 2 - 2>> ./mapping_log/${id}_mapping.log | \
samtools rmdup - ${id}_rmdup.bam 2>> ./mapping_log/${id}_mapping.log
done &
done

Call peak

image.png
Peak calling is commonly used in ChIP-seq (Chromatin Immunoprecipitation sequencing) and other sequencing-based assays to identify genomic regions where DNA-binding proteins, histone modifications, or other chromatin features are localized.
less ./List/sample_list.txt |while read list;
do
less ./List/${list} |while read id;
do
macs2 callpeak --call-summits -f BAMPE -g 2e9 -p 0.05 \
-t ./bam/${id}_rmdup.bam \
-c ./bam/SRR15694072_rmdup.bam \
--outdir ./peak -n ${id} > ./peak_log/${id}.log 2>&1
done &
done

# get sub peaks
# summit + 75 & summit -75

less SRR_Acc_List.txt |while read id;
do
awk '{OFS="\t"}{$2=$2+$10-75;$3=$2+150;$10=75; print $0}' ./peak/${id}_peaks.narrowPeak > ./sub_peak/${id}_sub_peaks.narrowPeak
done

Assessment of Reproducibility

IDR (irreproducible discovery rate)

Each list of peaks is ranked according to p-value or signal score
################################################################################################
## IDR script (shell)
cat $1 | while read i; do

echo log > ./idr_log/IDR_process.log

# process config file
set IFS=$"\t"; array=($i); unset IFS
sample_name=${array[0]}
sample1=${array[1]}
sample2=${array[2]}
echo "$sample_name $sample1 $sample2" >> ./idr_log/IDR_process.log

idr --samples ./sub_peak/${sample1}_sub_peaks.narrowPeak ./sub_peak/${sample2}_sub_peaks.narrowPeak \
--input-file-type narrowPeak --rank signal.value --idr-threshold 0.01 \
--output-file ./idr/${sample_name}_overlap_peak.txt > ./idr_log/${sample_name}_IDR.log 2>&1
done &

##################################################################################################
bash IDR.sh sample_pair.txt
In this step, we can know which sort of peaks can repreasent the repetition
Then we drag them out
################################################################################################
## merge summit peak script (shell)
cat $1 | while read i; do

echo log > ./merge/merge_process.log

# process config file
set IFS=$"\t"; array=($i); unset IFS
sample_name=${array[0]}
sample1=${array[1]}
sample2=${array[2]}
echo "$sample_name $sample1 $sample2" >> ./idr_log/IDR_process.log
echo $sample_name
awk '{OFS="\t"}{if (NR==FNR) {id=$1$2;a[id]=$4;col8=$8;col9=$9;col10=$10;} \
else {id=$1$13; pos1=int(($2+$3)/2-75); pos2=pos1+150; print $1"\t"pos1"\t"pos2"\t"a[id]"\t"$5"\t"$6"\t"($7)/2"\t"col8"\t"col9"\t"col10}}' \
./sub_peak/${sample1}_sub_peaks.narrowPeak ./idr/${sample_name}_overlap_peak.txt > ./merge/${sample_name}_merged_sp.narrowPeak

done

##################################################################################################
bash merge_peak.sh sample_pair.txt
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.