Chapter 8: Peak Calling & QC
Learning Objectives
- Conceptualize "Peak Calling" as the statistical identification of signal over background noise.
- Quantify the impact of your filtering steps by comparing raw versus clean alignment files.
- Explain the biological necessity of the
--shiftand--extsizeparameters for ATAC-seq data. - Execute the MACS2 algorithm to identify open chromatin regions.
- Interpret the standard output files (
narrowPeak,summits,xls) that contain your biological discoveries.
The Concept: From Maps to Mountains
In the previous chapter, we successfully aligned our reads to the genome. We now have a file (final_clean_peaks.bed) containing coordinates for thousands of DNA fragments. But a list of coordinates is not a biological result. We need to answer the core question of the experiment: Where is the chromatin open?
Imagine looking at a map of a city where every person is represented by a tiny dot. If you wanted to find the stadiums or concert halls, you would look for the places where thousands of dots are piled on top of each other. In bioinformatics, these "piles" of reads are called Peaks.
A "Peak" represents a genomic region where the Tn5 transposase was able to access the DNA frequently, implying that the chromatin there was open and active. Peak Calling is the statistical process of distinguishing these significant mountains of reads from the random, scattered stones (background noise) across the genome.
Phase 1: The Reality Check (Quantifying Your Filter)
Before we call peaks, we must perform a "Sanity Check" on the work we did in the Alignment chapter. We claimed that we filtered out PCR duplicates and low-quality reads, but did it actually work?
We can verify this using a simple command called wc (Word Count). By using the -l flag, we ask the computer to count the number of lines (reads) in our files.
Step 1: Check the "Dirty" File
First, let's look at the alignment file before we ran the filters.
# Count lines in the raw alignment wc -l results/chromap/aligned.bed
Expected Output:
44509 results/chromap/aligned.bed
Step 2: Check the "Clean" File
Now, let's look at the file after we applied the duplicate removal and quality filters.
# Count lines in the filtered alignment wc -l results/chromap/aligned_filtered.bed
Expected Output:
37750 results/chromap/aligned_filtered.bed
The Analysis:
Compare the numbers: 44,509 vs 37,750.
You effectively removed 6,759 reads (about 15% of your data).
- Is this bad? No, this is excellent!
- What were they? These were likely PCR duplicates (artificial clones) or low-confidence mappings that would have created false signals in your analysis. You have successfully "de-noised" your data.
Phase 2: The Tool (MACS2)
To find our peaks, we will use a tool called MACS2 (Model-based Analysis for ChIP-Seq).
- Wait, ChIP-seq? Yes. Although the tool was originally designed for ChIP-seq (detecting protein binding), its mathematics are so robust that it became the "Gold Standard" for ATAC-seq as well.
First, let's install it into our environment.
conda install -c bioconda macs2
Setup:
We need a place to store our "Treasure" (the peak files).
# Navigate to your project folder if you aren't there cd atacseq # Create a directory for peaks mkdir -p results/peaks
Phase 3: The Execution (Calling the Peaks)
This is the single most important command in the analysis. However, standard MACS2 settings are for ChIP-seq. For ATAC-seq, we must modify the parameters to account for the biology of the Tn5 transposase.
The Biological Twist:
The Tn5 transposase binds to DNA as a dimer (two copies attached). When it cuts the DNA, it actually sits on a 9bp region. This means the "cut site" recorded by the sequencer is slightly offset from the actual center of the open chromatin.
- To fix this, we use
--shift -100and--extsize 200. - In plain English: This tells the software, "Take the start of the read, shift it back 100 bases, and then extend it 200 bases to smooth it out." This centers our signal exactly over the open chromatin region.
The Command:
Run this as a single line.
macs2 callpeak -t results/chromap/final_clean_peaks.bed \
-f BED \
-n results/peaks/ENCFF535OGL \
-g mm \
--nomodel \
--shift -100 \
--extsize 200 \
-q 0.01
Parameter Breakdown:
-t: Treatment file (your clean reads).-f BED: Format (we are using the BED file from Chromap).-g mm: Genome size (mm = Mus musculus/Mouse). This is essential for calculating statistics.--nomodel: Tells MACS2 "Don't try to guess the fragment length; we already know it because this is ATAC-seq."-q 0.01: The significance cutoff (False Discovery Rate). We are being strict: we only want peaks that are statistically very likely to be real.
Expected Output:
MACS2 is very chatty. It will print a log of its statistical modeling to your screen.
INFO @ Wed, 10 Dec 2025 01:17:49: #1 read tag files... INFO @ Wed, 10 Dec 2025 01:17:49: #1 read treatment tags... INFO @ Wed, 10 Dec 2025 01:17:49: Detected format is: BED INFO @ Wed, 10 Dec 2025 01:17:49: #1 tag size is determined as 47 bps INFO @ Wed, 10 Dec 2025 01:17:49: #1 total tags in treatment: 37750 ... INFO @ Wed, 10 Dec 2025 01:17:49: #3 Call peaks... INFO @ Wed, 10 Dec 2025 01:17:49: #3 write output to results/peaks/ENCFF535OGL_peaks.narrowPeak ...
If you see "Done!" or the log finishes without an "ERROR" message, you have successfully mined your data.
Phase 4: The Treasure (Interpreting Results)
Go to your results folder and look at what you created.
ls results/peaks/
You will see three key files. These are the standard output formats used by almost all downstream tools (like R, Homer, or Genome Browsers).
1. The "Map": ENCFF535OGL_peaks.narrowPeak
This is the most important file. It is a text file (BED format) listing the exact coordinates of every open chromatin region found.
- Column 1: Chromosome (e.g., chr19)
- Column 2: Start of the peak
- Column 3: End of the peak
- Usage: You will load this into genome browsers to visualize the "mountains" or send it to tools like Homer to ask "Which genes are near these peaks?"
2. The "Summit": ENCFF535OGL_summits.bed
While the narrowPeak gives you the full width of the mountain (from the base on the left to the base on the right), this file gives you a single pixel (1bp) coordinate for the absolute highest point of the peak.
- Biological Meaning: This is likely the exact center of the regulatory element (enhancer or promoter) where transcription factors are binding.
3. The "Spreadsheet": ENCFF535OGL_peaks.xls
Despite the extension, this is a text file you can open in Excel. It contains the same information as the narrowPeak file but includes extra statistics like fold_enrichment (how much higher is this peak compared to background noise?) and -log10(pvalue) (how statistically significant is it?).
Congratulations! You have transformed raw, noisy sequencing data into a concise list of biologically meaningful genomic locations. In the next chapter, we will visualize these files to verify our results with our own eyes.