Chapter 8: Peak Calling & QC

Learning Objectives


The Concept: From Maps to Mountains

In the previous chapter, we successfully aligned our reads to the genome. We now have a file (final_clean_peaks.bed) containing coordinates for thousands of DNA fragments. But a list of coordinates is not a biological result. We need to answer the core question of the experiment: Where is the chromatin open?

Imagine looking at a map of a city where every person is represented by a tiny dot. If you wanted to find the stadiums or concert halls, you would look for the places where thousands of dots are piled on top of each other. In bioinformatics, these "piles" of reads are called Peaks.

A "Peak" represents a genomic region where the Tn5 transposase was able to access the DNA frequently, implying that the chromatin there was open and active. Peak Calling is the statistical process of distinguishing these significant mountains of reads from the random, scattered stones (background noise) across the genome.


Phase 1: The Reality Check (Quantifying Your Filter)

Before we call peaks, we must perform a "Sanity Check" on the work we did in the Alignment chapter. We claimed that we filtered out PCR duplicates and low-quality reads, but did it actually work?

We can verify this using a simple command called wc (Word Count). By using the -l flag, we ask the computer to count the number of lines (reads) in our files.

Step 1: Check the "Dirty" File

First, let's look at the alignment file before we ran the filters.

# Count lines in the raw alignment

wc -l results/chromap/aligned.bed

Expected Output:

44509 results/chromap/aligned.bed

Step 2: Check the "Clean" File

Now, let's look at the file after we applied the duplicate removal and quality filters.

# Count lines in the filtered alignment

wc -l results/chromap/aligned_filtered.bed

Expected Output:

37750 results/chromap/aligned_filtered.bed

The Analysis:

Compare the numbers: 44,509 vs 37,750.

You effectively removed 6,759 reads (about 15% of your data).


Phase 2: The Tool (MACS2)

To find our peaks, we will use a tool called MACS2 (Model-based Analysis for ChIP-Seq).

First, let's install it into our environment.

conda install -c bioconda macs2

Setup:

We need a place to store our "Treasure" (the peak files).

# Navigate to your project folder if you aren't there

cd atacseq

# Create a directory for peaks

mkdir -p results/peaks

Phase 3: The Execution (Calling the Peaks)

This is the single most important command in the analysis. However, standard MACS2 settings are for ChIP-seq. For ATAC-seq, we must modify the parameters to account for the biology of the Tn5 transposase.

The Biological Twist:

The Tn5 transposase binds to DNA as a dimer (two copies attached). When it cuts the DNA, it actually sits on a 9bp region. This means the "cut site" recorded by the sequencer is slightly offset from the actual center of the open chromatin.

The Command:

Run this as a single line.

macs2 callpeak -t results/chromap/final_clean_peaks.bed \

               -f BED \

               -n results/peaks/ENCFF535OGL \

               -g mm \

               --nomodel \

               --shift -100 \

               --extsize 200 \

               -q 0.01

Parameter Breakdown:

Expected Output:

MACS2 is very chatty. It will print a log of its statistical modeling to your screen.

INFO  @ Wed, 10 Dec 2025 01:17:49: #1 read tag files...

INFO  @ Wed, 10 Dec 2025 01:17:49: #1 read treatment tags...

INFO  @ Wed, 10 Dec 2025 01:17:49: Detected format is: BED

INFO  @ Wed, 10 Dec 2025 01:17:49: #1 tag size is determined as 47 bps

INFO  @ Wed, 10 Dec 2025 01:17:49: #1 total tags in treatment: 37750

...

INFO  @ Wed, 10 Dec 2025 01:17:49: #3 Call peaks...

INFO  @ Wed, 10 Dec 2025 01:17:49: #3 write output to results/peaks/ENCFF535OGL_peaks.narrowPeak ...

If you see "Done!" or the log finishes without an "ERROR" message, you have successfully mined your data.


Phase 4: The Treasure (Interpreting Results)

Go to your results folder and look at what you created.

ls results/peaks/

You will see three key files. These are the standard output formats used by almost all downstream tools (like R, Homer, or Genome Browsers).

1. The "Map": ENCFF535OGL_peaks.narrowPeak

This is the most important file. It is a text file (BED format) listing the exact coordinates of every open chromatin region found.

2. The "Summit": ENCFF535OGL_summits.bed

While the narrowPeak gives you the full width of the mountain (from the base on the left to the base on the right), this file gives you a single pixel (1bp) coordinate for the absolute highest point of the peak.

3. The "Spreadsheet": ENCFF535OGL_peaks.xls

Despite the extension, this is a text file you can open in Excel. It contains the same information as the narrowPeak file but includes extra statistics like fold_enrichment (how much higher is this peak compared to background noise?) and -log10(pvalue) (how statistically significant is it?).

Congratulations! You have transformed raw, noisy sequencing data into a concise list of biologically meaningful genomic locations. In the next chapter, we will visualize these files to verify our results with our own eyes.

← Back: Genome Alignment Home