Chapter 7: Genome Alignment – Placing the Puzzle Pieces

Learning Objectives


The Challenge: A Million Piece Puzzle

In the previous chapters, we quality-checked and trimmed our data. We now hold a file containing approximately one million "clean" DNA sequences, each about 48 base pairs long. However, these sequences are currently meaningless strings of letters (A, C, G, T). We do not know where in the genome they came from.

Imagine you have a 1,000,000-piece puzzle, but no box to guide you. Genome alignment is the process of taking each tiny puzzle piece (a read) and finding its exact location on the box picture (the Reference Genome).

For our mouse data, we use the mm10 reference genome (also known as GRCm38). Our goal is to ask the computer: "Does this 48bp sequence match a gene on Chromosome 1? Or a regulatory region on Chromosome 19?" Once we map all the reads, we will look for "piles" of reads stacked in one location—these piles indicate open chromatin.

Tool Selection: Why Chromap?

In bioinformatics, choosing the right tool is half the battle. You may have heard of "classic" aligners like Bowtie2 or BWA, or perhaps STAR for RNA-seq. For this handbook, we have chosen Chromap.

Why deviate from the classics? In a traditional pipeline (like the ones used in major consortiums), alignment is a multi-step, slow process:

  1. Align reads with Bowtie2 (Time: Hours).
  2. Sort the huge output file with Samtools (Time: Hours).
  3. Mark and remove PCR duplicates with Picard (Time: Hours).

Chromap changes the game. It is a modern tool optimized specifically for chromatin biology (ATAC-seq, ChIP-seq). It is incredibly fast and, most importantly, it can perform alignment, duplicate removal, and quality filtering all at once.

Tool Spotlight: Know Your Aligners

Tool Best Application Advantages Disadvantages Final Choice
Chromap ATAC-seq, ChIP-seq, Hi-C Ultra-fast speed. Handles trimming, mapping, and duplicate removal in one command. Optimized for chromatin data. Newer tool, so fewer community tutorials exist compared to Bowtie2. SELECTED: The most efficient choice for modern ATAC-seq pipelines.
Bowtie2 Standard DNA (WGS, ChIP) The "Gold Standard." Highly accurate, robust, and cited in thousands of papers. Slower than Chromap. Requires separate, manual steps for sorting and cleaning data. Reference Only: Good for traditional pipelines, but too slow for our needs.
STAR RNA-seq Handles "spliced" reads (jumping over introns). Essential for RNA analysis. High Memory Usage. Designed to align RNA, not genomic DNA. SKIP: Chemically incorrect for ATAC-seq (which sequences DNA).

Phase 1: Installation and Setup

First, we must install Chromap into our environment.

conda activate bio-tools

conda install -c bioconda -c conda-forge chromap

Next, we need to prepare our files. Following our "Tidy Data" principles, let's create a dedicated folder for our alignment results so they don't get lost.

# Navigate to your project folder

cd atacseq

# Create a specific folder for chromap output

mkdir -p results/chromap

Phase 2: The Reference Genome (Toy vs. Real)

To map our reads, we need a map. In a professional research setting using a High-Performance Computing (HPC) cluster, you would download the entire mouse genome (all 20 chromosomes + scaffolds).

However, indexing a full mammalian genome requires significant RAM (often >32GB), which will crash a standard laptop. Since this handbook is designed to be accessible to everyone, we will use a "Toy Reference": we will align our reads only to Chromosome 19. This single chromosome is small and gene-rich, allowing us to demonstrate the exact principles of alignment in seconds rather than hours.

Step 1: Download the Reference

We will download the sequence for Chromosome 19 from the UCSC Genome Browser database.

# Move to the reference data folder

cd reference_data

# Download only Mouse Chromosome 19 (mm10 build)

wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/chr19.fa.gz

# Unzip the file so we can use it

gunzip chr19.fa.gz

Step 2: Build the Index

Before a computer can search a genome, it must build an Index. Think of this like the index at the back of a massive textbook. Without it, finding a specific sentence would require reading the whole book every time. With an index, the computer can jump instantly to the right location.

# Build the index for Chromosome 19

# -i: Index mode

# -r: Input reference file (fasta)

# -o: Output index file name

chromap -i -r chr19.fa -o chr19.index

Expected Output:

Build index for the reference.

Kmer length: 17, window size: 7

Loaded all sequences successfully...

Collecting minimizers...

Sorting minimizers...

Index built successfully.

Phase 3: The Alignment (Mapping the Reads)

Now we are ready to map. We will take our trimmed reads from the previous chapter and align them to our new chr19 index.

For this specific step, we are going to run a "Raw Alignment" first, just to see how it works. We will not apply filters yet.

Note on Sequencing Type:

In a real experiment, ATAC-seq is almost always Paired-End (Read 1 and Read 2), because knowing the distance between the two ends tells us the size of the DNA fragment. Chromap has a specific flag --preset atac that automatically handles paired-end data.

However, for this "Toy Dataset," we are processing it as Single-End to keep things simple and fast. Therefore, we use the standard flags.

# Navigate back to the project root

cd ..

# Run Alignment

chromap -x reference_data/chr19.index \

        -r reference_data/chr19.fa \

        -1 results/trimmed/ENCFF535OGL_subset_trimmed.fastq.gz \

        -o results/chromap/aligned.bed

Expected Output:

Mapping...

Read file: results/trimmed/ENCFF535OGL_subset_trimmed.fastq.gz

Total reads: 1,000,000

Mapped reads: 85,432  (Note: Low mapping is expected because we are only mapping to 1 chromosome out of 20!)

Don't panic if your mapping rate is low. Since we threw away 95% of the genome (Chr 1-18, X, Y) to save RAM, we expect only the reads that truly belong to Chr 19 to align.


Phase 4: The Clean-Up (Filtering)

We now have an aligned file (aligned.bed), but it is "dirty." It contains three types of villains that ruin ATAC-seq analysis:

  1. Low-Quality Maps: Reads where the aligner wasn't sure if it belonged to gene A or gene B. We filter these by Mapping Quality (MapQ) Score. We typically want a score > 30.
  2. PCR Duplicates: During library prep, PCR can accidentally make millions of copies of the exact same DNA fragment. These are artificial clones, not real data. We must remove them.
  3. Mitochondrial Reads: The Tn5 enzyme loves mitochondrial DNA because it is "open" (no nucleosomes). A huge portion of ATAC-seq data is just mitochondrial noise.

We will now re-run the alignment, but this time we will turn on Chromap's powerful internal filters to handle villains #1 and #2 automatically.

Step 1: Align with Filters

# -q 30: Only keep reads with mapping quality > 30 (high confidence)

# --remove-pcr-duplicates: Automatically detect and delete PCR clones

chromap -x reference_data/chr19.index \

        -r reference_data/chr19.fa \

        -1 results/trimmed/ENCFF535OGL_subset_trimmed.fastq.gz \

        -o results/chromap/aligned_filtered.bed \

        --remove-pcr-duplicates \

        -q 30

Why .bed format?

You may notice we are outputting a .bed file instead of the massive .bam files used in older pipelines. Chromap is optimized to produce BED fragment files. They are significantly smaller (text-based) and faster to generate, fitting our goal of a lightweight, efficient pipeline. They contain all the necessary information (Chromosome, Start, End) for the next step: Peak Calling.

Step 2: Remove Mitochondrial Reads

Chromap handles duplicates and quality, but it leaves the Mitochondria ("chrM") behind. Since our output is a simple text file, we can use the command-line tool grep to physically remove any line that mentions "chrM".

# grep -v means "Select everything EXCEPT..."

# We select every line that does NOT contain "chrM"

grep -v "chrM" results/chromap/aligned_filtered.bed > results/chromap/final_clean_peaks.bed

Conclusion

You now have a file named final_clean_peaks.bed.

This file is the gold standard of your raw data. It contains only unique, high-quality, non-mitochondrial reads that have been successfully mapped to the genome. Each line in this file represents a real, biological instance of open chromatin.

In the next chapter, we will stack these reads on top of each other to find the "mountains" of data—the Peaks.

← Back: Adapter Trimming Home