Chapter 7: Genome Alignment – Placing the Puzzle Pieces
Learning Objectives
- Conceptualize genome alignment as mapping millions of short DNA fragments to a reference coordinate system.
- Justify the selection of Chromap over traditional tools like Bowtie2 or STAR for ATAC-seq analysis.
- Construct a genome index to allow for rapid data querying.
- Execute an alignment workflow that maps reads, removes technical artifacts (PCR duplicates), and filters for quality in a single step.
- Clean your dataset by removing mitochondrial noise to prepare for peak calling.
The Challenge: A Million Piece Puzzle
In the previous chapters, we quality-checked and trimmed our data. We now hold a file containing approximately one million "clean" DNA sequences, each about 48 base pairs long. However, these sequences are currently meaningless strings of letters (A, C, G, T). We do not know where in the genome they came from.
Imagine you have a 1,000,000-piece puzzle, but no box to guide you. Genome alignment is the process of taking each tiny puzzle piece (a read) and finding its exact location on the box picture (the Reference Genome).
For our mouse data, we use the mm10 reference genome (also known as GRCm38). Our goal is to ask the computer: "Does this 48bp sequence match a gene on Chromosome 1? Or a regulatory region on Chromosome 19?" Once we map all the reads, we will look for "piles" of reads stacked in one location—these piles indicate open chromatin.
Tool Selection: Why Chromap?
In bioinformatics, choosing the right tool is half the battle. You may have heard of "classic" aligners like Bowtie2 or BWA, or perhaps STAR for RNA-seq. For this handbook, we have chosen Chromap.
Why deviate from the classics? In a traditional pipeline (like the ones used in major consortiums), alignment is a multi-step, slow process:
- Align reads with Bowtie2 (Time: Hours).
- Sort the huge output file with Samtools (Time: Hours).
- Mark and remove PCR duplicates with Picard (Time: Hours).
Chromap changes the game. It is a modern tool optimized specifically for chromatin biology (ATAC-seq, ChIP-seq). It is incredibly fast and, most importantly, it can perform alignment, duplicate removal, and quality filtering all at once.
Tool Spotlight: Know Your Aligners
| Tool | Best Application | Advantages | Disadvantages | Final Choice |
|---|---|---|---|---|
| Chromap | ATAC-seq, ChIP-seq, Hi-C | Ultra-fast speed. Handles trimming, mapping, and duplicate removal in one command. Optimized for chromatin data. | Newer tool, so fewer community tutorials exist compared to Bowtie2. | SELECTED: The most efficient choice for modern ATAC-seq pipelines. |
| Bowtie2 | Standard DNA (WGS, ChIP) | The "Gold Standard." Highly accurate, robust, and cited in thousands of papers. | Slower than Chromap. Requires separate, manual steps for sorting and cleaning data. | Reference Only: Good for traditional pipelines, but too slow for our needs. |
| STAR | RNA-seq | Handles "spliced" reads (jumping over introns). Essential for RNA analysis. | High Memory Usage. Designed to align RNA, not genomic DNA. | SKIP: Chemically incorrect for ATAC-seq (which sequences DNA). |
Phase 1: Installation and Setup
First, we must install Chromap into our environment.
conda activate bio-tools conda install -c bioconda -c conda-forge chromap
Next, we need to prepare our files. Following our "Tidy Data" principles, let's create a dedicated folder for our alignment results so they don't get lost.
# Navigate to your project folder cd atacseq # Create a specific folder for chromap output mkdir -p results/chromap
Phase 2: The Reference Genome (Toy vs. Real)
To map our reads, we need a map. In a professional research setting using a High-Performance Computing (HPC) cluster, you would download the entire mouse genome (all 20 chromosomes + scaffolds).
However, indexing a full mammalian genome requires significant RAM (often >32GB), which will crash a standard laptop. Since this handbook is designed to be accessible to everyone, we will use a "Toy Reference": we will align our reads only to Chromosome 19. This single chromosome is small and gene-rich, allowing us to demonstrate the exact principles of alignment in seconds rather than hours.
Step 1: Download the Reference
We will download the sequence for Chromosome 19 from the UCSC Genome Browser database.
# Move to the reference data folder cd reference_data # Download only Mouse Chromosome 19 (mm10 build) wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/chr19.fa.gz # Unzip the file so we can use it gunzip chr19.fa.gz
Step 2: Build the Index
Before a computer can search a genome, it must build an Index. Think of this like the index at the back of a massive textbook. Without it, finding a specific sentence would require reading the whole book every time. With an index, the computer can jump instantly to the right location.
# Build the index for Chromosome 19 # -i: Index mode # -r: Input reference file (fasta) # -o: Output index file name chromap -i -r chr19.fa -o chr19.index
Expected Output:
Build index for the reference. Kmer length: 17, window size: 7 Loaded all sequences successfully... Collecting minimizers... Sorting minimizers... Index built successfully.
Phase 3: The Alignment (Mapping the Reads)
Now we are ready to map. We will take our trimmed reads from the previous chapter and align them to our new chr19 index.
For this specific step, we are going to run a "Raw Alignment" first, just to see how it works. We will not apply filters yet.
Note on Sequencing Type:
In a real experiment, ATAC-seq is almost always Paired-End (Read 1 and Read 2), because knowing the distance between the two ends tells us the size of the DNA fragment. Chromap has a specific flag --preset atac that automatically handles paired-end data.
However, for this "Toy Dataset," we are processing it as Single-End to keep things simple and fast. Therefore, we use the standard flags.
# Navigate back to the project root
cd ..
# Run Alignment
chromap -x reference_data/chr19.index \
-r reference_data/chr19.fa \
-1 results/trimmed/ENCFF535OGL_subset_trimmed.fastq.gz \
-o results/chromap/aligned.bed
Expected Output:
Mapping... Read file: results/trimmed/ENCFF535OGL_subset_trimmed.fastq.gz Total reads: 1,000,000 Mapped reads: 85,432 (Note: Low mapping is expected because we are only mapping to 1 chromosome out of 20!)
Don't panic if your mapping rate is low. Since we threw away 95% of the genome (Chr 1-18, X, Y) to save RAM, we expect only the reads that truly belong to Chr 19 to align.
Phase 4: The Clean-Up (Filtering)
We now have an aligned file (aligned.bed), but it is "dirty." It contains three types of villains that ruin ATAC-seq analysis:
- Low-Quality Maps: Reads where the aligner wasn't sure if it belonged to gene A or gene B. We filter these by Mapping Quality (MapQ) Score. We typically want a score > 30.
- PCR Duplicates: During library prep, PCR can accidentally make millions of copies of the exact same DNA fragment. These are artificial clones, not real data. We must remove them.
- Mitochondrial Reads: The Tn5 enzyme loves mitochondrial DNA because it is "open" (no nucleosomes). A huge portion of ATAC-seq data is just mitochondrial noise.
We will now re-run the alignment, but this time we will turn on Chromap's powerful internal filters to handle villains #1 and #2 automatically.
Step 1: Align with Filters
# -q 30: Only keep reads with mapping quality > 30 (high confidence)
# --remove-pcr-duplicates: Automatically detect and delete PCR clones
chromap -x reference_data/chr19.index \
-r reference_data/chr19.fa \
-1 results/trimmed/ENCFF535OGL_subset_trimmed.fastq.gz \
-o results/chromap/aligned_filtered.bed \
--remove-pcr-duplicates \
-q 30
Why .bed format?
You may notice we are outputting a .bed file instead of the massive .bam files used in older pipelines. Chromap is optimized to produce BED fragment files. They are significantly smaller (text-based) and faster to generate, fitting our goal of a lightweight, efficient pipeline. They contain all the necessary information (Chromosome, Start, End) for the next step: Peak Calling.
Step 2: Remove Mitochondrial Reads
Chromap handles duplicates and quality, but it leaves the Mitochondria ("chrM") behind. Since our output is a simple text file, we can use the command-line tool grep to physically remove any line that mentions "chrM".
# grep -v means "Select everything EXCEPT..." # We select every line that does NOT contain "chrM" grep -v "chrM" results/chromap/aligned_filtered.bed > results/chromap/final_clean_peaks.bed
Conclusion
You now have a file named final_clean_peaks.bed.
This file is the gold standard of your raw data. It contains only unique, high-quality, non-mitochondrial reads that have been successfully mapped to the genome. Each line in this file represents a real, biological instance of open chromatin.
In the next chapter, we will stack these reads on top of each other to find the "mountains" of data—the Peaks.