Chapter 1: ATAC-seq — What it is and a beginner-friendly pipeline overview

This chapter explains the biological idea behind ATAC-seq (Assay for Transposase-Accessible Chromatin), why it is typically run with paired-end sequencing, and the typical analysis stages — from raw FASTQ files to biological interpretation. Each section explains what happens, why it matters, and what to expect so you can follow along without deep computational experience.

1. What is ATAC-seq? (short, conceptual)

ATAC-seq stands for Assay for Transposase-Accessible Chromatin using sequencing. It uses a hyperactive transposase (Tn5) loaded with sequencing adapters. Tn5 preferentially inserts adapters into regions of open (accessible) chromatin — locations where DNA is not tightly bound by nucleosomes or other proteins. After Tn5 insertion, those DNA fragments are amplified and sequenced. Regions with many insertions correspond to accessible regulatory elements (promoters, enhancers).

Key conceptual points:

Tn5 insertion equals signal: more insertions → more accessibility.
Fragments report chromatin structure: short fragments often come from nucleosome-free regions; periodic fragment lengths (≈150 bp multiples) reflect nucleosome spacing.
Tn5 bias & offset: because Tn5 inserts adapters at slightly offset positions relative to the actual cut site, many analyses apply a small shift (Tn5 offset) to align reads correctly to insertion events.

Historical reference: Jason Buenrostro et al. introduced ATAC-seq in 2013 (Nature Methods / Nature), and the method is now widely used in epigenetics research.

2. Single-end vs paired-end sequencing for ATAC-seq — short answer

Always prefer paired-end sequencing for ATAC-seq. Each ATAC fragment has two ends (two Tn5 insertion sites). Paired-end reads let you reconstruct the full fragment length and exactly where Tn5 inserted on both sides — this is essential for fragment-length analysis, nucleosome calling, and correct mapping of insertion sites. Using single-end reads discards half of that information and reduces the sensitivity of downstream analyses.

Typical recommendation for standard experiments: ~10 million read pairs per sample (i.e., 10M clusters passing filter). This is sufficient for routine differential accessibility and motif analyses for many sample types. For genotyping or very deep studies, adjust accordingly.

3. High-level pipeline overview — three main stages

ATAC-seq analysis can be thought of in three major stages. Each stage breaks down into concrete steps (below) with the usual tools given as examples.

Stage A — Data processing (raw → cleaned alignments)

Quality control of raw reads (FastQC): check per-base quality, adapter content, overrepresented sequences.
Adapter trimming (Trim Galore / Cutadapt): remove sequencing adapters, especially important if many short fragments exist.
Alignment to the reference genome (Bowtie2 / BWA): map reads to genome; use parameters that preserve paired-end information and allow fragment length filtering.
Filtering (SAMtools / Picard): remove mitochondrial reads (often abundant in ATAC), low-quality mappings, and PCR duplicates (or mark duplicates depending on downstream needs).
Tn5 offset adjustment: shift alignments by +4/−5 bp (strand-specific) to center reads on cut sites for insertion calling.

Stage B — Peak calling & peak set construction

Peak calling per sample (MACS2 or pipeline defaults): find genomic regions with enriched insertions.
Merge peaks across samples to create a union peak set (so all samples are compared on the same feature set).
Count insertions (featureCounts / custom scripts): generate a matrix of insertion counts per peak per sample (this becomes your expression-like table for differential tests).

Stage C — Downstream analysis

Differential accessibility (DESeq2, edgeR, etc.): test which peaks change between conditions.
Motif enrichment (HOMER, MEME): identify TF motifs enriched in sets of peaks.
Peak annotation (ChIPseeker, HOMER): assign peaks to nearest genes, promoters, enhancers.
Visualization & integration (IGV / UCSC Genome Browser, ArchR for single-cell or advanced analyses): inspect tracks, integrate with RNA-seq, ChIP-seq, etc.

Recommended automatic pipelines

PEPATAC: portable, user-friendly, includes many QC metrics (TSS enrichment, fragment distribution). (See PEPATAC docs.)
ENCODE ATAC-seq pipeline: consortium-level, thoroughly tested pipeline used by ENCODE.
nf-core/atacseq: community pipeline with standardized inputs and reproducible containers.

4. Detailed, step-by-step walkthrough (what each step does & what to look for)

Step 1 — Pre-alignment QC (FastQC)

What: run FastQC on raw FASTQ files.

Why: find low quality reads, adapter contamination, unusual base composition.

What to look for: per-base quality (Phred), adapter content, overrepresented sequences. If adapters are present, trimming is required.

Step 2 — Adapter trimming (Cutadapt / Trim Galore)

What: remove adapter sequences and optionally low-quality bases from read ends.

Why: short fragments (common in ATAC) often include adapter sequence; unmended adapters cause alignment errors and false positives in peak calling.

Step 3 — Alignment (Bowtie2 / BWA)

What: map paired-end reads to the genome.

Why: correct alignment is needed to localize insertion events; preserve fragment length information produced by paired-end sequencing.

What to look for: mapping rate (ideally >70% for good libraries), fraction of reads mapping to mitochondria (high mitochondrial reads indicate poor nuclei prep). Align using parameters recommended for ATAC (keep properly paired reads, set reasonable max fragment length).

Step 4 — Remove mitochondrial reads & duplicates (SAMtools / Picard)

What: filter out reads aligning to the mitochondrial chromosome (chrM) and mark/remove PCR duplicates.

Why: mitochondrial reads do not report chromatin accessibility and can dominate the library; duplicates can inflate signal if PCR overamplified.

What to monitor: percent mitochondrial reads (low is better); duplication rate (high duplication → low complexity library).

Step 5 — Tn5 offset / adjust fragment centers

What: shift aligned reads by a small number of bases (strand-specific) so they represent the precise cut site of Tn5.

Why: centers insertion events correctly for accurate peak calling and footprinting analyses.

Step 6 — Peak calling (MACS2 or pipeline default)

What: detect genomic regions with enriched insertions compared with background.

Why: peaks approximate open chromatin regions used for downstream interpretation.

Step 7 — Merge peaks & count insertions

What: build a union (consensus) peak set across all samples and count insertions per sample per peak (count matrix).

Why: ensures all samples are compared on the same feature list for differential testing.

Step 8 — Downstream analyses

This includes:

Differential accessibility tests (DESeq2/edgeR)
Motif enrichment and TF inference
Peak annotation and integration with expression data
Visualization in genome browsers

5. Key QC metrics you should always check

Sequencing depth: number of read pairs (aim ≈10M read pairs for routine experiments).
Mapping rate: fraction of reads that align to the genome.
Fraction mitochondrial reads: high % suggests bad nuclei prep; mitochondrial reads are typically removed.
TSS enrichment score: high enrichment at transcription start sites indicates good signal-to-noise (pipelines like PEPATAC report this).
Fragment length distribution: peaks at nucleosome-free (~<100 bp) and mono/di/tri-nucleosome multiples (~150 bp) indicate expected biology.
Duplicate rate / library complexity: high duplication reduces usable unique reads.

6. Common problems & practical tips

Too many mitochondrial reads: optimize nuclei isolation (gentler lysis) and remove mtDNA reads after alignment.
Low TSS enrichment: indicates low signal; check sample prep quality and sequencing depth.
Adapter contamination: always trim adapters, especially if fragments are short.
Very low mapping rate: check reference genome, read format, and whether data are single-end vs paired-end mismatch.
High duplication: use more input material, reduce PCR cycles if possible, or accept lower complexity for QC experiments.

7. Tools & pipeline recommendations (easy starters)

Quick commands / small toolset

FastQC — raw read QC
seqtk — subsampling & basic FASTQ ops
Cutadapt / Trim Galore — trimming
Bowtie2 — aligner commonly used for ATAC
SAMtools / Picard — filtering & marking duplicates
MACS2 — peak calling

Full pipelines (recommended for beginners)

PEPATAC — portable, produces many QC metrics including TSS enrichment.
ENCODE ATAC-seq pipeline — consortium best practices.
nf-core/atacseq — community pipeline with standardization and container support.

8. Practical checklist for a single sample (quick)

Confirm paired-end sequencing and target depth (e.g., 10M read pairs).
Run FastQC on raw FASTQ and inspect adapter content & per-base quality.
Trim adapters if needed (Cutadapt / Trim Galore).
Align reads to the correct reference genome (paired-end aware).
Filter out mitochondrial reads and low-quality mappings; mark/remove duplicates.
Apply Tn5 offset correction to represent cut sites.
Call peaks (MACS2 or pipeline default).
Merge peaks across samples and count insertions for differential testing.
Run downstream analyses (differential, motif enrichment, annotation).

9. Short glossary (helpful terms)

Tn5 — transposase used to insert adapters into open chromatin.
Fragment — the DNA segment between paired reads; fragment length carries nucleosome information.
Peak — genomic interval with enriched insertions (open chromatin).
TSS enrichment — signal enrichment at transcription start sites; a QC metric for ATAC-seq.
Fragment length distribution — histogram of fragment sizes; nucleosome periodicity appears as peaks ~150 bp apart.

10. Further reading & links

Buenrostro et al., original ATAC-seq paper (2013) — introduction to the method.
Protocol & review: Nature Protocols / 2022 review (ATAC-seq best practices) — recommended reading (detailed).
PEPATAC documentation — a user-friendly pipeline that reports many QC metrics (TSS enrichment etc.).
ENCODE ATAC-seq pipeline — consortium pipeline & guidelines.
nf-core/atacseq — community, containerized pipeline.

Home Next: Project Directory Setup →