Chapter 1: ATAC-seq — What it is and a beginner-friendly pipeline overview

This chapter explains the biological idea behind ATAC-seq (Assay for Transposase-Accessible Chromatin), why it is typically run with paired-end sequencing, and the typical analysis stages — from raw FASTQ files to biological interpretation. Each section explains what happens, why it matters, and what to expect so you can follow along without deep computational experience.

1. What is ATAC-seq? (short, conceptual)

ATAC-seq stands for Assay for Transposase-Accessible Chromatin using sequencing. It uses a hyperactive transposase (Tn5) loaded with sequencing adapters. Tn5 preferentially inserts adapters into regions of open (accessible) chromatin — locations where DNA is not tightly bound by nucleosomes or other proteins. After Tn5 insertion, those DNA fragments are amplified and sequenced. Regions with many insertions correspond to accessible regulatory elements (promoters, enhancers).

Key conceptual points:

Historical reference: Jason Buenrostro et al. introduced ATAC-seq in 2013 (Nature Methods / Nature), and the method is now widely used in epigenetics research.

2. Single-end vs paired-end sequencing for ATAC-seq — short answer

Always prefer paired-end sequencing for ATAC-seq. Each ATAC fragment has two ends (two Tn5 insertion sites). Paired-end reads let you reconstruct the full fragment length and exactly where Tn5 inserted on both sides — this is essential for fragment-length analysis, nucleosome calling, and correct mapping of insertion sites. Using single-end reads discards half of that information and reduces the sensitivity of downstream analyses.

Typical recommendation for standard experiments: ~10 million read pairs per sample (i.e., 10M clusters passing filter). This is sufficient for routine differential accessibility and motif analyses for many sample types. For genotyping or very deep studies, adjust accordingly.

3. High-level pipeline overview — three main stages

ATAC-seq analysis can be thought of in three major stages. Each stage breaks down into concrete steps (below) with the usual tools given as examples.

Stage A — Data processing (raw → cleaned alignments)

  • Quality control of raw reads (FastQC): check per-base quality, adapter content, overrepresented sequences.
  • Adapter trimming (Trim Galore / Cutadapt): remove sequencing adapters, especially important if many short fragments exist.
  • Alignment to the reference genome (Bowtie2 / BWA): map reads to genome; use parameters that preserve paired-end information and allow fragment length filtering.
  • Filtering (SAMtools / Picard): remove mitochondrial reads (often abundant in ATAC), low-quality mappings, and PCR duplicates (or mark duplicates depending on downstream needs).
  • Tn5 offset adjustment: shift alignments by +4/−5 bp (strand-specific) to center reads on cut sites for insertion calling.

Stage B — Peak calling & peak set construction

  • Peak calling per sample (MACS2 or pipeline defaults): find genomic regions with enriched insertions.
  • Merge peaks across samples to create a union peak set (so all samples are compared on the same feature set).
  • Count insertions (featureCounts / custom scripts): generate a matrix of insertion counts per peak per sample (this becomes your expression-like table for differential tests).

Stage C — Downstream analysis

  • Differential accessibility (DESeq2, edgeR, etc.): test which peaks change between conditions.
  • Motif enrichment (HOMER, MEME): identify TF motifs enriched in sets of peaks.
  • Peak annotation (ChIPseeker, HOMER): assign peaks to nearest genes, promoters, enhancers.
  • Visualization & integration (IGV / UCSC Genome Browser, ArchR for single-cell or advanced analyses): inspect tracks, integrate with RNA-seq, ChIP-seq, etc.

Recommended automatic pipelines

  • PEPATAC: portable, user-friendly, includes many QC metrics (TSS enrichment, fragment distribution). (See PEPATAC docs.)
  • ENCODE ATAC-seq pipeline: consortium-level, thoroughly tested pipeline used by ENCODE.
  • nf-core/atacseq: community pipeline with standardized inputs and reproducible containers.

4. Detailed, step-by-step walkthrough (what each step does & what to look for)

Step 1 — Pre-alignment QC (FastQC)

What: run FastQC on raw FASTQ files.

Why: find low quality reads, adapter contamination, unusual base composition.

What to look for: per-base quality (Phred), adapter content, overrepresented sequences. If adapters are present, trimming is required.

Step 2 — Adapter trimming (Cutadapt / Trim Galore)

What: remove adapter sequences and optionally low-quality bases from read ends.

Why: short fragments (common in ATAC) often include adapter sequence; unmended adapters cause alignment errors and false positives in peak calling.

Step 3 — Alignment (Bowtie2 / BWA)

What: map paired-end reads to the genome.

Why: correct alignment is needed to localize insertion events; preserve fragment length information produced by paired-end sequencing.

What to look for: mapping rate (ideally >70% for good libraries), fraction of reads mapping to mitochondria (high mitochondrial reads indicate poor nuclei prep). Align using parameters recommended for ATAC (keep properly paired reads, set reasonable max fragment length).

Step 4 — Remove mitochondrial reads & duplicates (SAMtools / Picard)

What: filter out reads aligning to the mitochondrial chromosome (chrM) and mark/remove PCR duplicates.

Why: mitochondrial reads do not report chromatin accessibility and can dominate the library; duplicates can inflate signal if PCR overamplified.

What to monitor: percent mitochondrial reads (low is better); duplication rate (high duplication → low complexity library).

Step 5 — Tn5 offset / adjust fragment centers

What: shift aligned reads by a small number of bases (strand-specific) so they represent the precise cut site of Tn5.

Why: centers insertion events correctly for accurate peak calling and footprinting analyses.

Step 6 — Peak calling (MACS2 or pipeline default)

What: detect genomic regions with enriched insertions compared with background.

Why: peaks approximate open chromatin regions used for downstream interpretation.

Step 7 — Merge peaks & count insertions

What: build a union (consensus) peak set across all samples and count insertions per sample per peak (count matrix).

Why: ensures all samples are compared on the same feature list for differential testing.

Step 8 — Downstream analyses

This includes:

5. Key QC metrics you should always check

6. Common problems & practical tips

Quick commands / small toolset

  • FastQC — raw read QC
  • seqtk — subsampling & basic FASTQ ops
  • Cutadapt / Trim Galore — trimming
  • Bowtie2 — aligner commonly used for ATAC
  • SAMtools / Picard — filtering & marking duplicates
  • MACS2 — peak calling

Full pipelines (recommended for beginners)

8. Practical checklist for a single sample (quick)

  1. Confirm paired-end sequencing and target depth (e.g., 10M read pairs).
  2. Run FastQC on raw FASTQ and inspect adapter content & per-base quality.
  3. Trim adapters if needed (Cutadapt / Trim Galore).
  4. Align reads to the correct reference genome (paired-end aware).
  5. Filter out mitochondrial reads and low-quality mappings; mark/remove duplicates.
  6. Apply Tn5 offset correction to represent cut sites.
  7. Call peaks (MACS2 or pipeline default).
  8. Merge peaks across samples and count insertions for differential testing.
  9. Run downstream analyses (differential, motif enrichment, annotation).

9. Short glossary (helpful terms)

Tn5 — transposase used to insert adapters into open chromatin.
Fragment — the DNA segment between paired reads; fragment length carries nucleosome information.
Peak — genomic interval with enriched insertions (open chromatin).
TSS enrichment — signal enrichment at transcription start sites; a QC metric for ATAC-seq.
Fragment length distribution — histogram of fragment sizes; nucleosome periodicity appears as peaks ~150 bp apart.

10. Further reading & links

Home