Chapter 6: Data Cleaning – Trimming and Adapting

Learning Objectives

By the end of this chapter, you will be able to:

The Biology of the Signal: Interpreting the Noise

Welcome to the data cleaning phase. Before we can align our sequencing reads to the genome, we must ensure that the data we are feeding into the alignment software is an accurate representation of the biology we are studying.

In the previous chapter, we generated a Quality Control (QC) report using FastQC. You likely noticed some "Warnings" or "Failures" in that report. For a beginner, red flags in a report can be alarming, but in bioinformatics, context is everything. To understand what needs to be fixed and what needs to be ignored, we have to return to the wet lab and look closely at the chemistry of the assay itself.

The most confusing part of an ATAC-seq QC report is often the "Per Base Sequence Content" module. In a standard DNA sequencing run, we expect the four bases—Adenine (A), Cytosine (C), Guanine (G), and Thymine (T)—to appear randomly and in roughly equal proportions (25% each) across the length of the read.

However, your ATAC-seq report likely shows wild fluctuations—wavy lines that spike and dip dramatically—at the very beginning of the read (the first 9 to 12 base pairs). A general QC tool like FastQC flags this as a "Failure," interpreting it as a technical error in the sequencer. But to a biologist knowing the mechanism of ATAC-seq, this is not an error; it is a signature.

Recall how ATAC-seq libraries are constructed. We use an enzyme called Tn5 transposase. This enzyme has a dual job: it simultaneously fragments the open chromatin and attaches sequencing adapters to the ends of those fragments (a process called tagmentation).

While we often describe Tn5 as cutting "randomly" in open chromatin, enzymes are rarely truly random. The Tn5 transposase has a specific binding preference—a specific motif of DNA sequences that it latches onto most efficiently. Because the sequencing machine starts reading exactly where the Tn5 enzyme made its cut, the first few bases of every single read will inevitably reflect this enzyme's binding preference rather than the random sequence of the genome.

What does this mean for your analysis?

It means that the "error" you see is actually biological confirmation that your enzymatic reaction worked. Crucially, you must not trim these initial bases. If you were to trim the first 10 bases to make the report look "clean," you would be deleting the exact nucleotides where the chromatin was accessible. This would ruin your ability to perform downstream tasks like "footprinting," where we look for the exact binding sites of transcription factors. Therefore, we safely ignore this specific warning.

The Intruder: Sequencing Adapters

The second issue often seen in QC reports is Adapter Content, and unlike the Tn5 bias, this one requires our intervention.

When we prepare a sequencing library, we attach artificial pieces of DNA called "adapters" to both ends of our biological DNA fragments. These adapters act like handles, allowing the DNA to attach to the flow cell of the sequencing machine (the Illumina lane).

Ideally, the sequencing machine reads only the biological DNA in the middle. However, ATAC-seq often produces fragments of varying lengths. If a DNA fragment is shorter than the number of cycles the machine is set to read (for example, a 30bp DNA fragment sequenced on a 50bp run), the machine will read all the way through the biological DNA and continue reading right into the adapter on the other side.

This is a problem for the computer. When we later try to align that read to the mouse genome, the software will see a sequence that is half-mouse and half-artificial adapter. Since the adapter sequence doesn't exist in nature, the aligner will fail to find a match, and that perfectly good DNA read will be thrown in the trash. To prevent this, we must chemically "scissor" these adapters off the ends of our reads before alignment.

The Standard Solution: Trim Galore!

In the bioinformatics community, the standard tool for this task is Trim Galore! It is a "wrapper" script, which means it wraps around more complex tools (specifically one called Cutadapt) to make them user-friendly. It is designed to automatically detect which adapter was used (usually the standard Illumina Nextera sequence for ATAC-seq) and trim it off without requiring you to manually type out the DNA sequence of the adapter.

If you are working on a standard Linux server or an older computer, this is the tool you would use. The workflow is straightforward. First, you activate your environment:

conda activate bio-tools

Then, you install the tool directly from the bioconda channel:

conda install trim-galore

Once installed, you would navigate to your data folder (cd atacseq/raw_data) and run the command on your subset file. The command is simple because Trim Galore handles the complexity for you:

trim_galore ENCFF535OGL_subset.fastq.gz

Expected Output:

When running Trim Galore, your terminal will display the progress of cutadapt working in the background:

1.18.0
Summing up ...
Trimming mode: paired-end
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Writing to ENCFF535OGL_subset_trimmed.fastq.gz

>>> Now performing quality control on the trimmed file...

This single command would produce a new, clean file named ENCFF535OGL_subset_trimmed.fastq.gz, ready for alignment.

The Hardware Reality Check: A Pivot to fastp

However, bioinformatics is rarely one-size-fits-all, and hardware limitations often force us to adapt. You are likely following this guide on a modern MacBook (M1, M2, or M3 chips). These computers use a different processor architecture (osx-arm64) compared to the older Intel chips (osx-64) that many bioinformatics tools were originally built for.

Tools like Trim Galore! rely on a chain of dependencies (Cutadapt, Python, etc.) that can sometimes cause "dependency conflicts" on these newer Macs. You might see errors refusing to install the software or complaining about "unsatisfiable" environments. Rather than fighting the architecture of your computer for hours, we can switch to a more modern, efficient tool that is better supported on new hardware.

That tool is called fastp.

fastp is an ultra-fast, all-in-one preprocessor. Unlike Trim Galore, which is a wrapper script written in Python, fastp is written in C++, making it incredibly fast. It performs quality control, adapter detection, and trimming all in a single pass.

Executing the Workflow with fastp

Let's implement this solution. First, we need to install the tool into our environment.

conda install fastp

Once installed, we need to move to the directory where our raw data lives. Never run commands from the wrong folder; the computer needs to see the file to work on it.

cd atacseq/raw_data

Now we will run the trimming command. We need to specify the input file (flag -i), the name of the new output file we want to create (flag -o), and the names of the reports we want it to generate (flags -h for HTML and -j for JSON).

Run the following block as a single command:

fastp -i ENCFF535OGL_subset.fastq.gz \
      -o ENCFF535OGL_subset_trimmed.fastq.gz \
      -h fastp_report.html \
      -j fastp_report.json

Expected Output:

When the command finishes, fastp will print a short statistical summary to your terminal screen, but more importantly, it will generate three physical files in your current directory:

With these files created, we have successfully scrubbed our data of technical artifacts.

Note: This output confirms that fastp detected the adapters and removed them ("reads with adapter trimmed"). It also shows that the vast majority of your reads "passed filter" and are ready for analysis.

Housekeeping: The "Tidy Data" Principle

We now have a cluttered directory. Our raw_data folder, which should be a sanctuary for original, untouched files, now contains a processed trimmed file and two report files. A key habit of a professional bioinformatician is #organization. We must separate our raw inputs from our processed outputs to prevent accidental deletion or confusion.

First, let's move the report files. We will create a specific folder for them inside our results directory and move them there.

mkdir ../results/fastp
mv fastp_report.html fastp_report.json ../results/fastp/

Next, we need to move the trimmed data. Even though this file is now the input for our next step, it is technically a "result" of the trimming process. We will create a dedicated folder for trimmed reads and move the file out of raw_data.

cd ../
mkdir results/trimmed
mv raw_data/ENCFF535OGL_subset_trimmed.fastq.gz results/trimmed/

By doing this, we ensure that raw_data remains a "read-only" zone, protecting your original experiment.

The Sanity Check: Understanding Sequence Length Changes

We never trust a computational tool blindly. To confirm that the adapters are truly gone, we run our trusted diagnostic tool, FastQC, on the newly generated trimmed file.

fastqc results/trimmed/ENCFF535OGL_subset_trimmed.fastq.gz -o results/fastqc/

Open this new report. You should immediately see that the "Adapter Content" graph is now a completely flat line at 0%, confirming the cleaning was successful. However, you will likely notice a new "Warning" flag for Sequence Length Distribution, showing a sharp peak around 46-48 base pairs.

Why did the read lengths change?

This change is expected and actually proves the trimming worked.

  1. Before Trimming: Every read coming off the sequencer was exactly the same length (e.g., 51 bp). This is because the machine runs for a set number of cycles (51 cycles = 51 bases). If the biological DNA fragment was short (e.g., 48 bp), the machine filled the remaining 3 bases with adapter sequence to reach 51.
  2. After Trimming: fastp identified those 3 bases of "junk" adapter sequence and cut them off. That specific read is now only 48 bp long.
  3. The Peak at 800,000: The graph showing ~800,000 reads at 46-48 bp simply tells us that the majority of our library consisted of high-quality DNA that required very little trimming (maybe just 1-3 bases removed). This "peak" represents the bulk of your clean data.

Far from being an error, this distribution confirms that we have successfully removed the artificial adapter sequences while preserving the maximum amount of biological data. With clean, adapter-free data, we are now ready to map our reads to the genome.

← Back: FASTQC Results Home