Chapter 9: Visualization – Seeing is Believing

Learning Objectives

By the end of this chapter, you will be able to:


The Concept: From Spreadsheets to Landscapes

At this stage in our journey, your data exists in two primary formats: final_clean_peaks.bed (your filtered reads) and ENCFF535OGL_peaks.narrowPeak (your statistical peaks). While these files contain all your biological discoveries, they are essentially massive spreadsheets containing millions of lines of coordinates.

If you tried to load millions of text lines directly into a genome browser, it would overwhelm your computer's memory. To visualize the data effectively, we need to change its shape. We need to calculate Coverage—essentially counting how many reads stack up at every single base pair of the genome to form "mountains" of data.

We save this coverage data in a format called BigWig (.bw). Unlike the text-heavy BED format ("Read A is at position 100"), a BigWig file is binary and compressed ("Position 100 has a pile height of 50"). This makes it ultra-fast for browsing. Our goal in this chapter is to convert our alignment into a BigWig file and then load it into a web browser to verify our results with our own eyes.


Phase 1: Tool Installation

To perform this conversion, we need to add a few more tools to our bioinformatics toolkit. We need samtools to read the reference genome index, bedtools to calculate the coverage (the "Swiss Army Knife" of genomics), and ucsc-bedgraphtobigwig to perform the final compression.

Run the following command to install them into your environment:

conda activate bio-tools

conda install samtools

conda install -c bioconda bedtools ucsc-bedgraphtobigwig

Expected Output:

The terminal will display the "Package Plan" listing the dependencies to be installed. Once you type y and hit Enter, it will download and install the software, returning you to the command prompt when finished.


Phase 2: Defining the Map Boundaries (Chromosome Sizes)

Before we can build a map of our reads, the computer needs to know the boundaries of the territory. Since we are using our "Toy" reference genome (Chromosome 19), we need a simple text file that tells the software exactly how long Chromosome 19 is. Without this, the software won't know when to stop drawing the map.

Step 1: Calculate the Size

We do not need to look up this information online; we can derive it directly from the reference file (chr19.fa) we downloaded in Chapter 6. We use samtools faidx (Fasta Index) to scan the file and count the bases.

# Navigate to the reference data folder

cd reference_data

# Index the fasta file

samtools faidx chr19.fa

# View the contents of the new index file

cat chr19.fa.fai

Expected Output:

chr19	61431566	7	50	51

The second number, 61431566, is the critical piece of information: it is the exact length of Chromosome 19 in base pairs.

Step 2: Create the Sizes File (Avoiding the "Hidden Character" Trap)

Now we need to save this name and number into a file named chr19.chrom.sizes.

Important Formatting Note:

Bioinformatics tools are extremely strict about formatting. The columns in this file must be separated by a Tab character, not a Space. If you use a space, bedtools will fail to read the file (as you saw in your logs with the "no valid entries" error). To ensure we create a perfect file, we use the printf command with \t, which forces the computer to insert a real Tab character.

# Return to project root

cd ..

# Create the file using printf to force a TAB (\t) separator

printf "chr19\t61431566\n" > reference_data/chr19.chrom.sizes

Expected Output:

This command is silent. It produces no text on the screen, but if you run cat reference_data/chr19.chrom.sizes, you will see your properly formatted file.


Phase 3: Creating the Coverage Track (The BigWig)

Now that we have our map boundaries, we can convert our raw reads into a visual track. This is a standard three-step pipeline: Count -> Sort -> Compress.

Step 1: Calculate Coverage (BED to BedGraph)

We use bedtools genomecov to count the reads piling up at each base. We are taking our clean aligned reads (final_clean_peaks.bed) and asking the tool to output a "BedGraph" format text file.

# Create a folder for visualization files

mkdir -p results/visualization

# Run coverage calculation

# -i: Input (Your clean, aligned reads)

# -g: Genome sizes file (The map boundaries)

# -bg: Output format (BedGraph text file)

bedtools genomecov -i results/chromap/final_clean_peaks.bed \

                   -g reference_data/chr19.chrom.sizes \

                   -bg > results/visualization/temp.bedgraph

Expected Output:

This command runs silently. If successful, it creates a new file results/visualization/temp.bedgraph and returns you to the command prompt.

Step 2: Sort the Data

The compression tool we use in the next step is picky; it requires the data to be sorted by chromosome position (from the beginning of the chromosome to the end). We use the sort command to arrange the lines numerically.

# Sort by chromosome (column 1) then by numeric position (column 2)

sort -k1,1 -k2,2n results/visualization/temp.bedgraph > results/visualization/sorted.bedgraph

Expected Output:

This command also runs silently. It reads your temp file and saves the sorted version as results/visualization/sorted.bedgraph.

Step 3: Compress to BigWig (BedGraph to BigWig)

Finally, we convert the text file into the binary BigWig format using the bedGraphToBigWig tool. This dramatically reduces the file size and makes it readable by web browsers.

bedGraphToBigWig results/visualization/sorted.bedgraph \

                 reference_data/chr19.chrom.sizes \

                 results/visualization/ENCFF535OGL.bw

Expected Output:

This command runs silently. It creates your final result file: results/visualization/ENCFF535OGL.bw.

Clean Up:

We no longer need the intermediate text files (temp.bedgraph and sorted.bedgraph). It is good practice to delete them to keep your folders tidy.

rm results/visualization/*.bedgraph
← Back: Peak Calling & QC Home