Chapter 9: Visualization – Seeing is Believing
Learning Objectives
By the end of this chapter, you will be able to:
- Differentiate between analysis file formats (
.bed) and visualization file formats (.bw/ BigWig). - Derive accurate chromosome sizes directly from a reference genome using
samtools. - Execute a file conversion pipeline using
bedtoolsto transform raw reads into a coverage track. - Visualize your genomic data using the cloud-based IGV Web App (no installation required).
- Perform a visual "Sanity Check" by overlaying your raw signal (BigWig) with your statistical calls (Peaks).
The Concept: From Spreadsheets to Landscapes
At this stage in our journey, your data exists in two primary formats: final_clean_peaks.bed (your filtered reads) and ENCFF535OGL_peaks.narrowPeak (your statistical peaks). While these files contain all your biological discoveries, they are essentially massive spreadsheets containing millions of lines of coordinates.
If you tried to load millions of text lines directly into a genome browser, it would overwhelm your computer's memory. To visualize the data effectively, we need to change its shape. We need to calculate Coverage—essentially counting how many reads stack up at every single base pair of the genome to form "mountains" of data.
We save this coverage data in a format called BigWig (.bw). Unlike the text-heavy BED format ("Read A is at position 100"), a BigWig file is binary and compressed ("Position 100 has a pile height of 50"). This makes it ultra-fast for browsing. Our goal in this chapter is to convert our alignment into a BigWig file and then load it into a web browser to verify our results with our own eyes.
Phase 1: Tool Installation
To perform this conversion, we need to add a few more tools to our bioinformatics toolkit. We need samtools to read the reference genome index, bedtools to calculate the coverage (the "Swiss Army Knife" of genomics), and ucsc-bedgraphtobigwig to perform the final compression.
Run the following command to install them into your environment:
conda activate bio-tools conda install samtools conda install -c bioconda bedtools ucsc-bedgraphtobigwig
Expected Output:
The terminal will display the "Package Plan" listing the dependencies to be installed. Once you type y and hit Enter, it will download and install the software, returning you to the command prompt when finished.
Phase 2: Defining the Map Boundaries (Chromosome Sizes)
Before we can build a map of our reads, the computer needs to know the boundaries of the territory. Since we are using our "Toy" reference genome (Chromosome 19), we need a simple text file that tells the software exactly how long Chromosome 19 is. Without this, the software won't know when to stop drawing the map.
Step 1: Calculate the Size
We do not need to look up this information online; we can derive it directly from the reference file (chr19.fa) we downloaded in Chapter 6. We use samtools faidx (Fasta Index) to scan the file and count the bases.
# Navigate to the reference data folder cd reference_data # Index the fasta file samtools faidx chr19.fa # View the contents of the new index file cat chr19.fa.fai
Expected Output:
chr19 61431566 7 50 51
The second number, 61431566, is the critical piece of information: it is the exact length of Chromosome 19 in base pairs.
Step 2: Create the Sizes File (Avoiding the "Hidden Character" Trap)
Now we need to save this name and number into a file named chr19.chrom.sizes.
Important Formatting Note:
Bioinformatics tools are extremely strict about formatting. The columns in this file must be separated by a Tab character, not a Space. If you use a space, bedtools will fail to read the file (as you saw in your logs with the "no valid entries" error). To ensure we create a perfect file, we use the printf command with \t, which forces the computer to insert a real Tab character.
# Return to project root cd .. # Create the file using printf to force a TAB (\t) separator printf "chr19\t61431566\n" > reference_data/chr19.chrom.sizes
Expected Output:
This command is silent. It produces no text on the screen, but if you run cat reference_data/chr19.chrom.sizes, you will see your properly formatted file.
Phase 3: Creating the Coverage Track (The BigWig)
Now that we have our map boundaries, we can convert our raw reads into a visual track. This is a standard three-step pipeline: Count -> Sort -> Compress.
Step 1: Calculate Coverage (BED to BedGraph)
We use bedtools genomecov to count the reads piling up at each base. We are taking our clean aligned reads (final_clean_peaks.bed) and asking the tool to output a "BedGraph" format text file.
# Create a folder for visualization files
mkdir -p results/visualization
# Run coverage calculation
# -i: Input (Your clean, aligned reads)
# -g: Genome sizes file (The map boundaries)
# -bg: Output format (BedGraph text file)
bedtools genomecov -i results/chromap/final_clean_peaks.bed \
-g reference_data/chr19.chrom.sizes \
-bg > results/visualization/temp.bedgraph
Expected Output:
This command runs silently. If successful, it creates a new file results/visualization/temp.bedgraph and returns you to the command prompt.
Step 2: Sort the Data
The compression tool we use in the next step is picky; it requires the data to be sorted by chromosome position (from the beginning of the chromosome to the end). We use the sort command to arrange the lines numerically.
# Sort by chromosome (column 1) then by numeric position (column 2) sort -k1,1 -k2,2n results/visualization/temp.bedgraph > results/visualization/sorted.bedgraph
Expected Output:
This command also runs silently. It reads your temp file and saves the sorted version as results/visualization/sorted.bedgraph.
Step 3: Compress to BigWig (BedGraph to BigWig)
Finally, we convert the text file into the binary BigWig format using the bedGraphToBigWig tool. This dramatically reduces the file size and makes it readable by web browsers.
bedGraphToBigWig results/visualization/sorted.bedgraph \
reference_data/chr19.chrom.sizes \
results/visualization/ENCFF535OGL.bw
Expected Output:
This command runs silently. It creates your final result file: results/visualization/ENCFF535OGL.bw.
Clean Up:
We no longer need the intermediate text files (temp.bedgraph and sorted.bedgraph). It is good practice to delete them to keep your folders tidy.
rm results/visualization/*.bedgraph