Chapter 2: Project Organization: How to Stop Losing Your Data
Welcome to the command line. If you are coming from a wet lab background, looking at a blank terminal window with a blinking cursor can feel like walking into a lab where the lights are off, the benches are empty, and you don't know where the pipettes are stored. This chapter is about turning that empty black box into a fully organized, functional workspace.
In the wet lab, you wouldn't start an expensive ATAC-seq library preparation on a cluttered desk covered in old coffee cups and loose papers. You would wipe down the bench, arrange your racks, label your tubes, and set out your reagents. We are going to do the exact same thing here, but instead of plastic tubes and buffers, we are organizing directories (folders) and files.
Part 1: The Philosophy of Structure
Why does the folder layout matter?
Imagine you finish your ATAC-seq experiment, get beautiful peaks, and publish a paper. Two years later, a grad student wants to re-analyze your data to look for a specific transcription factor footprint. They open your project folder.
If your folder is a mess of files named analysis_final_v2.txt and test_run_new.bam all dumped in one place, that data is effectively dead. They won't know which file was the raw data, which was the filtered data, or which reference genome you used.
We organize our project to ensure Reproducibility.
- The "Raw Data" is sacred: Just like a clinical tissue sample, if you lose or corrupt the raw sequencing data, you can't get it back without paying thousands of dollars. We need a safe place for it.
- The "Protocol" must be clear: In the wet lab, you have a lab notebook. In the dry lab, your "scripts" are your notebook. They need their own dedicated shelf.
- The "Trash" is separate: We generate a lot of intermediate garbage files during analysis. We need to know what can be thrown away and what must be kept.
Part 2: Laying the Foundation
Let’s build this structure together, step by step. Open your terminal.
Step 1: finding your footing (pwd and cd)
When you first open a terminal, you are usually standing in your "Home" directory. This is your personal room in the massive building that is the computer's hard drive.
To verify where you are, type:
pwd
What this does: pwd stands for Print Working Directory. It asks the computer: "Where am I right now?"
Expected Output: You should see something like /Users/yourname (macOS) or /home/yourname (Linux).
Now, let's make sure we start fresh by moving to this home base (just in case you were somewhere else):
cd
What this does: cd stands for Change Directory. Typing it with no arguments is the equivalent of "clicking the Home button." It teleports you back to your user folder.
Step 2: Creating the Project Root (mkdir)
We need a main box to hold our entire experiment. We will call it atacseq.
mkdir atacseq
What this does: mkdir stands for Make Directory. You have just created a new empty folder.
Now, we must physically walk into that folder to start working.
cd atacseq
If you type pwd again, you will see your location has updated. You are now "inside" the project.
Part 3: Building the Departments
Now that we are inside the atacseq folder, we need to create specific sub-rooms for different types of materials. In a wet lab, you wouldn't store your 4°C enzymes in the 37°C incubator. Similarly, we don't mix raw data with results.
Run this command to create the primary structure:
mkdir raw_data reference_data scripts logs meta
1. raw_data/ (The -80°C Freezer)
- Biological Context: This is where your FASTQ files go immediately after you download them from the sequencing core.
- The Golden Rule: This folder is Read-Only. You put files in, but you never edit them, open them, or save over them. If you mess up your analysis, you can always go back to the
raw_dataand start over. If you mess up theraw_data, the experiment is over.
2. reference_data/ (The Library/Textbooks)
- Biological Context: To analyze ATAC-seq, we need to map your reads to a genome (like Human hg38 or Mouse mm10). This folder holds the genome sequence (FASTA files) and the gene annotations (GTF files).
- Why separate it? These files are huge and rarely change. By keeping them here, you avoid cluttering your analysis folders.
3. scripts/ (The Lab Notebook)
- Biological Context: This is where you save the code you write.
- Analogy: If you were doing a Western Blot, you would write down the antibody concentrations and incubation times. In bioinformatics, the "script" is the protocol. It tells the computer exactly how to process the samples.
4. logs/ (The Troubleshooting Journal)
- Biological Context: When you run a bioinformatics tool, it spits out text telling you what it did ("Processed 1 million reads...").
- Why separate it? Sometimes tools crash. The "error message" is your only clue to what went wrong. We redirect those messages into this folder so your screen doesn't get flooded, but the record is kept for debugging.
5. meta/ (The Sample Sheet)
- Biological Context: This contains the "metadata." Which sample is Control? Which is Treated? Which replicate is which? It’s the spreadsheet that decodes the filenames.
Part 4: Preparing for Results
We are almost done. We need a place to put our output. However, we anticipate that we will use different tools, and we don't want all the results mixed together.
We will use a special trick with mkdir here:
mkdir -p results/fastqc results/bowtie2
The -p Flag (Parent Mode):
Normally, if you told the computer "Make a folder called fastqc inside results," it would complain: "Error: The folder 'results' doesn't exist yet!"
The -p flag tells the computer: "If the parent folder (results) doesn't exist, please create it for me first, and then create the child folder inside it."
What are these specific result folders?
results/fastqc: This is for Quality Control. Before analyzing biological biology, we check technical chemistry. Did the sequencing reaction work? Are the reads high quality?results/bowtie2: This is for Alignment. This is the step where we figure out where each DNA fragment belongs on the chromosomes.
Part 5: Verification
You have typed the commands, but how do you know it worked? Let's take a bird's-eye view of your creation.
Run this command:
ls -R
What this does: ls lists files. The -R flag stands for Recursive. It acts like an X-ray, showing you the current folder, plus everything inside the subfolders, and the sub-subfolders.
You should see this structure:
.
logs meta raw_data reference_data results scripts ./logs: ./meta: ./raw_data: ./reference_data: ./results: bowtie2 fastqc ./results/bowtie2: ./results/fastqc: ./scripts:
Summary checklist
If you see the output above, congratulations! You have successfully set up your digital lab bench.
raw_data: Empty for now. Waiting for FASTQ files.reference_data: Empty. Waiting for the genome.results: Has two dedicated buckets (fastqc,bowtie2) ready to catch data.
Troubleshooting
- "Permission Denied": You might be trying to create folders in a system directory (like
/rootor/usr). Make sure you rancdfirst to get to your home directory. - "File exists": You might have run the command twice. That's okay!
mkdir -pis safe to run multiple times; it won't delete what is already there.
Next Step for You
Now that your "bench" is clean and organized, we need to buy the equipment. You wouldn't start an experiment without pipettes and a centrifuge, and we can't analyze DNA without software. In Chapter 3, I will guide you through installing Conda (your digital lab manager) and setting up the essential tools you will need to run your experiments.