Chapter 2: Project Organization: How to Stop Losing Your Data

Welcome to the command line. If you are coming from a wet lab background, looking at a blank terminal window with a blinking cursor can feel like walking into a lab where the lights are off, the benches are empty, and you don't know where the pipettes are stored. This chapter is about turning that empty black box into a fully organized, functional workspace.

In the wet lab, you wouldn't start an expensive ATAC-seq library preparation on a cluttered desk covered in old coffee cups and loose papers. You would wipe down the bench, arrange your racks, label your tubes, and set out your reagents. We are going to do the exact same thing here, but instead of plastic tubes and buffers, we are organizing directories (folders) and files.

Part 1: The Philosophy of Structure

Why does the folder layout matter?

Imagine you finish your ATAC-seq experiment, get beautiful peaks, and publish a paper. Two years later, a grad student wants to re-analyze your data to look for a specific transcription factor footprint. They open your project folder.

If your folder is a mess of files named analysis_final_v2.txt and test_run_new.bam all dumped in one place, that data is effectively dead. They won't know which file was the raw data, which was the filtered data, or which reference genome you used.

We organize our project to ensure Reproducibility.

The "Raw Data" is sacred: Just like a clinical tissue sample, if you lose or corrupt the raw sequencing data, you can't get it back without paying thousands of dollars. We need a safe place for it.
The "Protocol" must be clear: In the wet lab, you have a lab notebook. In the dry lab, your "scripts" are your notebook. They need their own dedicated shelf.
The "Trash" is separate: We generate a lot of intermediate garbage files during analysis. We need to know what can be thrown away and what must be kept.

Part 2: Laying the Foundation

Let’s build this structure together, step by step. Open your terminal.

Step 1: finding your footing (`pwd` and `cd`)

When you first open a terminal, you are usually standing in your "Home" directory. This is your personal room in the massive building that is the computer's hard drive.

To verify where you are, type:

pwd

What this does: pwd stands for Print Working Directory. It asks the computer: "Where am I right now?"

Expected Output: You should see something like /Users/yourname (macOS) or /home/yourname (Linux).

Now, let's make sure we start fresh by moving to this home base (just in case you were somewhere else):

cd

What this does: cd stands for Change Directory. Typing it with no arguments is the equivalent of "clicking the Home button." It teleports you back to your user folder.

Step 2: Creating the Project Root (`mkdir`)

We need a main box to hold our entire experiment. We will call it atacseq.

mkdir atacseq

What this does: mkdir stands for Make Directory. You have just created a new empty folder.

Now, we must physically walk into that folder to start working.

cd atacseq

If you type pwd again, you will see your location has updated. You are now "inside" the project.

Part 3: Building the Departments

Now that we are inside the atacseq folder, we need to create specific sub-rooms for different types of materials. In a wet lab, you wouldn't store your 4°C enzymes in the 37°C incubator. Similarly, we don't mix raw data with results.

Run this command to create the primary structure:

mkdir raw_data reference_data scripts logs meta

1. `raw_data/` (The -80°C Freezer)

Biological Context: This is where your FASTQ files go immediately after you download them from the sequencing core.
The Golden Rule: This folder is Read-Only. You put files in, but you never edit them, open them, or save over them. If you mess up your analysis, you can always go back to the raw_data and start over. If you mess up the raw_data, the experiment is over.

2. `reference_data/` (The Library/Textbooks)

Biological Context: To analyze ATAC-seq, we need to map your reads to a genome (like Human hg38 or Mouse mm10). This folder holds the genome sequence (FASTA files) and the gene annotations (GTF files).
Why separate it? These files are huge and rarely change. By keeping them here, you avoid cluttering your analysis folders.

3. `scripts/` (The Lab Notebook)

Biological Context: This is where you save the code you write.
Analogy: If you were doing a Western Blot, you would write down the antibody concentrations and incubation times. In bioinformatics, the "script" is the protocol. It tells the computer exactly how to process the samples.

4. `logs/` (The Troubleshooting Journal)

Biological Context: When you run a bioinformatics tool, it spits out text telling you what it did ("Processed 1 million reads...").
Why separate it? Sometimes tools crash. The "error message" is your only clue to what went wrong. We redirect those messages into this folder so your screen doesn't get flooded, but the record is kept for debugging.

5. `meta/` (The Sample Sheet)

Biological Context: This contains the "metadata." Which sample is Control? Which is Treated? Which replicate is which? It’s the spreadsheet that decodes the filenames.

Part 4: Preparing for Results

We are almost done. We need a place to put our output. However, we anticipate that we will use different tools, and we don't want all the results mixed together.

We will use a special trick with mkdir here:

mkdir -p results/fastqc results/bowtie2

The -p Flag (Parent Mode):

Normally, if you told the computer "Make a folder called fastqc inside results," it would complain: "Error: The folder 'results' doesn't exist yet!"

The -p flag tells the computer: "If the parent folder (results) doesn't exist, please create it for me first, and then create the child folder inside it."

What are these specific result folders?

results/fastqc: This is for Quality Control. Before analyzing biological biology, we check technical chemistry. Did the sequencing reaction work? Are the reads high quality?
results/bowtie2: This is for Alignment. This is the step where we figure out where each DNA fragment belongs on the chromosomes.

Part 5: Verification

You have typed the commands, but how do you know it worked? Let's take a bird's-eye view of your creation.

Run this command:

ls -R

What this does: ls lists files. The -R flag stands for Recursive. It acts like an X-ray, showing you the current folder, plus everything inside the subfolders, and the sub-subfolders.

You should see this structure:

.

logs  meta  raw_data  reference_data  results  scripts

./logs:

./meta:

./raw_data:

./reference_data:

./results:

bowtie2  fastqc

./results/bowtie2:

./results/fastqc:

./scripts:

Summary checklist

If you see the output above, congratulations! You have successfully set up your digital lab bench.

raw_data: Empty for now. Waiting for FASTQ files.
reference_data: Empty. Waiting for the genome.
results: Has two dedicated buckets (fastqc, bowtie2) ready to catch data.

Troubleshooting

"Permission Denied": You might be trying to create folders in a system directory (like /root or /usr). Make sure you ran cd first to get to your home directory.
"File exists": You might have run the command twice. That's okay! mkdir -p is safe to run multiple times; it won't delete what is already there.

Next Step for You

Now that your "bench" is clean and organized, we need to buy the equipment. You wouldn't start an experiment without pipettes and a centrifuge, and we can't analyze DNA without software. In Chapter 3, I will guide you through installing Conda (your digital lab manager) and setting up the essential tools you will need to run your experiments.

← Back: ATAC-seq Overview Home Next: Conda Setup →

Chapter 2: Project Organization: How to Stop Losing Your Data

Part 1: The Philosophy of Structure

Part 2: Laying the Foundation

Step 1: finding your footing (pwd and cd)

Step 2: Creating the Project Root (mkdir)