Chapter 2: Project Organization: How to Stop Losing Your Data

Welcome to the command line. If you are coming from a wet lab background, looking at a blank terminal window with a blinking cursor can feel like walking into a lab where the lights are off, the benches are empty, and you don't know where the pipettes are stored. This chapter is about turning that empty black box into a fully organized, functional workspace.

In the wet lab, you wouldn't start an expensive ATAC-seq library preparation on a cluttered desk covered in old coffee cups and loose papers. You would wipe down the bench, arrange your racks, label your tubes, and set out your reagents. We are going to do the exact same thing here, but instead of plastic tubes and buffers, we are organizing directories (folders) and files.

Part 1: The Philosophy of Structure

Why does the folder layout matter?

Imagine you finish your ATAC-seq experiment, get beautiful peaks, and publish a paper. Two years later, a grad student wants to re-analyze your data to look for a specific transcription factor footprint. They open your project folder.

If your folder is a mess of files named analysis_final_v2.txt and test_run_new.bam all dumped in one place, that data is effectively dead. They won't know which file was the raw data, which was the filtered data, or which reference genome you used.

We organize our project to ensure Reproducibility.

Part 2: Laying the Foundation

Let’s build this structure together, step by step. Open your terminal.

Step 1: finding your footing (pwd and cd)

When you first open a terminal, you are usually standing in your "Home" directory. This is your personal room in the massive building that is the computer's hard drive.

To verify where you are, type:

pwd

What this does: pwd stands for Print Working Directory. It asks the computer: "Where am I right now?"

Expected Output: You should see something like /Users/yourname (macOS) or /home/yourname (Linux).

Now, let's make sure we start fresh by moving to this home base (just in case you were somewhere else):

cd

What this does: cd stands for Change Directory. Typing it with no arguments is the equivalent of "clicking the Home button." It teleports you back to your user folder.

Step 2: Creating the Project Root (mkdir)

We need a main box to hold our entire experiment. We will call it atacseq.

mkdir atacseq

What this does: mkdir stands for Make Directory. You have just created a new empty folder.

Now, we must physically walk into that folder to start working.

cd atacseq

If you type pwd again, you will see your location has updated. You are now "inside" the project.

Part 3: Building the Departments

Now that we are inside the atacseq folder, we need to create specific sub-rooms for different types of materials. In a wet lab, you wouldn't store your 4°C enzymes in the 37°C incubator. Similarly, we don't mix raw data with results.

Run this command to create the primary structure:

mkdir raw_data reference_data scripts logs meta

1. raw_data/ (The -80°C Freezer)

2. reference_data/ (The Library/Textbooks)

3. scripts/ (The Lab Notebook)

4. logs/ (The Troubleshooting Journal)

5. meta/ (The Sample Sheet)

Part 4: Preparing for Results

We are almost done. We need a place to put our output. However, we anticipate that we will use different tools, and we don't want all the results mixed together.

We will use a special trick with mkdir here:

mkdir -p results/fastqc results/bowtie2

The -p Flag (Parent Mode):

Normally, if you told the computer "Make a folder called fastqc inside results," it would complain: "Error: The folder 'results' doesn't exist yet!"

The -p flag tells the computer: "If the parent folder (results) doesn't exist, please create it for me first, and then create the child folder inside it."

What are these specific result folders?

Part 5: Verification

You have typed the commands, but how do you know it worked? Let's take a bird's-eye view of your creation.

Run this command:

ls -R

What this does: ls lists files. The -R flag stands for Recursive. It acts like an X-ray, showing you the current folder, plus everything inside the subfolders, and the sub-subfolders.

You should see this structure:

.
logs meta raw_data reference_data results scripts ./logs: ./meta: ./raw_data: ./reference_data: ./results: bowtie2 fastqc ./results/bowtie2: ./results/fastqc: ./scripts:

Summary checklist

If you see the output above, congratulations! You have successfully set up your digital lab bench.

  1. raw_data: Empty for now. Waiting for FASTQ files.
  2. reference_data: Empty. Waiting for the genome.
  3. results: Has two dedicated buckets (fastqc, bowtie2) ready to catch data.

Troubleshooting


Next Step for You

Now that your "bench" is clean and organized, we need to buy the equipment. You wouldn't start an experiment without pipettes and a centrifuge, and we can't analyze DNA without software. In Chapter 3, I will guide you through installing Conda (your digital lab manager) and setting up the essential tools you will need to run your experiments.

← Back: ATAC-seq Overview Home