Variables and for loops
Variables and For loops
Unix
Things covered here:
- Variables
- For loops
Welcome to the wonderful world of loops!
Loops are extremely powerful in all programming languages. They are what let us write out a command or operation once, and have it run on all of our samples or files or whatever we want to act on. Not only is this powerful, but it also helps with keeping our code more concise and readable, and it helps elmininate some more of our mortal enemy (human error). There are multiple types of loops, but here we are going to cover what is probably the most common type: the for loop. First, we need to quickly cover a tiny bit about variables.
Let’s change back into our starting
unix_intro
directory::cd ~/unix_intro
Variables
We can think of a variable as a placeholder for a value that will change with every iteration of our loop. To set a variable at the command line, we need to provide the variable name we want, an equals sign, and then the value we want the variable to hold (with no spaces in between any of that). Let’s try it:
my_var=Europa
Nothing prints out when a variable is set, but the value “Europa” has been stored in the variable “my_var”.
To use what’s stored in a variable, the variable name needs to be preceded by a $
so the shell knows to evaluate what follows, rather than just treat it as generic characters.
To see this in action, we’ll use the echo
command. echo
is a command that prints out whatever is provided to it (it turns out this is actually really useful in some places – like in a program, for example, to report information to the user). Here we’ll use it to check what’s being stored in our variable:
echo $my_var
Note that if we don’t put the $
in front, echo
just prints out the text we gave it:
echo my_var
Recall that spaces are special characters on the command line. If we wanted to set a variable that contained spaces, we could surround it in quotations to tell Unix it should be considered as one thing:
my_new_var="Europa is awesome."
echo $my_new_var
Great, that’s really all we need to know about variables for now. Let’s get to the good stuff 🙂
For loops
Let’s make a new directory to work in:
mkdir for_loops
cd for_loops/
The 4 magic words
There are 4 special words in the syntax of a For Loop in Unix languages: for
, in
, do
, and done
.
Magic word | Purpose |
---|---|
for |
set the loop variable name |
in |
specify whatever it is we are looping over |
do |
specify what we want to do with each item |
done |
tell the computer we are done telling it what to do with each item |
Let’s see what this looks like in practice. Here we are going to: name the variable “item” (we can name this whatever we want); loop over 3 words (car, truck, and ukulele); and we’re going to just echo
each item, which will print each word to the terminal.
for item in car truck ukulele
do
echo $item
done
Note: Notice the prompt is different while we are within the loop syntax. This is to tell us we are not at the typical prompt. If we get stuck with that alternate prompt and we want to get rid of it, we can press
ctrl + c
to cancel it.
Just to note, we don’t need to put these on separate lines, and we don’t need to indent over the “body” of the loop like we did above (the echo $item
part), but both can help with readability so we will continue doing that moving forward. As an example though, we could also enter it like this on one line, separating the major blocks with semicolons:
for item in car truck ukulele; do echo $item; done
We can also do multiple things within the body of the loop (the lines between the special words do
and done
). Here we’ll add another line that also writes the words into a file we’ll call “words.txt”:
for item in car truck ukulele
do
echo $item
echo $item >> words.txt
done
Now we created a new file that holds these words:
ls
head words.txt
for item in car truck ukulele do echo $item echo $item > test.txt done
head test.txtSince
Usually we won’t want to type out the items we’re looping over, that was just to demonstrate what’s happening. Often we will want to loop through items in a file, like a list of samples or genomes.
Looping through lines of a file
Instead of typing out the elements we want to loop over, we can execute a command in such a way that the output of that command becomes the list of things we are looping over.
We’re going to use the cat
command to help us do this (which comes from concatenate). cat
is kind of like head
, except that instead of just printing the first lines in a file, it prints the whole thing:
cat words.txt
Here we’ll use cat
to pull the items we want to loop over from the file, instead of us needing to type them out like we did above. The syntax of how to do this may seem a little odd at first, but let’s look at it and then break it down. Here is an example with our “words.txt” file we just made:
for item in $(cat words.txt)
do
echo $item
done
Here, where we say $(cat words.txt)
, the command line is performing that operation first (it’s evaluting what’s inside the parentheses, similar to what the dollar sign does when put in front of our variable name, “item”), and then puts the output in its place. We can use echo
to see this has the same result as when we typed the items out:
echo $(cat words.txt)
For a more practical example, let’s pull multiple specific sequences we want from a file!
BONUS ROUND: interleaving files with paste
A pretty neat use of paste
is to interleave two files. What paste
is doing is sticking two files together, line-by-line, with some delimiter (separating character) in between them. This delimiter by default is a tab
character, but we can set it to other things too, including a newline character. To demonstrate this, let’s make a fasta-formatted sequence file from our genes in the previous lesson.
NOTE: “Fasta” is a common format for holding sequence information. In it, each sequence entry takes up two lines: the first is the name of the sequence and needs to be preceded by a
>
character; and the second line is the sequence. It looks like this:>Seq_1 ATGCGACC >Seq_2 TCCGACTT
To start, let’s copy over our table that holds the gene IDs, lengths, and sequences (remember the .
says to copy it to our current location and keep the same name):
cp ~/unix_intro/six_commands/genes_and_seqs.tsv .
This file holds the gene IDs in the first column and the sequences in the third:
head -n 1 genes_and_seqs.tsv
Let’s get them into their own files. Note the use of -n +2
in the tail
command here. This takes everything in the file except the first line, which we don’t want here because it is the header of the table:
cut -f 1 genes_and_seqs.tsv | tail -n +2 > ids.tmp
cut -f 3 genes_and_seqs.tsv | tail -n +2 > seqs.tmp
head ids.tmp
head seqs.tmp
We also need to add the >
character in front of our IDs though because that is part of the fasta format. We can do that with sed
and using a special character that represents the start of every line (^
):
sed 's/^/>/' ids.tmp > fasta_ids.tmp
head fasta_ids.tmp
This sed
command is searching for the start of every line (^
), and then adding in the >
character.
Now for the interleaving, to think about what’s happening here, remember that paste
is normally just sticking things together with a tab
in between them:
paste fasta_ids.tmp seqs.tmp | head -n 2
But we can tell paste
to use a different delimiter by providing it to the -d
argument. Here is if we wanted the delimiter to be a dash:
paste -d "-" fasta_ids.tmp seqs.tmp | head -n 2
And we can also tell it to combine them with a newline character in between (which is represented by \n
):
paste -d "\n" fasta_ids.tmp seqs.tmp | head -n 4
And that’s our fasta-formatted file! So let’s write it to a new file and get rid of the temporary files we made along the way:
paste -d "\n" fasta_ids.tmp seqs.tmp > genes.faa
head genes.faa
ls *.tmp
rm *.tmp
Retrieving specific sequences with a loop
Now imagine we want to pull out all of the sequences that were annotated with that function we looked at before, epoxyqueuosine reductase, which we figured out had the KO identifier “K18979”. We can get the gene IDs using grep
like we did previously and then using cut
to just keep the first column (note that we are providing the relative path to this file, starting from our current location):
grep "K18979" ../six_commands/gene_annotations.tsv | cut -f 1
And let’s write them to a file:
grep "K18979" ../six_commands/gene_annotations.tsv | cut -f 1 > target_gene_ids.txt
ls
head target_gene_ids.txt
For pulling a few sequences out of a fasta file, grep
can be very convenient. But remember the format of fasta is each entry takes two lines, and if we use grep
with default settings to find a gene ID, we will only get the line with the gene ID:
grep "99" genes.faa
Fortunately, grep
has a handy parameter that let’s you pull out lines following your matched text also (in addition to just the line with the matched text), it’s the -A
parameter. So we can tell grep
to pull out the line that matches and the following line like so:
grep -A 1 "99" genes.faa
Cool! There’s one more nuance we need to address though, and that is whether grep
is looking for exact matches only or not. For example, trying to grab gene “9” does not do what we want:
grep -A 1 "9" genes.faa
It grabs everything that has a “9” in it. But we can tell grep
to only take exact matches, meaning it needs to be the full word, if we provide the -w
flag (for word). Here, the “word” (string we’re looking for) must be immediately surrounded by whitespace (spaces, tabs, and newline characters count as whitespace). We then also just need to add the leading >
character in front of the sequence ID we want:
grep -w -A 1 ">9" genes.faa
Great! Back to our target genes, we usually won’t want to do that for every individual sequence we want (even though we only have 2 in our example here). So let’s loop through our “target_gene_ids.txt” file! Here’s just with echo
like we did above, to see how we can add the >
in front of the variable:
for gene in $(cat target_gene_ids.txt)
do
echo $gene
echo ">$gene"
done
Note that since the
>
character is special at the command line (it redirects output), we need to put what we want to give toecho
in quotes so that the whole thing goes to it (here “>$gene”).
Now let’s put the grep
command we made above in the loop, and change what we’re looking for to be the >
character followed by the $
and variable name (just like it was provided to echo
), and write the output to a new file:
for gene in $(cat target_gene_ids.txt)
do
grep -w -A 1 ">$gene" genes.faa
done > target_genes.faa
ls
head target_genes.faa
And now we’ve made a new fasta file holding the sequences of just the genes we wanted!
Summary
Even though loops can get much more complicated as needed, practicing these foundational skills a bit is all that’s needed to start harnessing their awesome power 🙂