python

why do we need to prepare our data?

let’s say that we want to write a program that mimics what trna does in real life. we want our program to go through a dna sequence and read the sequence three letters by three letters, codon-based, and then translate each to its corresponding amino acid.

before even starting to write the program, we need to tell our programming language which codon corresponds to which amino acid. for that we would go to standard genetic code page and see the codon-amino acid pars.

we see this table but how can our programming language access it?

codon-amino acid

data preparation here means that you make these types of information readable files for your programming language. for example, this data in html on a website needs to be converted into a .csv (comma-separated) or .xlsx (excel) file to make it accessible and compatible for your program.

if you learn how to do this right, the rest of the programming would be easy, especially with all the ai help we have.

how to prepare your data

i use VS Code as my code editor and i access wsl remotely inside it.

so let’s use the codon - amino acid pair for creating the table.

1. copy data

we copy from the website and paste it into a file with .csv extension in VS Code. (you need to create a new file and call it something like data_preparation.csv to get a file in csv format.)

codon amino acid

2. clean the data

we actually don’t need all the information here. just the pairs would be enough for us.
i would like to keep the single-letter amino acid codes and codons and clean everything else.

before cleaning the data

Always Render Whitespace To See It Visually And Be Able To Count Its Number.

go to your VS Code settings, search for rendering whitespace, and set it to all. after activation, it should look like this:

codon amino acid

now we can delete all the information that we don’t want manually, or we can use a tool called regular expression or regex for short.

do whichever you feel most comfortable with for your first data preparations, but regex is super easy and fast, so i am using it here.

also you don’t need to memorize any regex patterns. you can always search & ask ai.

codon amino acid

(\w{3}) (\w) (\w{3})
$1,$2

codon amino acid

codon amino acid

also, pay attention to the tiniest details, like how many spaces are where. they could cause big trouble later. make it as clean and smooth as possible.

codon amino acid

(\w{3}),(\w)
$1,$2\n

codon amino acid

^

codon amino acid

codon amino acid

^\s*$\n

codon amino acid

save your file (ctrl + S) and enjoy working with it.

and the answer to our introductory question will be discussed on the dictionaries page.