python

text manipulation

text manipulation in python means making changes to a text. this text can be any type of string; in our case, we are working with dna strings.

i will propose the question, then you take the time to understand what you are being asked and then think about the steps you need to take to solve the puzzle. once you have a plan, you can untoggle the answers.

give yourself time limit on the do. part. you are learning how to use this tool and to do that you need to see a variety of solutions other people have come up with before finding your own way. train your brain with as much as question-answers you can. i didn’t do this when i started and now i think i should have spent less time thinking deeply about this small set of questions and more time spreading my sources widely. it’s okay, i did that and now i know this. next time different approach.

let’s start.

ONE: write a program that will print GC content of this sequence.

dna_seq = ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT

think.
do.

dna_seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
G_content = dna_seq.count('G')
C_content = dna_seq.count('C')
    
content_GC = G_content + C_content

ratio = content_GC / len(dna_seq)
print(ratio * 100)
  

TWO: write a program that will print the complement of this sequence.

dna2_seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'

think.
do.

replace_A = dna2_seq.replace('A', 't')
replace_T = replace_A.replace('T', 'a')
replace_G = replace_T.replace('G', 'c')
replace_C = replace_G.replace('C', 'g')

print(replace_C.upper())
  
improve.

dna2_seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
equivalence_dict = {
    'A':'T',
    'T':'A',
    'C':'G',
    'G':'C',
}
    
complementary_dna = []

for base in dna2_seq:
    replace = equivalence_dict[base]
    # now prevent re-write by writing every replaced character to a new string.
    complementary_dna.append(replace)
    
string = ''
print(string.join(complementary_dna))
  

THREE: the motif G*AATTC is the recognition site for the EcoRI restriction enzyme.

dna3_seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT

think.
do.

dna3_seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
    
cut_index = dna3_seq.find('GAATTC')

# find will find the exact substring & will return the index of the first occurrence.
# this is also the cut index.
print(cut_index)

fragment_1 = dna3_seq[:22]
fragment_2 = dna3_seq[22:]

print(fragment_1)
print(len(fragment_1))
print(fragment_2)
print(len(fragment_2))
  
think.
do.

dna3_seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
    
cut = dna3_seq.find('GAATTC')

frag_1 = dna3_seq[:cut+1]
frag_2 = dna3_seq[cut+1:]

print(len(frag_1))
print(len(frag_2))
  

FOUR: print this sequence as a list of codons.

dna_pf_seq = "ATGACCATCGAAAAGGTCGTTCGTGTTCTGCTTCTGATGGTGCTGGGCGCTGGCCGTACCGTTCGCCGATCTGCTGGTCTTCGTTGCTGAACAGCCTGGCCGCTGGCTTTGAGCTGTTCATGGTGATGACCTGAACGTTCGCTGCTGCTGGCTACTGCTGCTGATGTGCTGAATAA"

think.
do.

for index in range(0, len(dna_pf_seq), 3):
    codon = dna_pf_seq[index:index+3]
    print(codon)
  
improve.

for index in range(0, len(dna_pf_seq), 3):
    codon = dna_pf_seq[index:index+3]
    if len(codon) == 3:
        print(codon)
  

FIVE: calculate what part of this Prosthecobacter fusiformis dna sequence is coding.

dna_pf_seq = "ATGACCATCGAAAAGGTCGTTCGTGTTCTGCTTCTGATGGTGCTGGGCGCTGGCCGTACCGTTCGCCGATCTGCTGGTCTTCGTTGCTGAACAGCCTGGCCGCTGGCTTTGAGCTGTTCATGGTGATGACCTGAACGTTCGCTGCTGCTGGCTACTGCTGCTGATGTGCTGAATAA"

find regions between start and end codons.

think.
do.

dna_pf_seq = "ATGACCATCGAAAAGGTCGTTCGTGTTCTGCTTCTGATGGTGCTGGGCGCTGGCCGTACCGTTCGCCGATCTGCTGGTCTTCGTTGCTGAACAGCCTGGCCGCTGGCTTTGAGCTGTTCATGGTGATGACCTGAACGTTCGCTGCTGCTGGCTACTGCTGCTGATGTGCTGAATAA"
start = 'ATG'
end = ['TAG','TGA','TAA']
    
for index in range(0, len(dna_pf_seq), 3):
    codon = dna_pf_seq[index:index+3]
    if codon == start:
        print(f'there is a start codon in {index} position')
        
for index in range(0, len(dna_pf_seq), 3):
    codon = dna_pf_seq[index:index+3]
    for item in end:
        if codon == item:
            print(f'there is a stop codon in {index}')
#the second ATG is just methionone inside the protein. the first one is our start codon.
            
print(dna_pf_seq[0:90])
# there is no tRNA for stop codons so we should actually print zero to 86 (which we need to +1 for the end of slicing).
print(dna_pf_seq[0:87])
  

SIX: there are two exons and one intron in this dna sequence. exon 1 is located from the first character to the sixty third character and exon 2 from ninety first to the end.

think.
do.

dna = 'ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT'

# two exons, one intron

# exon 1: first to the sixty third character
# exon 2: 91 to the end

#### write a program that will calculate what percentage of the dna is coding.
#### print the genomic dna sequence with coding bases in upper and non-coding bases in lowercase.

exon_1 = dna[0:63]
exon_2 = dna[90:]

#### print just the coding regions
print(exon_1)
print(exon_2)

#### write a program that will calculate what percentage of the dna is coding.
print((len(exon_1)+len(exon_2)) / len(dna) * 100)
# don't forget to put paranthesis for the addition (prioritize to division and multipication).

#### print the genomic dna sequence with coding bases in upper and non-coding bases in lowercase.
intron = dna[63:91]

print(exon_1.upper() + intron.lower() + exon_2.upper())