Sequences#
import os
import errno
import urllib.request
import numpy as np
At the most basic level, sequences are simply a string and could be manipulated as such. However, the Biopython library provides a sophisticated set of tools for reading, annotating and manipulating biological sequence data. In this notebook we will look at a few example use cases of the Biopython libary to demonstrate how sequences can be handled as data. These examples are adapted from the Biopython Tutorial and Cookbook (Biopython version 1.78).
Some code cells will be marked with
##########################
######## To Do ###########
##########################
This indicates that you are being asked to write a piece of code to complete the notebook.
To get started, we will use pip to install the Biopython package if it’s not already available.
!pip install biopython
Reading Sequence Files#
We are going to look at some data from a recent paper published by Bryan Branson’s group at MIT. We’ll looked at the paper in more detail before Professor Branson gives a guest lecture later in the course.
Hie, B., Zhong, E. D., Berger, B., & Bryson, B. (2021). Learning the language of viral evolution and escape. Science, 371(6526), 284-288.
!wget http://cb.csail.mit.edu/cb/viral-mutation/data.tar.gz
!tar xvf data.tar.gz
# Define file path constants
root_dir = os.getcwd()
data_dir = os.path.join(root_dir,'data')
fname = 'data/influenza/ird_influenzaA_HA_allspecies.fa'
meta_fname = 'data/influenza/ird_influenzaA_HA_allspecies_meta.tsv'
We are going to start by looking at a fasta file which is a standard text-based format for biological sequence data. We first print out the first 500 characters of the file to see how it is structured.
with open(fname,'r') as f:
print(f.read(500))
>gb:K00429|ncbiId:AAR96248.1|UniProtKB:Q6LEJ4|Organism:Influenza A virus (A/seal/Mass/1/1980(H7N7))|Strain Name:A/seal/Mass/1/1980|Protein Name:HA Hemagglutinin|Gene Symbol:HA|Segment:4|Subtype:H7N7|Host:Sea Mammal
MNTQILVFIACVLIEAKGDKICLGHHAVANGTKVNTLTERGIEVVNATETVETANIGKICTQGKRPTDLG
QCGLLGTLIGPPQCDQFLEFESNLIIERREGNDVCYPGKFTNEESLRQILRGSGGVDKESMGFTYSGIRT
NGTTSACRRSGSSFYAEMKWLLSNSDNAAFPQMTKSYRNPRNKPALIVWGIHHSGSTTEQTRLYGSGNKL
ITVGSSKYQQSFTPSPGARPQVNGQSGRIDFHWLLLDPNDTVTFTFNGAFIAPNRASFFRGESLGVQSDV
P
Each record in the fasta file begins with a header line preceded by >
. Now, we can open the file using the Biopython SeqIO
parser. Bio.SeqIO
provides suppport for many different sequence file formats. Check out the docs to see a comprehensive list.
from Bio import SeqIO
for i,seq_record in enumerate(SeqIO.parse(fname, 'fasta')):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
if i > 10:
break
gb:K00429|ncbiId:AAR96248.1|UniProtKB:Q6LEJ4|Organism:Influenza
Seq('MNTQILVFIACVLIEAKGDKICLGHHAVANGTKVNTLTERGIEVVNATETVETA...ICI')
560
gb:J02176|ncbiId:AAA43209.1|UniProtKB:P03454|Organism:Influenza
Seq('MKAKLLVLLYAFVATDADTICIGYHANNSTDTVDTIFEKNVAVTHSVNLLEDRH...ICI')
565
gb:CY021709|ncbiId:ABP49327.1|UniProtKB:A4U6V2|Organism:Influenza
Seq('MKARLLVLLCALAATDADTICIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDSH...ICI')
566
gb:CY020285|ncbiId:ABO38054.1|UniProtKB:A4GBW6|Organism:Influenza
Seq('MKARLLVLLCALAATDADTICIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDSH...ICI')
566
gb:CY083910|ncbiId:ADX98960.1|UniProtKB:F0TT51|Organism:Influenza
Seq('MKAILVVLLYTFVTANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
gb:CY063606|ncbiId:ADH01958.1|UniProtKB:D7ECP4|Organism:Influenza
Seq('MKAILVVLLYTFATANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
gb:CY083776|ncbiId:ADX98792.1|UniProtKB:F0TSN3|Organism:Influenza
Seq('MKAILVVLLYTFATANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
gb:CY073725|ncbiId:ADN05226.1|UniProtKB:E0V7J5|Organism:Influenza
Seq('MKAILVVLLYTFATANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
gb:CY062691|ncbiId:ADG42153.1|UniProtKB:D5XJG9|Organism:Influenza
Seq('MKAILVVLLYTFATANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
gb:CY062699|ncbiId:ADG42163.1|UniProtKB:D5XJH9|Organism:Influenza
Seq('MKAILVVLLYTFATANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
gb:CY066935|ncbiId:ADK33671.1|UniProtKB:D8KV23|Organism:Influenza
Seq('MKAILVVLLYTFATANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
gb:CY066943|ncbiId:ADK33681.1|UniProtKB:D8KV33|Organism:Influenza
Seq('MKAILVVLLYTFATANADTLCIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDKH...ICI')
566
Sequence Manipulations#
Biopython provides some useful tools for manipulating sequences some of which are biology specific while others act as if on basic strings. So far we’ve looked at amino acid sequences, but Biopython also supports DNA sequence.
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Generate the complement or reverse complement
print('Complement:',dna.complement())
print('Reverse complement:', dna.reverse_complement())
Complement: TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC
Reverse complement: CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT
# Transcribe coding DNA to RNA
rna = dna.transcribe()
print('RNA:', rna)
RNA: AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
Notice that the transcribe function replaces T with U.
# Translate dna or rna to protein
print('DNA to protein:',dna.translate())
print('RNA to protein:',rna.translate())
DNA to protein: MAIVMGR*KGAR*
RNA to protein: MAIVMGR*KGAR*
One-Hot Encoding#
For most machine learning applications, DNA or RNA sequence is represented using one-hot encoding. In this schema, each nucleotide is treated as its own class such that DNA sequence would be represented as four classes. Each class/nucleotide has a binary value of 0 (absent) or 1 (present).
For example, the sequence ATGTC
would be encoded as
$$
$$
Exercise#
Write a function that takes in a sequence of $n$ bases as a string and produces an $n\times 4$ array where the sequence is represented using one-hot encoding. You may also need to write code to turn the sequence from a string into an array.
##########################
######## To Do ###########
##########################
def one_hot_dna(seq):
# Encode the sequence with one hot encoding and return a numpy array
Show code cell content
def one_hot_dna(seq):
# Convert seq from string to array
raw = np.array(list(seq))
out = np.zeros((raw.shape[0],4))
out[raw=='A',0] = 1
out[raw=='T',1] = 1
out[raw=='G',2] = 1
out[raw=='C',3] = 1
return out
one_hot_dna(dna)
array([[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 1., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 1., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 1., 0.]])
Processing Branson Sequence Data#
We’ll start by loading all of the sequences from the fasta file.
seqs, lengths = [], []
for i,seq_record in enumerate(SeqIO.parse(fname, 'fasta')):
seqs.append(seq_record.seq)
lengths.append(len(seq_record))
len(seqs)
94560
And then taking the alphabetic sequence and turning it into a numeric code.
AAs = [
'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H',
'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W',
'Y', 'V', 'X', 'Z', 'J', 'U', 'B', 'Z'
]
vocabulary = { aa: idx + 1 for idx, aa in enumerate(sorted(AAs)) }
seq_num = []
for s in seqs:
seq_num.append(
[vocabulary[a] for a in s]
)
Next we check the length of the sequences.
np.unique(lengths)
array([114, 357, 450, 459, 510, 547, 549, 550, 552, 553, 556, 558, 559,
560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572,
573, 574, 575, 576])
Most models require that all inputs are the same size, so we need a method to standardize the size of these sequences before using them for training. Intuitively there are two strategies for turning a variable set of sequence lengths into a standard length. First, we can trim all sequences down to the minimum sequence length, but this risks discarding valueable data. Alternatively, we can pad short sequences to the length of the longest sequence.
Keras provides a preprocessing function that handles this task.
from tensorflow.keras.preprocessing.sequence import pad_sequences
pad_sequences?
X = pad_sequences(seq_num, maxlen=max(lengths))
len(X[0])
576
%load_ext watermark
%watermark -u -d -vm --iversions
Last updated: 2021-04-28
Python implementation: CPython
Python version : 3.7.10
IPython version : 5.5.0
Compiler : GCC 7.5.0
OS : Linux
Release : 4.19.112+
Machine : x86_64
Processor : x86_64
CPU cores : 2
Architecture: 64bit
numpy : 1.19.5
Bio : 1.78
IPython: 5.5.0