Multiple Sequence Alignment with Custom ClustalW Implementation
University Projects #Data Science#Python#Bioinformatics

ClustalW Algorithm#

Builds a dynamic programming table based on 2 given sequences and fills it based on the calculated sum of pairs scores. Optimal alignments are identified by tracing back through the table.

In its initial implementation there was a significant bug that caused an infinite loop when tracing back through the AB CD alignment. To resolve this issue, I implemented the following bug fixes:

  • Updated scores in all conditional branches so progress through the table happens in all branches
  • Added a termination check for reaching the end of the table
  • Created an optimal alignment array to track the optimal alignments
  • Converted while True loop to while row>=0 asnd col>=0 loop
  • Ensured that fillTable and traceback use scores in the same way

You can view the code here.

ClustalW Implementation#

Takes 2 sequences and processes each alignment, tracks progress, reports intermediate and final results. The process is as follows:

  • Build the dynamic programming table based on sequence lengths and initialize the first row and column
  • Fill the table by calculating the maximum score for each cell using normalized sum of pairs
  • Track alignments by tracing back through the dynamic programming table and updating a global dictionary of alignments
  • Traceback only returns 1 optimal alignment

Results#

The following sequences were used to test this algorithm:

  • A: GTTAT
  • B: CGTTT
  • C: GCAAT
  • D: CCGAT

Aligning A with B

Aligning C with D

Aligning AB with CD

← Back to Projects