Multiple Sequence Alignment with Custom ClustalW Implementation
University Projects #Data Science#Python#Bioinformatics
ClustalW Algorithm
Builds a dynamic programming table based on 2 given sequences and fills it based on the calculated sum of pairs scores. Optimal alignments are identified by tracing back through the table.
In its initial implementation there was a significant bug that caused an infinite loop when tracing back through the AB CD alignment. To resolve this issue, I implemented the following bug fixes:
- Updated scores in all conditional branches so progress through the table happens in all branches
- Added a termination check for reaching the end of the table
- Created an optimal alignment array to track the optimal alignments
- Converted
while Trueloop towhile row>=0 asnd col>=0loop - Ensured that
fillTableandtracebackuse scores in the same way
You can view the code here.
ClustalW Implementation
Takes 2 sequences and processes each alignment, tracks progress, reports intermediate and final results. The process is as follows:
- Build the dynamic programming table based on sequence lengths and initialize the first row and column
- Fill the table by calculating the maximum score for each cell using normalized sum of pairs
- Track alignments by tracing back through the dynamic programming table and updating a global dictionary of alignments
- Traceback only returns 1 optimal alignment
Results
The following sequences were used to test this algorithm:
- A: GTTAT
- B: CGTTT
- C: GCAAT
- D: CCGAT
Aligning A with B

Aligning C with D

Aligning AB with CD
