Multiple Sequence Alignment with Custom ClustalW Implementation
University Projects #Data Science#Python#Bioinformatics
Overview
A custom implementation of the ClustalW multiple sequence alignment algorithm using dynamic programming. The system progressively aligns biological sequences by building scoring matrices and tracing back optimal alignments. You can view the code here.
Key Achievements
- Successfully aligned 4 sequences (GTTAT, CGTTT, GCAAT, CCGAT) through progressive alignment
- Identified and resolved a complex infinite loop bug through systematic debugging
- Produced correct alignments matching expected ClustalW output
Results
Aligning A (GTTAT) with B (CGTTT)

Aligning C (GCAAT) with D (CCGAT)

Aligning AB with CD

Implementation
Algorithm Process
- Build dynamic programming table with first row/col set up
- Fill table with max score for each cell using Sum-of-Pairs
- Find optimal alignment by tracing back through from end to start
- Align aligned sequences together progressively
Debugging
The initial implementation had a critical bug causing infinite loops during traceback on the AB-CD alignment phase. I systematically identified and fixed the issues:
| Bug | Fix |
|---|---|
| Infinite loop in traceback | while True to while row>=0 and col>=0 |
| Progress not made in all branches | Update scores in all conditional branches |
| Exit check missing | Add end of table boundary check |
| Alignment tracking issue | Optimal alignment array |
| Scoring inconsistency | Ensure fillTable and traceback use scores the same |
Technologies
Python, Dynamic Programming, Sum-of-Pairs Scoring, Bioinformatics, BioPython