Multiple Sequence Alignment with Custom ClustalW Implementation
University Projects #Data Science#Python#Bioinformatics
Overview
A custom implementation of the ClustalW multiple sequence alignment algorithm using dynamic programming. The system progressively aligns biological sequences by building scoring matrices and tracing back optimal alignments.
Key Achievements
- Successfully aligned 4 sequences (GTTAT, CGTTT, GCAAT, CCGAT) through progressive alignment
- Identified and resolved a complex infinite loop bug through systematic debugging
- Produced correct alignments matching expected ClustalW output
Results
Aligning A (GTTAT) with B (CGTTT)

Aligning C (GCAAT) with D (CCGAT)

Aligning AB with CD

Implementation
Algorithm Process
- Build dynamic programming table with first row/col set up
- Fill table with max score for each cell using Sum-of-Pairs
- Find optimal alignment by tracing back through from end to start
- Align aligned sequences together progressively
Debugging
The initial implementation had a critical bug causing infinite loops during traceback on the AB-CD alignment phase. I systematically identified and fixed the issues:
| Bug | Fix |
|---|---|
| Infinite loop in traceback | while True to while row>=0 and col>=0 |
| Progress not made in all branches | Update scores in all conditional branches |
| Exit check missing | Add end of table boundary check |
| Alignment tracking issue | Optimal alignment array |
| Scoring inconsistency | Ensure fillTable and traceback use scores the same |
Technologies
Python, Dynamic Programming, Sum-of-Pairs Scoring, Bioinformatics, BioPython