Multiple Sequence Alignment with Custom ClustalW Implementation
University Projects #Data Science#Python#Bioinformatics

Overview#

A custom implementation of the ClustalW multiple sequence alignment algorithm using dynamic programming. The system progressively aligns biological sequences by building scoring matrices and tracing back optimal alignments. You can view the code here.

Key Achievements#

  • Successfully aligned 4 sequences (GTTAT, CGTTT, GCAAT, CCGAT) through progressive alignment
  • Identified and resolved a complex infinite loop bug through systematic debugging
  • Produced correct alignments matching expected ClustalW output

Results#

Aligning A (GTTAT) with B (CGTTT)#

Aligning C (GCAAT) with D (CCGAT)#

Aligning AB with CD#

Implementation#

Algorithm Process#

  1. Build dynamic programming table with first row/col set up
  2. Fill table with max score for each cell using Sum-of-Pairs
  3. Find optimal alignment by tracing back through from end to start
  4. Align aligned sequences together progressively

Debugging#

The initial implementation had a critical bug causing infinite loops during traceback on the AB-CD alignment phase. I systematically identified and fixed the issues:

BugFix
Infinite loop in tracebackwhile True to while row>=0 and col>=0
Progress not made in all branchesUpdate scores in all conditional branches
Exit check missingAdd end of table boundary check
Alignment tracking issueOptimal alignment array
Scoring inconsistencyEnsure fillTable and traceback use scores the same

Technologies#

Python, Dynamic Programming, Sum-of-Pairs Scoring, Bioinformatics, BioPython

← Back to Projects