Comparing Pattern Matching Algorithms on the Sorangium Cellulosum Genome

University Projects #Data Science#Python#Bioinformatics

Overview#

This pattern matching implementation compares the efficiency of the brute force and Knuth-Morris-Pratt pattern matching algorithms on the Sorangium Cellulosum genome using Python and BioPython.In addition to comparing the algorithms, I identified a pattern that runs significantly faster with KMP and found a pattern that codes for amino acids.

View the full project here.

Key Achievements#

The pattern CCCC runs 32% faster with the Knuth-Morris-Pratt algorithm (20.45s vs 30.11s), likely because the genome has a 71% GC content and the pattern is highly repetitive.
The pattern AACGTT returns an even 400 matches, unique from other patterns tested, indicating that the pattern is a coding sequence for a pair of amino acids.

Brute Force Pattern Matching#

The Brute Force implementation compares the given sequence to a given pattern at all possible starting positions and returns both the total number of matches and the time taken. Performs in O(n*m) time in the worst case, meaning a mismatch on the last spot of the pattern. Best for short sequences with low repetition.

KMP Pattern Matching#

The KMP implementation uses a dynamic programming table and pointers for the indexes of the sequence and pattern. At each step, check if there is partial match between the sequence[seqIndex+patternIndex] and pattern[patternIndex]. If there is, increment patternIndex. If there isn’t a partial match, there is either a full match or a mismatch, and the text pointer is moved. If the characters at both pointer positions match, both pointers move forward. Otherwise, the pattern pointer is reset to the last value in the table. KMP returns the total number of matches in the forward strand, the total number of matches in the complement strand, and the time taken for the search. Performs in O(n+m) time in the worst case. Best for long sequences and repetitive patterns, as KMP never rechecks previously matched characters.