Comparing Pattern Matching Algorithms on the Sorangium Cellulosum Genome
University Projects #Data Science#Python#Bioinformatics

Brute Force Pattern Matching#

The Brute Force implementation compares the given sequence to a given pattern at all possible starting positions and returns both the total number of matches and the time taken.

KMP Pattern Matching#

The KMP implementation uses a dynamic programming table and pointers for the indexes of the sequence and pattern. At each step, check if there is partial match between the sequence[seqIndex+patternIndex] and pattern[patternIndex]. If there is, increment patternIndex. If there isn’t a partial match, there is either a full match or a mismatch, and the text pointer is moved. If the characters at both pointer positions match, both pointers move forward. Otherwise, the pattern pointer is reset to the last value in the table. KMP returns the total number of matches in the forward strand, the total number of matches in the complement strand, and the time taken for the search.

Results#

In the worst case, KMP can match the pattern in O(n) time (as KMP never rechecks previously matched characters), whereas the brute force algorithm can match the pattern in O(n^2) time (if there is a mismatch in the last spot of the pattern).

The pattern CCCC runs significantly faster with the Knuth-Morris-Pratt algorithm (20.45s vs 30.11s), likely because the genome has a 71% GC content and the pattern is highly repetitive.

The pattern AACGTT returns an even 400 matches, unique from other patterns tested, indicating that the pattern is a coding sequence for a pair of amino acids.

View the full project here.

← Back to Projects