Assignment #9


For this assignment you will write your own little program to find subsequences using a standard least common subsequence algorithm. The usage of your program will look like this.

findMatches sequence|file file percentMatch

The first argument can be a file name or a sequence of characters you want to match. If there is a file of that name it assumes it is a filename. When it is a filename, your program opens the file and reads it in and uses the contants as a single string for the sequence. After that is the name of a single file that you will be searching in. The last argument is what percentage of the characters have to match for you to accept something. You will find the smallest sequences that are at least as long as the search sequence that match that percentage.

To make this more clear, let's look at an example. You fun the following command

findMatches ACCGTTAGC tomato.gen 80

Now tomato.gen might have an extremely long sequence of characters in it. We will assume for now it only have the following line.

AAGTCGACCTGATTGAGTACTGACCGTTAGCGATTTACGTACAGGGTCCAAACTGATAACG

The search sequence has 9 characters in it and you have asked for matches to 80% of the characters. So you will only report back things where the longest common subsequence is 8 or 9 characters long since 7/9 is less than .8. You can take many approaches to this. One would be to just pick substrings from the long string whose length is what you are searching for and run lcs to see if they match. There are probably more efficient methods that employ divide and conquer methods as well. I don't care what you do for this. Your output should give each of the match sequences and tell what fraction of the characters were a match.