Project 2

For the second project of the semester, you will be able to select one of the following problems to solve. You should use Perl to solve the problem. When the project is due, you must turn in your code as well as a 5+ page paper (12 point font, 1.5 spaced, normal margins) describing what you did and what you found. As described in the syllabus, the projects should be done in groups. Each member of the group will also submit an evaluation describing what he/she did as part of the project as well as the work done by others.

Psychology Option - ???

Physics Option - Search through the Sloan digital sky survey.

Biology - For this project you will be playing with BLAST. You will be doing things on both the internet and on a local machine (Keck). The first part of the project is for you to look for the genes NM_001036818, NM_001003219, and J04077 using the web format. You need to tell me what gene it is and what species it comes from. You will also need the sequence of that gene so that you can do searches for it. Using the nt database on the Keck machine, I want you to write a Perl script that will do the following. The script will be provided with an input file name and the number of top hits you are interested in (n). First it should do a blastn search for the specified input file on the database. It should then turn around and take the top n hits and do a search on each of those back against the database to see where the original gene comes up in the search. You should print a table that shows the matching genes along with their score and E value on the original search. Each row should also show the rank, score, and E value of the reverse search. To get the FASTA information for a particular gene in the standalone install use the fastacmd command in the bin directory. The easiest way to do this is to use the -s option. For that you will use the other identifiers other than the LOC identifier for the gene. For example, the one we did in class had the identifier XM_701884.1. You can also use the identifier 68396607 for that gene.

If you want, you could make it so that your program takes an optional format of specifying the gene tag instead of an input file. You might have it use command like options like -i for input file, -g for gene code, and -n for the number of matches to put in the table.

Chemistry - ???

Engineering - ???

Geoscience - ???