Tragic Mishap has been a long and faithful member of the UD community, and so I’m highlighting a project he is thinking about:
I’ve had an idea rolling around in my head for awhile based on Douglas Axe’s research on a 150 residue section of beta-glutamase. He used experiments to come up with a number for the probability of finding a functional fold by mutating it. It was approximately 10^-77. (I’m not sure of the exact number but it doesn’t matter for the illustration.) Ever since becoming aware of that research, I have wondered if it would be possible to theoretically predict such results by simply restricting the type of amino acid substitutions which are allowed.
For instance, a protein with 150 residues would have 20^150 or 10^195 possible mutations. But the experimental number Axe found was much lower, suggesting a very large number of mutations can occur and still retain function but not anywhere close to 10^192. How many would that be? Very simply:
x/10^195 = 10^-77
x = 10^118
In other words, Axe’s work suggests that there are 10^118 possible mutations in the 150-residue amino acid sequence that can result in a functional protein fold out of 10^195 theoretically possible mutations. There ought to be some way to get a handle on which mutations work and which don’t.
To do this, I looked at a few similar cytochrome p450 and Na+K+-ATPase sequences and catalogued only single residue polymorphisms to come up with a list of single substitutions that appear to not affect protein structure and function. I ignored longer stretches of mutations since those types of mutations compensate for each other. I came up with the following code:
Each amino acid in a group is assumed to be able to freely substitute for each other. The secondary and tertiary groups can substitute for each other or for their primary or secondary groups as well. The purpose of this exercise would be to write a simple program which could calculate the total number of possible sequences based on any particular proteomic code you like. Here is a simple example of how the program would work:
MUTATION LIMIT: 1
MUTATION LIMIT: 2
3*2+3*3+2*3 = 21
TOTAL(2): 21+ 9 = 30
MUTATION LIMIT: 3
3*2*3 = 18
TOTAL(3): 18 + 30 = 48
TOTAL: 4*3*4 = 48
MUTATION LIMIT: 4
4: 3*2*3*3+138 = 192
TOTAL: 4*3*4*4 = 192
So if up to four mutations are allowed on this four residue sequence than according to the given substitution code there are 192 possible sequences out of 20^4 total possibilities, giving a 1.2e-3 probability of finding a working sequence at random.
The goal would be to construct a code that gives results which agree with experiments like Axe’s. This program could also be written using probabilities calculated according to the genetic code which would be more complicated, but hopefully I’ve explained how it would work. This sort of approach might be useful to protein structure prediction.
So, is there anybody who could help me write such a program? I’m not a programmer. Presumably we would start simple and build complexities into the code, such as secondary and tertiary substitution groups, distance between point mutations, mutation limits (as shown) and perhaps even build in allowable block substitutions or different codes at different positions based on some criteria.
People can respond here at UD and at the following thread on the CEU Bulletin Board where the comments and some source code can be deposited.
Eventually if the project becomes larger, SourceForge would be a good location. Anyone with background on similar projects is especially welcome to comment.