- Share
-
-
arroba
Tragic Mishap has been a long and faithful member of the UD community, and so I’m highlighting a project he is thinking about:
I’ve had an idea rolling around in my head for awhile based on Douglas Axe’s research on a 150 residue section of beta-glutamase. He used experiments to come up with a number for the probability of finding a functional fold by mutating it. It was approximately 10^-77. (I’m not sure of the exact number but it doesn’t matter for the illustration.) Ever since becoming aware of that research, I have wondered if it would be possible to theoretically predict such results by simply restricting the type of amino acid substitutions which are allowed.
For instance, a protein with 150 residues would have 20^150 or 10^195 possible mutations. But the experimental number Axe found was much lower, suggesting a very large number of mutations can occur and still retain function but not anywhere close to 10^192. How many would that be? Very simply:
x/10^195 = 10^-77
x = 10^118
In other words, Axe’s work suggests that there are 10^118 possible mutations in the 150-residue amino acid sequence that can result in a functional protein fold out of 10^195 theoretically possible mutations. There ought to be some way to get a handle on which mutations work and which don’t.
To do this, I looked at a few similar cytochrome p450 and Na+K+-ATPase sequences and catalogued only single residue polymorphisms to come up with a list of single substitutions that appear to not affect protein structure and function. I ignored longer stretches of mutations since those types of mutations compensate for each other. I came up with the following code:
prim.—–sec.—–tert.
EDHN—GSC—-AVIT
FYLM————–AVIT
KRW
QPEach amino acid in a group is assumed to be able to freely substitute for each other. The secondary and tertiary groups can substitute for each other or for their primary or secondary groups as well. The purpose of this exercise would be to write a simple program which could calculate the total number of possible sequences based on any particular proteomic code you like. Here is a simple example of how the program would work:
PROGRAM INPUT:
SEQUENCE: EGA
CODE:
EDHN
GSC
AVIT
FYLM
KRW
QPMUTATION LIMIT: 1
DGA
HGA
NGAESA
ECAEGV
EGI
EGT3+2+3=8
TOTAL(1): 8+1=9
MUTATION LIMIT: 2
DSA
DCA
HSA
HCA
NSA
NCADGV
DGI
DGT
HGV
HGI
HGT
NGV
NGI
NGTESV
ESI
EST
ECV
ECI
ECT3*2+3*3+2*3 = 21
TOTAL(2): 21+ 9 = 30
MUTATION LIMIT: 3
DSV
DSI
DST
DCV
DCI
DCTHSV
HSI
HST
HCV
HCI
HCTNSV
NSI
NST
NCV
NCI
NCT3*2*3 = 18
TOTAL(3): 18 + 30 = 48
TOTAL: 4*3*4 = 48
PROGRAM INPUT:
SEQUENCE: EGAF
CODE:
EDHN
GSC
AVIT
FYLM
KRW
QPMUTATION LIMIT: 4
PROGRAM OUTPUT:
1: 3+2+3+3+1=12
2: 3*2+3*3+3*3+2*3+2*3+3*3+12=57
3: 3*2*3+3*2*3+3*3*3+2*3*3+57=138
4: 3*2*3*3+138 = 192
TOTAL: 4*3*4*4 = 192So if up to four mutations are allowed on this four residue sequence than according to the given substitution code there are 192 possible sequences out of 20^4 total possibilities, giving a 1.2e-3 probability of finding a working sequence at random.
The goal would be to construct a code that gives results which agree with experiments like Axe’s. This program could also be written using probabilities calculated according to the genetic code which would be more complicated, but hopefully I’ve explained how it would work. This sort of approach might be useful to protein structure prediction.
So, is there anybody who could help me write such a program? I’m not a programmer. Presumably we would start simple and build complexities into the code, such as secondary and tertiary substitution groups, distance between point mutations, mutation limits (as shown) and perhaps even build in allowable block substitutions or different codes at different positions based on some criteria.
People can respond here at UD and at the following thread on the CEU Bulletin Board where the comments and some source code can be deposited.
Eventually if the project becomes larger, SourceForge would be a good location. Anyone with background on similar projects is especially welcome to comment.