# Tragic Mishap’s Proteomic Coding Project

Spread the love

Tragic Mishap has been a long and faithful member of the UD community, and so I’m highlighting a project he is thinking about:

I’ve had an idea rolling around in my head for awhile based on Douglas Axe’s research on a 150 residue section of beta-glutamase. He used experiments to come up with a number for the probability of finding a functional fold by mutating it. It was approximately 10^-77. (I’m not sure of the exact number but it doesn’t matter for the illustration.) Ever since becoming aware of that research, I have wondered if it would be possible to theoretically predict such results by simply restricting the type of amino acid substitutions which are allowed.

For instance, a protein with 150 residues would have 20^150 or 10^195 possible mutations. But the experimental number Axe found was much lower, suggesting a very large number of mutations can occur and still retain function but not anywhere close to 10^192. How many would that be? Very simply:

x/10^195 = 10^-77

x = 10^118

In other words, Axe’s work suggests that there are 10^118 possible mutations in the 150-residue amino acid sequence that can result in a functional protein fold out of 10^195 theoretically possible mutations. There ought to be some way to get a handle on which mutations work and which don’t.

To do this, I looked at a few similar cytochrome p450 and Na+K+-ATPase sequences and catalogued only single residue polymorphisms to come up with a list of single substitutions that appear to not affect protein structure and function. I ignored longer stretches of mutations since those types of mutations compensate for each other. I came up with the following code:

prim.—–sec.—–tert.
EDHN—GSC—-AVIT
FYLM————–AVIT
KRW
QP

Each amino acid in a group is assumed to be able to freely substitute for each other. The secondary and tertiary groups can substitute for each other or for their primary or secondary groups as well. The purpose of this exercise would be to write a simple program which could calculate the total number of possible sequences based on any particular proteomic code you like. Here is a simple example of how the program would work:

PROGRAM INPUT:

SEQUENCE: EGA

CODE:

EDHN
GSC
AVIT
FYLM
KRW
QP

MUTATION LIMIT: 1

DGA
HGA
NGA

ESA
ECA

EGV
EGI
EGT

3+2+3=8

TOTAL(1): 8+1=9

MUTATION LIMIT: 2

DSA
DCA
HSA
HCA
NSA
NCA

DGV
DGI
DGT
HGV
HGI
HGT
NGV
NGI
NGT

ESV
ESI
EST
ECV
ECI
ECT

3*2+3*3+2*3 = 21

TOTAL(2): 21+ 9 = 30

MUTATION LIMIT: 3

DSV
DSI
DST
DCV
DCI
DCT

HSV
HSI
HST
HCV
HCI
HCT

NSV
NSI
NST
NCV
NCI
NCT

3*2*3 = 18

TOTAL(3): 18 + 30 = 48

TOTAL: 4*3*4 = 48

PROGRAM INPUT:

SEQUENCE: EGAF

CODE:

EDHN
GSC
AVIT
FYLM
KRW
QP

MUTATION LIMIT: 4

PROGRAM OUTPUT:

1: 3+2+3+3+1=12
2: 3*2+3*3+3*3+2*3+2*3+3*3+12=57
3: 3*2*3+3*2*3+3*3*3+2*3*3+57=138
4: 3*2*3*3+138 = 192
TOTAL: 4*3*4*4 = 192

So if up to four mutations are allowed on this four residue sequence than according to the given substitution code there are 192 possible sequences out of 20^4 total possibilities, giving a 1.2e-3 probability of finding a working sequence at random.

The goal would be to construct a code that gives results which agree with experiments like Axe’s. This program could also be written using probabilities calculated according to the genetic code which would be more complicated, but hopefully I’ve explained how it would work. This sort of approach might be useful to protein structure prediction.

So, is there anybody who could help me write such a program? I’m not a programmer. Presumably we would start simple and build complexities into the code, such as secondary and tertiary substitution groups, distance between point mutations, mutation limits (as shown) and perhaps even build in allowable block substitutions or different codes at different positions based on some criteria.

People can respond here at UD and at the following thread on the CEU Bulletin Board where the comments and some source code can be deposited.

Proteomic Coding at CEU Forum

Eventually if the project becomes larger, SourceForge would be a good location. Anyone with background on similar projects is especially welcome to comment.

## 11 Replies to “Tragic Mishap’s Proteomic Coding Project”

1. 1
JGuy says:

Try contacting Gil Dodgen if you can’t find anyone. This looks like it would be easy for him.

2. 2
ciphertext says:

Sounds like it would be a fun diversion. What isn’t apparent from the requirements, is what are the “rules” for substitution. In your “pseudo-code” you have

PROGRAM INPUT:
SEQUENCE:…
CODE:

What is expected as program input?

Should the program do any processing at all during the sequence portion? If so, what are the rules for sequencing? Or, are you saying that sequence: EGA is the program input?

What functionality are you wanting to represent for code in your pseudo-code? Is that the section for which you are attempting to solicit assistance in developing?

Are the values that follow the code section (e.g. EDHN, GSC, AVIT, etc…) the results of processing performed in the code section, or something other? It seems that those are the “mutations” that you are trying to derive and then use in calculation later (e.g. 3+2+3 = 8 + 1 = 9 total mutations).

I think a more clear “spec” would need to be developed before you project would be capable of automation through programming. In addition, you would need to provide a mechanism to identify how mutation occurs (character substitution?) that could be replicated in computer code. You would need to delineate the operating rules (e.g. “A” could bind to “G” but not ever to “E”). Such that when the mutation (i.e. string concatenation of characters) occurred, you could insure that no incorrect bindings were generated (e.g. “AG” but not “AE”). Sounds a bit like grammar doesn’t it (e.g. “i” before “e” except after “c”)?

3. 3
4. 4
bornagain77 says:

JGuy, you may find these related notes helpful to that video:

Do centrioles generate a polar ejection force? – Wells J. – 2005
Excerpt: Centrioles consist of nine microtubule triplets arranged like the blades of a tiny turbine. Instead of viewing centrioles through the spectacles of molecular reductionism and neo-Darwinism, this hypothesis assumes that they are holistically designed to be turbines. Orthogonally oriented centriolar turbines could generate oscillations in spindle microtubules that resemble the motion produced by a laboratory vortexer. The result would be a microtubule-mediated ejection force tending to move chromosomes away from the spindle axis and the poles.
http://www.ncbi.nlm.nih.gov/pubmed/15889341

The Spindle Assembly Checkpoint Mechanism and the Consequences of its Dysfunction – J. Mclatchie – Dec. 10, 2013
The spindle assembly checkpoint pathway is an elegantly engineered surveillance system for protecting the cell from the adverse consequences of improper kinetochore-microtubule attachment. Proper attachment of kinetochores to microtubules is monitored by tension-sensing and by detection of attachment of the ends of the microtubules to the kinetochores. Even a single unattached kinetochore is sufficient to trigger the wait anaphase signal, which inhibits activation of the APC that drives entry into anaphase. Impairment of the spindle assembly checkpoint pathway can result in aneuploidy, a contributor to cancer and developmental abnormalities such as Down’s syndrome.
http://jmclatchie.blogspot.co......point.html

DNA – Replication, Wrapping & Mitosis – video
https://vimeo.com/33882804

5. 5
tragic mishap says:

Thanks for highlighting this, scordova. I have been busy with work and the REASONS conference this weekend. I have another 12-hour shift ahead of me so more detailed comments will have to wait till tomorrow. Looks like I need to pay more attention to the forum!

ciphertext, I am not in the least a programmer. The “PROGRAM INPUT” was intended to convey three separate inputs: a code for which amino-acid substitutions are allowed (which was “compiled” manually from real point mutations in protein sequences), the sequence to be mutated and the maximum number of allowed mutations. The code itself is intended to be variable, that is not hard-coded into the program so it can be adjusted to achieve the result corresponding to experimental results like Axe’s.

6. 6
ciphertext says:

Maybe I can help you refine your requirements into a specification that can be used as a basis for a prototype. My area of expertise is in cloud application development. At the least, I could help you articulate a design/technical document that we could prototype. I think that a programmer with experience in proteomics or bioinformatics would be able to develop a better program, but I don’t mind providing you with some assistance early on in your prototyping.

7. 7
ciphertext says:

Regarding your program design, it sounds like you wish to provide the program with a “library” of substitutions as input. Would these substitutions likely be in the form of a rule or rules (e.g. “A to G, but not to C”)?

The remaining inputs would be the protein sequence that should be mutated (a string type), and then the max number of iterations the mutation process should occur.

Regarding the “max” mutation iteration, is the goal simply to mutate the input sequence during the first iteration; and then mutate the result of the procedure each subsequent iteration?

As an example:

The input for the rules indicate that “A” can never immediately precede a “G”, and a “C” can never immediately precede an “A”.

The input sequence to mutate is “EACRG”.

The max iterations is 2.

So the processing would simply reorder each protein symbol (e.g. E, A, C, R) as a maximum of 2 times. Each iteration would use the results of the previous iteration as the sequence input.

Iteration 1: Result = EGCRA

Iteration 1: Result = ECRGA

Is that what you are thinking?

Or are you thinking that the rules would be that you could substitute and “R” for a “G” but not for a “C”. In which case you would simply perform an actual substitution for each letter according to the rules and the resultant mutation as input for the next sequence.

8. 8
tragic mishap says:

Would these substitutions likely be in the form of a rule or rules (e.g. “A to G, but not to C”)?

Yes.

Regarding the “max” mutation iteration, is the goal simply to mutate the input sequence during the first iteration; and then mutate the result of the procedure each subsequent iteration?

That’s a good point that I had not thought too much about, but with the simple codes I’m suggesting it should not make a difference whether the substitutions occur based on the first or subsequent iterations. I’m interested in the raw probability, and I think we would just assume that the mutations occur at different locations. If they occurred at the same location, it would be the same as reducing the mutation limit by one for each same location substitution, which already included in the calculation.

9. 9
tragic mishap says:

It would be easier to continue this on the forum, but for a more complicated and realistic type of code it might make a difference in the calculation.

10. 10
ciphertext says:

So, it seems that there are two different processes we could considered for the substitution. I’ll call one the “static” method, and the other the “recursive” method.

In the static method, you would be generating all possible mutations of an input string up to the maximum (or ceiling) allowed number of mutations. Each iteration of mutation would use the same input string and apply any substitution rules that were already applied. Essentially, it would function a lot like a code “cracking” machine. It would attempt to “guess” the mutations possible without repeating results.

Your code would need as input, the rules for substitution (e.g. “A” cannot immediately precede “G”, “E” can substitute for “R”), the starting sequence, and the maximum number of mutations. This wouldn’t reveal all possible combinations (if there were ones beyond the max input value), just what those combinations were. You would essentially have a list of allowed combinations. You could use that list as a “check” for the iterator. Such that if the resultant mutation already existed in the dictionary (possibly from a subsequent execution), then the iteration wouldn’t count against the max value.

The second method I indicated was a “dynamic” process in that the input string to be mutated was always the resultant from the previous mutation. With the obvious exception being that the system would use the starting sequence provided as input on the first iteration. The processing method wouldn’t need to worry about which rules had already been applied.

Depending upon your approach to rules and “meta rules” (which govern the programmatic execution of the rules) you could engage a variety of processing behaviors. For instance, you could instruct the application to be either “static” or “dynamic” in its mutation algorithm. You could also instruct the code to restrict its mutation to cycle left to right per codon or maybe by group of codons. You could indicate that you wish the program to mutate the first position only while executing the mutation in dynamic mode. Which means that all the rules (e.g. “E” can substitute for “G”, “A” cannot immediately precede “G”) would be stepped through for the first position only and the resultant mutation would be the input for the next mutation iteration.
The “checks” could have access to the whole string to insure that the rules are applied in the event that there is a “no immediately preceding” scenario, but the substitution would be only for the first codon. In this case, you would see a listing of mutations that only mutated the first codon.

Essentially, we are just developing a parser that parses a string and performs a substitution based upon some set of pre-defined rules.

11. 11
ciphertext says:

Well…rats…in the above, when you see “dynamic” simply think of my “recursive” method. I couldn’t decide how best to describe it.