Uncommon Descent Serving The Intelligent Design Community

On FSCO/I vs. Needles and Haystacks (as well as elephants in rooms)

Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

Sometimes, the very dismissiveness of hyperskeptical objections is their undoing, as in this case from TSZ:

Pesky EleP(T|H)ant

Over at Uncommon Descent KirosFocus repeats the same old bignum arguments as always. He seems to enjoy the ‘needle in a haystack’ metaphor, but I’d like to counter by asking how does he know he’s not searching for a needle in a needle stack? . . .

What had happened, is that on June 24th, I had posted a discussion here at UD on what Functionally Specific Complex Organisation and associated Information (FSCO/I) is about, including this summary infographic:

csi_defnInstead of addressing what this actually does, RTH of TSZ sought to strawmannise and rhetorically dismiss it by an allusion to the 2005 Dembski expression for Complex Specified Information, CSI:

χ = – log2[10^120 ·ϕS(T)·P(T|H)].

–> χ is “chi” and ϕ is “phi” (where, CSI exists if Chi > ~ 1)

. . . failing to understand — as did the sock-puppet Mathgrrrl [not to be confused with the Calculus prof who uses that improperly appropriated handle) — that by simply moving forward to the extraction of the information and threshold terms involved, this expression reduces as follows:

To simplify and build a more “practical” mathematical model, we note that information theory researchers Shannon and Hartley showed us how to measure information by changing probability into a log measure that allows pieces of information to add up naturally:

Ip = – log p, in bits if the base is 2. That is where the now familiar unit, the bit, comes from. Where we may observe from say — as just one of many examples of a standard result — Principles of Comm Systems, 2nd edn, Taub and Schilling (McGraw Hill, 1986), p. 512, Sect. 13.2:

Let us consider a communication system in which the allowable messages are m1, m2, . . ., with probabilities of occurrence p1, p2, . . . . Of course p1 + p2 + . . . = 1. Let the transmitter select message mk of probability pk; let us further assume that the receiver has correctly identified the message [[–> My nb: i.e. the a posteriori probability in my online discussion here is 1]. Then we shall say, by way of definition of the term information, that the system has communicated an amount of information Ik given by

I_k = (def) log_2  1/p_k   (13.2-1)

xxi: So, since 10^120 ~ 2^398, we may “boil down” the Dembski metric using some algebra — i.e. substituting and simplifying the three terms in order — as log(p*q*r) = log(p) + log(q ) + log(r) and log(1/p) = log (p):

Chi = – log2(2^398 * D2 * p), in bits,  and where also D2 = ϕS(T)
Chi = Ip – (398 + K2), where now: log2 (D2 ) = K
That is, chi is a metric of bits from a zone of interest, beyond a threshold of “sufficient complexity to not plausibly be the result of chance,”  (398 + K2).  So,
(a) since (398 + K2) tends to at most 500 bits on the gamut of our solar system [[our practical universe, for chemical interactions! ( . . . if you want , 1,000 bits would be a limit for the observable cosmos)] and
(b) as we can define and introduce a dummy variable for specificity, S, where
(c) S = 1 or 0 according as the observed configuration, E, is on objective analysis specific to a narrow and independently describable zone of interest, T:

Chi =  Ip*S – 500, in bits beyond a “complex enough” threshold

  • NB: If S = 0, this locks us at Chi = – 500; and, if Ip is less than 500 bits, Chi will be negative even if S is positive.
  • E.g.: a string of 501 coins tossed at random will have S = 0, but if the coins are arranged to spell out a message in English using the ASCII code [[notice independent specification of a narrow zone of possible configurations, T], Chi will — unsurprisingly — be positive.

explan_filter

  • S goes to 1 when we have objective grounds — to be explained case by case — to assign that value.
  • That is, we need to justify why we think the observed cases E come from a narrow zone of interest, T, that is independently describable, not just a list of members E1, E2, E3 . . . ; in short, we must have a reasonable criterion that allows us to build or recognise cases Ei from T, without resorting to an arbitrary list.
  • A string at random is a list with one member, but if we pick it as a password, it is now a zone with one member.  (Where also, a lottery, is a sort of inverse password game where we pay for the privilege; and where the complexity has to be carefully managed to make it winnable. )
  • An obvious example of such a zone T, is code symbol strings of a given length that work in a programme or communicate meaningful statements in a language based on its grammar, vocabulary etc. This paragraph is a case in point, which can be contrasted with typical random strings ( . . . 68gsdesnmyw . . . ) or repetitive ones ( . . . ftftftft . . . ); where we can also see by this case how such a case can enfold random and repetitive sub-strings.
  • Arguably — and of course this is hotly disputed — DNA protein and regulatory codes are another. Design theorists argue that the only observed adequate cause for such is a process of intelligently directed configuration, i.e. of  design, so we are justified in taking such a case as a reliable sign of such a cause having been at work. (Thus, the sign then counts as evidence pointing to a perhaps otherwise unknown designer having been at work.)
  • So also, to overthrow the design inference, a valid counter example would be needed, a case where blind mechanical necessity and/or blind chance produces such functionally specific, complex information. (Points xiv – xvi above outline why that will be hard indeed to come up with. There are literally billions of cases where FSCI is observed to come from design.)

xxii: So, we have some reason to suggest that if something, E, is based on specific information describable in a way that does not just quote E and requires at least 500 specific bits to store the specific information, then the most reasonable explanation for the cause of E is that it was designed. The metric may be directly applied to biological cases:

Using Durston’s Fits values — functionally specific bits — from his Table 1, to quantify I, so also  accepting functionality on specific sequences as showing specificity giving S = 1, we may apply the simplified Chi_500 metric of bits beyond the threshold:
RecA: 242 AA, 832 fits, Chi: 332 bits beyond
SecY: 342 AA, 688 fits, Chi: 188 bits beyond
Corona S2: 445 AA, 1285 fits, Chi: 785 bits beyond

Where, of course, there are many well known ways to obtain the information content of an entity, which automatically addresses the “how do you evaluate p(T|H)” issue. (As has been repeatedly pointed out, just insistently ignored in the rhetorical intent to seize upon a dismissive talking point.)

There is no elephant in the room.

Apart from . . . the usual one design objectors generally refuse to address, selective hyperskepticism.

But also, RTH imagines there is a whole field of needles, refusing to accept that many relevant complex entities are critically dependent on having the right parts, correctly arranged, coupled and organised in order to function.

That is, there are indeed empirically and analytically well founded narrow zones of functional configs in the space of possible configs. By far and away most of the ways in which the parts of a watch may be arranged — even leaving off the ever so many more ways they can be scattered across a planet or solar system– will not work.

The reality of narrow and recognisable zones T in large spaces W beyond the blind sampling capacity — that’s yet another concern — of a solar system of 10^57 atoms or an observed cosmos of 10^80 or so atoms and 10^17 s or so duration, is patent. (And if RTH wishes to dismiss this, let him show us observed cases of life spontaneously organising itself out of reasonable components, say soup cans. Or, of watches created by shaking parts in drums, or of recognisable English text strings of at least 72 characters being created through random text generation . . . which last is a simple case that is WLOG, as the infographic points out. As, 3D functional arrangements can be reduced to code strings, per AutoCAD etc.)

Finally, when the material issue is sampling, we do not need to generate grand probability calculations.

The proverbial needle in the haystack
The proverbial needle in the haystack

For, once we are reasonably confident that we are looking at deeply isolated zones in a field of possibilities, it is simple to show that unless a “search” is so “biased” as to be decidedly not random and decidedly not blind, only a blind sample on a scope sufficient to make it reasonably likely to catch zones T in the field W would be a plausible blind chance + mechanical necessity causal account.

But, 500 – 1,000 bits (a rather conservative threshold relative to what we see in just the genomes of life forms) of FSCO/I is (as the infographic shows) far more than enough to demolish that hope. For 500 bits, one can see that to give every one of the 10^57 atoms of our solar system a tray of 500 H/T coins tossed and inspected every 10^-14 s — a fast ionic reaction rate — would sample as one straw to a cubical haystack 1,000 LY across, about as thick as our galaxy’s central bulge. If such a haystack were superposed on our galactic neighbourhood and we were to take a blind, reasonably random one-straw sized sample it would with maximum likelihood be straw.

As in, empirically impossible, or if you insist, all but impossible.

 

It seems that objectors to design inferences on FSCO/I have been reduced to clutching at straws. END

Comments
gpuccio, Given our shared interest in the evolution of 'novel' protein folds, I thought you might find this recent paper interesting: Large-scale determination of previously unsolved protein structures using evolutionary information This work demonstrates that there are many proteins that lack any homology that is detectable at the simple protein-sequence-alignment level, but in fact have similar 3D structures. Also, once you start thinking about protein evolution in the context of co-evolution of distant residues that make contacts, the multi-dimensional search space is not so sparse as one might think and it is, in fact, rather well interconnected. Enjoy.DNA_Jock
November 21, 2015
November
11
Nov
21
21
2015
02:29 PM
2
02
29
PM
PDT
DNA_Jock: The discussion continues to be very good. I would very much like to leave to you the last word, but at least a few simple comments are due. For the "Texas Sharp-Shooter problem", I think that I have discussed that aspect in my posts 146 and 149 here. If you like, you could refer to them The discussion about Elizabeth's post, if I remember well, was "parallel": I posted here and my interlocutors posted at TSZ. I have nothing against posting at TSZ (I have done that, or at least in similar places. more than once time ago). However, I decided some time ago to limit my activity to UD: it is already too exacting this way. However, my criticism to Lizzie's argument is very simple: it is an example of intelligent selection applied to random variation. It is of the same type of the Weasel and of Szostak's ATP binding protein. You see, I am already well convinced that RV + IS can generate dFSCI. It is the bottom up strategy to engineer things. So, I have no proble with Lizzie's example, example for its title: "Creating CSI with NS" That is simply wrong. "Creating CSI with design by IS" would be perfectly fine. Your field seems to willfully ignore the difference between NS and IS. It is a huge difference. IS requires a conscious intelligent agent who recognizes some function as desirable, sets the context to develop it, can measure it at any desired level, and can intervene in the system to expand any result which shows any degree of the desired function. IOWs, both the definition of the function, the way to measure it, and the interventions to facilitate its emergence are carefully engineered. It's design all the way. On the contrary, NS assumes that some new complex function arises in a system which is not aware of its meaning and possibilities, only because some intermediary steps represent a step to it, and through the selection of the intermediary steps because of one property alone: higher reproductive success. So, I ask a simple question: what reproductive success is present in Lizzie's example? None at all. It's the designer who selects what he wants to obtain. The property selected has no capability at all to be selected "on its own merits". Therefore, Lizzie's example has nothing to do with NS. I am certain of Lizzie's good faith. I have great esteem for her. I am equally certain that she is confused about these themes. I don't remember ever having discussed PDZ. However, I understand your discussion about neighbours, and if that's what you mean by multidimensional I have no problem with that. My point remains that the only way to judge if a sequence is a neighbour of another one, in the absence of any other observable, is sequence similarity. I can accept strucure similarity as a marker of "neighbouroodness" even in absence of detectable sequence similarity. But in the absence of both similarities, I maintain that two sequences should be considered as unrelated. Which does not mean that one cannot be derived from the other. But it means that it is distant from the other, and therefore the probability of reaching it in a random walk is in the range of the lower probability states. I am well aware that the target space is not a single sequence, but a set of sequences. That's what we call "the functional set", whose probability is approximately measured by the target space/search space ration. That's why I usually refer to Durston's results as a measure of the functional complexity. In the case of ATP synthase, mt 1600 bits derive from a consideration only of the AA positions which have been conserved throughout natural history. If I had used Durston's method, I would have got a higher value of complexity. Again, I am using a lower threshold of functional complexity for that molecule. About optimization, do you agree that ATP synthase seems to have been highly optimized already in LUCA? About TFs, I respect your idea that "the control of transcription in eukaryotes shows all the hallmarks of being a rather inefficient series of kludges cobbled together", but I strongly disagree. I think it shows all the hallmarks of being an extremely efficient combinatorial system of which we still don't understand almost anything. Different scientific epistemologies often entail different interpretations. We will see, in time, who is right.gpuccio
November 3, 2014
November
11
Nov
3
03
2014
07:12 AM
7
07
12
AM
PDT
KF, because I think closing comments as an act of censorship and cowardly and more reflective of "wanting to keep a clean copy" rather than have a substantive exchange, I clicked through to your post but did not read it. I recommend others do the same until KF stops trying to bully the message and regions the debate. 'Bydand' indeed.Rich
November 3, 2014
November
11
Nov
3
03
2014
06:48 AM
6
06
48
AM
PDT
Kf, Great to know that you have not forgotten this thread. You have already admitted that you assume independence and that this assumption is incorrect (“in info contexts reduce info capacity”), but you have asserted that this error is “not material”. How big is the error? How do you know? Please be as precise and concise as you can.DNA_Jock
November 3, 2014
November
11
Nov
3
03
2014
06:00 AM
6
06
00
AM
PDT
F/N: I have responded on record to recent comments on islands of function and the like, as well as abusive tactics, here. KFkairosfocus
November 3, 2014
November
11
Nov
3
03
2014
05:07 AM
5
05
07
AM
PDT
What evidence demonstrates that blind and undirected chemical processes can produce transcription factors on a world that didn't have any? (hint- there isn't any)Joe
November 3, 2014
November
11
Nov
3
03
2014
03:54 AM
3
03
54
AM
PDT
What evidence demonstrates that blind and undirected chemical processes can produce DNA on a world that didn't have DNA? (hint- there isn't any) What evidence demonstrates that blind and undirected chemical processes can produce transcription and translation on a world that didn't have it? (hint- there isn't any)Joe
November 3, 2014
November
11
Nov
3
03
2014
03:53 AM
3
03
53
AM
PDT
Correction : PDZ has only 2.456 million 2-step neighborsDNA_Jock
November 3, 2014
November
11
Nov
3
03
2014
03:48 AM
3
03
48
AM
PDT
Gpuccio, I have enjoyed our chat. As I stated at the outset, I think that the origins of large protein folds is one of the most interesting challenges facing the MES. Unfortunately, I have yet to read an ID-proponent who avoids the sin of over-stating and over-simplifying their case. This detracts from the force of their arguments. Random Minor Points 1) While admitting that it is “not really relevant to my discussion”, you bring up “Functional Specification”. I have yet to see a “Functional Specification” that avoids the Texas Sharp-Shooter problem; Hazen came closer than any IDist. 2)
I know very well that neo darwinism is “short on supporting facts”. Very short, I would say. That’s exactly its problem. And if it were good at making “testable predictions” (possibly not self-referential), then it would be no more “short”. How can a prediction be “tested”, if not by observed facts?
I think we agree that common descent is EXTREMELY well supported. The question is whether intervention is required; here, parsimony applies. MES makes lots of testable predictions; it is subject to disconfirmation daily. 3) Irreducible complexity, per Behe, assumes that the final step in construction of a system is the addition of element. ‘Nuff said. 4)
I don’t know why you link that post by Elizabeth. I have discussed it in detail time ago. I can do that again, if you like. It’s a good example of bad reasoning.
Let’s continue that discussion there, then. Did you use a different “handle” at TSZ? 5)
DNAJ wrote: All outcomes, all sequences, are the results of stochastic processes. GP: Are you saying that the result of a computer computation which gives a specific mathematical outcome to a specific mathematical problem (even a simple one, like 2+2) is not different from the results of coin tossing? 4 as the result of the computation of 2+2 is “stochastic” as the sequence 0010111010 (let’s say derived from a fair coin tossing)? Is that your point? Please, clarify.
I was talking in the context of DNA or protein sequences. As I said earlier in the quoted post, “For many, many processes the stochastic element is virtually undetectable, and can safely be ignored. Thus we can safely ignore the stochastic effects of QM for almost all biological systems.” If you add 2+2 on a computer, the result will be 4 almost all of the time. If you use a parity check, then the risk of error goes down even further. To steal from Orwell: “All processes are stochastic, but some are more stochastic than others.” 6)
At least one of the best supporters of ID here is a YEC. I respect him deeply, and he is perfectly correct in all his ID arguments, without mixing them with other types of arguments.
I am curious, to whom do you refer? The meat of the matter Where we disagree, AFAICT, is your insistence that RV and NS must be considered separately AND that no NS can act until there is a selective advantage that is a “fact”, meaning it has been demonstrated to be operative (and, you seem to imply, historically accurate?) by evidence that you personally find clear and convincing. I, OTOH, am willing to posit small selective advantages for simpler, poorly optimized polymers, and try to investigate what these rudimentary functionalities might look like. And the experimental data on protein evolution supports me here: in particular, Phylos Inc demonstrated that using libraries of sizes of ~ 10^13 (e.g. USP 6,261,804), you could evolve peptides that bound to pretty much ANYTHING. Unfortunately, I can’t get much more specific, but here’s a “statement against interest”: the libraries produced better binders if the random peptide was anchored by an invariant ‘scaffold’. They used fibronectin, but I suspect that a bit of beta sheet at each end of the random peptide would have done the trick. They also had a technical problem in optimizing catalysis, but that limitation would not apply in actual living systems. Dimensionality I explained to you, with supporting data, why “Take-home : there are lots of “unrelated” sequences that have probabilities above 1/n.” You replied, in part:
RV in the DNA sequence is not “multidimensional”, as far as I can understand .
Your understanding is lacking. How many different ways are there for a random walk to go from AAAAA to GTTCC? See how there are five dimensions? Of course, no sane person looks at distantly related-or-unrelated proteins and compares the cDNA sequences to determine whether they are related. Everyone, including yourself, looks at the protein sequence. There are 20 possible attribute levels at each position, and the multi-dimensional nature of the space becomes more obvious. Thus when you discussed the amazingly unlikely, 20^-83, nature of PDZ in light of McLaughlin, you did not appear to notice that about 90% of PDZ’s 1,577 nearest neighbors retain partial function. So there are 1,419 different final steps could reach PDZ. To a first approximation, we might estimate that 80% of the sequences that are two steps away would have partial function: 80 % of 2,484,564 is 1,987,651 sequences that are only two steps away. Three steps away… The point being, in a multi-dimensional space, you have a lot of neighbors. Modulators of Gene Expression To recap our discussion of transcription factors:
GP: Third, individual short modules, very often, while functional in a greater context (the long proteins), would not be naturally selectable by themselves (or, again, we have no evidence for that). DNAJ : “Well, it could itself bind to DNA, and thus sterically exclude RNA polymerase from binding. Kd for the monomer is much higher, but it still exists. And any dimerization will produce a protein dimer that recognizes a palindrome with far lower Kd, purely as a consequence of C2 symmetry. Call me biased, but I think modulation of transcription is pretty useful.” GP: Useful is not the same as naturally selectable. And again, where are evidences? Facts?
And in parallel
GP: Second…But just think of transcription factors, and all the part of their sequence (usually the longest part) which is not a DBD. DNAJ “In my youth, I spent a lot of time thinking about transcription factors. You are assuming that they need to be that long in order to carry out their function. Furthermore, you are assuming that their progenitor had to be that long and complicated to perform the progenitor’s function. If you want to talk about facts, they are against you here.” GP: Why? You know as well as I do that the function of a TF is not linked only to the conserved DBDs. TFs act combinatorially at the enhanceosome, interacting one with many others after having been bound to DNA. There can be no doubt that other parts of the molecule are extremenly relevant to the regulatory activity, to its combinatorial complexity, and therefor to the final result. And you certainly know that the epigenetic details of that regulatory combinatorial activity may be very different in different species and contexts, even for the same TF. These are facts. I don’t understand your statement
Oh dear. Some TFs interact combinatorially; some genes are subject to epigenetic modulation. I am very familiar with examples of both (although I did get the mechanism of epigenetic regulation almost completely wrong, oops). But we are talking about what is the minimal functionality that could be selectable. A helix-turn-helix motif is enough. Add dimerization to square its effectiveness. Add a short amphipathic helix or a short negatively charged random peptide and you’ve turned a repressor into an activator. Your fallacy is to look at an optimized system, and say “it must have been this good to be selectable”. Some modulators of gene expression are very simple. IMO, the control of transcription in eukaryotes shows all the hallmarks of being a rather inefficient series of kludges cobbled together. If you want to see a system that is REALLY optimized, check out bacteriophage. THEY are the pinnacles of evolution (or of creation, if you prefer). Why should that be the case?DNA_Jock
November 2, 2014
November
11
Nov
2
02
2014
12:13 PM
12
12
13
PM
PDT
DNA_Jock: 1) I can call the process random or stochastic, but my definition remains the same. For all practical purposes, I think we can maintain the difference between systems where the evolution of the system can be described by necessity laws, and systems which are best described by probability distributions. Of course, many systems are mixed, and yet in modeling them we distinguish the two approaches. I have not really used "random" as applied to the sequences in my reasoning. For me, the only difference between sequences is if they can implements some defined function or not. Please, refer to this OP of mine for my definition of functional information: https://uncommondescent.com/intelligent-design/functional-information-defined/ Another matter is the presence of some order in a sequence. That os not really relevant to my discussion. However, to sum up, I think we have at least 3 types of strings: a) Completely undetermined (what you could call "random"): no special order, and no special complex function which can be defined for the string (simple functions can always be defined). That would describe almost all the strings of a certain length. b) Functionally specified: a complex function can be defined for the string, which requires a high number of specific bits of information to be implemented. That is the case for language, software and protein genes, in most cases. These kind of sequences can be almost identical, in form, to the first type, but they do implement a complex function. Only the functional specification really distinguishes them from the first type. c) Ordered strings. They can be specified because of their order, but they can often be explained by some necessity mechanism in the system. Most biological strings are not of this type. 2) You say: "P0m then replicates, creating “P1” whose distribution of mutations IS dependent on function." I don't understand what you mean. Are you assuming a selection effect? 3) You say: "No my statement is accurate and relevant. Your statement that “by variations in the sequence, which are independent from the final sequence” is incorrect. The variations in the sequence are stochastic, but they are NOT independent of the final sequence. They are correlated with the final sequence. As I said, it is a subtle point: I will try to explain: imagine a stochastic process A that leads to stochastic outcomes B. We can say “B is affected by A”, “A is not influenced by B”, but we CANNOT say “A is independent of B”, because the two things are in fact correlated. The direction of the causation arrow does not matter, as far as the correlation is concerned." I am not sure I understand what you mean, but I will try to explain better what I mean. RV can be modeled, but we cannot include a specific function in that modeling. IOWs, the probability distribution of the random events which can take place is not biased towards any specific function. Do you agree on that? And if you agree, then what is your point? Remember, I am not including NS at this level of the reasoning. That statement is about RV alone. 4) You say: "This is a root cause of our disagreement; I think we disagree about inference to best explanation and burdens of proof. I will note that your reference to faith in magic fairies strikes me as the sort of taunt a “football fan” would make, and it disappoints me." Sorry to disappoint you, but maybe I am a "football fan" of my philosophy of science and epistemology. One thing is respecting your views and being open to discussion about them. Another thing is agreeing with your epistemology. I do believe that the current scientific thought is ideologically biased. I cannot pretend that I think differently. 5) You say: "The early returns show that the time-to-no-discernable-similarity is one quarter of the time-to-no-correlation. Thus for every sequence with discernable similarity, there will be three “invisibly related” sequences. I think that this ratio will only get larger as dimensionality increases. Take-home: there are lots of “unrelated” sequences that have probabilities above 1/n." I am not sure that I follow all the aspects of your argument, but for the moment I will accept your conclusions. But let me understand better. RV in the DNA sequence is not "multidimensional", as far as I can understand. So, if we reason only at the sequence level, and if we look at sequences as they are, they are either related (at some threshold we can decide) or unrelated. That is the only empirical assessment we can make. Your reasoning only tries to show that it is possible that some sequences are nearer in the phase space than they appear. But if we cannot identify them, how does that help your reasoning? The important point is that the variation at sequence level has nothing to do with function, unless you factor selection. The problem is not if a sequence could derive from another (in principle, any sequence can derive from another). The problem is much simpler. It is: if n sequences can derive from A through a random walk, and if the number of sequences which implement function X is only 1:10^500 of n (just to make an example with about 1600 bits of functional information, the value I assume for those famous two sequences of ATP synthase), and if function X emerges at point t1 and was not there before, how can you explain that one of the sequences of the tiny functional space emerges? The "function which is not yet there" cannot in any way "favor" some variations instead of others. You see, I am not interested in showing that protein B is not derived from sequence A. I believe that it derives from A. But I believe that the derivation is designed, guided by the conscious representation of the function X. The simple point is that no random walk will ever generate ATP synthase, generating the 1600 bits which are necessary for it to work. Even if you could demonstrate some "invisible derivation" from an undefined precursor that nobody knows (and I really cannot understand how you can hope to support such a theory empirically), still you have done nothing to explain the specific functional configuration of 1600 bits that accomplishes the function, and which did not exist before, and which has been conserved by negative selection for billions of years. 6) You say: "In my youth, I spent a lot of time thinking about transcription factors. You are assuming that they need to be that long in order to carry out their function. Furthermore, you are assuming that their progenitor had to be that long and complicated to perform the progenitor’s function. If you want to talk about facts, they are against you here." Why? You know as well as I do that the function of a TF is not linked only to the conserved DBDs. TFs act combinatorially at the enhanceosome, interacting one with many others after having been bound to DNA. There can be no doubt that other parts of the molecule are extremenly relevant to the regulatory activity, to its combinatorial complexity, and therefor to the final result. And you certainly know that the epigenetic details of that regulatory combinatorial activity may be very different in different species and contexts, even for the same TF. These are facts. I don't understand your statement. 7) You say: "You have a strange view of Science. Working hypotheses may be short on supporting facts today. The key thing is that they make testable predictions." You have a strange view of science. Hypotheses are born from observed facts. That's what they try to explain. I have no reason to understate the importance of predictions.But theories are born from existing facts, and when possible confirmed by predictions about new facts which may be observed after. However, any fact, after having been observed, becomes an observed fact. Therefore, in the end, theories are supported only by observed facts. I know very well that neo darwinism is "short on supporting facts". Very short, I would say. That's exactly its problem. And if it were good at making "testable predictions" (possibly not self-referential), then it would be no more "short". How can a prediction be "tested", if not by observed facts? 8) You say: "As I thought, this is an argument by analogy that has been demolished previously, including in Kitzmiller." I don't agree. It's an argument by analogy, like many of the best arguments in science and in human cognition. But I don't think it has ever been "demolished". Maybe if you give the details of the demolition argument, we can discuss it. 9) You say: "I agree that it is “telling” that people at UD cannot even understand a simple toy example." I don't know why you link that post by Elizabeth. I have discussed it in detail time ago. I can do that again, if you like. It's a good example of bad reasoning. 10) You say: "As you noted above, a lot of people on UD read the word “random” and think that it means “uniform probability distribution”. I would recommend that you use the word “stochastic”, which does not suffer from this risk of misinterpretation." The misinterpretation derives only by a non correct understanding of probability theory. There is no reason to change terms for that. 11) You say: "Your definition of “random” leads to a problem when used to describe an outcome, however : it is completely without meaning. All outcomes, all sequences, are the results of stochastic processes." What do you mean? Are you saying that the result of a computer computation which gives a specific mathematical outcome to a specific mathematical problem (even a simple one, like 2+2) is not different from the results of coin tossing? 4 as the result of the computation of 2+2 is "stochastic" as the sequence 0010111010 (let's say derived from a fair coin tossing)? Is that your point? Please, clarify. 12) You say: "I too answer for myself, and I recognize that that is all anyone can do. However, a key aspect of scientific debate is the willingness to attack the arguments of people with whom you concur with, if anything, greater aggression than you attack the arguments of those you disagree with. My perception may be biased, but at UD I observe a reluctance to ‘rock the boat’ by addressing differences between ID advocates. Different posters will make comments that are mutually contradictory, yet there is a strange, unscientific, reluctance to address these discrepancies. In that sense, I am afraid that UD does look like a political party, trying to keep both believers in common descent and YECs under the same big tent. I would hold you in even higher esteem if you were more willing to correct other regulars here when they say things that you know make no sense , such as kf’s probability calculations. But I understand the reluctance." I will be very clear. I am here to defend a scientific theory (ID) which is for me very important and very true. I enter debate with people, like you, who think differently because that is my purpose here. When I think it is necessary, I clarify important differences in what I think in respect to what others in the ID field think. I have many times defended Common Descent against many friends here, for the simple reason that I consider it a very good scientific explanation of facts. In the same way, I have many times expressed differences in respect to some ideas of Dembski (a thinker that I deeply respect and love), especially his "latest" definition of specification. I respect very very much Behe and practically all that he says, but still I disagree with the final part of TEOE, where he apparently makes a "TE" argument. And so on. But, certainly, I don't consider my duty to attack my fellow IDists whenever I don't agree with some statement they make. That is not my role, and not the reason why I come here. In particular, I have always been clear that I consider any reference to religion and specific religious beliefs extremely out of context in a scientific discussion. That's why I strictly avoid those aspects in my reasonings. However, it is perfectly fine to discuss those things in more general posts (however, I usually avoid doing that too). I am not a YEC, and I don't approve Creation Science, but I respects both positions as faith motivated, because I have deep respect for the faith choices of everyone, including atheists and darwinists :) . However, I don't consider those positions which are explicitly motivated by faith as scientifically acceptable positions. At least one of the best supporters of ID here is a YEC. I respect him deeply, and he is perfectly correct in all his ID arguments, without mixing them with other types of arguments. That's all I can say.gpuccio
November 1, 2014
November
11
Nov
1
01
2014
03:38 PM
3
03
38
PM
PDT
Gpuccio, I like this discussion too; you make rational arguments.
First of all, I give you my explicit definition of “random”. I use that term in one sense only. Random is any system which, while evolving according to necessity laws (we are not discussing quantum systems here) can best be described by some appropriate probability distribution. IOWs, a random system is a necessity system which we cannot describe with precision by its necessity laws, usually because too many variables are involved. Let’s go on.
I would call your “random process” a “stochastic process”. I would call your “random sequence” an “undetermined sequence”. Of course, pretty much any sequence would be “random” under your usage, which does reduce the usefulness of the term somewhat. And it is my belief (albeit a rather personal, idiosyncratic one) that ALL processes are stochastic, just to a greater or lesser degree. For many, many processes the stochastic element is virtually undetectable, and can safely be ignored. Thus we can safely ignore the stochastic effects of QM for almost all biological systems.
[snip] That’s why I refer “random” to the random walk if not modified by selection. I only use the term random in the “RV” part of the algorithm. It mean that the variation is random, because we cannot describe it by a strict necessity law, but we can use a probabilistic approach. Some people (many, unfortunately) erroneously think that random means some system showing an uniform probability distribution, but that is not your case, I think. So, please always refer to my initial definition of a random system. To be more clear, I will explicitly define RV: In a biological system, RV is any variation oh which we cannot describe the results with precision, but which can be well modeled by some probability distribution. The neo darwinian model is a probabilistic model which includes a necessity step, NS. The model is also sequential, because it hypothesizes that NS acts on the results of RV, modifying the probabilistic scenario by expanding naturally selectable outcomes. But RV and NS always act in sequence, and that’s why we can separate them in our reasoning and modeling. DNAJ wrote “The variations that are introduced during any iteration are “random” with respect to function. However, the output from any iteration (and therefore the input to all subsequent iterations) is strongly dependent on function.” I don’t agree. See my definition of “random system”. If you have a different definition, please give it explicitly. I find your use of “random” rather confusing here. First of all, what do you mean by “iteration”?
“Iteration” refers to a single act of replication. Maybe easier to think of as a single generation, although this would underestimate the number of iterations. A population “P0” suffers mutation(s) that are random-with-respect-to-function, creating “P0m”. P0m then replicates, creating “P1” whose distribution of mutations IS dependent on function.
My way to describe RV is: “The variation events that happen at any moment are random because their outcome can best be described by a probabilistic approach.” For example, SNP are random because we cannot anticipate which outcome will take place. Even if the probabilities of each event is not the same, the system is random just the same. In general, we can assume an uniform distribution of the individual transitions for practical purpose. And of course other variation events, like deletion and so on, have not the same probability as a general SNP. But the system of all the possible variations remains random (unless the variations are designed ). Function has nothing to do with this definition.
I would use the word stochastic; I agree that modeling the individual transitions as uniform p is okay for practical purposes, although you might want to distinguish transitions from transversions. Modeling indels and recombination is tougher, of course.
DNAJ wrote “It’s a subtle point, but technically, your statement “by variations in the sequence, which are independent from the final sequence” is also incorrect: “independent” has a specific meaning, and the variations introduced (random as they are wrt function) are in fact correlated with the final sequence that is reached. This confusion may be caused by your Texas Sharpshooter Fallacy, as seen in the phrase “which will be reached” – there is not a unitary final sequence, rather there is a set of possible sequences, one of which we observe.” No, you are confused here.
No my statement is accurate and relevant. Your statement that “by variations in the sequence, which are independent from the final sequence” is incorrect. The variations in the sequence are stochastic, but they are NOT independent of the final sequence. They are correlated with the final sequence. As I said, it is a subtle point: I will try to explain: imagine a stochastic process A that leads to stochastic outcomes B. We can say “B is affected by A”, “A is not influenced by B”, but we CANNOT say “A is independent of B”, because the two things are in fact correlated. The direction of the causation arrow does not matter, as far as the correlation is concerned.
I will try to be more clear. What I mean is that the outcome of random variation can best be explained by a probabilistic approach, and the probabilistic description of that outcome has nothing to do with any functional consideration. Can you agree with that simple statement?
For a single iteration I agree. For two or more iterations, I disagree.
DNAJ wrote: “One can model drift as a random walk, agreed, but any application of selection, however slight, wrecks the RW model.” I have never said anything different. But RV and selection act sequentially. So I can apply the random walk model to any step which does not include selection. And I will accept NS only if it is supported by facts, not as a magic fairy accepted by default because of a pre commitment based on personal (or collective) faith.
This is a root cause of our disagreement; I think we disagree about inference to best explanation and burdens of proof. I will note that your reference to faith in magic fairies strikes me as the sort of taunt a “football fan” would make, and it disappoints me.
DNAJ wrote “Furthermore, the probability for a random walk occupying position x only reaches 1/n after the string has been completely randomized, which will take a very long time. By way of illustration, the probability that a random walk will occupy positions closer to the starting point remains higher than 1/n even after every ‘monomer’ has been mutated, on average, five times in every member of the population. (After 5 an average of mutations per monomer, there is a 0.6% chance that any individual monomer is still unmutated; so for a 100 amino acid domain, that’s a half chance of still retaining an original amino acid in the sequence; it ain’t scrambled yet, but there has not been any discernable similarity for a while…)” That’s why I say that 1/n is the highest probability we can assume for an unrelated state. Obviously, related states are more likely. What are you assuming here, that all proteins are related although we have no way of detecting that? Faith again? The fact remains that a lot of protein sequences are unrelated, as far as we can objectively judge. Therefore, we must assume 1/n as the high threshold of probability in a random walk from an unrelated precursor to the. Please, let me understand how a single precursor, related to all existing proteins, could have generated the many unrelated existing proteins by individual random walks, each with probability higher than 1/n. Is that what you are really suggesting? Seriously? DNJA wrote
Here’s where a second equivocation causes problems. You appear to be using the word “related” to mean both “having discernable similarity” and “sharing a common ancestor”.
No. I only mean “having discernable similarity”. I am an empirical guy. “Having discernable similarity” is an observable, a fact. “Sharing a common ancestor” is an hypothesis, a theory. If we have not the fact, we cannot support the hypothesis (unless we can observe or infer the process in other ways). IOWs, if we have no discernable similarity, there is no reason to believe that two sequences share a common ancestor (unless you have other independent evidence, based on facts. Observables. My categories are very simple, and must remain simple. DNAJ wrote “I had great difficulty trying to understand this sentence. Firstly, we are actually considering two or more extant sequences, that share no apparent similarity. Absent a time machine, we cannot directly access the ancestral sequences. You appear to be saying that, because they share no apparent similarity today, the only reasonable hypothesis is that they emerged from “unrelated sequences” 2 billion years ago, meaning sequences that had no discernable similarity 2 billion years ago. I think that a reasonable hypothesis might be that the intervening 2 billion years has buried any detectable similarity signal.” No. If you have read the previous point, maybe you can understand better what I mean. “Reasonable” here mean “scientifically reasonable”, IOWs justified by facts. So, the meaning is very simple: we observe two unrelated sequences, and we have no facts that support the existence of a common ancestor. Therefore, the reasonable scientific hypothesis os that they share no common ancestor. You can certainly hypothesize that they shared a common ancestor and that “the intervening 2 billion years has buried any detectable similarity signal”, but that is not science, unless you have independent empirical support. It is not scientifically reasonable to hypothesize at the same time that something is true but that we cannot have any evidence of it.
Aha! I was wrong about the nature of your equivocation. My apologies. The problem is rather with the implications you make from the word “unrelated”. Given your strict use of the word “related” to mean “having discernable similarity”, then “unrelated” refers to all sequences that lack a discernable similarity, however near or far they may be from the test sequence. Under your usage, many “unrelated” sequences have probabilities that are far HIGHER than 1/n. I had originally assumed that when you said 1/n was the “upper bound” for “unrelated sequences”, that was a typographical error, and you actually meant “lower bound”. My bad. Think about it this way: there are three categories of sequence : 1) “related”, meaning having discernable similarity 2) “totally unrelated”, meaning having a probability of less than 1/n 3) “invisibly related”, meaning having some proximity in the phase space, but this proximity is too low to be a reliable indicator of any (ancestral) relationship My paragraph that you quoted above (“at an average of 5 mutations per monomer, there is still a 50% chance that a 100 amino acid domain retains an original amino acid”) was my quick-and-dirty way of trying to point out the importance of this third category, just using the Poisson distribution. I now realize that the existence of this third category is probably a bone of contention, so I did a little modeling. Last year, as part of my job, I wrote a Gibbs sampling routine to allow me to optimize a set of ten parameters; today it was a simple matter to turn OFF the ‘oracle’, so that the code performs a truly random walk in the ten-dimensional space. I then asked how many generations does it take for a random walk to wander outside of the “discernable similarity” space (I used p=0.05 as my criterion here), and how many generations does it take for the same walk to reach a point where its position is fully uncorrelated with its starting position, i.e. the point when the walk first crosses the dividing line separating the half of the phase space that is closer to the starting point and first explores the “more distant half”. The early returns show that the time-to-no-discernable-similarity is one quarter of the time-to-no-correlation. Thus for every sequence with discernable similarity, there will be three “invisibly related” sequences. I think that this ratio will only get larger as dimensionality increases. Take-home: there are lots of “unrelated” sequences that have probabilities above 1/n.
DNAJ wrote “You do realize that you can string simple short proteins together to produce longer, more complicated proteins? I’m glad that you don’t infer design from them, I guess, but this concession alone pretty much torpedoes your argument.” Some proteins use simpler modules. That does not torpedo anything. First of all, many of those basic modules are complex enough. Second, proteins have many long parts which are certainly functional and are not explained as the sum of simple modules. ATP synthase subunits are again a good example. But just think of transcription factors, and all the part of their sequence (usually the longest part) which is not a DBD.
In my youth, I spent a lot of time thinking about transcription factors. You are assuming that they need to be that long in order to carry out their function. Furthermore, you are assuming that their progenitor had to be that long and complicated to perform the progenitor’s function. If you want to talk about facts, they are against you here.
Third, individual short modules, very often, while functional in a greater context (the long proteins), would not be naturally selectable by themselves (or, again, we have no evidence for that). DNAJ wrote “Well, it could itself bind to DNA, and thus sterically exclude RNA polymerase from binding. Kd for the monomer is much higher, but it still exists. And any dimerization will produce a protein dimer that recognizes a palindrome with far lower Kd, purely as a consequence of C2 symmetry. Call me biased, but I think modulation of transcription is pretty useful.” Useful is not the same as naturally selectable. And again, where are evidences? Facts?
This makes no sense to me whatsoever.
DNAJ wrote “Yup, pretty much, but without the unnecessary value judgments. I am imagining some pre-LUCA’s who synthesized short alpha helices, beta sheets, helix-turn-helix motifs, etc., etc. and concatenations thereof, selected in name of some minimal selective advantage, such as DNA binding motifs which inhibit transcription, until they were out-competed by their fractionally less hopeless cousins. But, unfortunately, all the selection and extinction that has occurred since that time has badly fogged our view of this era. I doubt that we will ever know the historical truth about how early proteins did emerge, but – thanks to in vitro protein evolution studies – we do know a fair amount about what is feasible.” Again, refer to my discussion about what science is. Science is not about what is possible, but about what is supported by facts.
You have a strange view of Science. Working hypotheses may be short on supporting facts today. The key thing is that they make testable predictions.
DNAJ wrote “Please provide support for this assertion. Please be very precise.” It’s simple. The assertion is: “Complex functional structures cannot be deconstructed iinto simple naturally selectable steps. It’s as simple as that”. We have tons of examples of complex functional structures. In machines, in software, in language, in proteins and other biological machines. In all examples of complex functional information, the global function is the result of an intelligent aggregation of bits of information which, in themselves, cannot give any incremental function. No complex function cab be deconstructed into a simple sequence of transformations, each of them simple enough to be generated by random variation, each of them conferring higher functionality in a linear sequence. That is as true of software as it is true of proteins. There is no rule of logic which says that I can build complex functions by aggregating simple steps which are likely enough to be generated randomly.
As I thought, this is an argument by analogy that has been demolished previously, including in Kitzmiller.
Please, show any deconstruction of that kind for one single complex protein. Then show that there are reasons to believe that such a (non existent) case can be the general case. Then we can discuss NS as a scientific theory supported by facts. Examples exist that model how error-prone replication combined iteratively with selection can achieve results that ID-proponents claim cannot be achieved within the lifetime of the universe. Some rudimentary ones have been discussed on UD, but the discussions here shed more heat than light, sadly. To what are you referring? And I notice the “rudimentary” in your statement. Telling, isn’t it?
Well the “rudimentary” one was Weasel. I agree that it is “telling” that people at UD cannot even understand a simple toy example. I found the discussion http://theskepticalzone.com/wp/?p=576 more interesting…
[snip] I have given my definitions. In a discussion with me, please stick to them (or criticize them).
As you noted above, a lot of people on UD read the word “random” and think that it means “uniform probability distribution”. I would recommend that you use the word “stochastic”, which does not suffer from this risk of misinterpretation. Your definition of “random” leads to a problem when used to describe an outcome, however : it is completely without meaning. All outcomes, all sequences, are the results of stochastic processes.
DNAJ wrote “And get your colleagues to do the same.” I answer for myself. Like many other interlocutors, you have a strange idea of scientific debate. I have no colleagues. So should you. We are intelligent (I hope ) people, sharing some ideas and not others. And we discuss. This is not a political party, or a fight between football fans.
I too answer for myself, and I recognize that that is all anyone can do. However, a key aspect of scientific debate is the willingness to attack the arguments of people with whom you concur with, if anything, greater aggression than you attack the arguments of those you disagree with. My perception may be biased, but at UD I observe a reluctance to ‘rock the boat’ by addressing differences between ID advocates. Different posters will make comments that are mutually contradictory, yet there is a strange, unscientific, reluctance to address these discrepancies. In that sense, I am afraid that UD does look like a political party, trying to keep both believers in common descent and YECs under the same big tent. I would hold you in even higher esteem if you were more willing to correct other regulars here when they say things that you know make no sense , such as kf's probability calculations. But I understand the reluctance.DNA_Jock
November 1, 2014
November
11
Nov
1
01
2014
01:37 PM
1
01
37
PM
PDT
Alan: It is not for me, and never has been. ID is about ideas, scientific ideas. The rest is not important.gpuccio
November 1, 2014
November
11
Nov
1
01
2014
11:06 AM
11
11
06
AM
PDT
This is not a political party, or a fight between football fans. Well,it shouldn't be.Alan Fox
November 1, 2014
November
11
Nov
1
01
2014
03:36 AM
3
03
36
AM
PDT
DNA_Jock: Again, thank you for the comments. I like this discussion, because your arguments are good and clear and pertinent. So, let's go on. First of all, I give you my explicit definition of "random". I use that term in one sense only. Random is any system which, while evolving according to necessity laws (we are not discussing quantum systems here) can best be described by some appropriate probability distribution. IOWs, a random system is a necessity system which we cannot describe with precision by its necessity laws, usually because too many variables are involved. Let's go on.
And on this point I agree with you. However, your choice of superfamilies made a number of your statements regarding ‘relatedness’ factually incorrect. Let’s aim for precision in thought and language.
Well, I am happy we have clarified. I like precision, but please admit that this is a general blog where I was trying to make a general discussion available to all. Sometimes you have to make some tradeoff between precision and simplicity of language.
Actually, that is the point. The sequence itself is NOT “random”, in the mathematical sense.
That's why I refer "random" to the random walk if not modified by selection. I only use the term random in the "RV" part of the algorithm. It mean that the variation is random, because we cannot describe it by a strict necessity law, but we can use a probabilistic approach. Some people (many, unfortunately) erroneously think that random means some system showing an uniform probability distribution, but that is not your case, I think. So, please always refer to my initial definition of a random system. To be more clear, I will explicitly define RV: In a biological system, RV is any variation oh which we cannot describe the results with precision, but which can be well modeled by some probability distribution. The neo darwinian model is a probabilistic model which includes a necessity step, NS. The model is also sequential, because it hypothesizes that NS acts on the results of RV, modifying the probabilistic scenario by expanding naturally selectable outcomes. But RV and NS always act in sequence, and that's why we can separate them in our reasoning and modeling.
The variations that are introduced during any iteration are “random” with respect to function. However, the output from any iteration (and therefore the input to all subsequent iterations) is strongly dependent on function.
I don't agree. See my definition of "random system". If you have a different definition, please give it explicitly. I find your use of "random" rather confusing here. First of all, what do you mean by "iteration"? My way to describe RV is: "The variation events that happen at any moment are random because their outcome can best be described by a probabilistic approach." For example, SNP are random because we cannot anticipate which outcome will take place. Even if the probabilities of each event is not the same, the system is random just the same. In general, we can assume an uniform distribution of the individual transitions for practical purpose. And of course other variation events, like deletion and so on, have not the same probability as a general SNP. But the system of all the possible variations remains random (unless the variations are designed :) ). Function has nothing to do with this definition.
It’s a subtle point, but technically, your statement “by variations in the sequence, which are independent from the final sequence” is also incorrect: “independent” has a specific meaning, and the variations introduced (random as they are wrt function) are in fact correlated with the final sequence that is reached. This confusion may be caused by your Texas Sharpshooter Fallacy, as seen in the phrase “which will be reached” – there is not a unitary final sequence, rather there is a set of possible sequences, one of which we observe.
No, you are confused here. I will try to be more clear. What I mean is that the outcome of random variation can best be explained by a probabilistic approach, and the probabilistic description of that outcome has nothing to do with any functional consideration. Can you agree with that simple statement?
One can model drift as a random walk, agreed, but any application of selection, however slight, wrecks the RW model.
I have never said anything different. But RV and selection act sequentially. So I can apply the random walk model to any step which does not include selection. And I will accept NS only if it is supported by facts, not as a magic fairy accepted by default because of a pre commitment based on personal (or collective) faith.
Furthermore, the probability for a random walk occupying position x only reaches 1/n after the string has been completely randomized, which will take a very long time. By way of illustration, the probability that a random walk will occupy positions closer to the starting point remains higher than 1/n even after every ‘monomer’ has been mutated, on average, five times in every member of the population. (After 5 an average of mutations per monomer, there is a 0.6% chance that any individual monomer is still unmutated; so for a 100 amino acid domain, that’s a half chance of still retaining an original amino acid in the sequence; it ain’t scrambled yet, but there has not been any discernable similarity for a while…)
That's why I say that 1/n is the highest probability we can assume for an unrelated state. Obviously, related states are more likely. What are you assuming here, that all proteins are related although we have no way of detecting that? Faith again? The fact remains that a lot of protein sequences are unrelated, as far as we can objectively judge. Therefore, we must assume 1/n as the high threshold of probability in a random walk from an unrelated precursor to the. Please, let me understand how a single precursor, related to all existing proteins, could have generated the many unrelated existing proteins by individual random walks, each with probability higher than 1/n. Is that what you are really suggesting? Seriously?
Agreed.
Re-agreed! :)
Here’s where a second equivocation causes problems. You appear to be using the word “related” to mean both “having discernable similarity” and “sharing a common ancestor”.
No. I only mean “having discernable similarity”. I am an empirical guy. “Having discernable similarity” is an observable, a fact. “Sharing a common ancestor” is an hypothesis, a theory. If we have not the fact, we cannot support the hypothesis (unless we can observe or infer the process in other ways). IOWs, if we have no discernable similarity, there is no reason to believe that two sequences share a common ancestor (unless you have other independent evidence, based on facts. Observables. My categories are very simple, and must remain simple.
I had great difficulty trying to understand this sentence. Firstly, we are actually considering two or more extant sequences, that share no apparent similarity. Absent a time machine, we cannot directly access the ancestral sequences. You appear to be saying that, because they share no apparent similarity today, the only reasonable hypothesis is that they emerged from “unrelated sequences” 2 billion years ago, meaning sequences that had no discernable similarity 2 billion years ago. I think that a reasonable hypothesis might be that the intervening 2 billion years has buried any detectable similarity signal.
No. If you have read the previous point, maybe you can understand better what I mean. "Reasonable" here mean "scientifically reasonable", IOWs justified by facts. So, the meaning is very simple: we observe two unrelated sequences, and we have no facts that support the existence of a common ancestor. Therefore, the reasonable scientific hypothesis os that they share no common ancestor. You can certainly hypothesize that they shared a common ancestor and that "the intervening 2 billion years has buried any detectable similarity signal", but that is not science, unless you have independent empirical support. It is not scientifically reasonable to hypothesize at the same time that something is true but that we cannot have any evidence of it.
You do realize that you can string simple short proteins together to produce longer, more complicated proteins? I’m glad that you don’t infer design from them, I guess, but this concession alone pretty much torpedoes your argument.
Some proteins use simpler modules. That does not torpedo anything. First of all, many of those basic modules are complex enough. Second, proteins have many long parts which are certainly functional and are not explained as the sum of simple modules. ATP synthase subunits are again a good example. But just think of transcription factors, and all the part of their sequence (usually the longest part) which is not a DBD. Third, individual short modules, very often, while functional in a greater context (the long proteins), would not be naturally selectable by themselves (or, again, we have no evidence for that).
Well, it could itself bind to DNA, and thus sterically exclude RNA polymerase from binding. Kd for the monomer is much higher, but it still exists. And any dimerization will produce a protein dimer that recognizes a palindrome with far lower Kd, purely as a consequence of C2 symmetry. Call me biased, but I think modulation of transcription is pretty useful.
Useful is not the same as naturally selectable. And again, where are evidences? Facts?
Yup, pretty much, but without the unnecessary value judgments. I am imagining some pre-LUCA’s who synthesized short alpha helices, beta sheets, helix-turn-helix motifs, etc., etc. and concatenations thereof, selected in name of some minimal selective advantage, such as DNA binding motifs which inhibit transcription, until they were out-competed by their fractionally less hopeless cousins. But, unfortunately, all the selection and extinction that has occurred since that time has badly fogged our view of this era. I doubt that we will ever know the historical truth about how early proteins did emerge, but – thanks to in vitro protein evolution studies – we do know a fair amount about what is feasible.
Again, refer to my discussion about what science is. Science is not about what is possible, but about what is supported by facts.
Please provide support for this assertion. Please be very precise.
It's simple. The assertion is: "Complex functional structures cannot be deconstructed iinto simple naturally selectable steps. It’s as simple as that". We have tons of examples of complex functional structures. In machines, in software, in language, in proteins and other biological machines. In all examples of complex functional information, the global function is the result of an intelligent aggregation of bits of information which, in themselves, cannot give any incremental function. No complex function cab be deconstructed into a simple sequence of transformations, each of them simple enough to be generated by random variation, each of them conferring higher functionality in a linear sequence. That is as true of software as it is true of proteins. There is no rule of logic which says that I can build complex functions by aggregating simple steps which are likely enough to be generated randomly. Please, show any deconstruction of that kind for one single complex protein. Then show that there are reasons to believe that such a (non existent) case can be the general case. Then we can discuss NS as a scientific theory supported by facts.
Examples exist that model how error-prone replication combined iteratively with selection can achieve results that ID-proponents claim cannot be achieved within the lifetime of the universe. Some rudimentary ones have been discussed on UD, but the discussions here shed more heat than light, sadly.
To what are you referring? And I notice the "rudimentary" in your statement. Telling, isn't it?
Some of your colleagues have yet to get the memo. Contrast the behavior of those who assert “such pathways cannot exist” with those who say “let’s investigate what would be required?”
I have always been very clear about my position. The second one.
I would humbly suggest that if you want to clarify what is the role of “random” in the ID reasoning, you make clear, on every occasion that the word is used, which of the various meanings you are ascribing to it: “Unguided” = your “random walk” “Unselected” = your “random” source DNA. “Mathematically random” (i.e. unbiased, independent sampling) = kairosfocus’s erroneous assumption about strings.
I have given my definitions. In a discussion with me, please stick to them (or criticize them).
And get your colleagues to do the same.
I answer for myself. Like many other interlocutors, you have a strange idea of scientific debate. I have no colleagues. So should you. We are intelligent (I hope :) ) people, sharing some ideas and not others. And we discuss. This is not a political party, or a fight between football fans.gpuccio
November 1, 2014
November
11
Nov
1
01
2014
01:13 AM
1
01
13
AM
PDT
DNA_Jock: I read only now your post. Thank you for the comments. Some brief answers: I don’t understand your argument about my choice of superfamilies. I have always said clearly that in SCOP there are at least 3 major groupings which can be used in a discussion: about 1000 basic foldings, about 2000 superfamilies, and about 4000 families. In the papers which in the literature deal with the problem of functional complexity or the appearance of new strucutres in natural history, any of them has been used. Some reason in terms of folds, others in terms of superfamilies, others in terms of families. I believe that superfamilies are the grouping which offers the best tradeoff in terms of detecting truly isolated groups. The family grouping probably includes some possibly similar sequences which are not the best example to discuss the problems of new functional complexity. The folding grouping is IMO too “high”, and exaggerates the aggregation of completely different structures. However, I am proposing a general reasoning here: the point is not if we are debating a grouping of 1000, or 2000, or 4000. The point is that there are a lot of functional structures which appear isolated in all reasonable dimensions.
And on this point I agree with you. However, your choice of superfamilies made a number of your statements regarding ‘relatedness’ factually incorrect. Let’s aim for precision in thought and language.
You say: “I have not forgotten non-coding DNA. It still isn’t “random” in the mathematical sense. Consider the prevalence of di-, tri- and tetra-nucleotide repeats, or CpG bias. But thank you from bringing up transposons.” I love transposons. They are the future of ID. I think you miss the point. What is random is not the sequence itself. A non functional sequence and a functional proteins can have similar quasi ranodm appearance, or share some regularities which have biochemical reasons and are not related to the function. That’s not the point. [emphasis added]
Actually, that is the point. The sequence itself is NOT “random”, in the mathematical sense.
The point is that the only way to go from one sequence to another one which are sequence unrelated, by means of random variation alone (bear with me about that for a moment, I will answer your other point), that is by variations in the sequence which are independent from the final sequence which will be reached (the functional sequence) can only be modeled as a random walk, where distant unrelated results have a similar probability of being reached, whose higher threshold is 1/n. So, it’s the walk which is random, and it’s the probability of a functional state of being reached through that random walk that we are computing.
I think that your error here may be caused by an equivocation of the word “random”. I am confident that this has been explained to you before, so I do not hold out much hope of succeeding where others have failed. But I’m an optimist, so… The variations that are introduced during any iteration are “random” with respect to function. However, the output from any iteration (and therefore the input to all subsequent iterations) is strongly dependent on function. It’s a subtle point, but technically, your statement “by variations in the sequence, which are independent from the final sequence” is also incorrect: “independent” has a specific meaning, and the variations introduced (random as they are wrt function) are in fact correlated with the final sequence that is reached. This confusion may be caused by your Texas Sharpshooter Fallacy, as seen in the phrase “which will be reached” – there is not a unitary final sequence, rather there is a set of possible sequences, one of which we observe. One can model drift as a random walk, agreed, but any application of selection, however slight, wrecks the RW model. Furthermore, the probability for a random walk occupying position x only reaches 1/n after the string has been completely randomized, which will take a very long time. By way of illustration, the probability that a random walk will occupy positions closer to the starting point remains higher than 1/n even after every ‘monomer’ has been mutated, on average, five times in every member of the population. (After 5 an average of mutations per monomer, there is a 0.6% chance that any individual monomer is still unmutated; so for a 100 amino acid domain, that’s a half chance of still retaining an original amino acid in the sequence; it ain’t scrambled yet, but there has not been any discernable similarity for a while…)
You say: ““A primordial ATP binding protein”. I, too, am unsatisfied with this answer.” So am I. And remember, ATP synthase is just one example of many. Of many many.
Agreed.
You say: “As I understand this sentence, you are saying that if we cannot determine the source DNA for an event that took place 2 billion years ago, the only reasonable hypothesis is that the source DNA is completely unrelated to the DNA that is ancestral to any other extant DNA sequences. Weird.” Not weird at all. The point is, let’s say that we have A and B today, and that they are completely sequence unrelated, and structure unrelated, and function unrelated. Are you saying that it is reasonable to explain them both with a derivation from some unknown sequence, let’s call it X, which had sequence similarities to both, or structure similarities to both, or function similarities to both? Is that your proposal? Weird, really.
Here’s where a second equivocation causes problems. You appear to be using the word “related” to mean both “having discernable similarity” and “sharing a common ancestor”. The sentence I was replying to:
The fact remains, if a specific function sequence emerges, say 2 billion years ago (for example, all the many new superfamilies and proteins in eukatyotes) and there is no trace of those sequences, foldings and functions before, the only reasonable hypothesis is that those sequences emerged from unrelated sequences at some time.
I had great difficulty trying to understand this sentence. Firstly, we are actually considering two or more extant sequences, that share no apparent similarity. Absent a time machine, we cannot directly access the ancestral sequences. You appear to be saying that, because they share no apparent similarity today, the only reasonable hypothesis is that they emerged from “unrelated sequences” 2 billion years ago, meaning sequences that had no discernable similarity 2 billion years ago. I think that a reasonable hypothesis might be that the intervening 2 billion years has buried any detectable similarity signal.
Do you really think that it is a credible scientific explanation for two completely different things to postulate an unknown source for both? Without any evidence at all that it exists?
Well, that’s a whole other conversation. Motes and beams, y’know. ;)
You say: ““Forms a stable alpha helix” is a useful function, from the protein’s perspective. ” But that is exactly the point. We cannot consider “the protein’s perspective” in the RV+NS algorithm. The only perspective there is the reproductive fitness of the replicator. So, unless you can show how the formation of a generic stable alpha helix can confer a reproductive advantage to some real biological being, the fact remains that a generic alpha helix cannot be naturally selected.
Protein X’s function is to be a convenient way of storing amino acids for future use, without raising osmolality prohibitively. The stable alpha helix reduces the risk of aggregation.
“Anyway, I also included the helix-turn-helix motif.” That is different. That is a 3d motif, which is functional. It is a simple one, about 20 AAs, so I would not choose it as an example of dFSCI. But what is your point? Of course there are simple functional motifs, and simple short proteins too which can be functional. That’s why I don’t use them as examples of dFSCI and I don’t infer design for them. [emphasis added]
You do realize that you can string simple short proteins together to produce longer, more complicated proteins? I’m glad that you don’t infer design from them, I guess, but this concession alone pretty much torpedoes your argument.
Moreover, that motif is a DNA binding motif (one of many) which is essential to functions of proteins interacting with DNA. It is difficult to think of an independent function for the motif itself, which can confer a reproductive advantage.
Well, it could itself bind to DNA, and thus sterically exclude RNA polymerase from binding. Kd for the monomer is much higher, but it still exists. And any dimerization will produce a protein dimer that recognizes a palindrome with far lower Kd, purely as a consequence of C2 symmetry. Call me biased, but I think modulation of transcription is pretty useful.
I don’t understand what is your point. Are you saying that you can explain the emergence of long and complex proteins or domains or superfamilies (whatever you prefer) as the result of reasonable recombinations of an existing pool of short motifs, each of which was expanded because in itself it conferred a reproductive advantage? Are you imagining some pre-LUCA whose main activity was to synthesize short alpha helices or beta sheets, selected in name of I don’t know what perspective, or short DNA binding motifs which bind DNA without having any other effect, while waiting that ATP synthase and all the rest came out from their random recombination?
Yup, pretty much, but without the unnecessary value judgments. I am imagining some pre-LUCA’s who synthesized short alpha helices, beta sheets, helix-turn-helix motifs, etc., etc. and concatenations thereof, selected in name of some minimal selective advantage, such as DNA binding motifs which inhibit transcription, until they were out-competed by their fractionally less hopeless cousins. But, unfortunately, all the selection and extinction that has occurred since that time has badly fogged our view of this era. I doubt that we will ever know the historical truth about how early proteins did emerge, but - thanks to in vitro protein evolution studies - we do know a fair amount about what is feasible.
I believe that you are forgetting here the huge limits of the necessity part of the algorithm: NS. It can never help to explain the origin of complex functional structures, because complex functional structures cannot be deconstructed into simple naturally selectable steps. It’s as simple as that.
Please provide support for this assertion. Please be very precise.
You say: “No. If you are trying to calculate the probability of hitting a target, given the “chance” hypothesis (which includes RM + NS), then you ARE going to have to consider the effect of selection. And recombination…” No. What I do is to use probability for the parts of the proposed explanation which are attributed to random variation, and include NS in the model if and when it is really explained how it worked in that case. IOWs, I don’t accept NS as a magic fairy, never truly observed, which prevents me from computing real probability barriers to a mechanism which is supposed to rely on probabilities. I believe that that is the only credible scientific approach.
Examples exist that model how error-prone replication combined iteratively with selection can achieve results that ID-proponents claim cannot be achieved within the lifetime of the universe. Some rudimentary ones have been discussed on UD, but the discussions here shed more heat than light, sadly.
Let’s say that you show me a true, existing step between A and B which is naturally selectable and is a sequence intermediate between A and B. Let’s call it A1. OK, then I divide the transition into two steps: A to A1 and A1 to B. That helps your case, but I can still compute the probabilites of each of the two transitions, and of the whole process, including the expansion of A1 by some perfect NS process. I have done those computations, some time ago, here. They show that selectable intermediates do reduce the probabilistic barriers, but they don’t eliminate them. In general, you need a lot of selectable intermediates, each of them completely expanded to the whole population, to bring a complex proteins into the range of small probabilistic transitions, which can be accomplished by the probabilistic resource of a real biological system. And that should be the general case. What a pity that we have not even one of those paths available. But that’s what neo darwinism is: a scientific explanation which relies on uncomputed probabilities and never observed necessity paths. That’s not my idea of science. [emphasis added]
Some of your colleagues have yet to get the memo. Contrast the behavior of those who assert “such pathways cannot exist” with those who say “let’s investigate what would be required?”
“But in all cases, and especially if the source is duplicated or inactivated genes, then the source DNA is not “random”. It may be random with respect to function, but my point to kairosfocus was this: Any bit-counting method of calculating p(T|H), or the information content of a string, rests on the assumption that the values at each position of the string are INDEPENDENT. Whatever “random” (meaning unselected) bits of DNA a cell might cobble together to make a new gene, the resulting string is NOT “random” in the independent sampling sense of the word. That was my point.” Well, I am not really interested in what your point to KF was or is. I have tried to clarify what is the role of “random” in the ID reasoning: a random walk to some unrelated functional sequence. I have tried to be explicit and clear. The only thing I am interested in is eventual points about that, if you have them.
I would humbly suggest that if you want to clarify what is the role of “random” in the ID reasoning, you make clear, on every occasion that the word is used, which of the various meanings you are ascribing to it: “Unguided” = your “random walk” “Unselected” = your “random” source DNA. “Mathematically random” (i.e. unbiased, independent sampling) = kairosfocus’s erroneous assumption about strings. And get your colleagues to do the same.DNA_Jock
October 31, 2014
October
10
Oct
31
31
2014
12:09 PM
12
12
09
PM
PDT
DNA_Jock: I read only now your post. Thank you for the comments. Some brief answers: I don't understand your argument about my choice of superfamilies. I have always said clearly that in SCOP there are at least 3 major groupings which can be used in a discussion: about 1000 basic foldings, about 2000 superfamilies, and about 4000 families. In the papers which in the literature deal with the problem of functional complexity or the appearance of new strucutres in natural history, any of them has been used. Some reason in terms of folds, others in terms of superfamilies, others in terms of families. I believe that superfamilies are the grouping which offers the best tradeoff in terms of detecting truly isolated groups. The family grouping probably includes some possibly similar sequences which are not the best example to discuss the problems of new functional complexity. The folding grouping is IMO too "high", and exaggerates the aggregation of completely different structures. However, I am proposing a general reasoning here: the point is not if we are debating a grouping of 1000, or 2000, or 4000. The point is that there are a lot of functional structures which appear isolated in all reasonable dimensions. You say: "I have not forgotten non-coding DNA. It still isn’t “random” in the mathematical sense. Consider the prevalence of di-, tri- and tetra-nucleotide repeats, or CpG bias. But thank you from bringing up transposons." I love transposons. They are the future of ID. I think you miss the point. What is random is not the sequence itself. A non functional sequence and a functional proteins can have similar quasi ranodm appearance, or share some regularities which have biochemical reasons and are not related to the function. That's not the point. The point is that the only way to go from one sequence to another one which are sequence unrelated, by means of random variation alone (bear with me about that for a moment, I will answer your other point), that is by variations in the sequence which are independent from the final sequence which will be reached (the functional sequence) can only be modeled as a random walk, where distant unrelated results have a similar probability of being reached, whose higher threshold is 1/n. So, it's the walk which is random, and it's the probability of a functional state of being reached through that random walk that we are computing. You say: "“A primordial ATP binding protein”. I, too, am unsatisfied with this answer." So am I. And remember, ATP synthase is just one example of many. Of many many. You say: "As I understand this sentence, you are saying that if we cannot determine the source DNA for an event that took place 2 billion years ago, the only reasonable hypothesis is that the source DNA is completely unrelated to the DNA that is ancestral to any other extant DNA sequences. Weird." Not weird at all. The point is, let's say that we have A and B today, and that they are completely sequence unrelated, and structure unrelated, and function unrelated. Are you saying that it is reasonable to explain them both with a derivation from some unknown sequence, let's call it X, which had sequence similarities to both, or structure similarities to both, or function similarities to both? Is that your proposal? Weird, really. Do you really think that it is a credible scientific explanation for two completely different things to postulate an unknown source for both? Without any evidence at all that it exists? You say: "“Forms a stable alpha helix” is a useful function, from the protein’s perspective. " But that is exactly the point. We cannot consider "the protein's perspective" in the RV+NS algorithm. The only perspective there is the reproductive fitness of the replicator. So, unless you can show how the formation of a generic stable alpha helix can confer a reproductive advantage to some real biological being, the fact remains that a generic alpha helix cannot be naturally selected. "Anyway, I also included the helix-turn-helix motif." That is different. That is a 3d motif, which is functional. It is a simple one, about 20 AAs, so I would not choose it as an example of dFSCI. But what is your point? Of course there are simple functional motifs, and simple short proteins too which can be functional. That's why I don't use them as examples of dFSCI and I don't infer design for them. Moreover, that motif is a DNA binding motif (one of many) which is essential to functions of proteins interacting with DNA. It is difficult to think of an independent function for the motif itself, which can confer a reproductive advantage. I don't understand what is your point. Are you saying that you can explain the emergence of long and complex proteins or domains or superfamilies (whatever you prefer) as the result of reasonable recombinations of an existing pool of short motifs, each of which was expanded because in itself it conferred a reproductive advantage? Are you imagining some pre-LUCA whose main activity was to synthesize short alpha helices or beta sheets, selected in name of I don't know what perspective, or short DNA binding motifs which bind DNA without having any other effect, while waiting that ATP synthase and all the rest came out from their random recombination? I believe that you are forgetting here the huge limits of the necessity part of the algorithm: NS. It can never help to explain the origin of complex functional structures, because complex functional structures cannot be deconstructed into simple naturally selectable steps. It's as simple as that. You say: "No. If you are trying to calculate the probability of hitting a target, given the “chance” hypothesis (which includes RM + NS), then you ARE going to have to consider the effect of selection. And recombination…" No. What I do is to use probability for the parts of the proposed explanation which are attributed to random variation, and include NS in the model if and when it is really explained how it worked in that case. IOWs, I don't accept NS as a magic fairy, never truly observed, which prevents me from computing real probability barriers to a mechanism which is supposed to rely on probabilities. I believe that that is the only credible scientific approach. Let's say that you show me a true, existing step between A and B which is naturally selectable and is a sequence intermediate between A and B. Let's call it A1. OK, then I divide the transition into two steps: A to A1 and A1 to B. That helps your case, but I can still compute the probabilites of each of the two transitions, and of the whole process, including the expansion of A1 by some perfect NS process. I have done those computations, some time ago, here. They show that selectable intermediates do reduce the probabilistic barriers, but they don't eliminate them. In general, you need a lot of selectable intermediates, each of them completely expanded to the whole population, to bring a complex proteins into the range of small probabilistic transitions, which can be accomplished by the probabilistic resource of a real biological system. And that should be the general case. What a pity that we have not even one of those paths available. But that's what neo darwinism is: a scientific explanation which relies on uncomputed probabilities and never observed necessity paths. That's not my idea of science. "But in all cases, and especially if the source is duplicated or inactivated genes, then the source DNA is not “random”. It may be random with respect to function, but my point to kairosfocus was this: Any bit-counting method of calculating p(T|H), or the information content of a string, rests on the assumption that the values at each position of the string are INDEPENDENT. Whatever “random” (meaning unselected) bits of DNA a cell might cobble together to make a new gene, the resulting string is NOT “random” in the independent sampling sense of the word. That was my point." Well, I am not really interested in what your point to KF was or is. I have tried to clarify what is the role of "random" in the ID reasoning: a random walk to some unrelated functional sequence. I have tried to be explicit and clear. The only thing I am interested in is eventual points about that, if you have them.gpuccio
October 31, 2014
October
10
Oct
31
31
2014
12:45 AM
12
12
45
AM
PDT
D-J: Have you really read what I did say; as in, particularly the significance of SUM {Pi log pi) and the bridge from Shannon's info work to entropy on an informational view? Next, the point of the real world Monte Carlo is not that blind chance and mechanical necessity led to life -- that's a big begging of the question on a priori evo mat [your probabilities as triumphalistically announced collapse], but that we see the scope of search relative to space of possibilities, which turns out to be such that the only reasonable expectation is to capture the bulk, non-functional gibberish. Precisely because of the tight specificity of configs that confer function, which locks out most of the space of possibilities. As for the claimed contradiction, the chaining chemistry of DNA and AAs do not impose any particular sequencing in strings, that would be self defeating. Indeed, if there were such an imposition it would render DNA incapable of carrying information, and it would undermine the ability of the proteins to have the varied functional forms they do. In that context, a priori it is quite legitimate to use carrying capacity as an info measure as is common in PC work. However, real codes do not show flat random distributions, mostly because of functional rules or requisites or conventions. DNA chains are coding for proteins, which in turn must fold and funciton, which imposes subtle, specific limits. Thus protein families as studied by Durston et al, have the difference between null -- flat random, ground and funcitonal states, which can then lead to onward more refined metrics such as he used. But in no wise does this delegitimise the string of choices approach. Are you going to say that file sizes as reported for DOC fioes etc are not reporting information metrics? If so, then I think you will be highly idiosyncratic. A fairer result would be to accept that here are diverse info metrics taken up for different purposes, a point highlighted by Shannon in 1948. And of course, whether we use DNA and AA chains or go to more refined metrics -- try Yockey's work of several decades ago -- the material point remains the same. We are well beyond any threshold reasonably attainable by blind chance and mechanical necessity with 10^57 monte carlo sim runs of 10^14 tries per sec for 10^17 s, or if you want 10^80 runs for the observed cosmos. I have to run, having been summoned for an urgent meeting on very short notice. I cannot take up more time from what is now on my plate, tonight. G'night, KFkairosfocus
October 28, 2014
October
10
Oct
28
28
2014
04:21 PM
4
04
21
PM
PDT
kairosfocus said
...the history of life can be taken as exploring the field of realistic possibilities in a real life Monte Carlo run, on the usual timeline across c 4 BY.
I agree. Of course n=1, and the posterior probability for the outcome we observe is = 1. What can we infer about the prior probabilities? Not a lot, given the [cough] small sample size.
...the history of life can be taken as exploring the field of realistic possibilities in a real life Monte Carlo run, on the usual timeline across c 4 BY.
Not if you believe in Intelligent Design, it can't.
...the history of life can be taken as exploring the field of realistic possibilities in a real life Monte Carlo run, on the usual timeline across c 4 BY.
That's quite the GSW to the foot.DNA_Jock
October 28, 2014
October
10
Oct
28
28
2014
02:44 PM
2
02
44
PM
PDT
*grabs popcorn*rich
October 27, 2014
October
10
Oct
27
27
2014
03:03 PM
3
03
03
PM
PDT
Gpuccio, I agree with you that the origin of novel protein domains (specifically "folds") is perhaps the greatest challenge to the MES. OTOH I disagree with almost everything that you say about selection; in my reply below, however, I am going to attempt to restrict myself to those disagreements that bear directly on the point at issue.
DNA_Jock: Good questions. Here are my answers. 1) “The ancestral sequence is not directly available to us; we have X to A and X to B. Some researchers have, with some success, inferred what X must have looked like. To assume that the path of interest is from A to B is to fall victim to “Axe’s mistake”.” OK, but we are dealing here with superfamilies which have no detectable sequence similarities, and no similar folding, and no similar function.
No. The 2020 superfamilies in SCOP are grouped into 1202 "folds", based on similarities in folding. Furthermore, Apolipoprotein A-I and Apolipoprotein A-II belong to different folds, and therefore different superfamilies, yet they have similarity in “function”.
Yoi must remember that the walk happens at sequence level, through random mutations in the nucleotide sequence, which have nothing to do with the function. I will discuss the role of selection later. So, whatever the possible ancestor X, the fact remains that what we observe are unrelated sequences (2000 of them) with unrelated folding [incorrect] and unrelated function [incorrect]. Believing that they derived from some common ancestor, which was itself functional, is like believing in a myth: you can do that, but there is no empirical support. When an ancestor is reconstructed (tentatively) that is done because of similarities in the descendants. You cannot do that for unrelated sequences. 2) “It is NOT a ‘random’ walk. It is two or more stochastic, somewhat constrained walks. Selection may play a role.” Again, I will discuss selection later. Regarding the two walks, no, it is a random walk form the ancestor A, whatever it was, to B. The simple fact is, the ancestor had to be different at sequence level and folding level and function level, if the superfamily as such only appears at some moment in natural history, and there is no trace at all of any ancestor, at sequence, folding [incorrect] or function [incorrect] level, in the existing proteome. Again, you can believe what you like, and myths are a legitimate hobby, but in science we have to look at facts. You say: “Your definition of superfamilies differs from mine. My understanding is that a superfamily is the highest taxonomic level at which ancestry can be inferred. It’s not that different families are completely unrelated, it’s that the relationship is unclear.” The currently defined superfamilies in SCOP are completely sequence unrelated, and they have different folding [incorrect] and different functions [incorrect]. What do you want more? You say: “This makes no sense. You appear to be claiming that novel superfamilies arise from TOTALLY random sequences. Where on earth would an organism find such totally random sequences?” You are forgetting non coding DNA. We have now many examples of new genes emerging from non coding sequences, usually through transposon activity. Can you explain why a non coding sequence should not be unrelated to a protein coding sequence which emerges in time from it? Especially if that happens through unguided transposon activity?
I have not forgotten non-coding DNA. It still isn’t “random” in the mathematical sense. Consider the prevalence of di-, tri- and tetra-nucleotide repeats, or CpG bias. But thank you from bringing up transposons.
Let’s go back to the alpha and beta subunits of ATP synthase. They are unique sequences, emerging very early (in LUCA) and retaining most of the sequence for 4 billion years.
Not sure what “unique” means here: ncbi reports 42 distinct versions of alpha, and 51 distinct versions of beta. "unique" means they differ from each other, I guess.
What is your hypothesis? From what “ancestor” sequence did they emerge?
“A primordial ATP binding protein”. I, too, am unsatisfied with this answer.
The fact remains, if a specific function sequence emerges, say 2 billion years ago (for example, all the many new superfamilies and proteins in eukatyotes) and there is no trace of those sequences, foldings and functions before, the only reasonable hypothesis is that those sequences emerged from unrelated sequences at some time.
As I understand this sentence, you are saying that if we cannot determine the source DNA for an event that took place 2 billion years ago, the only reasonable hypothesis is that the source DNA is completely unrelated to the DNA that is ancestral to any other extant DNA sequences. Weird.
You say: “Cells are, however, chock-a-block full of sequences that form stable alpha-helices, others that form stable beta-sheets, some that form stable helix-turn-helix motifs. These motifs are too short and permissive to allow the accurate inference of common ancestry millions of years later*, but they completely destroy the ‘bit-counting’ probability calculations. ” No. Absolutely not. Secondary structures are common to all proteins, but in themselves they build no specific function. The function is defined at many levels, but the two most important ones are: a) The general folding and more specific tertiary structure of the molecule. That is highly sequence dependent, even if different sequences can fold in similar ways, but almost always they retain at least some homology [but the homology may be undetectable, see ‘folds’, above]. So, in the same superfamily or family you can have 90% homology or 30% homology. In rare cases you can have almost no homology, but that is rather an exception. [as you get further apart, it becomes a fairly frequent exception] b) The active site. That determines more often the specific affinity to different possible substrates, and is often what changes inside a family to bear new, more or less different, functions.
c) Allosteric sites are important too.
But you cannot certainly consider secondary structures, like alpha helix and beta sheets, as independent functional units. Instead, domains which are really functional, even if simple, are usually retained throughout evolution, and show clear homologies even in very distant proteins.
Why can I not consider them? “Forms a stable alpha helix” is a useful function, from the protein’s perspective. Anyway, I also included the helix-turn-helix motif. Sequence-specific DNA binding seems “really functional”. There are a number of 2ndary structure motifs that make useful building blocks for new proteins, but they are too short and/or too permissive of substitutions for the relationship to be unambiguously detected billions of years later, and you have to be wary of false positives caused by convergent evolution. Hence the problem with inferring ancestral relationships above the level of superfamily.
You say: “As does any selection.” OK, but as I have always said the role of selection must be considered separately. It has nothing to do with the computation of probabilistic resources and barriers.
No. If you are trying to calculate the probability of hitting a target, given the “chance” hypothesis (which includes RM + NS), then you ARE going to have to consider the effect of selection. And recombination…
[snipped section on selection, most of which I disagree with, but we can save that debate for another time…] That’s why most biologists rely more on scenarios where the ancestor is not functional: duplicated, inactivated genes, or simply non coding sequences.
Or fragments of the aforementioned categories, brought into novel combinations by illegitimate recombination, potentially catalyzed by transposons. But in all cases, and especially if the source is duplicated or inactivated genes, then the source DNA is not “random”. It may be random with respect to function, but my point to kairosfocus was this: Any bit-counting method of calculating p(T|H), or the information content of a string, rests on the assumption that the values at each position of the string are INDEPENDENT. Whatever “random” (meaning unselected) bits of DNA a cell might cobble together to make a new gene, the resulting string is NOT “random” in the independent sampling sense of the word. That was my point.DNA_Jock
October 27, 2014
October
10
Oct
27
27
2014
03:03 PM
3
03
03
PM
PDT
kf, As I already said, in the comment you are replying to,
"Accounting for the different prevalences of the different monomers is easily done, I agree. "
Which is what Morse did.
"But that is not where your problem lies."
You have already admitted that you assume independence and that this assumption is incorrect. How big is the error? How do you know?DNA_Jock
October 27, 2014
October
10
Oct
27
27
2014
02:52 PM
2
02
52
PM
PDT
D-J et al: I have to be busy elsewhere, but I note simply that H is anmetric of avg info per element, which can be derived from the on the ground statistics, which is what Durston et al did. Where, the history of life can be taken as exploring the field of realistic possibilities in a real life Monte Carlo run, on the usual timeline across c 4 BY. Similarly, it is a pretty standard result that the chain of y/n possibilities defining a state is a valid info metric. So, it's not we are stuck for probability values, but that info can be arrived at in its own terms and in fact then worked back ways to probabilities if you want. For example, Morse simply went to printers to learn general English text statistical frequencies and used that to set up his code for efficiency. No prizes for guessing why E is a single dot. So, the pretence that no there are no valid info metrics on the table that indicate FSCO/I collapsed long since; not to mention the pretence that FSCO/I does not describe a real phenomenon -- something as commonplace as text in posts in this thread and the complex functional organisation of the PCs etc people use to look at the thread. Also, the AA space and islands in it, GP is clearly addressing. KFkairosfocus
October 26, 2014
October
10
Oct
26
26
2014
11:54 AM
11
11
54
AM
PDT
DNA_Jock: Good questions. Here are my answers. 1) "The ancestral sequence is not directly available to us; we have X to A and X to B. Some researchers have, with some success, inferred what X must have looked like. To assume that the path of interest is from A to B is to fall victim to “Axe’s mistake”." OK, but we are dealing here with superfamilies which have no detectable sequence similarities, and no similar folding, and no similar function. Yoi must remember that the walk happens at sequence level, through random mutations in the nucleotide sequence, which have nothing to do with the function. I will discuss the role of selection later. So, whatever the possible ancestor X, the fact remains that what we observe are unrelated sequences (2000 of them) with unrelated folding and unrelated function. Believing that they derived from some common ancestor, which was itself functional, is like believing in a myth: you can do that, but there is no empirical support. When an ancestor is reconstructed (tentatively) that is done because of similarities in the descendants. You cannot do that for unrelated sequences. 2) "It is NOT a ‘random’ walk. It is two or more stochastic, somewhat constrained walks. Selection may play a role." Again, I will discuss selection later. Regarding the two walks, no, it is a random walk form the ancestor A, whatever it was, to B. The simple fact is, the ancestor had to be different at sequence level and folding level and function level, if the superfamily as such only appears at some moment in natural history, and there is no trace at all of any ancestor, at sequence, folding or function level, in the existing proteome. Again, you can believe what you like, and myths are a legitimate hobby, but in science we have to look at facts. You say: "Your definition of superfamilies differs from mine. My understanding is that a superfamily is the highest taxonomic level at which ancestry can be inferred. It’s not that different families are completely unrelated, it’s that the relationship is unclear." The currently defined superfamilies in SCOP are completely sequence unrelated, and they have different folding and different functions. What do you want more? You say: "This makes no sense. You appear to be claiming that novel superfamilies arise from TOTALLY random sequences. Where on earth would an organism find such totally random sequences?" You are forgetting non coding DNA. We have now many examples of new genes emerging from non coding sequences, usually through transposon activity. Can you explain why a non coding sequence should not be unrelated to a protein coding sequence which emerges in time from it? Especially if that happens through unguided transposon activity? Let's go back to the alpha and beta subunits of ATP synthase. They are unique sequences, emerging very early (in LUCA) and retaining most of the sequence for 4 billion years. What is your hypothesis? From what "ancestor" sequence did they emerge? The fact remains, if a specific function sequence emerges, say 2 billion years ago (for example, all the many new superfamilies and proteins in eukatyotes) and there is no trace of those sequences, foldings and functions before, the only reasonable hypothesis is that those sequences emerged from unrelated sequences at some time. You say: "Cells are, however, chock-a-block full of sequences that form stable alpha-helices, others that form stable beta-sheets, some that form stable helix-turn-helix motifs. These motifs are too short and permissive to allow the accurate inference of common ancestry millions of years later*, but they completely destroy the ‘bit-counting’ probability calculations. " No. Absolutely not. Secondary structures are common to all proteins, but in themselves they build no specific function. The function is defined at many levels, but the two most important ones are: a) The general folding and more specific tertiary structure of the molecule. That is highly sequence dependent, even if different sequences can fold in similar ways, but almost always they retain at least some homology. So, in the same superfamily or family you can have 90% homology or 30% homology. In rare cases you can have almost no homology, but that is rather an exception. b) The active site. That determines more often the specific affinity to different possible substrates, and is often what changes inside a family to bear new, more or less different, functions. But you cannot certainly consider secondary structures, like alpha helix and beta sheets, as independent functional units. Instead, domains which are really functional, even if simple, are usually retained throughout evolution, and show clear homologies even in very distant proteins. You say: "As does any selection." OK, but as I have always said the role of selection must be considered separately. It has nothing to do with the computation of probabilistic resources and barriers. Now, the topic is very big, and I have dealed with it in detail many times. I cannot cover it in all its aspects in this post. Briefly, the main points are. 1) Negative selection is responsible for the sequence conservation of what is already functional. That is a very powerful mechanism. It is the reason why ATP synthase chains are almost the same after 4 billion years: almost all mutations are functionally detrimental, so the sequence is conserved. 2) Neutral mutations and genetic drift are responsible for the modifications in the permissive, not strictly constrained, parts of the sequence. That's why similar proteins, with the same function, and structure, can sometimes diverge very much in different species. That is the "big bang theory of protein evolution": new proteins appear functional, and then traverse their functional space through neutral variation. 3) Positive selection should act by expanding any naturally selectable positive variation and therefore "fixing" it in the general population. This is the only part that can lower the probabilistic barriers. Unfortunately, it is also the part for which we have no evidence. Apart from the few known, trivial microevolutionary scenarios. The fact is, positive selection is extremely rare and trivial. It selects for small variations (one or two AAs) which usually lower the functionality of some existing structure, but confer an advantage under specific extreme selective scenarios (see antibiotic resistance, for example). Now, for positive selection to be really useful in generating new unrelated functional sequences, two thing should be true: a) Complex functional sequences should be, as a rule, deconstructable into many intermediate, each of them functional and naturally selectable, each of them similar at sequence level to its ancestor except for a couple of AAs. IOWs, the transition from A to B should always be deconstructable into a composite transition: A to A1 to A2 to A3... to B-3 to B-2 to B-1 to B where each transition is in the range of the probabilistic resources of the system (from one to a few AAs) and each intermediate step is fully naturally selectable and naturally selected at some time (expanded to the general population). That is simply not true, both logically and empirically. We have no examples of such a deconstruction, and there is no logical reason why that should be possible, not only in specific cases, but in the general case. So, we have both logical and empirical reasons to reject that assumption. b) Anyway, if a) were true, we should observe strong and obvious traces of those intermediates in the proteome. We don't. Or we should be able to find those paths in our labs. We don't. So, again, myths and nothing else. A final consideration: negative selection, which is a strong and well documented mechanism, can only act as a force against the transition as long as the ancestor is a functional sequence. That has been shown very clearly in the ragged landscape paper. Each relative peak of functionality in the sequence space tends to block the process: those peaks are, indeed, holes. That's why most biologists rely more on scenarios where the ancestor is not functional: duplicated, inactivated genes, or simply non coding sequences. But that means relying essentially on neutral variation. And neutral variation is completely irrelevant in lowering the probabilistic barriers: it simply cannot do that. Finally, as far as I know new superfamilies are documented up to the primates. I am not sure in humans, probably not. But, as you know, there are many possible new genes in humans which have not yet been characterized (a few thousands).gpuccio
October 26, 2014
October
10
Oct
26
26
2014
10:24 AM
10
10
24
AM
PDT
kf,
But prevalence of A/G/C/T or U and the 20 proteins in general and for specific families has long been studied as can be found in Yockey’s studies.
Accounting for the different prevalences of the different monomers is easily done, I agree. But that is not where your problem lies.
The Durston metrics of H reflect this, which comes out in SUM pi log pi metrics.
H is a measure of functional uncertainty, not a probability. You have to feed pi into this equation... I am asking you how you calculate a single pi value. You have already admitted that you assume independence and that this assumption is incorrect- viz: " info contexts reduce info capacity", but you have asserted that your error is not "material". When I ask you what basis you have for this "not material" claim, your response is "because the numbers we get are soooo big". So you have yet to calculate p(T|H) with any known level of precision.DNA_Jock
October 25, 2014
October
10
Oct
25
25
2014
12:26 PM
12
12
26
PM
PDT
gpuccio, Thank you for the thoughtful reply. When you say
A to B through a random walk
I have two quibbles: 1) The ancestral sequence is not directly available to us; we have X to A and X to B. Some researchers have, with some success, inferred what X must have looked like. To assume that the path of interest is from A to B is to fall victim to "Axe's mistake". 2) It is NOT a 'random' walk. It is two or more stochastic, somewhat constrained walks. Selection may play a role.
Where A and B are two states which are unrelated at sequence level. Why? Because protein superfamilies are by definition unrelated at sequence level. And there are 2000 of them
Your definition of superfamilies differs from mine. My understanding is that a superfamily is the highest taxonomic level at which ancestry can be inferred. It's not that different families are completely unrelated, it's that the relationship is unclear.
So, B is sequence unrelated to A, and therefore we can only assume that it has the same probability as any other unrelated state.
This makes no sense. You appear to be claiming that novel superfamilies arise from TOTALLY random sequences. Where on earth would an organism find such totally random sequences? Cells are, however, chock-a-block full of sequences that form stable alpha-helices, others that form stable beta-sheets, some that form stable helix-turn-helix motifs. These motifs are too short and permissive to allow the accurate inference of common ancestry millions of years later*, but they completely destroy the 'bit-counting' probability calculations. As does any selection. So using 1/n for p(T|H) is hopelessly inaccurate. * Do you know how recent is the most recent superfamily? Thanks, DNA_JDNA_Jock
October 25, 2014
October
10
Oct
25
25
2014
11:56 AM
11
11
56
AM
PDT
LoL!@ thorton the moron. I NEVER calculated the CSI of an aardvark by counting the letters in its dictionary definition. You are either a liar or a loser. I used the definition of an aardvark as an example of specified information and then used Shannon methodology to determine if it was also complex. I was a simple example but obviously too complicated for a simpleton like yourself. Being proud of your ignorance and dishonesty can't be a good thing.Joe
October 25, 2014
October
10
Oct
25
25
2014
11:50 AM
11
11
50
AM
PDT
Joke
And that is after being shown how to calculate CSI
Yeah Joke, you told us. Like when you calculated the CSI of an aardvark by counting the letters in its dictionary definition. ALL SCIENCE SO FAR!Thorton
October 25, 2014
October
10
Oct
25
25
2014
11:29 AM
11
11
29
AM
PDT
timmy: "Thar ain't no ID CSI calculations cuz I don't know what I am talking about!" And that is after being shown how to calculate CSI and a peer-reviewed paper that follows the methodology.Joe
October 25, 2014
October
10
Oct
25
25
2014
10:51 AM
10
10
51
AM
PDT
DNA_Jock: If you want, I can give you the scenario for protein coding genes. If we assume that a new protein superfamily appears at some time in natural history (which is exactly what we observe), and that the new gene derives from some existing sequence (another protein coding gene, or a pseudogene, or some non coding DNA sequence), then the scenario is: A to B through a random walk. Where A and B are two states which are unrelated at sequence level. Why? Because protein superfamilies are by definition unrelated at sequence level. And there are 2000 of them So, B is sequence unrelated to A, and therefore we can only assume that it has the same probability as any other unrelated state. What is that probability? As we assume that all unrelated states have approximately the same probability, if we have n possible states, it is obvious that the minority of states which are sequence related to A has a higher probability of being found than any unrelated state, especially in a random walk where most variation is of a single aminoacid. Therefore, we can safely assume that 1/n is a higher threshold for the probability of each unrelated state. Is that clear?gpuccio
October 25, 2014
October
10
Oct
25
25
2014
09:51 AM
9
09
51
AM
PDT
Thorton, you are now speaking with disregard to truth, hoping that what you have said or suggested will be taken as true. Sad. You need to take time to actually address what Durston et al have done which is a direct fact to the contrary of your declaration. And, of course there is much more that you have willfully despised and pretended away. Please, do better. KF PS: Onlookers, remember that just to start, ordinary computer files are FSCO/I this is a familiar thing, and that through node-arc 3-d patterns that ae functionally specific, ortganisation converts to FSCO/I metrics, as AutoCAD etc routinely do. MRNA coding for proteins is a string data structure at essentially 2 bits per base, 6 per protein AA. Again, if you have been unduly disturbed cf: https://uncommondescent.com/atheism/fyiftr-making-basic-sense-of-fscoi-functionally-specific-complex-organisation-and-associated-information/kairosfocus
October 25, 2014
October
10
Oct
25
25
2014
08:43 AM
8
08
43
AM
PDT
1 2 3 9

Leave a Reply