Uncommon Descent Serving The Intelligent Design Community

Bioinformatics tools used in my OPs: some basic information.

Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

EugeneS made this simple request in the thread about Random Variation:

I also have a couple of very concrete and probably very simple questions regarding the bioinformatics algorithms and software you are using. Could you write a post on the bioinformatics basics, the metrics and a little more detail about how you produced those graphs, for the benefit of the general audience?

That’s a very reasonable request, and so I am trying here to address it. So, this OP is mainly intended as a reference, and not necessarily for discussion. However, I will be happy, of course, to answer any further requests for clarifications or details, or any criticism or debate.

My first clarification is that I work on proteins sequences. And I use essentially two important tools available on the web to all.

The first basic site is Uniprot.

The mission of the site is clearly stated in the home page:

The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

I would say: how beautiful to work with a site which, in its own mission, incorporates the concept of functional information! And believe me, it’s not an ID site! 🙂

Uniprot is a database of proteins. Here is a screenshot of the search page.

 

 

Here I searched for “ATP synthase beta”, and I found easily the human form of the beta chain:

Now, while the “Entry name”, “ATPB_human”, is a brief identifier of the protein in Uniprot, the really important ID is in the column “Entry”:  “P06576”. Indeed, this is the ID that can be used as accession number in the BLAST software, that we will discuss later.

The “Reviewed” icon in the thord column is important too, because in general it’s better to use only reviewed sequences.

By clicking on the ID in the “Entry” column, we can open the page dedicated to that protein.

 

 

Here, we can find a lot of important information, first of all the “Function” section, which sums up what is known (or not known) about the protein function.

Another important section is the “Family and Domains” section, which gives information about domains in the protein. In this case, it just states:

Belongs to the ATPase alpha/beta chains family.

Then, the “Sequence” section gives the reference sequence for the protein:

It is often useful to have the sequence in FASTA format, which is probably the most commonly used format fro sequences. To do that, we can simply click on the FASTA button (above the Sequence section). This is the result:

This sequence is made of two parts: a comment line, which is a summary description of the sequence, and then the sequence itself. A sequence in this form can easily be pasted into BLAST, or other bioinformatics tools, either including the comment line, or just using the mere sequence.

Now, let’s go to the second important site: BLAST (Basic Local Alignment Search Tool).  It’s a service of NCBI (National Center for Biotechnology Information).

We want to go to the Protein Blast page.

Now, let’s see how we can verify my repeated statement that the beta chain of ATP synthase is extremely conserved, from bacteria to humans. So, we past the ID from Uniprot (P06576) in the field “Accession number”, and we select Escherichia coli in the field “Organism” (important: the organism name must be selected from the drop menu, and must include the taxid number). IOWs, we are blasting the human proteins against all E. coli proteins. Here’s how it looks:

 

Now, we can click on the “BLAST” blue button (bottom left), and the query starts. It takes a little time. Here is the result:

 

In the upper part, we can see a line representing the 529 AAs which make the protein sequence, and the recognized domains in the sequence (only one in this case).

The red lines are the hits (red, because each of them is higher than 200 bits). When you see red lines, something is there.

Going down, we see a summary of the first 100 hits, in order of decreasing homology. We can see that the first hit is with a protein named “F0F1 ATP synthase subunit beta [Escherichia coli]“, and has a bitscore of 663 bits. However, there are more than 50 hits with a bitscore above 600 bits, all of them in E. coli (that was how our query was defined), and all of them with variants of the same proteins. Such a redundancy is common, especially with bacteria, and especially with E. coli, because there are a lot of sequences available, often of practically the same protein.

Now, if we click on the first hit, or just go down, we can find the corresponding alignment:

 

The “query” here is the human protein. You can see that the alignment involves AAs 59 – 523, corresponding tp AAs 2 – 460 of the “Subject”, that is the E. coli protein.

Th middle line represents the identities (aminoacid letter) and positives (+). The title reminds us that the hit is with a protein of E. coli whose name is “F0F1 ATP synthase subunit beta”, which is 460 AAs long (rather shorter than the human protein). It also gives us an ID/accession number for the protein, which is a different ID from Uniprot’s IDs, but can be used just the same for BLAST queries.

The important components of the result are:

  1. Score: this is the bitscore, the number I use to measure functional information, provided that the homology is conserved for a long evolutionary time (usually, at least 200 – 400 million years). The bitscore is already adjusted for the properties of the scoring system.
  2. The Expect number: it is simply the number of such homologies that we would expect to find for unrelated sequences (IOWs random homologies) in a similar search. This is not exactly a p value, but, as stated in the BLAST reference page: when E < 0.01, P-values and E-value are nearly identical.
  3. Identities: just the number and percent of identical AAs in the two sequences, in the alignment. The percent is relative to the aligned part of the sequence, not to the total of its length.
  4. Positives: the number and percent of identical + positive AAs in the alignment.  Here is a clear explanation of what “positives” are:Similarity (aka Positives)When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won’t compromise the translated protein.Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.(From: binf.snipcademy.com)    IOWs, we could consider “positives” as “half identities”.
  5. Gaps. This is the number of gaps used in the alignments. Our alignments are gapped, IOWs spaces are introduced to improve the alignment. The lower the number of gaps, the better the alignment. However, the bitscore already takes gaps in consideration, so we can usually not worry too much about them.

A very useful tool is the “Taxonomy report” (at the top of the page), which shows the hits in the various groups of organisms.

While in our example we looked only at E. coli, usually our search will include a wider range of organisms. If no organism is specified, BLAST will look for homologies in the whole protein database.

It is often useful to make queries in more groups of organisms, if necessary using the “exclude” option. For example, if I am interested in the transition to vertebrates for the SATB2 protein (ID = Q9UPW6, a protein that I have discussed in a previous OP), I can make a search in the whole metazoa group, excluding only vertebrates, as follows:

 

As you can see, there is very low homology before vertebrates:

 

And this is the taxonomy report:

The best hit is 158 bits in a spider.

Then, to see the difference in the first vertebrates, we can run a query of the same human protein on cartilaginous fish. Here is the result:

 

As you can see. now the best hit is 1197 bits. Quite a difference with the 158 bits best hit in pre-vertebrates.

Well, that’s what I call an information jump!

Now, my further step has been to gather the results of similar BLAST queries made for all human proteins. It is practically impossible to do that online, so I downloaded the  BLAST executables and databases. That can be done from the BLAST site, and allows one to make queries locally on one’s own computer. The use of the BLAST executables is a little more complex, because it is made by command line instructions, but it is not extremely difficult.

To perform my queries, I downloaded from Uniprot a list of all reviewed human proteins: at the time I did that, the total number was 20171. Today, it is 20239. The number varies slightly because the database is constantly modified.

So, using the local BLAST executables and the BLAST databases, I performed multiple queries of all the human proteome against different groups of organisms, as detailed in my OP here:

This kind of query take some time, from a few hours to a few days.

I have then imported the results in Excel, generating a dataset where for each  human protein (20171 rows) I have the value of protein ID, name and length, and best hit for each group of organism, including protein and organism name, protein length, bitscore value, expect value, number and percent of identities and positives, gaps. IOWs, all the results that we have when we perform a single query on the website, limited to the best hit.

In the Excel dataset I have then computed some derived variables:

  • the bitscore per AA site (baa: total bitscore / human protein length)
  • the information jump in bits for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (bitscore for cartilaginous fish – bitscore for non vertebrate deuteronomia)
  • the information jump in bits per aminoacid site for specific groups of organisms, in particolar between cartilaginous fish and pre-vertebrates (baa for cartilaginous fish – baa for non vertebrate deuteronomia)

The Excel data are then imported in R. R is a wonderful open source statistical software and programming language that is constantly developed and expanded by statisticians all over the world. It also allows to create very good graphs, like the following:

This is a kind of graph for which I have written the code and which, using the above mentioned dataset, can easily plot the evolutionary history, from cnidaria to mammals, of any human protein, or group of protein, using their IDs. This graph uses the bit per aminoacid values, and therefore sequences of different length can easily be compared. A refernce line is always plotted, with the mean baa value in each group of organism for all human proteins. That already allows to visualize how the pre-vertebrate-vertebrate transition exhibitis the greatest informational jump, in terms of human conserved information.

However, the plots regarding individual proteins are much more interesting, and they reveal huge differences in the their individual histories. In the above graph, for example, I have plotted the histories of two completely different proteins:

  1. The green line is protein Cdc42, “a small GTPase of the Rho family, which regulates signaling pathways that control diverse cellular functions including cell morphologycell migration, endocytosis and cell cycle progression” (Wikipedia). It is 191 AAs long, and, as can be seen in the graph, it is extremely conserved in all metazoa, presented almost maximal homology with the human form already in Cnidaria.
  2. The brown line is our well known SATB2 (see above), a 733 AAs protein which is, among other things, “required for the initiation of the upper-layer neurons (UL1) specific genetic program and for the inactivation of deep-layer neurons (DL) and UL2 specific genes, probably by modulating BCL11B expression. Repressor of Ctip2 and regulatory determinant of corticocortical connections in the developing cerebral cortex.” (Uniprot) In the graph we can admire its really astounding information jump in vertebrates.

This kind of graph is very good to visualize the behaviour of proteins and of groups of proteins. For example, the following graph is of proteins involved in neuronal adhesion:

It shows, as expected, a big information jump, mainly in cartilaginous fish.

The following, instead, is of odorant receptors:

Here, for example, the jump is later, in bony fish, and it goes on in amphibian and reptiles, and up to mammals.

Well, I think that I have given at least a general idea of the main issues and procedures. If anyone has specific requests, I am ready to answer.

 

Comments
Origenes: Yes, that's the idea. You are correct. However, it is not strictly necessary that low homology in pre-vertebrates should be found. We can also have a jump "from scratch", IOWs a completely new functional protein in vertebrates, practically absent in pre-vertebrates. It's rarer, but there are a few examples. See for example the Activity-dependent neuroprotector homeobox protein (Q9H2P0), that I have mentioned in post #19, in answer to Mung. A greater number of new proteins can certainly be found in the transition to eukaryotes. The point is: a big information jump points to design, either if it rests on already existing homology in previous existing proteins, or if it is "from scratch". There is no difference, from an ID point of view: it is simply the amount of functional complexity linked to the transition that counts. In the same way, it is not important if we write a 2 Megabytes module to add it to an existing program, or just as a standalone program. It is, always, a new original 2 Megabyte module of functional information. A similar amount of design is necessary to write it, whether it is as an addition to an existing program or as a standalone piece of software.gpuccio
November 23, 2017
November
11
Nov
23
23
2017
04:41 PM
4
04
41
PM
PDT
Origenes, Perhaps others readers, including myself, could learn from the interesting friendly discussion you have with gpuccio. Keep it going. Thanks.Dionisio
November 23, 2017
November
11
Nov
23
23
2017
04:39 PM
4
04
39
PM
PDT
GPuccio So, information which originated 400 million years ago and is conserved up till humans is without a reasonable doubt truly functional information. Not only is it functional information, we also know that there is no alternative functional information in the proximity of the sequence in search space. IOWs the functionality depends heavily on the information of the conserved sequence. Which means that, in order to show an information jump that is inexplicable by unguided evolution, we need low homology in pre-vertebrates, but not too low. It must be a functional protein in pre-vertebrates, but there must also be a 500+ bits information jump at the vertebrate transition. Good examples can be found in your article about SAT-A and SAT-B. Is that correct?Origenes
November 23, 2017
November
11
Nov
23
23
2017
04:00 PM
4
04
00
PM
PDT
Origenes: No, the problem is not the functionality of the proteins. Of course, there is no reason to believe that each protein is not fully functional, in its own organism. The problem is rather: how much does the conserved information correspond to specific functional information? IOWs, how much does the functionality depend on the conserved information? In that sense, the longer the evolutionary time that the protein has been exposed to variation, the more we can assume that conservation is a good measure of functional information. Of course, that is absolutely sure for very long evolutionary conservation, for example what we see in the alpha and beta chains of ATP synthase, which can flaunt billions of years of conservation. The 400+ million years are a safe guarantee, too. 200 million years, 100 million years... I believe those are valid time frames too, but in those cases some part of the conservation could be less related to functionality. Less than 100 million years: conservation retains functional value, of course, but passive conservation is probably an important factor too. I hope this is clear.gpuccio
November 23, 2017
November
11
Nov
23
23
2017
02:34 PM
2
02
34
PM
PDT
GPuccio, Thanks again for your time. I feel a bit guilty by taking advantage of your expertise. Hopefully there are others, besides me, who profit from my attempts and questions. I take it that when you write ...
While 100 million years are certainly a lot of time for neutral variation to occur, still it is likely that part of the homology we observe can be attributed to passive conservation.
... your concern is that the functionality of the sequence is not thoroughly established. If so, what if there is independent way to establish that the protein sequence is functional? What if functionality is established by e.g. disabling the protein in the organisms? Would that restore the claim that thousands of bits of functional information popped into existence?Origenes
November 23, 2017
November
11
Nov
23
23
2017
11:49 AM
11
11
49
AM
PDT
Origenes: If you look at this article: https://www.ncbi.nlm.nih.gov/books/NBK21946/ in particular at Fig. 26.17 and 26.18, you will see that 100 million years could be just enough to ensure about 1 mutation per synonymous nucleotide, which is certainly a lot. With 400 million years, we are almost at 3 mutations per site, which ensures complete degradation of any passive homology!gpuccio
November 23, 2017
November
11
Nov
23
23
2017
10:53 AM
10
10
53
AM
PDT
Origenes: "I take it that you reject the Cape golden mole (8255). Why is that?" No, I don't reject it. But my methodology is to take the best hit in each class of organisms. The best hit represents the best conservation in that class. In each class, especially if numerous, you have a lot of lower hits, which can be explained in many different ways: a) Different functional constraints in different sepecies b) Loss of functionality in specific species (a very common issue, as explicitly admitted by neo-darwinists) c) Hits with different but similar proteins, or with different isoforms of the same protein The best hit is what represents the class in all my reasonings. In Afrotheria, the best hit is 9264 bits. That means that, at the split between Afrotheria and Boreoeutheria, that level of information had to be present in the common precursor.gpuccio
November 23, 2017
November
11
Nov
23
23
2017
10:49 AM
10
10
49
AM
PDT
GPuccio @84
While 100 million years are certainly a lot of time for neutral variation to occur, still it is likely that part of the homology we observe can be attributed to passive conservation.
Oh, that's good to know. I didn't realize that.
The first big hit can be found in marsupialia:monodelphis domestica, 4627 bits.
I missed that one. 4627 bits is an astounding information jump, but, as you explained, a jump during mammal transition is not so convincing as a jump during vertebrate transition. - - - - - One technical question:
GPuccio: And, of course, in afrotheria we have a further jump: 9264 bits in Trichechus manatus latirostris.
I take it that you reject the Cape golden mole (8255). Why is that?Origenes
November 23, 2017
November
11
Nov
23
23
2017
10:35 AM
10
10
35
AM
PDT
EugeneS: You are welcome! You were the real inspiration for this OP and discussion. :)gpuccio
November 23, 2017
November
11
Nov
23
23
2017
10:00 AM
10
10
00
AM
PDT
Origenes: No mistakes. That is simply one of the many proteins that were engineered later, in particular in mammals. The first big hit can be found in marsupialia:monodelphis domestica, 4627 bits. And, of course, in afrotheria we have a further jump: 9264 bits in Trichechus manatus latirostris. A lot of human proteins are engineered in the transition to mammals. In particular, sperm and testis related proteins are often late-engineered. So, you are perfectly right. The transition to mammals certainly deserves to be analyzed. I have thought many times to write something about it. The reason why I stick usually to the vertebrate transition is very simple: it is much older. There, we have 400+ million years. With mammals, much less. Maybe 100 - 130 million years. Which is not a short time, certainly. But 400 is better. 400 million years guarantees complete and full exposure to neutral variation. That can be easily seen when Ka/Ks ratios are computed. The Ks ratio reaches what is called "satutation" after 400 million years: IOWs, any initial homology between synonymous sites is completely undetectable after that time. That mean that what is conserved after that time is certainly conserved because of functional constraint. While 100 million years are certainly a lot of time for neutral variation to occur, still it is likely that part of the homology we observe can be attributed to passive conservation. IOWs, let's say that we have 95% identity between humans and mouse, for some protein. Maybe some of that homology is simply due to the fact that the split was 80 million years ago: IOWs, some AA positions could be neutral, but still be the same only because there was not enough time to change them. Of course, the bulk of conserved information will still be functionally constrained, but probably not all of it. However, that does not prevent us from analyzing more recent transitions. 100 million years are still an interesting split time for that. Of course, there is scarce meaning in analyzing primates that way, for example, because there most of the similarity will be passive, especially in species very near to humans, like chimp and gorilla. Another way to say that is that we can assume that conserved information accurately measures functional information, provided that a long enough evolutionary time separates the branches we are analyzing.gpuccio
November 23, 2017
November
11
Nov
23
23
2017
09:58 AM
9
09
58
AM
PDT
GPuccio, Thank you very much! It's is absolutely essential to have a reference like this for a non-expert reader.EugeneS
November 23, 2017
November
11
Nov
23
23
2017
08:46 AM
8
08
46
AM
PDT
Here is a large human protein 'Fibrous sheath-interacting protein 2' — FSIP2 — that meets at least a few of GPuccio's standards: (1) It is reviewed. (2) It is non-modular. Information can be found at Uniprot Function:
This gene encodes a protein associated with the sperm fibrous sheath. Genes encoding most of the fibrous-sheath associated proteins genes are transcribed only during the postmeiotic period of spermatogenesis. The protein encoded by this gene is specific to spermatogenic cells. Copy number variation in this gene may be associated with testicular germ cell tumors. Pseudogenes associated with this gene are reported on chromosomes 2 and X. [provided by RefSeq, Aug 2016] [source]
There are two isoforms. I blasted isoform 1; Length:6,907. The results are astonishing if I am correct. But it seems very likely that I made some mistake (again). Anyway, here a summary of the results I got: homo sapiens: 14163 bonobo: 13951 gorilla: 13840 .... whales & dolphins: 9536 ... domestic cat: 9237 bats: 9120 .... the Cape golden mole :8255 Here is where things are getting weird. Rhincodon typus: 226 Surely, I have made a few mistakes.Origenes
November 23, 2017
November
11
Nov
23
23
2017
06:54 AM
6
06
54
AM
PDT
gpuccio @78, Interesting point. Thanks.Dionisio
November 23, 2017
November
11
Nov
23
23
2017
05:51 AM
5
05
51
AM
PDT
Origenes: Of course, the partial sequences cannot include all the spectrin domains. For example, take isoform 11, which is very short. If you blast the whole protein (Q8NF91) against the FASTA sequence of isoform 11 (from Uniprot), using the "Align two or more sequences" tool, you will see that isoform 11 is identical to the 7659-8797 part of the whole protein. Therefore, it includes only the last 8 spectrin domains. The extremely low partial homologies with the rest of the protein (the black alignments under 50 bits and the one green alignment of 52 bits) are all expressions of the low homology existing between different spectrin domains in the molecule, as I have previously mentioned.gpuccio
November 23, 2017
November
11
Nov
23
23
2017
05:22 AM
5
05
22
AM
PDT
Origenes: Isoform 1 is the reference sequence for the protein. As you can see in Uniprot, the other isoforms are mostly partial sequences, with missing parts, and some changes in other parts. I really don't know what their biological value is. One should look at the relative publications, but probably little is understood in most cases. For most practical purposes, what is known is about the reference sequence.gpuccio
November 23, 2017
November
11
Nov
23
23
2017
05:10 AM
5
05
10
AM
PDT
Dionisio: It is interesting for some aspects, but really vague. The idea that LUCA, or any other step of OOL, could have been a complex system of interacting organisms is interesting. I would probably support that point. The problem remains, of course: how did such a complex system of interactions arise? :)gpuccio
November 23, 2017
November
11
Nov
23
23
2017
05:03 AM
5
05
03
AM
PDT
GPuccio: One caution. This protein is a good example of a specific problem we have to consider, especially with these huge molecules: they often include many repetitions of some domain. For example, SYNE1 includes 74 instances of the Spectrin domain! They make up the greatest part of the protein.
Thank you for taking a look at it. I see that I can find this information (74 instances of the Spectrin domain) at Uniprot under the heading 'Family & Domains'. Twelve different isoforms are offered. Is there any way to tell which isoform contains many instances of the Spectrin domain? Or does each isoform contain 74 instances of the Spectrin domain? - - - - - / / / / / Mung is really good at this. :)Origenes
November 23, 2017
November
11
Nov
23
23
2017
04:42 AM
4
04
42
AM
PDT
Dionisio: or, of course or! :)gpuccio
November 23, 2017
November
11
Nov
23
23
2017
04:14 AM
4
04
14
AM
PDT
gpuccio, What do you think of this paper? https://www.frontiersin.org/articles/10.3389/fmicb.2015.01144/pdf Thanks.Dionisio
November 23, 2017
November
11
Nov
23
23
2017
03:49 AM
3
03
49
AM
PDT
gpuccio @72: Of course, that can in some way amplify sequence similarities of differences. Of course, that can in some way amplify sequence similarities or differences. "of" 'or'?Dionisio
November 23, 2017
November
11
Nov
23
23
2017
03:02 AM
3
03
02
AM
PDT
Mung: "I bet he’s a handsome guy though, with a nice smile, and an easygoing manner. Mothers, protect your daughters." If only all my critics were like you! :) :)gpuccio
November 22, 2017
November
11
Nov
22
22
2017
11:37 PM
11
11
37
PM
PDT
Origenes: Yes, that is indeed a huge protein, and it certainly shows a very big jump in vertebrates. The best pre-vertebrate hit is 1732, while in cartilaginous fish we have 10730 in Callorhincus milii, 8998 bits and 1.02 baa. One caution. This protein is a good example of a specific problem we have to consider, especially with these huge molecules: they often include many repetitions of some domain. For example, SYNE1 includes 74 instances of the Spectrin domain! They make up the greatest part of the protein. IOWs, they are highly modular proteins. Of course, that can in some way amplify sequence similarities or differences. Of course, the single modules are not really repetitive. For example, I have blasted the last module in SYNE1 (the C terminal one) against the protein itself, and homologies with all the other modules in the proteins are rather small (20-30% identities). That means that each module is in some way differently engineered (if you consider that the conserved homology of the whole protein in Callorhincus milii is 62% identities). However, in principle these highly modular proteins, which are often proteins involved in the cytoskeleton, should be considered with caution, because they certainly have great inner informational redundancy, if compared with most other proteins where such domain repetitions are not present. That can also probably explain the rather big difference in human homology that we observe between the two sequenced sharks: Callorhincus milii: 10730 Rhincodon typus: 6791 For most proteins, those two organisms are rather concordant. Again, the high modularity of this protein can amplify differences (or similarities).gpuccio
November 22, 2017
November
11
Nov
22
22
2017
11:35 PM
11
11
35
PM
PDT
So there you have it. Expose gpuccio to serious criticism and he folds like a worn out accordion. No wonder it sounds like he plays the same tune over and over. Does he have a doctorate in BLASTing? I think not. Are his publications on bioinformatics examples of leading edge in the field? Laughable. I bet he's a handsome guy though, with a nice smile, and an easygoing manner. Mothers, protect your daughters.Mung
November 22, 2017
November
11
Nov
22
22
2017
07:54 PM
7
07
54
PM
PDT
Yes, let's blast a protein! SNYE1 - Nesprin-1 is a rather large human protein. Function:
Multi-isomeric modular protein which forms a linking network between organelles and the actin cytoskeleton to maintain the subcellular spatial organization. As a component of the LINC (LInker of Nucleoskeleton and Cytoskeleton) complex involved in the connection between the nuclear lamina and the cytoskeleton. The nucleocytoplasmic interactions established by the LINC complex play an important role in the transmission of mechanical forces across the nuclear envelope and in nuclear movement and positioning. May be involved in nucleus-centrososme attachment and nuclear migration in neural progenitors implicating LINC complex association with SUN1/2 and probably association with cytoplasmic dynein-dynactin motor complexes; SYNE1 and SYNE2 may act redundantly. Required for centrosome migration to the apical cell surface during early ciliogenesis. May be involved in nuclear remodeling during sperm head formation in spermatogenenis; a probable SUN3:SYNE1/KASH1 LINC complex may tether spermatid nuclei to posterior cytoskeletal structures such as the manchette. [uniprot]
I blasted isoform1, also known as: Nesprin-1 Giant, Enaptin. Length 8,797 bits. What we end up seeing is an enormous information jump during the pre-vertebrate-vertebrate transition: Extreme conservation in primates: homo sapiens 17954, gorilla 17801, rhesus macaque 17589. Also very well conserved in: dog 16355, horse 16293, elephant 15969, gray short-tailed opossum 14830, crocodile 13575, Blue-crowned manakin (bird) 13010, Central bearded dragon (lizard) 12872, frog 11872 , cichlid fish (bony fish) 9965 And now, as promised, things get really jumpy: Whale shark 6791; West Indian Ocean coelacanth (fish) 6069; acorn worm 1732 (!), ant 853.Origenes
November 22, 2017
November
11
Nov
22
22
2017
03:50 PM
3
03
50
PM
PDT
The only experts at TSZ are those expert is obfuscation, equivocation and denial.ET
November 20, 2017
November
11
Nov
20
20
2017
05:43 PM
5
05
43
PM
PDT
Origenes: Yes, protein sequence conservation in spite of neutral variation is the best measure of functional information, which is what we are really interested in. I have worked a little with nucleotides when I made some evaluations of the Ka/Ks ratio, and it was really hard work. In general, it is not necessary to do that for our purposes.gpuccio
November 20, 2017
November
11
Nov
20
20
2017
02:34 PM
2
02
34
PM
PDT
GPuccio Thank you very much for your time. I must have made several mistakes — including choosing the wrong fasta sequence. Good that you point out the many problems with blasting nucleotides. It is clearer now. For all those reasons it is best to stick with protein sequences, as you say.Origenes
November 20, 2017
November
11
Nov
20
20
2017
02:18 PM
2
02
18
PM
PDT
Origenes: Blasting nucleotides is rather tricky. That's why I usually stick to proteins. Now, the gene you blasted is 12289 nucleotides long, and it is a genomic sequence. Most of it is introns, and there are at least 3 exons, and many different transcripts. In chimp, with megablast, you have 99% identity and 22957 bits. In gorilla I get 2821 (max) and 4181 (total). You see homology mainly in the exons, while the introns are already different. In mouse I get about 2000 bits, and it is still possible to distinguish 3 zones of homology (probably the 3 main exons, and some other near regions). In cartilaginous fish, you barely distinguish the exons. Introns seem not to be conserved, in large part, except for very near species (like chimp). One thing that we must consider is that, at nucleotide level, there is the strong effect of neutral variation in exons (synonymous mutations), which does not effect the protein sequence, but does change the nucleotide sequence, proportionally to evolutionary distance. Remember that 200 - 400 million years can alter greatly any homology in synonymous sites. In general, also, non coding DNA is not much conserved, as we have seen, even if there are some important exceptions (like the ultra-conserved segments). In blasting nucleotides, we have to pay attention also to the type of sequence we are blasting: DNA, cDNA, mRNA, and so on. There is a lot of heterogeneity, and that makes the task more difficult. Again, I have not great experience in blasting nucleotides. For my purposes, blasting protein sequences is more useful.gpuccio
November 20, 2017
November
11
Nov
20
20
2017
01:37 PM
1
01
37
PM
PDT
GPuccio Would you be so kind to tell me how to interpret the following? I blasted the human gene “sonic hedgehog (SHH)”. It codes for a protein with (of course) the same name, which, for one thing, plays a key-role in the organization of the brain. Rather a long gene: 10034 bits More info here. Also, on that page, you can find the fasta sequence — under the heading “Genomic regions, transcripts, and products.” Okay, so I blasted it: The top scores: Homo sapiens scores 18530; Chlorocebus sabaeus (green monkey) 6514; .Macaca fascicularis (crab-eating macaque) 3722; Gorilla 2710. Those were some unexpected jumps, but here is another: most animals (e.g. rodents, bats, whales & dolphins) score about 1000. Excluding vertebrates resulted in 2 hits: a cricket 231 and a millipede 326.Origenes
November 20, 2017
November
11
Nov
20
20
2017
11:21 AM
11
11
21
AM
PDT
daveS: "I wonder if your interlocutors actually feel so certain about their beliefs." Well, I am not in their minds and in their hearts. They act as though they were, however. "I don’t feel very confident in many of my beliefs," Me too, but I have a good trick, and a rather simple one too. When I don't feel very confident, I simply try to not consider those things as beliefs, but rather as tentative hypotheses. And I rarely speak of them. On the other hand, when I feel confident... :)gpuccio
November 20, 2017
November
11
Nov
20
20
2017
10:00 AM
10
10
00
AM
PDT
1 2 3 4 5

Leave a Reply