Cube is a server for comparative analysis of protein sequences. A group of sequences can be compared to find regions they have in common ('conservation analysis'). Alternatively, if it is known that the sequences can be divided into several groups or classes, such as two paralogous families of proteins, or proteins from two different kingdoms, they can be contrasted to find regions in which they differ, conferring the specific function to each group ('specialization analysis'). Follow this link to read more about why-and-how of protein sequence comparison.

The purpose of such an exercise is, typically, to help in the rational design of single-site mutational experiments, to estimate the impact of SNPs, or guide computational docking. Specialization analysis, in particular, may help in the search of separation-of-function mutants.

Below is the contents list for this help page. If you are new to the subject, consider starting from worked examples, even if you do not have a slightest intention of actually working through them.



INPUT
 

The only required input on the conservation page is a set of sequences in fasta format. You can read more about the FASTA format on its Wikipedia page.

If you do not have sequences homologous to your protein, you can read more about finding them by clicking here.

Optionally, the sequences can be pre-aligned (the server accepts fasta and msf formats). If your sequences are already aligned, please tick the "my sequences are aligned" checkbox, otherwise the server will ignore your original alignment.

The reference sequence can be specified. All scores will be shown mapped onto this sequence. (The positions that do not exists in this sequence will be skipped. This is to avoid one extraneosly long sequence in the input making the output unreadable.) If no reference sequence is given, the first sequence in the uploaded file will be used.

In addition, the structure can be provided. Check here if you are not sure whether your protein has a known structure.

The default scoring method can be changed. The references for the implemented scoring methods can be found below.

The users are invited to provide any information that they already may have about the protein residues (such as transmembrane regions, post-translational modifications sites, catalytic residues and similar), numbered according to any sequence in the alignment. This information is added to the downloadable table, alongside the conservation score, residue type, and surface accessibility information. Note: this information is, of course, uploaded to the computer that Cube runs on. If you believe you have something sensitive, you can add it to the spreadsheet after you download it.

On the specialization page, the input consists of sequences already divided into meaningful groups. The groups can be arbitrary, but typically they are expected to represent paralogous families of proteins in comparable taxonomical samples, or protein orthologues divided into clearly distinct taxonomical groups.

Each group of sequences should be in fasta format if the sequences are not aligned, or fasta or msf if they are (aligned).

Annotation, according to numbering in any of the sequences in the alignment can also be provided. It will appear in xls spreadsheet, alongside the information about conservation/specialization and surface accesibilty (if the structure was provided).


SEQUENCE NAMES
 

This is arguably a subtopic of the input section, but it is so important and potentially confusing, that we believe it deserves a section of its own. In FASTA format, in particular, a sequence header might look something like this

           >gi|114647465|ref|XP_528667.2| PREDICTED: 4-hydroxyphenylpyruvate dioxygenase isoform 3 [Pan troglodytes]. 
               

Note the '>' character - it is not arbitrary, nor optional. In FASTA format it marks the beginning of the header line. (You can read more about the FASTA format on its Wikipedia page). In this case, everything before the first space (e.g. 'gi|114647465|ref|XP_528667.2|') will be taken as the input identifier for the sequence. However, this identifier is a clunker, loaded with special characters, and completely un-informative regarding the sequence it labels. Cube will therefore try to shorten the name to something like PAN_TRO_114647465.

When adding the annotation, you can use 'gi|114647465|ref|XP_528667.2|' as the sequence name. Alternatively, you can use any unique part of it - such as '114647465.' The server will try to find the sequence you have in mind by pattern matching. This will fail if the selected name is not truly unique, and result in the input processing error. Therefore, try and give unique names to your sequences. Also consider the possibilty that the seqeunces wiht identical names actually are two identical sequences - in that case you can get rid of one of them.


OUTPUT
 

Conservation. The server produces a 1D conservation map (the conservation score color coded and mapped on the sequence) in the png format, the tabulated information (in xls format), and the conservation mapped onto the structure (as a PyMol session). A consistent color coding is used in all three forms of the output.

Specialization. In the output, the specialization scores are shown side-by-side with the conservation values (Shannon entropy) for each residue , both in the tabulated output (xls spreadsheet) as well as mapped on the structure (Pymol session). The scores are also immediately shown in the browser, mapped on the sequence, and in an html version of the output table.

Read more about the Pymol session produced by Cube.

Read more about the xls spreadsheet produced by Cube.

If you are really interested in the inner workings of Cube, or if you would simply like to download everything with a single click, Cube offers the whole work directory for download. Read more.

Exon boundaries. If the exons boundaries are known, you can indicate them in your reference sequence by inserting 'Z' character in their place. Also if you use ExoLocator server to create the input sequence alignment, it will provide that piece of annotation for you. In the conservation map, the exon boundaries are indicated as blue bars.


SERVER ERRORS
 

We are doing our best to keep the server as robust as possible against all sorts of formatting that crops in bio databases around the world. Sometimes the server fails, nevertheless. If it happens, please send us the error message you received to our contact address.

In the meantime you can try to decipher the error message yourself (like "Please provide the input sequences"), and perhaps see if simplifying the input format helps.


WORKED EXAMPLES
 

Conservation in HPPD Enzyme. Learn how to map conservation onto protein sequence. Create excel spreadheets with conservation data, and conservation maps in png format.

Conservation in HPPD Enzyme Redux. Learn how to map conservation onto protein structure. Add annotation to the spreadsheet.

Specialization in lactalbumin. After gene duplication 300 to 400 million years ago, an enzyme lysozyme C gave rise to a gene that currently codes for α-lactalbumin, a regulatory protein expressed only in the lactating mammary gland. Learn how to map the specialization info onto protein sequence and structure. Create excel spreadheets with specialization data, and specialization maps in png format.


MORE HELP
 
finding sequences why find sequences on your own
does my protein have paralogues? which species did my sequence come from?
making an alignment and aligning alignments viewing an alignment
finding structure viewing structure
finding annotation generic analysis checklist
what's this orthologue/paralogue business?
 

REFERENCES
 
Conservation:
  • rvet: real-valued evolutionary trace Mihalek et al. (J Mol Biol. 2004, 336:1265, PubMed)
  • entr: column entropy score based on Shannon's entropy; proposed independently by
                       Sander and Schneider (1991, Proteins 9:56-68, PubMed) and
                       Shenkin et al.(1991,Proteins 11:297-313, PubMed)
  • majority fraction: Wu and Kabat (J Exp. Med, 1970 132(2):211-50, PubMed)
  • valdar: Valdar & Thornton (2001, Proteins 42:108-124, PubMed):
  • integer-valued evolutionary trace: Lichtarge (J. Mol. Biol. 1996, 257:342, PubMed)
All methods with the name ending in "_s" have the amino acid alphabet size reduced from 20 to 9 by grouping residues into similarity groups. The similarity groups used are [V,I,L,A,M,G], [S,T], [D,E], [K,R,H], [F,Y], and [Q,N]. The remaining residue types are each in the group of their own.
        
Specialization:

CITING CUBE
  
If you have found Cube useful in your work, please cite
Zhang, Zong Hong, Aik Aun Khoo, and Ivana Mihalek. "Cube-An Online Tool for Comparison and Contrasting of Protein Sequences." PloS one 8.11 (2013): e79480.e