# Pseudo amino acid composition

﻿
Pseudo amino acid composition

Pseudo amino acid composition, or PseAA composition, was originally introduced by [http://gordonlifescience.org/members/kcchou/index.html Professor Kuo-Chen Chou] [1] in 2001 to represent protein samples for statistical prediction. In contrast with the conventional amino acid (AA) composition that contains 20 components with each reflecting the occurrence frequency for one of the 20 native amino acids in a protein, the PseAA composition contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional AA composition while the additional factors incorporate some sequence-order information via various modes. Typically, these additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAA composition is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model. Ever since the concept of PseAA composition was introduced, it has been widely utilized to predict various protein attributes, such as protein subcellular localization, membrane protein type, enzyme functional class, GPCR type, protease type, protein structural class, and protein secondary structural content, among many others (see the references cited in [2] and [3] ). Meanwhile, various different modes to formulate the PseAA composition have also been developed [2] .

Background

In the history of developing methods for predicting subcellular localization of proteins and their other attributes, two kinds of models were generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model. The most typical sequential representation for a protein sample is its entire amino acid sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction. However, this kind of approach failed to work when a query protein did not have significant homology to the attribute-known proteins. Thus, various discrete models were proposed.

The simplest discrete model is using the AA (amino acid) composition to represent protein samples, as formulated as follows. Given a protein sequence P with $L$ amino acid resides, i.e.,

$mathbf\left\{P\right\}= mbox\left\{R\right\}_1 mbox\left\{R\right\}_2 mbox\left\{R\right\}_3 mbox\left\{R\right\}_4 mbox\left\{R\right\}_5 mbox\left\{R\right\}_6 mbox\left\{R\right\}_7 cdots mbox\left\{R\right\}_L qquad mbox\left\{\left(1\right)\right\}$

where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth, according to the AA composition model, the protein P of Eq.1 can be expressed by

where $, f_u , \left(u=1, 2, cdots, 20\right)$ are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. Owing to its simplicity, the AA composition model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information would be lost by using the AA composition to represent a protein. This is its main shortcoming. To avoid completely losing the sequence-order information, the concept of PseAA (pseudo amino acid) composition was proposed by Professor Kuo-Chen Chou [1] . According to the PseAA composition model, the protein P of Eq.1 can be formulated as

where $20+lambda$ the components are given by

where $w$ is the weight factor, and $au_k$ the $k$-th tier correlation factor that reflects the sequence order correlation between all the $k$-th most contiguous residues (Fig.1) as formulated by

$au_k = frac \left\{1\right\}\left\{L-k\right\} sum_\left\{i=1\right\}^\left\{L-k\right\} , mbox\left\{J\right\}_\left\{i, i+k\right\}, ,,, \left(k < L\right) qquad mbox\left\{\left(5\right)\right\}$

with

$mbox\left\{J\right\}_\left\{i, i+k\right\} = frac\left\{1\right\}\left\{Gamma\right\} sum_\left\{g=1\right\}^\left\{Gamma\right\} left \left[Phi_\left\{xi\right\}left\left(mbox\left\{R\right\}_\left\{i+k\right\} ight\right) - Phi_\left\{xi\right\}left\left(mbox\left\{R\right\}_\left\{i\right\} ight \right) ight\right] ^2 qquad mbox\left\{\left(6\right)\right\}$

where $Phi_\left\{xi\right\}left\left(mbox\left\{R\right\}_\left\{i\right\} ight\right)$ is the $xi$-th function of the amino acid $mbox\left\{R\right\}_i ,$, and $Gamma,$ the total number of the functions considered. For example, in the original paper by [http://gordonlifescience.org/members/kcchou/index.html Professor Kuo-Chen Chou] [1] , $Phi_\left\{1\right\}left\left(mbox\left\{R\right\}_\left\{i\right\} ight\right)$, $Psi_\left\{2\right\}left\left(mbox\left\{R\right\}_\left\{i\right\} ight\right)$ and $Psi_\left\{3\right\}left\left(mbox\left\{R\right\}_\left\{i\right\} ight\right)$ are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid $mbox\left\{R\right\}_i ,$; while $Phi_\left\{1\right\}left\left(mbox\left\{R\right\}_\left\{i+1\right\} ight\right)$, $Phi_\left\{2\right\}left\left(mbox\left\{R\right\}_\left\{i+1\right\} ight\right)$ and $Phi_\left\{3\right\}left\left(mbox\left\{R\right\}_\left\{i+1\right\} ight\right)$ the corresponding values for the amino acid $mbox\left\{R\right\}_\left\{i+1\right\} ,$. Therefore, the total number of functions considered there is $Gamma =3 ,$. It can be seen from Eq.3 that the first 20 components, i.e. $p_1, , p_2, , cdots,, p_\left\{20\right\}$ are associated with the conventional AA composition of protein , while the remaining components $p_\left\{20+1\right\}, , cdots, , p_\left\{20+lambda\right\}$ are the correlation factors that reflect the 1st tier, 2nd tier, …, and the $lambda ,$-th tier sequence order correlation patterns (Fig.1). It is through these additional $lambda ,$ factors that some important sequence-order effects are incorporated.

Web server

Note that $lambda ,$ in Eq.3 is a parameter of integer and that choosing a different integer for $lambda ,$ will lead to a dimension-different PseAA composition [2] . Also note that using Eq.6 is just one of the modes for deriving the correlation factors or PseAA components. The others, such as the physicochemical distance mode [4] and amphiphilic pattern mode [5] , can also be used to derive different types of PseAA composition. In 2008 a free server called PseAAC [6] is provided at the website http://chou.med.harvard.edu/bioinf/PseAAC/. By using the web server, users can generate the PseAA composition for any given protein sequence by selecting the mode as desired.

Figure 1. A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and (3) the 3rd-tier sequence-order-correlation mode along a protein sequence, where R1 represent the amino acid residue at the sequence position 1, R2 at position 2, and so forth, and the coupling factors J i,j are given by Eq.6. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the 2nd most contiguous residues, and panel (c) that between all the 3rd most contiguous residues. Adapted from [1] with permission.

References

[1] Kuo-Chen Chou, [http://gordonlifescience.org/members/kcchou/paper/Proteins_43.pdf Prediction of protein cellular attributes using pseudo amino acid composition] , PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, Vol.44, 60) 43 (2001) 246-255.
[2] Kuo-Chen Chou, Hong-Bin Shen, [http://gordonlifescience.org/members/kcchou/paper/AB_review_2007.pdf Review: Recent progresses in protein subcellular location prediction] , Analytical Biochemistry 370 (2007) 1-16.
[3] Kuo-Chen Chou, Hong-Bin Shen, Cell-PLoc: [http://chou.med.harvard.edu/bioinf/Cell-PLoc/ A package of web-servers for predicting subcellular localization of proteins in various organisms] , [http://gordonlifescience.org/members/kcchou/paper/Nature-Protocols_2008.pdf Nature Protocols 3 (2008) 153-162] .
[4] Kuo-Chen Chou, [http://gordonlifescience.org/members/kcchou/paper/BBRC_quasi.pdf Prediction of protein subcellular locations by incorporating quasi-sequence-order effect] , Biochemical & Biophysical Research Communications 278 (2000) 477-483.
[5] Kuo-Chen Chou, [http://gordonlifescience.org/members/kcchou/paper/Bioinf_21.pdf Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes] , Bioinformatics 21 (2005) 10-19.
[6] Hong-Bin Shen, Kuo-Chen Chou, PseAAC: [http://gordonlifescience.org/members/kcchou/paper/AB_2008_PseAAC.pdf a flexible web-server for generating various kinds of protein pseudo amino acid composition] , Analytical Biochemistry 373 (2008) 386-388.

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Nucleic acid tertiary structure — Example of a large catalytic RNA. The self splicing group II intron from Oceanobacillus iheyensis.[1] The tertiary structure of a nucleic acid is its precise three dimensional structure, as defined by the atomic coordinates.[2] …   Wikipedia

• Nitric acid — Nitric acid …   Wikipedia

• Enzyme — Biocatalyst redirects here. For the use of natural catalysts in organic chemistry, see Biocatalysis. Human glyoxalase I. Two zinc ions that are needed for the enzyme to catalyze its reaction are shown as purp …   Wikipedia

• Organelle — A typical animal cell. Within the cytoplasm, the major organelles and cellular structures include: (1) nucleolus (2) nucleus (3) ribosome (4) vesicle (5) rough endoplasmic reticulum (6) Golgi apparatus …   Wikipedia

• Protein quaternary structure — In biochemistry, quaternary structure is the arrangement of multiple folded protein or coiling protein molecules in a multi subunit complex. Contents 1 Description and examples 2 Nomenclature of quaternary structures 3 Determination of qua …   Wikipedia

• Mahalanobis distance — In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936.[1] It is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown… …   Wikipedia

• Cell wall — A cell wall is a tough, flexible and sometimes fairly rigid layer surrounding a cell, located external to the cell membrane, which provides the cell with structural support, protection, and acts as a filtering mechanism. A major function of the… …   Wikipedia

• p53 — For the band and album of the same name, see P53 (band) and P53 (album). Tumor protein p53 PDB rendering based on 1TUP …   Wikipedia

• Cortisol — Not to be confused with cortisone, a similar compound with a similar name, genesis, and function. Cortisol Systematic (IUPAC …   Wikipedia

• Glycoprotein — Not to be confused with peptidoglycan or proteoglycan. N linked protein glycosylation (N glycosylation of N glycans) at Asn residues (Asn x Ser/Thr motifs) in glycoproteins.[1] Glycoproteins are proteins that contain oligosaccharide chai …   Wikipedia