Functional Site Based Protein Structure Analysis With Self-Organizing Maps

The exponential growth of proteome databases has increased the demand for methodologies that can reveal the structural relationships between proteins. In general, large protein families need to be approached on several different levels in order to be fully understood. In such families, key characteristics and relationships are hidden under their sophisticated structures. While similarities in the primary sequences of two proteins give basic clues about their relationship, three-dimensional structural information provides crucial details needed for determining protein functionality. As such, powerful and efficient computational analytic methods are becoming all the more essential. In the case of proteins, functionalities are most closely related with their three-dimensional structures. Thus, analysis based on the three-dimensional structure is absolutely necessary. The functions of proteins, particularly the functions of specific functional sites, are determined primarily by structural features. Thus, it can be said that structural similarities often point to functional similarities as well. This analysis, based on the functional site, suggests a unique way of constructing a structural comparison model using SOM, an unsupervised machine learning algorithm. The experiment was performed with two popular protein families. Structural alignment of protein structure was performed prior to the analysis, in hopes of minimizing the error in the three-dimensional structures of the proteins. The SOM technique was then applied to the aligned structures. The results obtained with the SOM algorithm highlight the similarity and dissimilarity of the proteins. Finally, by analyzing clusters in a SOM grid, the structure-function relationship between proteins could be identified.

structures--the alpha-helixes and the beta-sheets--of these proteins is not sufficient for uncovering the finer structural characteristics. These structural characteristics, including adopting a particular fold or conformation, can lead to a deeper understanding of the functional relationship between proteins [2]. Thus, since protein function is significantly related to its specific three-dimensional structure, a structurebased approach is crucial for identifying the relationship between proteins. The most common method for 3D protein structure comparison is global Root Mean Square Deviation (RMSD) that represents the average distance between the two equivalent atoms for the all pairs in global structure [25]. By focusing on the functional core, not comparing global structure, it is able to show the meaningful structural functional relationship.
A protein family is a group of proteins with common sequence features and similar biological functions. A large protein family often has a hierarchical relationship and can be arranged in a tree representing their evolutionary origin and their subfamilies (e.g. the Ras superfamily is divided into five major subfamilies) [3].
Proteins generally interact with their substrates at a particular site called the active site. The functional site is considered a decisive factor for discerning which kinds of molecules they will interact with. Ultimately, we expect that a structural comparison of the functional sites will allow us to classify the protein family based on the structure-function relationship of the proteins.
One of the most well-known proteomic structural databases is the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) [4]. PDB can be found on the web site http://www.rcsb.org, which contains information about the 3D structures of large biological molecules. PDB provides information on over 100,000 protein structures and seems to be expanding. With such a rapidly growing proteomic database, more effective analytical methods are becoming increasingly necessary for identifying the relationships between proteins.
The primary advantage of using SOM in this research is the ability to represent the similarities between the protein structures There was an initial experiment on functional center-based analysis of protein structure using Self-Organizing Maps (SOM) [5]. This novel method recognized a functionally important local structure, the functional center, and extracted out the surrounding structure within a certain radius. After performing structural alignment on the selected functional local structures, SOM was finally applied to these aligned structures. However, primitive local structural alignment techniques which had been performed manually with DS Viewer [6], were big hurdles in performing a fast and accurate analysis. Converting three-dimensional structural coordinates into linear vectors in order to construct feature vectors for SOM was also very difficult. In this paper, a new local structural alignment tool was used to improve the effectiveness of the research. In addition, straightforward feature vector constructions for SOM introduced here made the complex steps remarkably simple.
SOM is one of the artificial neural network algorithms, with an unsupervised learning aspect. Unsupervised learning trains the data without pre-defined categories whereas supervised learning has specified classes. SOM technique is often used as an analysis algorithm because it has many capabilities that other structural classification tools such as SCOP [7] and CATH [8] do not have. The greatest advantage of using SOM is its great ability to reduce dimensionality. In addition, SOM can process multiple objects at the same time and has the benefits of having graphical representations and easy interpretation. With Popsom [9], a new SOM package, a map can be constructed, as well as evaluated on its reliability, by computing the convergence rate of the map. The map can be trained until it has converged well, and this converged map can later be a criterion for selecting models that enhance the accuracy of the analysis of this research.
The objective of this research is to elucidate structural-functional relationships by classifying proteins from families into subfamilies using their structural features, given 3D coordinate information on the proteins via unsupervised machine learning.
In this paper, a unique structure-based approach is suggested, focusing on the structure of the functional site via the SOM algorithm with automated structural alignment techniques. BACKGROUND 2.1 Self-Organizing Maps The Self-Organizing Map (SOM) [10], introduced by Kohonen, is one of the most prominent artificial neural network algorithms with aspects of unsupervised learning. The main goal of unsupervised learning is to discover hidden patterns underlying data without explicit target definition. SOM is used in a wide variety of fields such as market analysis, image processing, and bioinformatics, fields that typically require finding clusters that group data by similarity. The main idea of the SOM technique is to project multi-dimensional data into a low-dimensional map, where the map represents the similarity or dissimilarity of the input. For each observation, a corresponding neuron is calculated in the SOM and a simple topological map shows the nice low-dimensional representation of the input data. By competitive learning, the SOM algorithm finds the best matching neuron and updates the winning node and its neighborhood neurons.
Training a map is similar to regression process. Let x be an n-dimensional input vector. At each iteration, vector x is compared with all the m i , the reference models, which have the same dimensionality as the input vector and are randomly initialized at the beginning. Then, the best matching unit or winning node using Euclidean distance between vector x and reference model m i , that is the minimal || x-m i ||, is computed, i where c is the index of the winning reference model. The winning reference model is the reference model with the shortest distance to the input vector x.
Next, the following formula shows the adjustment of the weights of all the reference models m i , where t = 0,1,2,…is the step index. Here h ci (t) is the neighborhood function defined as follows, where α is the learning rate and β is the neighborhood radius. The neighborhood function selects the reference models that need to be updated and only selects nodes that are within the neighborhood β. The neighborhood function gets increasingly smaller over time (that is, both α and β are functions of time t) and the adjusting steps are repeated consistently over the specified iteration.
The greatest advantage of SOM is data visualization. The low-dimensional SOM result can be interpreted intuitively. In addition, SOM achieves dimension reduction of data by projecting high-dimensional input data onto a two-dimensional grid that represents the essential clusters underlying input data with minimal loss of information. The gradient colors of grid units in map show the relative distances between reference vectors. Lighter colors represent greater similarity or closeness, while darker colors represent greater dissimilarity or distance.

Structural-Functional Relationship of Protein Family
A protein family is typically defined by similarities in the sequences of amino acids or similarities in their biological functions. Members of the same protein family are evolutionary-related so that they share a common ancestor and can thus often be arranged in a hierarchical system. For the most part, protein families can be divided into subfamilies and sometimes into even smaller families. For instance, the Ras superfamily is divided into 5 major subfamilies: Rho, Ras, Rab, Ran, and Arf. These divisions are made according to the structural and functional similarities, with each subfamily involved with a specific function [11].
Some computational methods for protein family classification are sequencebased, which finds the relationship among proteins based on similarity in amino acid sequence profiles [12]. However, it is well known that similarities in sequence do not indicate structural similarity [13]. Therefore, searching for sequential similarity alone is insufficient for determining other important functional properties which are more related to the three-dimensional structure.
The classification of protein families based on structural similarity is a major issue in computational biology. Comparing the 3D structure of proteins requires more intensive computation than sequential comparison. In general, the 3D structure of functional sites in a protein is highly conserved during evolution and is more related to the function of proteins. Comparing the structure of specific functional sites, such as the active site or the binding pocket, for example, helps to identify functional properties, since most proteins interact with other molecules and function by binding onto these sites. As the name suggests, a binding site is shaped so that other molecules or proteins can recognize it.
Thus, the structural similarity of proteins is a good measure for the classification of proteins. Furthermore, we believe that it is highly useful for predicting the functionalities and classification of more-newly discovered protein structures.

Protein Data Bank
There can be viewed using visualizing tools such as Jmol. These files are downloadable from the server in a variety of types. Jmol [14] is an interactive 3D viewer for molecular structures and can read over 60 file formats including PDB, CIF, SDF, MOL, and PyMOL. Jmol provides a variety of options for presenting protein structure. A typical PDB formatted file consists of several sections. The title section has a summary of the protein, the summary section goes over primary and secondary structure, the connectivity section describes the bonds and links between sheets and helices, and the coordinate section lists atoms along with 3D coordinates of the atoms in the protein.

Functional Site Based Analysis
A protein is a large and complex molecule composed of amino acid sequences that fold up into a unique three-dimensional structure. It is believed that this unique three-dimensional structure determines its biological properties and thus that protein function can be identified by detecting local structural similarities [15]. molecules. For instance, the property that proteins bind to other molecules to work as a molecular switch gives rise to the fact that binding sites that interact with other molecules are deeply related to protein functionality. An approach to classify protein kinase based on the binding pockets is a good example [16]. Thus, recognizing a functionally important local structure such as binding sites and functional motifs of a protein is essential in structure based analysis of proteins, and this aspect of protein behavior is also applied to the core of the approach in this paper.

Functional Site of Ras Superfamily
The Ras superfamily of small GTPases is a large and diverse group of proteins that act as molecular switches for regulating cellular functions [11]. This superfamily is divided into five major families based on their structural and functional similarities: Rho, Ras, Rab, Ran, and Arf. Rho, Ras, and Rab are the most closely related among the five [17]. The protein members of the Ras superfamily have 40% -85% of high primary sequence identity, while each subfamily has individual functions and different targets [18]. All members of the Ras superfamily have highly conserved common structural cores and function as GDP/GTP-regulated molecular switches. For example, a GTP-binding protein binds to either guanosine diphosphate (GDP) or guanosine triphosphate (GTP) so the protein becomes either inactive or active, respectively [19].
There is a particular motif in the proteins of the Ras superfamily that determines the features of each subfamily. Each subfamily either acts as a molecular switch for a unique target or intervenes in a cell process, such as cell proliferation.
Members of this superfamily conserve five G domains which are fundamental subunits: G1-G5 [11]. G domains are highly conserved regions related to nucleotide binding, a process that is involved with the GDP/GTP cycle. The G1 domain contains the phosphate binding loop (p-loop), which is a common motif in GTP binding proteins with a consensus of GXXXXGK[S/T], where X denotes any amino acid and S/T means S or T. A comparative analysis based on functional sites begins with finding the p-loop motif and comparing its three-dimensional shape. Table 1 shows the hierarchical relationship of the Ras superfamily and the list of PDB IDs chosen for analysis in this research project.   have the relative location of the atoms in the whole structure. Structural alignment is performed in two ways: locally and globally. By performing structural alignment both locally and globally, it is possible to discover any differences that may come up in the results. Alignment is performed based on the backbone structure of the protein: the skeletal structure composed of α-carbons for each residue.

Local Structural Alignment
The main purpose of local structural alignment is to minimize error by aligning smaller, selected regions without taking into consideration the rest of the structure, before the proteins are compared. In this paper, the local structure states the functional site to be observed and the local structural alignment performed in a pairwise manner, based on the one of the protein structures selected for the analysis.
Protein Local Alignment Tool (PLAT) [21] is a newly developed, web-based local structure alignment tool that performs pairwise alignment. PLAT provides simple but convenient ways to align local structures and makes the process of selecting specific residues to be aligned much easier. The protein data, more specifically the PDB ID and the chain type, is pulled straight from the PBD, and the local region to be aligned is selected as well. Aligned structures can then be viewed in jmol and saved as a .pdb file. Figure 1 includes a screenshot of plat and an example of aligned structure viewed using jmol. The regions shaded in yellow indicate the local structures chosen to be aligned. In order to perform an alignment, the number of the residues chosen should be the same.
After performing a local alignment, plat shows the new origin of the coordinate system and the rotation matrix. In this case, the P-loop structure of every protein is aligned based on the structure of 121P.   structural alignment, respectively. The six corners and two end points, for a total of eight points, represent the α-carbons of the p-loop structure. jmol is used to visualize the aligned xyz coordinates. The numbers indicate the distance in Å between the two corresponding α-carbons, and we can note in (b) that the local alignment technique tends to align the structure more precisely.

Preprocessing the Protein Structure Information and Feature Vector Construction
An innovative method is needed to describe the 3D structure of proteins, especially when the structural data is complex. The major steps for preprocessing protein data are summarized in Figure 4.   The primary advantage of using Self-Organizing Map (SOM) is the ability to train models in which the categories are not defined. SOM groups together similarities in the data and creates grid maps representing these similarities. Specifically, the geometric similarity of two proteins can be described as the distance between their corresponding atoms [23]. Due to the property of structure and function relationship, proteins are classified into families by structural similarities in their functional sites.
The Ras superfamily is a large superfamily consisting of structurally distinguishable families. One way to examine the structural-functional relationship of such proteins is to observe the clustering of the Ras superfamily through the SOM algorithm. All pairwise structural alignments using local and global techniques are performed based on the structure of 121P.

SOM with Local Structural Alignment
Local structural alignment focuses on more specific regions without taking into consideration any peripheral structures. 121P is selected as the base structure, and an alignment with each protein based on the p-loop structure is thus performed in pairwise manner. As a result, the aligned structure of each protein is preserved, while the coordinate data of the p-loop structure is taken out of the aligned structure. The feature vector is composed of the three-dimensional coordinate data on the eight αcarbons in the eight residues of the p-loop motif. Figure 6 shows thing to note is that the Rab family tended to disperse more so than the other families.
It is also important to note that the Rho, Ras, and Rab families tended to be closer or more mixed with each other because they were more closely related among the five subfamilies.   Although the SOM following global alignment looks a bit more organized than the one following local alignment, it is not completely obvious as to which map and thus which alignment technique, represents clustering better. Because of this, another clustering method is adopted to see the difference even better.

SOM result of Protein Kinase Family
In order to validate the assumptions of functional site based analysis, another protein group is adopted. STE group is one of the protein kinase families, and it      Both SOM and hierarchical clustering trees allow for easy visualization and interpretation. Dendrograms are relatively easy to read and interpret until the size of the clustering tree gets much bigger and more complicated. SOM maps, on the other hand, although harder to understand at first, are more useful than dendrograms when the number of observations gets much larger.

Conclusions
We have developed a unique method for comparing proteins and for discovering similarities and differences between functional sites via an unsupervised machine learning technique using SOM. SOM has the superior ability to recognize patterns in data. It maps structural patterns in protein families into low-dimensional grid maps by grouping proteins with similar structural patterns closer together. It is difficult to understand the relationships embedded in high-dimensional data simply by inspection. SOM helps to identify such relationships, especially among complex protein structures, through visualizations, which minimize the loss of information.
The nature of protein conformation indicates that structure and function are deeply related. The function of a protein is determined primarily by its tertiary structure, and then, although to a lesser extent, by its primary sequence. In this way, the functional core of the protein plays a critical role in classifying proteins into their respective subfamilies. The study of structural analysis based on functional sites of proteins began by merely identifying functionally important local structures. SOM expanded this study by investigating and comparing the three-dimensional shape of these functionally important local structures. Prior to the construction of SOM models, structural alignments were used solely to minimize errors existing in the coordinate system between protein structures. PLAT, a newly developed web-based protein local alignment tool, allows users to now select specific residues and align structures focused on these residues. Unlike local alignment, global alignment demonstrates that other domain structures affect the alignment of functional sites. Thus, it is remarkable that the small distortions in the functional sites extracted from globally aligned structures contributed to better clustering results than local alignment structures did.
The convergence rate of SOM made certain the reliability of SOM results. In a functional site-based analysis, similarities between proteins are found by using relatively small local structures and excluding all other unrelated structures. SOM successfully identified the clusters of subfamilies of two protein groups, the Ras superfamily and the STE kinase family, proving the structure-function relationship of proteins and the effectiveness of the functional site based approach.
The most notable improvements from preliminary research are, by far, the automated local structure extraction technique and the structural alignment technique (e.g., the backbone of the p-loop motif). This paper also introduced SOM's simple but effective feature vector construction component by unfolding the coordinate data on protein structure.

Future Work
Although the analysis was conducted in regards to two large protein families, only a limited number of proteins from each family were chosen. In order to consolidate the conclusions reached in this research project, more protein structures or protein groups should be added. If not, more domain structures can be added (e.g., the whole G-domain structure of the Ras superfamily can be used) so that the analysis is not just restricted to one functional site. Study of these strong predictive structural features will provide guidance in classification of newly discovered protein structures.
Both of these improvements enable broad understanding on the classification of protein structures.