Tools for Anti-Microbial Peptide Editable Database

The goal of this thesis is to create tools for Anti-Microbial Peptide Database AMPed, developed by Professor Lenore Martin’s microbial peptide research group at University of Rhode Island. The database is for Researchers and Biologists, as they will be the main users. To develop a functioning and efficient database there should be tools that are relevant to the users of the database. In our case, the primary users are biologists, needing to calculate specific results or to visualize their data. The tools that are developed as a part for this thesis are the MIC (Minimal Inhibitory Concentration) and MLC (Minimal Lethal Concentration) calculator, the Mass spectrometry fragmentation prediction tool. With the help of these tools, researchers can upload their lab data, perform the calculation and see results using graphical visualizations. Also, in addition to this, normalization of the database is performed, as this is an editable database and we may need to change the schema as necessary (add and delete attributes) and re-normalize the database. We use the agile software development process to support these iterative changes.

peptides may also have the ability to enhance immunity by functioning as immunomodulators [1]".
AMPed (Anti-Microbial Peptide editable database) [2], developed by Professor Lenore Martin's microbial peptide research group at the University of Rhode Island, was designed to provide a user-friendly resource for researchers seeking information regarding Antimicrobial Peptides. AMPed will serve as a catalogue of all known AMPs and other peptides, populated by information gathered from journals, research papers, and other databases. Existing protein databases, with broader subject matter, are often difficult and time-consuming to search for AMPs effectively. AMPed will organize and manage available AMP information. AMPed, in order to most effectively collect and disseminate pertinent information, will allow for researchers to upload relevant research data directly to the database. Our database, still in the data curation phase, seeks to collect a variety of information sets on each peptide.
For example, the database encompasses information on peptides including their name, sequence, and molecular weight, and microbes including species name, type, and wall classification. Additionally, the database will glean data from other sources in order to further strengthen its capabilities. Structural information will be gathered from the Protein Data Bank, a repository with over 130,000 entries [3]. Through manual curation of entries onto our database and the pooling of data from established data banks, AMPed will prove a valuable research tool for a variety of applications.
Following are the goals of this thesis:  Create MIC (minimal inhibitory concentration) and MLC (minimal lethal concentration) calculator tools.
 Create a Mass spectrometry prediction tool.
 Normalize the Database.
 Add important enhancements to the website, such as mission and vision, privacy policy and terms of use.
The MIC and MLC of an antibiotic or therapeutic is the lowest concentration of a drug that will inhibit the visible growth of an organism after overnight incubation or kill an organism, respectively [4]. These values can be influential in guiding the choice of antimicrobials used in a therapy situation. Additionally, regular surveillance of MICs is required due to a continuing decrease in susceptibility to the commonly used antibiotics in critically ill patients [5]. A tool that would be able calculate the MIC of a given peptide against a given organism would be valuable in this context. Moreover, within the laboratory MICs can serve as a research tool to determine the in vitro activity of new antimicrobials. Our MIC calculator tool will serve a dual purpose in both the guidance of use of antimicrobes in a therapy situation, as well as the characterization of new antimicrobials in the research laboratory.
Our other tool, mass spectrometry is an analytical technique that is used in a variety of applications throughout many branches of science. For example, in chemistry, mass spectrometry can be used to determine the identity of an unknown compound. Specifically, relevant to the applications in biotechnology, mass spectrometry can be used as a technique to determine the sequence of amino acids in a peptide. With respect to AMPs, mass spectrometry can be best applied to two major areas. Firstly, mass spectrometry can be used to determine the amino acid sequence of an unknown AMP. This information is crucial in the discovery of novel antimicrobial agents. Secondly, when undergoing the synthesis of AMPs for the purpose of research and drug characterization, mass spectrometry can be used to validate that the purified peptide has the intended sequence. It is important that the amino acid sequence is that of the peptide of interest as often the sequence and structure determine its efficacy as an AMP. The mass spectrometry tool would serve as a tool in both of these areas. It would assist a researcher in identifying novel AMPs, as well as validate the isolation and purification of peptides of interest.

Goal
Understand current environment of AMPed, and integrate changes to the current environment.

AMPed Current Environment
The AMPed database has been expanded for a couple of years and is under the  [4].

Hosting Server
As mentioned before my predecessor on this project Abraham Herrera, helped to move the AMPed website to a new virtualized Linux server running GNU/Linux.
The hosting server technical specifications are as follows:  Processor: Pentium Xeon 3.3Ghz, 4 cores.

GUI front
Any further addition to the current GUI should follow the tone and structure of its current environment. Every new page that is added should follow the same color scheme and use the same icons to be uniform with the chosen AMPed schema. When the user accesses the database it should not look like pages are randomly added with a different tone. This would make it look like whole setup is out of sync. Generally, these characteristics are not very user friendly.  accessed by the user without creating an account, but for some features such as tools or the contribute section, creating an account is necessary. This feature is for safety and security and controls who contributes data to the database, and who uses tools so that there will be no contamination or integrity problems with the data.

Portability
The AMPed graphical user interface has a mobile portability design [3], which also integrates new content (tools) developed for this project. This means that the user can upload a properly formatted data file, from their mobile device to the user interface where server side scripting is done. They can also download the result to the mobile device as well.

Privacy Policy
A privacy policy is an important legal document on any website that collects the user's personal information. It provides information about how a website visitor's personal information will be used.
Privacy policies are different according to the nature of how information is collected and used [5] [6]. Our privacy policy can be found at the footer part of the website. The full privacy policy in appendix D. "In the US, the California Online Privacy Protection Act (CalOPPA) [7] dictates that if you collect any personal information from any California-based users, such as email addresses, GPS location, phone numbers, or mailing addresses, you are required to have a legal statement available for users to review that discloses the privacy practices of your business" [8]. We are adhering to the California standard.

Terms of Use
Terms of Use are set of rules that user has to abide to use the website. "Terms of service can also be merely a disclaimer, especially regarding the use of websites" [9].
Sometimes information on the website can be derived from different sources and the website owner cannot be held liable for that information. User should not assume that information is always correct. To make user understand these things, terms of use is important. Nowadays, upon creating accounts or to access some functions and resources in websites "Agree" or "Do not Agree" signs comes up. This kind of feature is not implemented in our case but we do have complete "Terms of Use" in place if users wants to review it.
Generally, "Terms of use" can be found in the footer section of the AMPed website.
Our terms of use are given in Figure 5.

Goal
One of the main goals of this thesis is to develop minimal inhibitory concentration (MIC) and minimum lethal concentration (MLC) calculators, so that biologist/researchers can upload their raw data according to the specified template format and get the required calculation with a plot to visualize their data. By uploading their data onto our platform to use our software tool, we also begin to add new data to the AMPed database from researchers around the world. One main goal of the broader AMPed effort is to facilitate collaboration between researchers worldwide.

Sequence Diagram
The sequence diagram is used to show the interactions between software objects in sequential order as the interactions occur. The sequence diagram helps developers and related people/business staff to understand how the software functions or data flows. It also serves to communicate the requirements for system implementation. Figure 4, shows the General sequence diagram of user interaction with the AWI 3.0 version of AMPed.

Figure 6: sequence diagram, the order of typical events is shown from top to bottom
In the above diagram arrow heads show the directions and sequences of functions and how users interact with the database and algorithms. In order to use the tools, the user must be logged in so an "Authentication Request" is sent to the database where it checks to see if that the account is set up or not, then it sends back an "Authentication Response". User can request "Search" to locate information in the database as well; the "Search Result" shows the appropriate result response. Another function is "Contribute". Authorized users can contribute experimental data from their research groups to the database. The user data is parsed using an algorithm to determine if the data is acceptable and in the right format. It then generates an appropriate response.
Similarly, when using tools, the user uploads data that goes to the server and, using the chosen algorithm, the tool generates calculation result and shows the plots in the user interface and also allows the user to choose to save a copy of their result.

Data
Bacterial growth data that is to be used to calculate the MIC/MLC values is the output of bacterial plate scanners that are used in the lab. An example of scanner software is softmax pro [1]. In Dr. Martin's Lab [2] a 96-well plate is setup for the experiment.
The wells are setup according to a specific pattern. Plates are set up with test peptide dilutions in an ascending or descending order. To standardize data points, we have set up our plate templates to contain peptide dilutions in descending order. The overall peptide concentration decreases from left to right. Figure 5, shows how the plates need to be arranged to seamlessly upload growth data to the AMPed database. The MIC and MLC calculators determine the minimum concentration of a given AMP that will either be inhibitory or lethal against some specified species of bacteria.
Peptides are tested in triplicate, so the first 3 rows will all have the same peptide concentrations as Pep1 shown in Figure 5. As Pep1 rows 4-6 will contain dilutions of a second test peptide, let's say Pep2, set up analogous to the first 3 rows with Pep1.
Tests are repeated 3 times for each peptide (here we are supposing pep1 and pep2) to see what kind of changes the dilution brings to each peptide, and the two peptides are compared with the standard antibiotic in row 7.
In the above figure only one block is shown There can be many blocks like this depending upon the experiment. Each block represents one single time point in the growth curve of the bacteria. The antibiotic Row 7 will have a known antibiotic at the same dilutions as the test peptides.
only, no bacteria (contamination control). From Column 2-11 Dil1-Dil10 will be varying dilutions of test peptides. Column 12 or M will be bacteria and broth only as the negative control (no peptide/antibiotic added).

Data for upload
The file that we can get from Softmax Pro, is saved in .txt format but the non-binary format does not have list of the time points in each block. We require time points to calculate MIC/MLC for a minimum of 12 hours growth and also to monitor the entire growth curve for each bacterial culture. So, the output file will have the dilutions and all settings as described above from the lab experiment. That data is exported in Microsoft Excel in .csv format, and then are added time points for each block.
Suppose we have 50 blocks, there will be 50 time points in the data, one for each block. Figure 6, below shows a data template format in .csv format which now can be uploaded to the web to get the required calculations and visualizations. While generating the results, the algorithm uses the same "Timepoint" for each whole block. It is important for uploaded data to follow this format for a proper output of an accurate MIC/MLC and growth curve plot. We realize that this format may differ from the format that the data was originally collected, however, reorganizing the data prior to uploading it is essential for a successful result. For a sample of a whole dataset see appendix A.

Algorithm Steps
General Outline of algorithm for MIC/MLC(will be explain with code in details after this) Step 1: Any user who wants to use our tools must have an account with us. Tools reside in the footer area of the AWI, if the user is not logged in, they are redirected to the log in page or, if needed, to create a new account.
Step 2: A few questions are asked related to the data to be uploaded to gather information about the test peptides used in the experiments like what bacteria the test peptide fights against, the assay used and so on.
Step 3: A sample template is provided, if needed, with more information. The file should be in the proper format to get the right output.
Step 4: Select and upload the data file.
Step 5: The PHP script runs in the background creating a unique name for every file uploaded and saves the file in the server. The system executes an R-script to generate the result.
Step 6: The algorithm calculates the average of the first 3 rows (pep1) for every column and puts the results in a new table(pep1) for that time point.
The same is done for rows 4-6 (pep2) of every column. The result into a new table(pep2) for that time point.
Step 7: Graphs are plotted for Pep1 and Pep2 and a result table containing the calculations is published in the server.
Step 8: Results are also displayed on the web interface; these can be downloaded to the user's device.
The Average calculations of the pep1 and pep2 are displayed and by looking at those scientist can easily find the MIC/MLC of those peptides.

AMPed Web Interface(AWI) for MIC/MLC
Tools Tools can be found in the footer of the web interface as shown in Figure 7. The footer was updated as a part of AWI version 3.0. The "Tools" page was created to hold the tools created for this project. Figure 8 shows the footer part of the web interface where along with other footer title "AMPed Tools" is also added. As mentioned above this part holds the tools created for this project.

MIC/MLC Calculator Page
In order to gather the information related to the data a few questions (see Figure 11) are asked via GUI which helps the system to display the results according to the need of the researcher. Questions included are:  What bacteria does your peptide fight against?
 What kind of Assay you are using?
 What is your medium?
 What would you like to name Test Peptide 1?
 What would you like to name Test Peptide 2?
 What control antibiotic did you use?
 What concentration of the control antibiotic did you use?
 How many time points does your data have? Figure: 13 MIC/MLC Calculator page Also, if the user needs more information to understand the questions a "title" attribute is added to the text field. When the cursor hovers around the text field, more information is given, as shown in Figure 11. For example," What species/strain of bacteria are you trying to inhibit? E.g., S. aureus, E. coli, etc." pops up when the cursor hovers over the text box associated with "What bacteria does your peptide fight against?"

Upload file
As shown in the above figure, "Choose File" can be displayed using attributes (specific words written in code corresponds to what we want to see in the user interface) in PHP, described as below; type="file" inside the input tag and another attribute accept= " .csv" allow you to choose only .csv file format.

Snippet of the code
<input style="height: 35px;" name="upload_file" type="file" accept= " .csv" id="upload_file" value=""/> When writing client-side code, attribute enctype="multipart/form-data" [4] should be used if the form includes any <input type="file"> elements as used in our code. The "enctype" attribute of the form element specifies how the form-data should be encoded when submitting it to server. The enctype attribute can be used only if method="post" [5] POST is used to collect form data from client browser and send it to the web server. It has advanced functionality which supports multipart binary input -this allows the client (browser) to send multiple chunks of binary input (files) to the server.

Snippet of code
<form action="mic_upload_process.php" method="post" enctype="multipart/formdata" name="form1" id="form1" onSubmit="return valForm(this);"> To create submit button we can use "type" attribute inside input tag "type="Submit" Similarly, Figure 13 (6) shows creating .r file and writes the dynamic as well as constant text/r script in the file. Full PHP coding in appendix B

Snippet of code
This code shows how a user uploaded file is saved with unique time stamp.

Snippet of code for calculation
This code shows how the different rows are selected from the excel file and the steps that taken to separate different rows and put them in different variable and how after calculation saved as data frames. Below is the part of R script for calculation.  In the above plots of bacterial growth treatment with pep1 and pep2, each average of 3 measurements in a given area of the plate is plotted against the time point of that block according to peptide concentration added to that area of the plate. If we study the plots with the calculations carefully we can find that the minimum inhibitory concentration is at a X16 dilution of both peptides tested on a single 96-well plate. Plots are simply there to make data come alive. We can visualize the bacterial growth data i.e. pep1 and pep2 shown as above and this makes it easier to understand the data and conclude the MIC/MLC results.

Summary
Goals set were met as a tool was created to upload raw lab data in a specified .csv format to the web interface and get the required calculations and plot visualization which could be downloaded for the reference.

Introduction
Mass spectrometry is an analytical technique used to characterize the structure of a molecule, by using gas-phase momentum to directly measure the mass/charge ratios of individual molecules. MS can be used to determine or confirm the amino acid sequence of an unknown or known AMP.
MS is the most commonly used technique to check that the amino acid composition of a synthetic peptide is correct or not [1]. This test records the mass-to-charge ratio of each peptide and outputs a plot of the relative abundances of each isotopic mass peak with high resolution. Because some of the atoms in each peptide molecule moving through the spectrometer have a certain probability of containing a less-abundant heavier mass isotope therefore they will be viewed at one or two atomic mass units heavier than the lowest mass or monoisotopic mass of that peptide. MS output gives a distribution of mass peaks at different intensity depending upon their compositions different isotopes in each peptide [2]. This information is crucial in the discovery of novel antimicrobial drugs. Also, when performing the synthesis of AMPs for the purpose of research and drug characterization, mass spectrometry can be used to validate that a synthesized peptide has its asserted amino acid composition, as a gold standard characterization technique.

Data
Input for this tool will be a peptide sequence consisting of as a strong amino acid of known atomic composition. The program gives the isotope distribution with all the possible masses from the input peptide sequence.
Example: Peptide with atomic composition S user will enter the peptide sequence in the form CvHwNxOySz (usual elements that compose peptides), as well as a parameter k, for the number of peaks to be shown and then the program will output an k-by-2 matrix with the possible molecular masses in the left column and their corresponding abundances in the right column.

Programming Language
The Perl programming Language will be used, integrated with PHP. R will be used for graphical functions. The weight of each amino acid will be defined in the program i.e.
looked up in a table, and the then weight of the peptide chain can be calculated.

Algorithm
 User will input values in numbers for v,w,x,y,z,k to specify the peptide and the number of peaks. After that it will calculate the mass values of the peaks assuming that all of the atoms are low-mass isotopes and storing that in the first entry of the mass vector.
 Next, it will add 1 to that and stores that number in the next mass vector entry, for the case where there is exactly one heavier isotopic atom in the molecule.
 This process proceeds until mass contains k entries.
 Now, the probabilities of the masses of each peptide isotope needs to be stored so the vector p, is initialized with zeros in each entry. The probability of having the lowest mass isotope of a peptide randomly is as follows: [1] .
 Here, p 12 C is the probability of a random carbon atom being the minimum mass isotope, 12 C, and the other probabilities are defined similarly.
 Then algorithm will store this value into the first entry of p.
 Now, we will consider the case where there is exactly one heavier isotopic element of each element, meaning, each 12 C in the initial probability with one 13 C, then two 13 C, up to v 13 C, which corresponds to the case where all carbons in the molecule are heavier isotopes and all of the other atoms are low-mass isotopes. Then it will store each of the values in the 2 nd , 3 rd , and v+1 th entries of the prb vector.
 After that we will consider the cases where two elements, such as C and H, are isotopic. It will give us probability of the peptide with only one instance of 13 C and one 2 H, followed by one 13 C and two 2 H, up to one 13 C and w 2 H.
 Again same process will be used for instances of 2 H, but with two 13 C. Will continued until there are v 13 C and w 2 H.
 Algorithm will go on for three instances of heavier isotope elements, then four, and finally the cases where all five elements are heavier isotopes, adding each probability to the entry in the prb vector that corresponds to the mass associated with that isotope.
 Now, each of the probabilities in the prb vector will be divided by the largest entry in the prb vector and stored in a new vector as intensities, which is a list of the abundances for each mass relative to the highest mass (set to be 100%).
 Lastly, the mass vector and the vector with intensities are combined into a matrix with the entries of mass in the left column and the corresponding entries of the intensities in the right column.
This table will be the output of the program and resembles an MS data set.
There might be some limitations to the algorithm that needs to be addressed as being developed as a bioinformatic tool. The computational load increases exponentially with the size of the molecule. Peptides could be up to one hundred amino acids long and the tool should be able to handle any size of peptide; it is in fact a very huge project, currently algorithm can only handle peptides with fifteen long amino acids.
Output Figure 18 shows how the predicted Spectrum would look like.   An entity is an object or concept about how you want to store information. Database entities are converted into tables that hold attributes in columns.
Entity among entities.
Associative entities are transformed into tables that hold the foreign keys (a means of reference among tables) associated with it or connected with it. It is represented in an ER diagram by diamond shape within the rectangle shape.
Attributes, represented by ovals, are characteristic/properties of the entity. A key attribute, represented with an underline, is a uniquely identifies and entity.
A multivalued attribute can have more than one value and is represented as shown below.
Solid lines connecting attributes and entities show the relationships between them i.e.
which attributes are associated with which entities.

Relationship
Attribute Associative Entity

Attribute
There are 3 types of relationships among entities in the ER diagram, one-to-one, oneto-many, and many-to-many relationships. For examples in the AMPed ER diagram, one of the relationships is between the Peptide entity and the Gene entity. This is a one to many relationship, as shown in Figure 20 represented by 1-N. This means that one peptide can be related to many genes, or that many genes can related to one peptide.

Normalization
Normalization is the process of organizing data in a database; which includes creating tables and relationships between those tables, setting rules that define those relationships to protect the data and to make the database more flexible, correct and consistent by eliminating redundancy and inconsistent data. One of the goals of this  Amino_Acid_Address table and the 3D_Structure table. If   needed, we can we can still join 3D_Structure table and Amino_Acid_Address table   through the Peptide table. After these changes were made, we checked the database schema for compliance with normal forms. It is currently in 1 st , 2 nd , and 3 rd normal form.

Schema
The schema defines the database tables. The fields (attributes) in each table show the datatypes of the associated data. The schema also shows the relationships between the tables. The keys (unique identifiers of the rows of the table) and the foreign keys (references from one table to another used to combine information among tables) are also identified in the schema. The schema imposes integrity constraints that ensure that the changes made by authorized users do not result in loss of data and that information derived from multiple tables is correct and consistent.
See appendix E for updated schema of Amped database.

CONCLUSION AND FUTURE WORK
The The field of biology is ever growing and accumulates large volumes of data, Sorting data to uncover research results is a huge task. One of the contributions of these tools to manipulate the data for better visualization and to save time. These are the qualities that researchers look for in any tools.
The collection of data is not enough; data should also be easy to access when needed.
Our database has a huge collection of peptides and related information, so the interface to access that data should be user friendly as well. No one wants spend more time than necessary to just access the data, so the GUI should be easy to maneuver.
One of the goals was to add new features to the existing interface in such a way that it does not look like they were randomly added. The flow of the interface should be maintained. While adding features all these important points were considered. Our GUI is better and user friendly, and easy to maneuver. Complex looking interfaces confuse the user which is not considered a good quality.
Identifying correct relationships between data is a very crucial part of developing a database. While creating and updating our database, careful observations were made regarding the relationships between entities to avoid any redundancies and inconsistencies that may occur while inserting, deleting and updating the data in the database. Changes in the database design were made to prevent these problems.
The original goals of this work were achieved. We have furthered our efforts to make our database a hub for sharing resources and data between researchers in the same field. These goals have been achieved with the guidance of Dr. Martin. There are several other features that presently reside in AMPed, developed by students who worked on this project before. The work of this thesis helps to make the database more useful to researchers interested in anti-microbial peptides. The ultimate goal is a full functioning database with all required user and administrative features to support a comprehensive Enzyme information system [1].
In the future, a new tool also needs to be developed to automatically download content from a variety of sources and parse it to the AMPed site. This would continue populating the site with the ever-expanding amount of newly discovered and validated content GLOSSARY

AMPed
The Antimicrobial Peptide Editable Database developed at the University of Rhode Island under the direction of Professor Lenore Martin.

Antimicrobial
An antimicrobial is an agent that kills microorganisms or inhibits their growth, which includes antibacterial, antiviral, antifungal and antiprotozoal agents.

Amino acid
Organic compound that serves as the building blocks of proteins.

MySQL
MySQL is an open-source relational database management system (RDBMS) created by a Swedish company, MySQL AB.

MASS SPECTROMETRY
An analytical technique used to characterize the structure of a molecule.
During the process of mass spectrometry, the molecule of interest is fragmented and the mass/charge ratio of these fragments are determined via instrumentation.
NaOH Sodium hydroxide, a basic chemical used as a control in the experiment. The NaOH will be lethal to the bacteria.

NCBI
National Center for Biotechnology Information (NCBI) is a US government resource that develops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communities.
PDB Protein Data Bank (PDB) is a repository of curated 3D biological structures, usually proteins, populated by direct input from an international community of scientists. PDB utilizes different file formats (mmCif/XML) to display protein structure information.
Peptide A small protein usually less than 100 amino acids in length.

Perl
Perl (short for "Practical Extraction and Report Language") is a highly capable, feature-rich programming language with over 29 years of development. Perl runs on over 100 platforms from portables to mainframes and is suitable for both rapid prototyping and large scale development projects.

Pseudomonas aeruginosa (PA)
A Gram negative rod-shaped bacteria that may be used to assess the properties of the AMP. R R is a language and environment for statistical computing and graphics. It is available as free software, compiles and runs on a wide variety of UNIX platforms.One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.

R Studio
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. It is available in open source.

Staphylococcus aureus (SA)
A Gram positive round-shaped bacteria that may be used to assess the inhibitory/lethal properties of the AMP.