Transcriptome of the Eastern Oyster Crassostrea Virginica in Response to Bacterial Challenge

The Eastern oyster Crassostrea virginica, an ecologically and economically important estuarine organism, suffers mortalities as high as 90-100% in affected areas due to Roseovarius Oyster Disease (ROD), caused by the bacterial pathogen, Roseovarius crassostreae. Advanced genotypic breeding techniques necessitate information regarding markers and genes associated with disease resistance. As yet, the host-pathogen interaction between C. virginica and R. crassostreae is poorly understood at the molecular level. The identification of potential genes and pathways responsible for an effective host defense response in the Eastern oyster to R. crassostreae is important not only to provide a basis for enhanced breeding techniques, but also to enhance understanding of innate immunity in a broader, evolutionary sense. The present study proposed to uncover not only genes and general processes potentially involved in disease resistance to ROD in the Eastern oyster, but also diversified gene families. To that end the present study entailed a disease challenge exposing RODresistant and ROD-susceptible families of oysters to R. crassostreae, highthroughput cDNA sequencing of samples from several timepoints throughout the disease challenge, assembly of sequence data into a reference transcriptome, analysis of the transcriptome through differential gene expression and gene family similarity clustering, and single nucleotide polymorphism (SNP) detection to identify candidate gene markers. Oyster resistance to R. crassostreae was found to involve extracellular matrix remodeling, cell adhesion, inflammation, metabolism, and other processes. Several gene families identified as putatively diversified and important in the oyster host defense response were enumerated and described, including serine proteases, serine protease inhibitors, c-type lectins, C1q domain-containing proteins, fibrinogen domain-containing proteins, scavenger receptors with class B SRCR domains, interferon-induced protein 44 (IFI44) family proteins, and GTPase of the immunity associated protein (GIMAP) family proteins. Further, similarity clustering of proteins and translated transcripts from diverse invertebrates suggested that GIMAP proteins are expanded in molluscs and IFI44 proteins are expanded in bivalves.


INTRODUCTION
The Eastern oyster Crassostrea virginica is an ecologically and economically important estuarine organism cultured from Louisiana, USA to New Brunswick, Canada (Kennedy et al. 1996). Oyster production is an important sector of United States agriculture and the Eastern oyster was estimated in 2010 to have a farm gate value of $90.4 million in the United States (NMFS 2010). Several oyster diseases, both protozoan and bacterial, have expanded in range and increased in severity during the latter half of the twentieth century, often causing staggering losses (Cook et al. 1998;Ford and Chintala 1996;Ford and Smolowitz 2007). Roseovarius Oyster Disease (ROD), caused by the bacterium Roseovarius crassostreae, went unreported before 1988 and presently affects oysters from the Long Island Sound north to Maine (Bricelj et al. 1992;Ford and Borrero 2001;Boettcher et al. 2005; Markey and Gomez-Chiarri 2009). ROD can cause mortalities as high as 90-100% in affected areas (Bricelj et al. 1992;Ford and Borrero 2001;Boettcher et al. 2005). ROD can cause mortality events that last a few weeks, often coinciding with or following closely upon peak summer temperatures. Gross clinical signs include uneven shell margins, soft tissue emaciation, and conchiolin depositions on the inner shell surfaces (Bricelj et al. 1992;Barber et al. 1998;Boettcher et al. 1999;Ford and Borrero 2001).
The host-pathogen interaction between C. virginica and R. crassostreae is poorly understood. It is known that R. crassostreae colonizes the oyster's ,! inner shell surface before lesions develop in the epithelial mantle (Boardman et al. 2008). Shell colonization may enable R. crassostreae to evade cellmediated killing by hemocytes (Boardman et al. 2008), which are the active cells of the immune system present in the hemolymph and involved in triggering and sustaining both cell-mediated and humoral host defense, wound and shell repair, and other processes (Ford and Tripp 1996). Colonization likely stimulates the oyster to deposit conchiolin and it has been suggested that smaller oysters succumb to ROD because they lack adequate metabolic resources to fuel deposition, leading to emaciation (Bricelj et al. 1992;Ford and Borrero 2001;Boardman et al. 2008). R. crassostreae may produce a toxin with ciliostatic activity (Boettcher et al. 2000) and whereby environmental effects can be minimized, which have been shown to be of great effect in oyster culture and may seriously confound main effects in traditional family-based selection (Langdon et al. 2003;Guo et al. 2008;Massault et al. 2008;Cancela et al. 2010).!
The identification of potential genes and pathways responsible for an effective host defense response in the Eastern oyster to R. crassostreae is important not only to provide a basis for enhanced breeding techniques, but also to enhance understanding of innate immunity in a broader, evolutionary sense. Deep sequencing of the Eastern oyster transcriptome in response to bacterial challenge is a valuable and interesting contribution because the Eastern oyster is a member of Lophotrochozoa, a superphylum that has been poorly represented among genomic and transcriptomic datasets until the recent release and publication of the Pacific oyster, Crassostrea gigas, genome (Zhang et al. 2012). Research conducted on the innate immunity of model invertebrates with sequenced genomes has focused on deuterostomes like the purple sea urchin Strongylocentrotus purpuratus, the tunicate Ciona intestinalis, and the amphioxus Branchiostoma floridae, while research into the )! innate immunity of protostomes has disproportionately focused on members of Pancrustacea and, more specifically, Arthropoda, including Drosophila spp.
Invertebrate hosts lack the classical adaptive immune mechanism of receptor diversification through RAG-1-and RAG-2-mediated somatic recombination and gene conversion, yet they successfully combat widely varied types of microbes and parasites, which have comparatively high rates of mutation and rapid generation times (Flajnik and Du Pasquier 2004;Du Pasquier 2005). To mount effective and flexible defense responses to diverse pathogens, invertebrate hosts have developed diversified repertoires of receptors, regulators, and/or effectors (Messier-Solek et al. 2010). Though the precise role, relative importance, and presence/absence of specific genes and gene families relevant to host defense differs among the invertebrates in the taxa referred to above, several general strategies of diversification have been described including allelic diversity/SNPs/indels, alternative splicing, and gene family expansion (Ghosh et al. 2010;Messier-Solek et al. 2010). As an example of the latter, the genome of sea urchin S. purpuratus has undergone extensive expansions of Toll-like receptors (TLRs), NACHT-and leucine-rich repeat-containing (NLR) proteins and multi-domain scavenger receptors cysteine-rich (SRs), all of which directly or indirectly bind to pathogen associated molecular patterns (PAMPs) (Hibino et al. 2006). The Dscam gene, as studied in several arthropods, illustrates the strategy of diversification through alternative splicing. The Dscam intron-exon architecture enables a possible production of tens of thousands of protein isoforms that function in homophilic binding and, likely, heterophilic, bacterial binding (Watson et al. 2005). Fibrinogen-related proteins (FREPs) identified in arthropods and molluscs have been shown to participate in agglutination and phagocytosis and have attained diversity through the multiple strategies of gene family expansion, alternative splicing, allelic diversity, and even somatic recombination (Leonard et al. 2001;Zhang et al. 2004;Ghosh et al. 2010).
While high-throughput, digital transcriptomic studies of host defense have been performed in molluscs other than the Eastern oyster, including the Pacific oyster C. gigas, the mussel Mytilus edulis, the Manila clam Ruditapes phillipinarum, and others (de Lorgeril et al. 2011;Brulle et al. 2012;Philipp et al. 2012), disease challenge transcriptomic studies specific to Eastern oyster host defense have used medium-throughput approaches including expressed sequences tag (EST) analysis and microarrays (e.g. Jenny et al. 2002;. The present study proposed to uncover genes, gene families, and general processes potentially responsible for disease resistance to ROD in the Eastern oyster. Sequences of cDNA from ROD-resistant and susceptible families of oysters exposed to R. crassostreae were assembled into a reference transcriptome. Differential gene expression and gene family analyses were used to identify genes, gene families, and general processes involved in oyster immunity, and single nucleotide polymorphism (SNP) were &! identified in candidate genes. The present study found that genes involved in extracellular matrix (ECM) restructuring, cell adhesion, inflammation, metabolism, catecholamine signaling, and several other processes distinguished resistance from susceptibility to Roseovarius Oyster Disease.
The present study also identified several large gene families important in Eastern oyster host defense including two families poorly characterized in invertebrates, the GTPase of the immunity associated protein (GIMAP) family and the interferon-induced protein 44 (IFI44) family, which appear to be expanded in molluscs and bivalves, respectively. .!

Bacterial Challenge of Eastern Oysters
Juvenile Eastern oysters from 2 families with known differential susceptibility to ROD as determined in a previous study were provided by the Rutgers University shellfish hatchery (Guo and Gomez-Chiarri, in preparation   at a final tank concentration of 7.5 ! 10 6 colony forming units (CFU) ml "1 (day 0 of challenge), while oysters in control tanks were not exposed. Oysters were fed Instant Algae (Reed Mariculture) every other day and water was partially changed (50%) weekly. Oysters were monitored weekly for 93 days for mortalities and for the presence of clinical signs of ROD (uneven valves and conchiolin deposits in shells of dead oysters). Infection by R. crassostreae was confirmed by PCR .

Sample collection, cDNA preparation, and sequencing
Oyster whole body tissue was collected from 5 randomly sampled

Read processing and de novo assembly
Raw sequencing reads of 108 bp were pooled then processed and filtered for contamination of mitochondrial and ribosomal sequences by mapping to all Crassostrea spp. rRNA and mtDNA in NCBI Genbank database, and were filtered for vector sequences by mapping to Univec using bowtie2 (Langmead and Salzberg 2012; ftp://ftp.ncbi.nih.gov/pub/univec). The 5'-end of reads was trimmed to reduce GC-bias (Hansen et al. 2010). Using the btrim software package, Illumina adapters were trimmed, low-complexity artifacts were removed, and reads were further trimmed using adaptive quality trimming (Kong 2011). Reads less than 20 bp in length at this stage were discarded. Processed transcriptome reads were assembled using Trinity (release 20111126) with default options (Grabherr et al. 2011). Only those assembled contigs # 200 bp were retained.

Contamination Removal
Transcriptome contigs were compared to the RefSeq protein database (Sayers et al. 2012 (Finn et al. 2011). DNA transposons were identified and discarded using RepeatMasker (Smit et al. 2010).

Differential Gene Expression
Reads from individual treatments were aligned to the reference transcriptome using bowtie with parameters, "-v 3 --a --best --strata," such that 3 mismatches were allowed per read to account for the high rate of polymorphism in oysters, while only reporting the alignment with the least number of mismatches for each "stratum" (Langmead et al. 2009 (Grabherr et al. 2011). Only those DEGs that were significant at the level of "gene isoforms" and belonged to +,! components that were significant at the level of "gene models" were considered differentially expressed. While the relationship of Trinity contig to Trinity component works on the level of graph space and is not precisely biological (Grabherr et al. 2011), the relationship as it was applied only added a stringent filter to the pool of DEGs and did not contribute new DEGs.

Heatmap Analysis
For all genes differentially expressed in any one of the treatments GX-d1, GX-d5, F3L-d1, F3L-d5, F3L-d15, and F3L-d30, the log 2 -transformed RPKM for each gene in each of eight treatment-days (including control, CGX-d15 and CGX-d30) were Z-score centered by subtracting then dividing by the mean log 2 -transformed RPKM by gene. Genes were hierarchically clustered using Euclidean distance and complete linkage of the Z-score-transformed gene expression. Treatment-days were clustered using the complete linkage Euclidean distance of the Spearman correlation of the Z-score-transformed gene expression. Clustering and visualization were performed using the fastcluster and gplots packages, respectively, in the R programming environment (Bolker et al. 2010).

Annotation and Functional Enrichment
Transcriptome contigs were compared to the NCBI protein nonredundant database and Uniref100 using BLASTX (Altschul et al. 1997 (Schlicker et al. 2006), REViGO online visualization tool (Supek et al. 2011), and custom R scripts.

Protein Family Identification and Test for Enrichment
Transcriptome contigs were compared to Pfam_A using PfamScan (version 1.3) and HMMER (version 3.0) and only hits with an e-value $ 1e-5 were retained (Eddy 1998;Finn et al. 2011;Pfam_A downloaded July 7, 2012 Table S1). ESTs were translated to peptide using the prot4EST pipeline. Full-length Gastropoda Uniref100 proteins were used as the training set for determination of the codon usage bias for the three gastropod species, while full-length Bivalvia Uniref100 proteins were used as the training set for C. gigas ESTs.
Protein sequences and translated ESTs from all organisms were concatenated, formatted, and filtered using the first two steps of the OrthoMCL pipeline, orthomclAdjustFasta and orthomclFilterFasta Fischer et al. 2011). Proteins were compared in a parallel run of all-versus-all BLASTP, retaining hits with e-value $ 1e-10 (Altschul et al. 1997). The orthomcl utility orthomclBlastParser was used to calculate the percent identity of each hit. The list of significant hits, or edges, was then made non-redundant and with a maximum negative log10-transformed e-value of 99 using custom python scripts. The resultant network file was filtered using MCL set on the default inflation index of 1.5 (Enright et al. 2002).

Enrichment of DEGs among TribeMCL clusters
Each C. virginica-only TribeMCL cluster was independently tested for enrichment of the 3 DEG treatment comparison groups (GX_early, F3L_early, (frequency equal to one). Multiple names were used for the annotation if a number of non-identical hits of comparable frequency composed a majority of the cluster (e.g., "hemicentin/rhamnospondin/thrombospondin*"). All other TribeMCL clusters were named "unknown."

Selection of Putative Diversified Families
Gene diversification implies a gene family expansion comparative to other lineages (e.g. by gene duplication) and/or gene diversity significantly greater than that which is present in the genomic sequence alone (e.g. by alternative splicing). To identify "diversified" families with host defense +(! relevance, working definitions were adopted for the terms: gene family, expansion, and host defense relevance. A gene family was defined as a group of transcripts with similarity between one another (membership in the same TribeMCL cluster) and/or similarity to the same gene family based on BLAST matches (e.g. two transcripts with BLAST similarity to C1qDC were not included in further analyses, due to the fact that the presence of multiple repeats frustrated similarity clustering. When a diversified family was found that met several of the conditions stated above, a term search was conducted to find relevant hits among the BLAST results (e.g. terms like "serine protease," "trypsin," etc. were used to find serine proteases). The contig list in each TribeMCL cluster was expanded to include all contigs in a TribeMCL cluster if and only if more than or equal to half of the contigs in that cluster had BLASTX best hits to the group of interest and the remainder of contigs had no significant or conflicting hits, whereas the contig list was contracted if less than half of the contigs in that cluster had BLASTX best hits to the group of interest and the remainder of contigs had hits to dissimilar proteins, in which case, all contigs in that cluster were excluded from that group of interest. The contig list was neither contracted nor expanded but remained unchanged in the cases of a TribeMCL cluster in which more than or equal to half of the contigs in that cluster had BLASTX best hits to the group of interest but the remainder of contigs had significant hits to dissimilar proteins or if less than half of the contigs in that cluster had BLASTX best hits to the group of interest and the remainder of contigs had no significant hits. Those contigs (with hits to the gene family of interest) that were not contained in the TribeMCL graph were also retained.
Translated transcripts for each of the select diversified families of interest were reduced to non-redundant sets using CD-HIT, on settings "-G 0 - Each multi-sequence file in categories of interest like DEG groups (GX_early, F3L_early, F3L_late) and diversified gene families, was run through the codeml program in the PAML package (version 4.1b) to obtain the rate of nonsynonymous mutations (dN), the rate of synonymous mutations (dS) and the ratio of these two rates (dN/dS) (Yang 2007). The M0 model was assumed of a single!1 for all lineages, initial % and 1 values of 1.0 and 0.5, respectively, and the F3x4 codon frequency model (Goldman and Yang 1994). The final % and 1!values were estimated by codeml (Yang 2007).

Comparison of putative diversified families between diverse taxa
The diversified gene families analyzed in depth using the C. virginicaonly TribeMCL graph were tracked and enumerated in the multi-species ,'! diversified family of interest were subsequently reduced to non-redundant sets, independently for each family, using CD-HIT, on settings "-G 0 -aS 0.50 -c 0.95," which reduced the sets of proteins by 95% redundancy for a alignment length at least 50% of the smaller protein (Li and Godzik 2006).
The number of non-redundant proteins was enumerated for each species for each diversified gene family. In a few select cases, proteins from different species were so similar as to cluster together, in which case, proteins were counted for both species. For example, two serine proteases from D.
melanogaster that reduced to one sequence by CD-HIT, would have been counted as one serine protease, while one serine protease each from D.
melanogaster and D. pulex that reduced to one sequence by CD-HIT would have been counted as one serine protease for each species. In addition to an estimation of the number of non-redundant proteins for each species for each diversified family, the number of gene models was estimated for each species for each diversified family, according to the formula: Estimations of the total number of gene models for each species (NP, above) were taken from various sources including the published literature and websites of genome sequencing centers (Appendix A, Table S2). Estimates for total number of non-redundant proteins and number of gene models for L.
stagnalis are not reproduced here because the estimates were unreasonably low.

Phylogeny of diversified families
For a select few diversified groups of interest, the non-redundant protein sequences from the multi-species TribeMCL graph were aligned using the E-INS-I option of MAFFT (version 6, Katoh and Toh, 2008). Multiple sequence alignments were viewed in JalView (version 2.6.1) and manually curated to extract blocks of well-aligned sequence (Waterhouse et al. 2009).
The models of sequence divergence that best fit the multiple sequence alignments were found using ProtTest (version 3, Abascal et al. 2005).
FastTree2 was used to generate a phylogenetic tree (Price et al. 2010 ,-!

Oyster survival in response to bacterial challenge
Oysters from the F3L family experienced a consistent and high rate of mortality after challenge with the bacterial pathogen R. crassostreae, reaching over 90% cumulative mortality by the end of the 93-day period (Fig. 1). The survival curve of the challenged F3L oysters was significantly different from all other groups (log-rank survival, p < 0.01). Pearson's chi-squared test with Bonferroni corrections was used to compare oyster survival in the 3 groups used for sequencing, CGX, GX, and F3L, at day 28 (the closest time point at which mortality was tallied before collection of samples for RNA isolation at day 30), and at day 93 (the final time point of the bacterial challenge). F3L had a significantly higher cumulative mortality than GX at day 28 (p < 0.01), and a significantly higher cumulative mortality than GX and CGX at day 93 (p < 0.01). No significant differences in mortality were observed between unchallenged control oysters (CF3L and CGX) and oysters from the resistant family challenged with R. crassostreae at days 28 and 93 after challenge.
Oysters from the control ROD-resistant family (CGX) suffered a mortality event between days 1 and 7 (20% cumulative percent mortality by day 7), with an additional 10% mortality for the 86 days following that event (Fig. 1).

Differential gene expression in oysters in response to bacterial challenge
Differential gene expression analyses of samples GX-1d, GX-5d, F3L-1d, F3L-5d, F3L-15d, and/or F3L-30d yielded a total 6,296 differentially expressed genes (DEGs), or 1.8% of the total de novo-assembled transcriptome. When samples were clustered by gene expression patterns, two major clusters separating F3L and GX/CGX treatments were evident.
Furthermore, within each of these clusters, the following subclusters were detected: F3L 1 and 5d (F3L_early); F3L 15 and 30d (F3L_late); GX 1 and 5d (GX_early); and CGX 15 and 30d (control) (Fig. 2). This pattern of treatmentday clustering (GX_early, F3L_early, and F3L_late) was used in further analyses. The focus of the analysis was placed on GX_early DEGs according to the rationale that genes involved in disease resistance would likely be expressed in resistant GX oysters at early time points after exposure to the pathogen. ,.!
The more dramatic response to bacterial challenge of F3L compared to GX in terms of cumulative mortality was reflected in a similar differential response in terms of gene expression (Fig. 3). Of the 356,237 total transcripts tested for differential expression, 6,097 transcripts were differentially expressed in F3L_early and/or F3L_late, compared to only 552 transcripts differentially expressed in GX_early. F3L DEGs were described at the gene level where overlap was found with GX_early DEGs and were further described at the scale of Gene Ontology functional enrichment, Pfam enrichment, and TribeMCL-DEG enrichment (Tables 2 -5). A greater share of GX_early DEGs was shared with F3L_early and/or F3L_late DEGs (64%) than was unique to GX_early (36%) (Fig. 3).
DEGs shared by GX_early and F3L treatments should include (among others) genes associated with host defense and supporting functions (Tables   4 & 5). DEGs unique to GX_early should include genes contributing to disease resistance in the GX family (Tables 2 & 3). The most highly differentially expressed, annotated, up-regulated DEGs unique to GX_early (potentially involved in disease resistance) included, among others, transcripts that annotated to 2 scavenger receptor cysteine-rich proteins, 2 fibropellin ia proteins, 2 inhibitor of apoptosis (IAP) proteins, cytochrome p450, interleukin 17d, a fibrinolytic enzyme, and a disintegrin and metalloprotease with thrombospondin motifs 8 (ADAMTS8) ( Table 2). The most highly differentially expressed, annotated, down-regulated DEGs unique to GX_early included ,(! transcripts that annotated to the development-related protein rapunzel, 2 collagen proteins, tenascin xb, 2 cubilin proteins, inhibitor of apoptosis (IAP), a c-type lectin, a melatonin receptor (Table 3). The most highly differentially expressed, annotated, up-regulated DEGs shared between GX_early and F3L_early and/or F3L_late (potentially involved in responses to bacterial infection) included transcripts that annotated to 2 serine protease inhibitors, 2 dopamine beta-hydroxylases, 2 fatty acid synthases, 2 sulfatases, cytochrome p450, a C1q domain-containing (C1qDC) protein, and heat shock protein 60 (HSP60)( Table 4). The most highly differentially expressed, annotated, downregulated DEGs shared between GX_early and F3L_early and/or F3L_late included transcripts that annotated to a C1qDC protein, two monocarboxylate transporters, multiple epidermal growth factor 11 (MEGF 11), deleted in malignant brain tumors 1 (DMBT1), and sushi-repeat-containing x-linked 2 (Table 4). Because different contigs sometimes shared the similar annotations (suggesting that they could potentially be members of a gene family, alternatively spliced forms, or parts of the same transcript that were not assembled together), annotations were manually compared to find truly unique transcripts among GX_early unique DEGs. Sixteen non-redundant transcript annotations were found in the GX_early sample, including arginase I, arginase II, rho gtpase, and cubilin (down-regulated at day 5); and unc-5, furin, and interleukin 17d (up-regulated at day 1) (Appendix A. Table S3).

,/!
The relative ratio of up-regulated and down-regulated DEGs differed between GX and F3L, with a significantly greater number of up-regulated transcripts in GX_early DEGs (Pearson's chi-squared analysis, p < 0.01) and a significantly greater number of down-regulated transcripts in F3L DEGs (p < 0.01).

Gene Ontology categories enriched among oyster DEGs upon bacterial challenge
As there were fewer DEGs in GX_early compared to the F3L groups, so there were comparatively fewer enriched GO terms (Fig. 4). The most significantly enriched biological process GO term among GX_early upregulated DEGs was "protein folding", corresponding to 3 DEGs annotated to HSP60 that were up-regulated only at day 1. "Protein folding" was also enriched among F3L_early DEGs, also corresponding to HSP60 transcripts, some of which were down-regulated and some up-regulated. The enrichment in "defense response" among GX_early DEGs corresponded to transcripts that annotated to interleukin 17 (IL17), which were strongly up-regulated at day 1.
Terms closely allied to defense response were found enriched among F3L_early DEGs including "defense response to bacterium" and "response to molecule of bacterial origin", corresponding to transcripts that annotated to angiotensin-converting enzyme, defensin, and immune-responsive gene 1, the latter of which was also up-regulated in GX at day 1. The other biological '0! process terms enriched among up-regulated GX_early DEGs were the related terms "programmed cell death" and "apoptotic process", corresponding to transcripts that annotated to IAP transcripts. Though related cell death terms were not found to be enriched in F3L treatments, 3 IAP transcripts were differentially expressed in F3L, 2 of which were down-regulated at all time points. The most significantly enriched F3L early biological process term, found among up-regulated DEGs, was "cholesterol transport", corresponding to epididymal secretory protein E1. Other transport terms enriched among F3L_early DEGs include "hexose transport" (up-regulated) and "amino acid transport"-related and "carboxylic acid transport" (down-regulated). Various development-related terms, including the closely related terms "blood vessel morphogenesis," "angiogenesis," and "vascular development" as well as "fin development" were enriched among F3L_early down-regulated DEGs while "fin development" and neuron-related development terms were enriched among F3L_late down-regulated DEGs. Several terms related to carboxylic acid metabolism were uniquely enriched among F3L_late up-regulated DEGs, corresponding to various decarboxylases. Also uniquely enriched among F3L_late up-regulated DEGs was "oxidation-reduction process", corresponding to several cytochrome p450 transcripts.

Characteristics of selected putative diversified gene families in oysters in terms size, polymorphism, and response to bacterial challenge
C1qDC was among the largest gene families differentially expressed in C. virginica in response to bacterial challenge, with 391 non-redundant annotated transcripts, 323 of which mapped to 149 genomic loci (nonoverlapping mappings ±200bp) in the C. gigas genome ( In terms of differential expression within the diversified families of interest, nearly all differentially expressed SPs were up-regulated at GX day 5 or F3L day 5; nearly all SPIs were up-regulated at GX day 1 and 5 or F3L day 1 and 5; all FREDs/FREPs were down-regulated in F3L at one or more time points and no FREDs/FREPs were differentially expressed in GX; nearly all GIMAP DEGs were down-regulated in F3L at one or more time points and a few were down-regulated in GX day 1, while one transcript was up-regulated in GX day 5 (Fig. 6); nearly all IFI44 DEGs were down-regulated in F3L at one or more time points, while one transcript was up-regulated at F3L day 5, and one was up-regulated in GX at day 5 (Fig. 7). For the remainder of the diversified families, C1qDC, CTLDC, and DMBT1/SRCR type 12, a consistent pattern of differential expression could not be observed (data not shown). For '(! the relatively few transcripts in the select diversified families that were differentially expressed in both GX_early and F3L_early (19 transcripts total), the direction of differential expression, that is, up-or down-regulation relative to control, tended to match between GX_early and F3L_early, except for 2 CTLDC transcripts and 1 DMBT1/SR type 12 transcript (data not shown).

Comparison of size and phylogeny of selected diversified gene families in oysters and other organisms
Certain patterns could be seen in the size of selected diversified gene families across taxa and across the diversified groups ( purpuratus, showed tens of proteins/gene models in the bivalve species considered here. The warm colors signifying a high column z-score for IFI44 for bivalve species and for GIMAP proteins for molluscan species indicates that IFI44 is greatly expanded in bivalves and GIMAP is greatly expanded in molluscs. No IFI44 transcripts were found in the gastropod species considered '/! here and few were found in other non-bivalve invertebrate species. Phylogenetic analysis of GIMAP sequences showed 3 major clusters ( Fig. 8 and 9). One major cluster of sequences contained, in addition to many C. Phylogenetic analysis of IFI44 sequences showed two main clusters, one of which was composed solely of bivalve sequences (159 sequences) and the other of which was composed of 18 sequences from 8 diverse organisms including C. virginica ( Fig. 10 and 11). The main bivalve-only group could be further subdivided into subgroups that had variable numbers of sequences from both C. virginica and C. gigas. )0!

DISCUSSION
The present study provides a rich view of the processes and genes that constitute the oyster host defense responses to bacterial challenge.
Extracellulat matrix (ECM) restructuring, cell adhesion, inflammation, metabolism, catecholamine signaling, and several other processes are key in distinguishing resistance from susceptibility to Roseovarius Oyster Disease.
The present study also identified 8 putative diversified gene families important in the host defenses of Eastern oysters, including two families, GTPase of the immunity associated protein (GIMAP) and interferon-induced protein 44 (IFI44) families, which appear to be expanded in molluscs and bivalves, respectively.
Oysters from the resistant family did not show clinical signs of infection and suffered mortalities comparable to non-challenged oysters, suggesting that these oysters were able to eliminate the pathogen rapidly. Prior to a discussion of differential gene expression and gene family analysis results, it should be noted that only 20% of the transcriptome could be annotated by BLAST and only 30% of the transcriptome could be described by the combination of BLAST similarity, Pfam, and TribeMCL clustering. Conclusions should be regarded with caution and viewed as foundation for future study.
With that in mind, comparison of the patterns of gene expression between resistant and susceptible oysters suggest that resistance to R. crassostreae may involve a targeted hemocytic response followed by tight control of inflammatory processes and detoxification. )+!

ROD-resistant juvenile oysters responded to the bacterial pathogen R.
crassostreae mainly by up/down-regulating the expression of transcripts coding for proteins that modify the extracellular matrix, proteins that bind self or non-self ligands, stress proteins, and receptors and/or proteins involved in signaling. The interaction between R. crassostreae and its oyster host occurs primarily in the extracellular matrix based on DEG annotations and enriched GO terms (Fig. 4). The unique up-regulation in resistant oysters of several subtilisin-like pro-protein convertases (PPC) ( Table 2) (Table 2), and the down-regulation of tenascin-xb (Table 3), a glycoprotein involved in wound healing and matrix maturation with anti-adhesive properties (Egging et al. 2007). The multiple transcripts that annotate as tenascin in oysters further dispel the notion that tenascins are unique to chordates (Tucker et al. 2006), ),! as tenascin-like transcripts have also been found in the transcriptome of the Antarctic bivalve Laternula elliptica (Clark et al. 2010). Other transcripts involved in ECM restructuring and cell adhesion identified here as distinguishing resistance and susceptibility include those coding for hemicentin and fibropellin-ia. Hemicentin increases cell adhesion and re-shapes areas of cell contact in C. elegans (Vogel and Hedgecock 2001). Fibropellin-ia increases cell adhesion in sea urchin embryos (Burke et al. 1998). By sequence similarity, fibropellin transcripts separated into a cluster enriched for susceptible DEGs and another cluster enriched for resistant DEGs (Table 7).
Cell adhesion has long been known to be important in invertebrate innate immunity in general (Johansson 1999) and in oyster immunity (Gueguen et al. 2003). I hypothesize that the differential response of resistant oysters in respect to ECM restructuring and cell adhesion molecules enabled a more effective hemocytic response in these oysters, facilitating cell migration to the extrapallial cavity (the space between the oyster mantle tissue and the shell, where ROD is known to have its primary effects, Paillard et al. 1996;Boardman et al. 2008), likely followed by aggregation, phagocytosis, and apoptosis (Terahara et al. 2005;Anderson et al. 2011;de Lorgeril et al. 2011 The early resistant response also involved the pro-inflammatory mediator, interleukin 17 (IL17), and the nitric oxide modulator, arginase.
Previous research shows that injection of heat-killed gram-positive and gramnegative bacteria into C. gigas oysters produces a rapid and transient upregulation in IL17 transcript abundance in hemocytes, suggesting that IL17 is an important mediator of the pro-inflammatory response in oysters (Roberts et al. 2008). My results support an important role for IL17 in the immune response of oysters against bacterial infection and a potential role in disease resistance to ROD. Here, while IL17 was uniquely up-regulated in resistant oysters, arginase was uniquely down-regulated (Table 3). Arginases have been shown in macrophages to modulate the production of nitric oxide (Chang et al. 1998), which is an immune effector in the Eastern oyster (Villamil et al. 2007). Using microarray technology, a transcript annotating as arginase was shown to increase rapidly after 6 h of heat stress in C. gigas ). The down-regulation of arginase in resistant oysters on day 5 may signalize a down-regulation of the inflammation and stress response following a successful defense response.
Genes and processes activated in susceptible oysters in response to bacterial challenge and absent or present to a much lesser degree in resistant ))! oysters provide potential information on the molecular basis for disease susceptibility or are signs of an unsuccessful defense response. Many transcripts involved in metabolic processes (e.g. carbohydrate metabolism) were differentially expressed in susceptible oysters at both early and late time points following bacterial challenge, but not in resistant oysters as illustrated by enriched molecular function GO terms (Fig. 4). A decrease in energy metabolic enzyme activity and a down-regulation of genes related to energy metabolism have been shown to coincide with mortality events in the Eastern oyster (Genard et al. 2011;. The up-regulation of multiple dopamine beta-hydroxylase (DBH) transcripts early in both resistant and susceptible oysters ( The host defense response to bacterial challenge in resistant and susceptible oysters shared some commonalities, including the involvement of catecholamine signaling (discussed above), detoxification, and apoptosis.

)&!
Transcripts that were highly up-regulated in both resistant and susceptible oysters included transcripts annotating as glutathione s-transferase, cytochrome p450, and heat shock protein 60 (HSP60), which are involved in preventing oxidative damage (Table 4) oysters as related detoxification terms like "monooxygenase activity" and "oxidoreductase activity" were functionally enriched among the susceptible but not the resistant DEGs (Fig. 4). Detoxification-related transcripts are highly upregulated in C. virginica prior to mass mortality events (Genard et al. 2012).
While an early response of detoxification/stress transcripts may contribute to resistance, a persistent and more generalized response (that is, a greater number of up-regulated stress transcripts) may signalize imminent mortality.
Another process involved in both the resistant and susceptible response was apoptosis. Inhibitor or apoptosis (IAP) transcripts were found to be both up- gigas, naturally resistant to P. marinus, has significantly greater inhibitory activity than C. virginica, which has been interpreted as suggestive of the role of protease inhibitors in host defense and resistance (Faisal et al. 1998;Jenny et al. 2002). The list of known oyster SPs and SPIs has grown with each EST analysis (Gueguen et al. 2003;Roberts et al. 2008) and several SPIs have been biochemically characterized (Xue et al. 2006;Xue et al. 2009;La Peyre et al. 2010). Polymorphism in the promoter of an Eastern oyster SPI has been associated with disease resistance to P. marinus (Yu et al. 2011). Proteases from Perkinsus sp. inhibit phagocytosis of Vibrio tapetis in clams (Ordas et al. 1999) and the virulent effects of V. tapetis on clam hemocytes are consistent with the effects of bacterial proteases (Borrego et al. 1996;Allam and Ford 2006  Results of the present study suggest that the Eastern oyster expresses a great number of C1qDC proteins and that they play a role in host defense against gram-negative bacteria. While > 100 non-redundant translated transcripts/proteins were found by network similarity clustering in all three oyster species examined, only a handful of transcripts were found in gastropod species (Fig. 5). These observations are consistent with the hypothesis that  (Fig. 9). In vertebrates, GIMAP proteins have been best characterized in their role as regulators of apoptosis ), though it has been shown that GIMAP family members show differential expression patterns across tissue types and may serve varying functions at different times and in different tissues (Wang and Li 2009). Exposure of human monocytes to LPS induces the down-regulation of 28 genes by >4 fold, four of which are GIMAP proteins (Dower et al. 2008). It has been suggested that the downregulation of GIMAP proteins in humans may serve to promote the survival of monocytes by negatively regulating apoptosis (Dower et al. 2008). It may be that GIMAP proteins fulfill a parallel role in oyster hemocytes though further work will be needed to define this role. I show here that similarity clustering does offer the means of transferring annotations. The majority of clusters that were herein annotated included one or more transcripts that could not be annotated by BLAST alone, yet whose identity could be inferred from its neighbors. Moreover, the very process of reducing a set of transcripts to a smaller set of connected components has the promise to focus efforts and resources in the effort to characterize genes and gene families that presently cannot be annotated by similarity search to the public databases. While these clusters could not be used to describe the oyster host defense response here, TribeMCL or an analogous clustering technique could be used to facilitate annotation in future studies. By considering transcripts at the level of TribeMCL clusters, sequence similarity motifs may be extracted to aid eventual characterization.
While much work remains to be done in characterizing the present Eastern oyster transcriptome and describing the oyster host defense response to bacterial challenge, the present study has made great advances to these ends. When ROD-resistant and ROD-susceptible oysters were exposed R.
crassostreae and gene expression was compared throughout the challenge by high-throughput cDNA sequencing, several processes emerged as key to resistance including ECM remodeling, cell adhesion, and inflammation. The present study has generated a pool of candidate disease resistance genes for advanced genotypic selection regimes. Additionally, several gene families were identified as putatively diversified and of immune relevance in the Eastern oyster, two of which, IFI44 and GIMAP families, are of especial interest as expansions were found to be specific to bivalves and molluscs, respectively. Transcript translation and similarity clustering followed by gene family analysis should prove useful in describing the transcriptomes of other invertebrates in response to immune and/or stress challenge as an unbiased means of identifying putatively diversified groups of host receptors, regulators, and effectors.

!
The top fifty most differentially expressed and up-regulated genes unique to GX days 1 and 5 (genes not differentially expressed in any F3L treatment) are shown, ranked by false discovery rate-adjusted p-value. Magnitude of differential expression is expressed as log-10 fold change over control pool (logFC) and abundance is expressed as log-10 counts per million (logCPM). Regulation, or "Reg.", (up-or down-regulated) is shown for each contig.
Hyphen ( -) indicates that the contig is not differentially expressed at that timepoint (Reg. columns) or that the contig does not have BLASTX hit to the NCBI non-redundant protein database with associated e-value $ 1e-6 (annotation columns). Where contigs are differentially expressed in both timepoints, logFC, logCPM, and p-value correspond to the timepoint in which the contig was most highly differentially expressed and that timepoint is indicated in the regulation columns in bold.

!
Differentially expressed contigs unique to GX days 1 and 5 (genes not differentially expressed in any F3L treatment) that were down-regulated are shown, ranked by false discovery rate-adjusted p-value. Magnitude of differential expression is expressed as log-10 fold change over control pool (logFC) and abundance is expressed as log-10 counts per million (logCPM). Regulation, or "Reg.", (up-or down-regulated) is shown for each contig. ( -) indicates that the contig is not differentially expressed at that timepoint (Reg. columns) or that the contig does not have BLASTX hit to the NCBI nonredundant protein database with associated e-value $ 1e-6. Where contigs are differentially expressed in both timepoints, logFC, logCPM, and p-value correspond to the timepoint in which the contig was most highly differentially expressed and that timepoint is indicated in the regulation columns in bold.

!
The top fifty most differentially expressed and up-regulated genes shared between GX days 1 and 5 and F3L days 1, 5, 15, and/or 30 are shown, ranked by false discovery rate-adjusted p-value. Magnitude of differential expression is expressed as log-10 fold change over control pool (logFC) and abundance is expressed as log-10 counts per million (logCPM). Regulation, or "Reg.", (up-or down-regulated) is shown for each contig. Hyphen ( -) indicates that the contig is not differentially expressed at that timepoint (Reg. columns) or that the contig does not have BLASTX hit to the NCBI non-redundant protein database with associated e-value $ 1e-6 (annotation columns). Where contigs are differentially expressed in both timepoints, logFC, logCPM, and p-value correspond to the timepoint in which the contig was most highly differentially expressed and that timepoint is indicated in the regulation columns in bold.

!
The top fifty most differentially expressed and down-regulated genes shared between GX days 1 and 5 and F3L days 1, 5, 15, and/or 30 are shown, ranked by false discovery rate-adjusted p-value. Magnitude of differential expression is expressed as log-10 fold change over control pool (logFC) and abundance is expressed as log-10 counts per million (logCPM). Regulation, or "Reg.", (up-or down-regulated) is shown for each contig. Hyphen ( -) indicates that the contig is not differentially expressed at that timepoint (Reg. columns) or that the contig does not have BLASTX hit to the NCBI non-redundant protein database with associated e-value $ 1e-6 (annotation columns). Where contigs are differentially expressed in both timepoints, logFC, logCPM, and p-value correspond to the timepoint in which the contig was most highly differentially expressed and that timepoint is indicated in the regulation columns in bold.
Pearson's chi-squared test was performed to test for the enrichment of each unique HMM accession. If the expected count for any cell in the 2x2 contingency table was 5 or fewer, Fisher's exact test was performed instead. Significance p-values listed above were adjusted for the number of independent enrichment tests performed for each group of DEGs by the False Discovery Rate correction method. Pfam families are ordered above by (1) rank order of significance among F3L_early DEGs, (2) rank order of significance among F3L_late DEGs, then (3) rank order of significance among GX_early DEGs.
.)!   (1) rank order of significance of enrichment for F3L_early DEGs, (2) rank order of significance of enrichment for F3L_late DEGs, then (3) rank order of significance of enrichment for GX_early DEGs. Enriched TribeMCL clusters were annotated with a protein name if more than half of the contigs had identical or nearly identical BLASTX best hits and the remainder of contigs had no significant hits, or if # 80% of the contigs had identical or nearly identical BLASTX best hits and the remainder of contigs had dissimilar best BLASTX hits (which may be the case in the sharing of domains, e.g., neurotrypsin and trypsin). In the cases in which contigs that composed a TribeMCL cluster had no best BLASTX hits or had a number of dissimilar hits that if counted conjointly did not represent a majority of the contigs, then that TribeMCL cluster was not annotated, that is, was named "unknown." * Enriched TribeMCL clusters were reservedly annotated if less than half of the contigs had identical or nearly identical BLASTX best hits and composed a plurality while the remainder of contigs had no significant hits or a small number of dissimilar but non-repeating hits (frequency equal to one), or if a number of dissimilar hits of similar frequency that if counted conjointly composed # 50% of the contigs in the cluster (e.g., "hemicentin/rhamnospondin/thrombospondin*"). (-! Figure 1. Cumulative percent mortality in two families of oysters challenged with the bacterial pathogen Roseovarius crassostreae (F3L and GX) and in unchallenged controls (CF3L and CGX).

FIGURES
The cumulative mortality is shown over the course of the 93-day bacterial challenge for both families (F3L and GX) and both challenge and control oysters, the latter of which is indicated by the prefix "C" for "control." Time 0 is the moment of exposure by bath to Roseovarius crassotreae. A Pearson's chi-squared test of significance performed for each pairwise comparison between groups F3L, CGX, and GX at day 28, two days before the final RNA sample collection, and at day 93, the final timepoint of the bacterial challenge. Significance values (p-values) were adjusted independently at each timepoint by the Bonferroni method to account for the multiple comparisons. The four arrows indicate days 1, 5, 15, and 30 at which timepoints RNA was isolated from CGX, F3L, and GX, for cDNA synthesis and sequencing. For all genes differentially expressed in any one of the treatments GX-d1, GX-d5, F3L-d1, F3L-d5, F3L-d15, and F3L-d30, the Z-score centered log 2transformed RPKM for each gene in each of eight treatment-days (including control, CGX-d15 and CGX-d30) is shown. Genes were hierarchically clustered using Euclidean distance and complete linkage of the Z-scoretransformed gene expression. Sample groups were clustered using the complete linkage Euclidean distance of the Spearman correlation of the Zscore-transformed gene expression. Clustering and visualization were performed using the fastcluster and gplots packages, respectively, in the R programming environment. Functionally enriched Gene Ontology (GO) terms for each category of differentially expressed genes, GX_early, F3L_early, F3L_late, and for the three highest-level categories of the Gene Ontology hierarchy are displayed in SimRel semantic mapping space (Schlicker et al. 2006) by modifying the output of the REViGO server (Supek et al. 2011) in the R programming environment and plotting using ggplot2. Semantic mapping space is (/! equivalent within the same broad GO term category, e.g. a x,y-coordinate in GX_early biological process space would have the same identity as the same x,y-coordinate in F3L_early biological process space. The color of nodes from cool (green) to warm (red) signifies increasing significance of enrichment as indicated in the legend. The size of nodes reflects whether the GO term is enriched among upregulated DEGs (large) or downregulated DEGs (small), while a GO term enriched among both upregulated and downregulated DEGs is represented by a medium size node and can be further identified by its unique, square shape. To enhance readability, overlapping nodes were sometimes labeled conjointly. This was done in a manual manner by selecting a term name which properly described the conjointly labeled terms. These cases were noted by the addition of the suffix "-related." Figure 5. Numbers of non-redundant transcripts/proteins and gene models in selected putative diversified gene families in multiple organisms from diverse taxa In each cell is shown side-by-side: 1) the numbers of non-redundant transcripts/proteins, as determined by similarity clustering and 2) numbers of gene models (see Methods). A species tree is reproduced beside the matrix to emphasize evolutionarily relationships between the featured organisms. The number of gene models (the second number in each cell) was Z-score centered by gene family and the magnitude of this Z-score was assigned a color according to the color gradient indicated in the key. Figure 6. C. virginica GIMAP translated transcripts clustered by similarity with color and size reflective of differential expression C. virginica GIMAP translated transcripts represented in Cytoscape 2.8 and yFiles organic layout. An edge corresponds to a significant similarity (e-value $ 1e-05, hit identity # 20%), and the wider the edge, the higher the hit identity percentage.
/,! Figure 7. C. virginica IFI44 translated transcripts clustered by similarity with color and size reflective of differential expression C. virginica IFI44 translated transcripts represented in Cytoscape 2.8 and yFiles organic layout. An edge corresponds to a significant similarity (e-value $ 1e-05, hit identity # 20%), and the wider the edge, the higher the hit identity percentage.
/'! Figure 8. Maximum likelihood phylogenetic tree of non-redundant GTPase of the immunity associated protein (GIMAP) transcripts/proteins from multiple organisms with leaves colored by organism A manually curated multiple alignment of GTPase of immunity associated protein (GIMAP) translated transcripts/proteins from multiple organisms in diverse taxa was used generate the above maximum likelihood phylogenetic tree using FastTree2 assuming a WAG model and hybrid CAT/gamma approximations. The tree is represented as a circular cladogram and bootstrap support from 1,000 replicates is indicated as a percentage next to each tree node. Leaves of the tree are colored according to the species to which the sequence belongs, as specified in the legend.  2)02 //0/ AB7C%)))//*A,!B7C AB7C%)))/2*A,!B7C AB7C%))(..   . Maximum likelihood phylogenetic tree of non-redundant GTPase of the immunity associated protein (GIMAP) transcripts/proteins from multiple organisms with major groupings highlighted and summarized A manually curated multiple alignment of GTPase of immunity associated protein (GIMAP) translated transcripts/proteins from multiple organisms in diverse taxa was used generate the above maximum likelihood phylogenetic tree using FastTree2 assuming a WAG model and hybrid CAT/gamma approximations (as in Fig. 8). The tree is represented as a circular cladogram. Bootstrap support from 1,000 replicates is indicated for the only node used to define the three major groupings indicated by the solid or dotted arcs that circumscribe the tree. (The group indicated by the dotted arc is dotted because it was defined negatively by exclusion.) The numbers of sequences for each species in each of the major groupings are listed in parentheses beside the species abbreviation. Some details are provided on the C. gigas genomic loci to which the C. virginica sequences mapped.
/-! Figure 10. Maximum likelihood phylogenetic tree of non-redundant interferoninduced protein 44 (IFI44) transcripts/proteins with leaves colored by organism A manually curated multiple alignment of interferon-induced protein 44 (IFI44) translated transcripts/proteins from multiple organisms in diverse taxa was used generate the above maximum likelihood phylogenetic tree using FastTree2 assuming a WAG model and hybrid CAT/gamma approximations. The tree is represented as a circular cladogram and bootstrap support from 1,000 replicates is indicated as a percentage next to each tree node. Leaves of the tree are colored according to the species to which the sequence belongs, as specified in the legend.
/&! Figure 11. Maximum likelihood phylogenetic tree of non-redundant interferoninduced protein 44 (IFI44) transcripts/proteins with major groupings highlighted and summarized A manually curated multiple alignment of interferon-induced protein 44 (IFI44) translated transcripts/proteins from multiple organisms in diverse taxa was used generate the above maximum likelihood phylogenetic tree using FastTree2 assuming a WAG model and hybrid CAT/gamma approximations (as in Fig. 10). The tree is represented as a circular cladogram. Bootstrap support from 1,000 replicates is indicated for only those nodes used to define the major groupings indicated by the solid arcs that circumscribe the tree. The major groupings of the tree that contained bivalve-only sequences were indicated by red coloration in the inter-branch space and the major grouping that contained sequences from diverse organisms was indicated by light blue coloration in the inter-branch space. The numbers of sequences for each species in each of the major groupings are listed in parentheses beside the species abbreviation. Some details are provided on the C. gigas genomic loci to which the C. virginica sequences mapped. Annotations unique to DEGs in GX days 1 and 5 are shown. Significance of differential is expressed as FDR-adjusted p-value, magnitude of differential expression is expressed as log-10 fold change over control pool (logFC) and abundance is expressed as log-10 counts per million (logCPM). Regulation, or "Reg.", (up-or down-regulated) is shown for each contig.