INTRODUCTION
The mulberry family, also known as the Moraceae family, is a large and widely distributed group of angiosperm plants that inhabit tropical to subtropical regions. Currently, there are over 37 genera comprising 1,050 reported species within this family (Berg, 2005). Plant in this family have several economic values such as silk production (Morus and Maclura genera), paper making (Broussonetia, Maclura and Morus), edible fruit (Artocarpus, Ficus and Morus), or timer (Artorcarpus and Brousonetia) (Zhekun and Gilbert, 2003). The plants within the mulberry family exhibit diverse variations in morphological traits, inflorescence architectures, and breeding systems (Clement and Weiblen, 2009). To evaluate the genetic diversity of this family, several studies have been published, although the majority of these studies have focused on morphological traits such as pollen (Teleb and Salah-El-din, 2014), trichomes (Schnetzler et al., 2017), morphological and anatomical characteristics (Erarslan et al., 2021), as well as molecular evidence such as ndhF and 26S nuclear ribosomal subunit (Zerega et al., 2005), or internal transcribed spacer and trnL-F sequences (Weiguo et al., 2005).
However, relying solely on morphological features for classification is not entirely reliable as these traits can be influenced by various factors such as environmental conditions and developmental stages. To overcome the limitations of morphological methods, the use of DNA barcodes has been proposed. However, DNA barcoding has its own drawbacks as it relies on the sequencing of a limited number of genome regions, which may not provide sufficient discriminatory power due to the similarity of sequences between species (Galimberti et al., 2014). Additionally, DNA barcodes are specific to individual species and may not be reliable when applied to higher taxonomic levels. Currently, the highest level of discrimination achieved through DNA barcoding is only around 70%, and this accuracy may be further reduced in plants with complex genomes (Besse et al., 2021).
The variation observed in chloroplast (cp) genomes of plants has been extensively utilized in studies related to population genetics, evolutionary relationships, and genetic connections, serving the conservation efforts of endangered plant species and facilitating the development of molecular markers to enhance breeding processes with greater efficiency. In recent times, next-generation sequencing (NGS) has become a commonly employed method for DNA sequencing, replacing the Sanger method in various applications that require the sequencing of multiple target DNA or RNA molecules simultaneously or the identification of complete genomes. Numerous studies have demonstrated that NGS can address the remaining challenges associated with DNA barcode technology, particularly in determining the origin of plants, detecting the presence of low-quality ingredients in products, and establishing traceability of plant-derived materials (Galimberti et al., 2014). Although several phylogenetic studies based on complete cp sequences have been published, these studies have primarily focused on individual genera such as Ficus (Zhang et al., 2022) or Morus (Zeng et al., 2022), or a limited number of genera analyzed simultaneously (He et al., 2020; De Souza et al., 2021). In this investigation, the complete cp genome sequences of 36 species, spanning 12 genera within the mulberry family, were acquired from GenBank and subjected to analysis. By comparing the sequences, the study identified clustering patterns and specific variable DNA regions within each species and across the entire family. These findings have the potential to enhance the accuracy of species classification within the mulberry family.
MATERIALS AND METHODS
Sequence annotation and comparison of chloroplast genomes
A total of 32 complete cp genome sequences, representing 32 species from 12 genera within the mulberry family, were obtained from the NCBI GenBank. or genera with multiple available sequences (Artocarpus, Broussonetia, Ficus, and Morus), only five cp sequences were randomly selected for further analysis (Table 1). The Geseq program (https://chlorobox.mpimp-golm.mpg.de/geseq.html) was used to annotate, and locate genes in the cp genomes (Tillich et al., 2017). Chloroplot software (https://irscope.shinyapps.io/Chloroplot/) were used to identify the number of protein-coding genes, rRNA genes, tRNA genes, and GC content in each cp genome (Zheng et al., 2020). Entire 32 cp sequences were further compared using VISTA program (https://genome.lbl.gov/vista/index.shtml) in Shuffle-LAGAN mode (Brudno et al., 2003).
Repeat element analysis
The MAFFT program (https://mafft.cbrc.jp/alignment/server/) was used to align the 32 genome sequences with default settings. The IRscope tool (https://irscope.shinyapps.io/irapp/) was then employed to visualize the comparisons of the large single-copy region (LSC)/inverted repeat B (IRB)/small single-copy region (SSC)/inverted repeat A (IRA) junctions among these sequences (Amiryousefi et al., 2018). To detect simple sequence repeat (SSR) motifs, the MISA tool (http://pgrc.ipk-gatersleben.de/misa/misa.html) was utilized with the following parameters: a minimum of ten repeats for mononucleotides, six repeats for dinucleotides, five repeats for trinucleotides, four repeats for tetranucleotides, and three repeats each for penta- and hexa-nucleotides (Beier et al., 2017). The 32 genome sequences were aligned by the MAFFT program (https://mafft.cbrc.jp/alignment/server/) with default parameters. The comparison of the LSC/IRB/SSC/IRA junctions among these sequences was visualized by IRscope (https://irscope.shinyapps.io/irapp/) based on the annotations of their available cp genomes in GenBank (Amiryousefi et al., 2018).
Phylogenetic analysis
The TimeTree tool (http://www.timetree.org) was utilized to determine the divergence times and initial phylogenetic relatedness of all species in the mulberry family based on the available molecular sequences on NCBI (Kumar et al., 2017). The alignment of 32 cp genomes was then performed using the MAFFT alignment tool (https://mafft.cbrc.jp/alignment/server), and phylogenetic trees were constructed using Maximum Likelihood representing discrete character methods (Kang et al., 2017) using MEGA11 software (Tamura et al., 2021) with 500 bootstrap replicates. To serve as an outgroup, cp genomes from four different families were included (MK361034: Rosa banksiae; NC_058887: Elaeagnus pungens; NC_040984: Barbeya oleoides; and NC_026562: Cannabis sativa) following strategies to select outgroup described by Luo and colleagues (Luo et al., 2010). To assess the resolution of the cp genomes, the established phylogenetic tree was examined. The criterion used was as follows: If all species belonging to a genus were grouped together in a single monophyletic clade, the genus was considered to have clear resolution. On the other hand, if species from a particular genus were scattered across different clades, the genus was considered unresolved (Sikdar et al., 2018).
RESULTS AND DISCUSSION
Sequence annotation and comparison of chloroplast genomes
Through a search conducted on the NCBI database, complete cp genome sequences of 32 species from 12 genera within the mulberry family were obtained for analysis. As of February 2, 2024, the genus Figus had the highest number of available sequences, with 34 cp genomes. This was followed by Morus, Artocarpus, Broussonetia, Maclura, Milicia, Trophis, Afromorus, Antiaris, Malaisia, Pseudostreblus, and Streblus, with 16, 12, 6, 3, 2, 2, 1, 1, 1, 1, and 1 sequence, respectively. To ensure equal representation, up to five sequences per genus were selected for further analysis, resulting in a total of 32 cp sequences (Table 1). The Geseq program was used to obtain the structural characteristics and gene contents of the cp genomes, as shown in Fig. 1. Like other cp genomes, all cp genomes in this study exhibited a four-part structure, comprising a LSC, a SSC, and two IRs.
The size of the cp genomes in the 32 sequences ranged from 158,459 bp in Morus mongolica to 162,594 bp in Broussonetia luzonica, with an average size of 160,432 bp per cp genome. The cp genome size of the mulberry family is slightly smaller compared to that of other land plants, as described by De Las Rivas et al. (2002). The number of protein-coding genes varies from 80 in Ficus religiosa to 92 in Morus mongolica. There seems to be no consistent rule governing the quantity of these genes among species within the same genus or across different genera, as the numbers vary significantly even within a single genus. Previous studies have also reported substantial variation in the number of protein-coding genes within genera such as Prunus (Xue et al., 2019), Cycas (Chang et al., 2020), or Rhus (Xu et al., 2022). As the number of complete cp genomes for each species continues to increase in the near future, it will become entirely feasible to analyze and discern the specific gene counts for each species, facilitating the discovery of patterns regarding the quantity of these genes within the same genus. Most of the cp genomes contain 8 rRNA genes and 37 tRNA genes, except for Ficus racemose with only 4 rRNA and 27 tRNA genes. Artocarpus petelotii is the only species with 36 tRNA genes. The average GC content of the cp genomes in the 32 species is approximately 36%.
When the cp genome of Artocarpus hypargyreus (NC_057287.1) served as the reference for aligning the cp genomes by mMISTA, noticeable distinctions were observed in the cp sequences of Artocarpus species (NC_057287.1: A. hypargyreus; NC_059002.1: A. altilis; NC_054247.1: A. camansi; NC_056286.1: A. petelotii and NC_080592.1: A. gomezianus) compared to other cp sequences (Fig. 2). A significant gap identified in the NC_080592.1 sequence warrants further investigation to determine whether this fragment was altered in the cp genome due to mutation or resulted from incomplete genome assembly.
Repeat element analysis
Using the default parameters of the MISA program, tandem repeat sequences consisting of 1–6 nucleotide repeat units were analyzed to determine the relative abundance of SSRs (Fig. 3). A total of 2,140 SSRs were detected across the 32 cp genomes, ranging from 51 SSRs in Artocarpus petelotii to 97 SSRs in Maclura cochinchinensis, with an average of approximately 66 SSRs per cp genome. Among the detected SSRs, 11 different motifs were identified, namely A, T, C, G, AT, TA, TAA, TTC, TTA, AAT, and AAAT. The most dominant mononucleotide types were A and T, accounting for 35.4% (758) and 56.3% (1,206) of the total SSRs, respectively. On the other hand, the mononucleotide types C and G were rarely detected, with only 22 and 16 instances, respectively. Other motifs such as dinucleotide (AT and TA), trinucleotide (TAA, TTC, and TTA), and tetranucleotide (AAAT) were also identified with lower frequencies. The significant variation in SSR motifs among species within this family can provide valuable information for species identification, population genetics, and phylogenetic studies (Androsiuk et al., 2020).
While the IR regions in cp genomes are typically highly conserved, several instances of gene losses have been recorded in diverse plant species, such as barley, bamboo, cassava, and chickpea (Dobrogojski et al., 2020). By aligning the LSC/IRb/SSC/IRa borders and adjacent genes in the 32 cp genomes, significant variations were identified (Fig. 4). Notably, ndhF displayed the highest variability, ranging from translocation to the right site of SSC in Artocarpus gomezianus to truncation in Antiaris toxicaria, Streblus indicus, and Pseudostreblus indicus, and complete loss in Ficus concinna and F. religiosa. The loss of rpl22 was observed in all species of the Broussonetia genus, as well as in Malaisia scandens and Trophis scandens. These findings corroborate a previous study by Mohanta et al. (2020), which examined 2,511 cp genomes and identified ndhF and rpl22 as among the most frequently deleted genes in cp genomes. Variations among cp genomes are commonly observed, and several explanations have been proposed, including gene loss, expansions or contractions of IR regions, and intron loss (Mower and Vickrey, 2018).
Phylogenetic analyses
Using the TimeTree tool, information was collected from the NCBI GenBank database for 390 species across 37 genera within the mulberry family. The divergence times of these 37 genera are presented in Fig. 5. The data indicates that speciation events within the mulberry family took place over a wide range of time, from approximately 59 million years ago to around 6 million years ago, spanning various geologic periods. However, it is worth noting that three genera, namely Afromorus, Malaisia, and Pseudostreblus, were not included in this phylogenetic tree, suggesting a discrepancy between the genera listed in the TimeTree database and the number of genera identified through the analysis of cp genomes.
Furthermore, a robust phylogenetic analysis was conducted using the available 32 cp genomes from 12 genera, and the results are presented in Fig. 6 with a high bootstrap value. Among these genera, Artocarpus, Milicia, Morus, Maclura, and Ficus were found to exhibit the highest conservation, forming distinct monophyletic clades. Previously, the Royal Botanic Gardens Kew considered Streblus indicus and Pseudostreblus indicus as two separate species (https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:856410-1). However, our analysis reveals that they cluster together, suggesting a close relationship and the possibility of being a single species. In contrast, although the five cp sequences of the Broussoneitia genus were grouped within a single clade, the insertion of Malaisia scandens and Trophis scandens sequences fragmented their clustering. The inconsistency in the phylogenetic results may be attributed to variations in the number of markers used in different studies. Phylogenetic studies based on morphological traits or specific genes/markers can yield erroneous outcomes due to the different evolutionary rates of these markers (Wu and Ge, 2011). On the other hand, analyzing the complete cp genome provides a wealth of information and higher resolution, making it valuable for classifying organisms below the species level and conducting phylogenetic studies (Long et al., 2023).
CONCLUSIONS
As the sequencing of cp genomes becomes increasingly accessible, there is a notable surge in the number of sequenced genomes. Consequently, it becomes essential to conduct in silico studies to comprehensively assess the cp genomes generated from various independent studies. This study aims to elucidate the common structure and content of cp genomes within the mulberry family by analyzing 32 species. By examining the similarities and differences among these cp genomes, we can enhance our understanding of the genetic structure of this plant family.