FishSNP: a high quality cross-species SNP database of fishes


Share this post

Choose a social network to share with, or copy the shortened URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

The progress of aquaculture heavily relies on efficiently tapping into diverse genetic resources to boost production efficiency and increase profitability. Single nucleotide polymorphisms (SNPs) play a crucial role in aquaculture genomics, genetics, and breeding research due to their widespread presence as molecular markers throughout the genome. With the rapid evolution of sequencing technologies, there has been a surge of fish genomics data. However, this essential SNP data is scattered across various databases and literature sources, posing a challenge for researchers looking to access and utilize it. Current databases have limited coverage of SNP data and fish species, especially for economically significant fish species.Furthermore, the lack of a comprehensive database providing quality information for SNP markers impedes researchers from gaining valuable insights effectively. Hence, establishing an accurate and dependable SNP database is of utmost importance for advancing fish research.

We primarily sourced the fish SNP data from the following three channels: (1) Fish SNP marker data in public databases. We consolidated existing fish SNP data from public databases and clearly documented their origins in the dataset to ensure data traceability and transparency. (2) SNP data reported in published literature. We systematically gathered relevant SNP data from published research articles focused on fish SNP studies. (3) SNP markers obtained by processing original literature data through a unified approach. In case where literatures offered raw sequencing data without SNP information during the collection process, we implemented a unified process to conduct SNP calling on the raw data, resulting in a more comprehensive and accurate dataset.

While collecting fish SNP data, we encountered several challenges. One major issue was the inconsistency in data formats, as SNP data across various articles differed in format and did not always align with the standard VCF format. To tackle this, we had to extract key information and standardize the data format by converting it into VCF format. Another obstacle was retrieving missing data from literature that lacked SNP information and raw sequencing data. To address this, we contacted authors via email to request necessary data. Moreover, inconsistencies in reference genome versions posed a challenge, as different studies might utilize varying versions of the reference genome. To ensure data uniformity and accessibility, we compared and standardized the data against both the original genome version and the most recent version available on NCBI.

In developing the workflows for processing raw sequencing data, we explored various widely-used mutation detection tools, such as SAMtools/BCFtools, CLC Genomics Workbench, FreeBayes, GATK, 16GT, LoFreq, SNVer, VarDict, and VarScan for the SNP discovery. After thorough evaluation, we have chosen GATK for our data analysis. While GATK is renowned for its accuracy and citation rate in the human species, it is also extensively employed in non-human species. Nevertheless, there is an ongoing debate regarding the accuracy comparison between GATK and SAMtools/BCFtools. Several studies suggest that GATK demonstrates superior accuracy in mutation detection for non-human species. For example, in the case of the model plant Arabidopsis thaliana, Comparison of mutation tools on multiple species, including tomato, Studies on rainbow trout, a fish species, also demonstrated fewer false-positive SNPs when using GATK. Research supporting SAMtools/BCFtools is based on testing single-species datasets like fruit fly, chicken and wheat. Research indicates that although SAMtools/BCFtools can detect more SNP sites, using GATK results in fewer false-positive SNPs for many species, including rainbow trout. As our database is positioned as a high-quality SNP database, we have chosen GATK as the mutation detection tool.

In the validation of SNPs, researchers commonly rely on a blend of sequencing specifics and population genetic characteristics. Whether utilizing SAMtools/BCFtools, GATK, or DeepVariant, the filtering process is based in the results of sequence alignment. However, factors like errors in library preparation and biases in sequencing can impact the accuracy of identified SNPs. Additionally, population genetic attributes play a pivotal role. These attributes are identified through the analysis of the genetic composition of samples from diverse populations. Tests like Mendelian segregation ratios and Hardy-Weinberg equilibrium assessments offer a distinct perspective compared to sequencing details. They evaluate SNPs from a genetic standpoint, providing valuable insights into their accuracy. In practice, human SNP databases combine both methodologies to bolster the accuracy of identified SNPs.

FishSNP allows users to easily search and navigate data by specifying criteria such as species and genes. Moreover, users can conveniently download all raw data in VCF format directly for comprehensive analysis. We have also integrated annotation tools within the website to further assit users. These tools streamline the process of annotating SNPs, empowering users to gain deeper insights and analyze the data effectively.

Our future objective is to expand the database by incorporating a broader range of fish species and SNP data, thereby enhancing its depth and scope. Furthermore, we are committed to enhance the quality and reliability of the included SNPs through an optimized SNP calling process. It is important to highlight that our forthcoming efforts will involve evaluating various SNP calling software specifically tailored for fish species. We plan to rigorously assess their performance and, based on the outcomes, provide researchers with an valuable reference for selecting mutation detection tools relevant to fish species. This optimization initiative will encompass exploring additional SNP calling tools and refining our SNP calling pipeline to ensure heightened accuracy and efficiency. Through these refinements, our aim is to offer researchers a robust resource for conducting fish genomics studies, facilitating deeper insights into the genetic diversity and evolutionary dynamics of fish populations.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Life Sciences > Biological Sciences > Ecology > Ecosystems > Marine Biology > Aquaculture
Functional Genomics
Life Sciences > Biological Sciences > Genetics and Genomics > Genomics > Functional Genomics

Related Collections

With collections, you can get published faster and increase your visibility.

Medical imaging data for digital diagnostics

This Collection presents a series of articles describing annotated datasets of medical images and video. All medical specialities are considered and data can be derived from study participants, tissue samples, electronic health records (EHRs) or other sources.

Publishing Model: Open Access

Deadline: Dec 20, 2023

Remote sensing data for changes in land use

This Collection comprises a series of articles presenting data on changes to land use in urban areas, farmland, forests, and natural environments, as determined using remote sensing techniques.

Publishing Model: Open Access

Deadline: Jan 31, 2024