18 new pubmed citations were retrieved for your search.
Click on the search hyperlink below to display the complete search results:
"BMC Bioinformatics"[jour]
These pubmed results were generated on 2020/03/14
PubMed comprises more than millions of citations for biomedical literature from MEDLINE, life science journals, and online books.
Citations may include links to full-text content from PubMed Central and publisher web sites.
1.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):80. doi: 10.1186/s12859-020-3347-7.
visGReMLIN: graph mining-based detection and visualization of conserved motifs at 3D protein-ligand interface at the atomic level.
Ribeiro VS1, Santana CA2, Fassio AV2, Cerqueira FR3, da Silveira CH4, Romanelli JPR4, Patarroyo-Vargas A5, Oliveira MGA5,6, Gonçalves-Almeida V2, Izidoro SC4, de Melo-Minardi RC7, Silveira SA8,9.
Abstract
BACKGROUND:
Interactions between proteins and non-proteic small molecule ligands play important roles in the biological processes of living systems. Thus, the development of computational methods to support our understanding of the ligand-receptor recognition process is of fundamental importance since these methods are a major step towards ligand prediction, target identification, lead discovery, and more. This article presents visGReMLIN, a web server that couples a graph mining-based strategy to detect motifs at the protein-ligand interface with an interactive platform to visually explore and interpret these motifs in the context of protein-ligand interfaces.
RESULTS:
To illustrate the potential of visGReMLIN, we conducted two cases in which our strategy was compared with previous experimentally and computationally determined results. visGReMLIN allowed us to detect patterns previously documented in the literature in a totally visual manner. In addition, we found some motifs that we believe are relevant to protein-ligand interactions in the analyzed datasets.
CONCLUSIONS:
We aimed to build a visual analytics-oriented web server to detect and visualize common motifs at the protein-ligand interface. visGReMLIN motifs can support users in gaining insights on the key atoms/residues responsible for protein-ligand interactions in a dataset of complexes.
2.
BMC Bioinformatics. 2020 Mar 12;21(1):101. doi: 10.1186/s12859-020-3444-7.
Network hub-node prioritization of gene regulation with intra-network association.
Abstract
BACKGROUND:
To identify and prioritize the influential hub genes in a gene-set or biological pathway, most analyses rely on calculation of marginal effects or tests of statistical significance. These procedures may be inappropriate since hub nodes are common connection points and therefore may interact with other nodes more often than non-hub nodes do. Such dependence among gene nodes can be conjectured based on the topology of the pathway network or the correlation between them.
RESULTS:
Here we develop a pathway activity score incorporating the marginal (local) effects of gene nodes as well as intra-network affinity measures. This score summarizes the expression levels in a gene-set/pathway for each sample, with weights on local and network information, respectively. The score is next used to examine the impact of each node through a leave-one-out evaluation. To illustrate the procedure, two cancer studies, one involving RNA-Seq from breast cancer patients with high-grade ductal carcinoma in situ and one microarray expression data from ovarian cancer patients, are used to assess the performance of the procedure, and to compare with existing methods, both ones that do and do not take into consideration correlation and network information. The hub nodes identified by the proposed procedure in the two cancer studies are known influential genes; some have been included in standard treatments and some are currently considered in clinical trials for target therapy. The results from simulation studies show that when marginal effects are mild or weak, the proposed procedure can still identify causal nodes, whereas methods relying only on marginal effect size cannot.
CONCLUSIONS:
The NetworkHub procedure proposed in this research can effectively utilize the network information in combination with local effects derived from marker values, and provide a useful and complementary list of recommendations for prioritizing causal hubs.
KEYWORDS:
Direction regularization; Intra-network; Neighbor correlation; Pathway activity score; Topology measure
3.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):81. doi: 10.1186/s12859-020-3348-6.
BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs.
Abstract
BACKGROUND:
The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed.
RESULTS:
We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min.
CONCLUSIONS:
BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
KEYWORDS:
Basic linear algebra subprograms (BLAS); Graphics processing units (GPUs); High performance computing (HPC); Position weight matrix (PWM)
4.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):87. doi: 10.1186/s12859-020-3354-8.
Conversion from electrocardiosignals to equivalent electrical sources on heart surface.
Zhikhareva GV1, Kramm MN2, Bodin ON3, Seepold R4,5, Madrid NM5,6, Chernikov AI1, Kupriyanova YA1, Zhuravleva NA1.
Abstract
BACKGROUND:
The actual task of electrocardiographic examinations is to increase the reliability of diagnosing the condition of the heart. Within the framework of this task, an important direction is the solution of the inverse problem of electrocardiography, based on the processing of electrocardiographic signals of multichannel cardio leads at known electrode coordinates in these leads (Titomir et al. Noninvasiv electrocardiotopography, 2003), (Macfarlane et al. Comprehensive Electrocardiology, 2nd ed. (Chapter 9), 2011).
RESULTS:
In order to obtain more detailed information about the electrical activity of the heart, we carry out a reconstruction of the distribution of equivalent electrical sources on the heart surface. In this area, we hold reconstruction of the equivalent sources during the cardiac cycle at relatively low hardware cost. ECG maps of electrical potentials on the surface of the torso (TSPM) and electrical sources on the surface of the heart (HSSM) were studied for different times of the cardiac cycle. We carried out a visual and quantitative comparison of these maps in the presence of pathological regions of different localization. For this purpose we used the model of the heart electrical activity, based on cellular automata.
CONCLUSIONS:
The model of cellular automata allows us to consider the processes of heart excitation in the presence of pathological regions of various sizes and localization. It is shown, that changes in the distribution of electrical sources on the surface of the epicardium in the presence of pathological areas with disturbances in the conduction of heart excitation are much more noticeable than changes in ECG maps on the torso surface.
KEYWORDS:
Cellular automata; Electric potential; Electrocardiographic leads; Equivalent electric sources; Heart; Maps of distributions; Multichannel; Reconstruction; Torso
5.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):85. doi: 10.1186/s12859-020-3352-x.
GSP4PDB: a web tool to visualize, search and explore protein-ligand structural patterns.
Abstract
BACKGROUND:
In the field of protein engineering and biotechnology, the discovery and characterization of structural patterns is highly relevant as these patterns can give fundamental insights into protein-ligand interaction and protein function. This paper presents GSP4PDB, a bioinformatics web tool that enables the user to visualize, search and explore protein-ligand structural patterns within the entire Protein Data Bank.
RESULTS:
We introduce the notion of graph-based structural pattern (GSP) as an abstract model for representing protein-ligand interactions. A GSP is a graph where the nodes represent entities of the protein-ligand complex (amino acids and ligands) and the edges represent structural relationships (e.g. distances ligand - amino acid). The novel feature of GSP4PDB is a simple and intuitive graphical interface where the user can "draw" a GSP and execute its search in a relational database containing the structural data of each PDB entry. The results of the search are displayed using the same graph-based representation of the pattern. The user can further explore and analyse the results using a wide range of filters, or download their related information for external post-processing and analysis.
CONCLUSIONS:
GSP4PDB is a user-friendly and efficient application to search and discover new patterns of protein-ligand interaction.
KEYWORDS:
Big data; PDB; Protein-ligand interaction; Structural patterns
6.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):83. doi: 10.1186/s12859-020-3350-z.
Amazing symmetrical clustering in chloroplast genomes.
Abstract
BACKGROUND:
Previously, a seven-cluster pattern claiming to be a universal one in bacterial genomes has been reported. Keeping in mind the most popular theory of chloroplast origin, we checked whether a similar pattern is observed in chloroplast genomes.
RESULTS:
Surprisingly, eight cluster structure has been found, for chloroplasts. The pattern observed for chloroplasts differs rather significantly, from bacterial one, and from that latter observed for cyanobacteria. The structure is provided by clustering of the fragments of equal length isolated within a genome so that each fragment is converted in triplet frequency dictionary with non-overlapping triplets with no gaps in frame tiling. The points in 63-dimensional space were clustered due to elastic map technique. The eight cluster found in chloroplasts comprises the fragments of a genome bearing tRNA genes and exhibiting excessively high GC-content, in comparison to the entire genome.
CONCLUSION:
Chloroplasts exhibit very specific symmetry type in distribution of coding and non-coding fragments of a genome in the space of triplet frequencies: this is mirror symmetry. Cyanobacteria may have both mirror symmetry, and the rotational symmetry typical for other bacteria.
KEYWORDS:
Clustering; Order; Triplet
7.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):90. doi: 10.1186/s12859-020-3357-5.
Applications of machine learning for simulations of red blood cells in microfluidic devices.
Abstract
BACKGROUND:
For optimization of microfluidic devices for the analysis of blood samples, it is useful to simulate blood cells as elastic objects in flow of blood plasma. In such numerical models, we primarily need to take into consideration the movement and behavior of the dominant component of the blood, the red blood cells. This can be done quite precisely in small channels and within a short timeframe. However, larger volumes or timescales require different approaches. Instead of simplifying the simulation, we use a neural network to predict the movement of the red blood cells.
RESULTS:
The neural network uses data from the numerical simulation for learning, however, the simulation needs only be run once. Alternatively, the data could come from video processing of a recording of a biological experiment. Afterwards, the network is able to predict the movement of the red blood cells because it is a system of bases that gives an approximate cell velocity at each point of the simulation channel as a linear combination of bases.In a simple box geometry, the neural network gives results comparable to predictions using fluid streamlines, however in a channel with obstacles forming slits, the neural network is about five times more accurate.The network can also be used as a discriminator between different situations. We observe about two-fold increase in mean relative error when a network trained on one geometry is used to predict trajectories in a modified geometry. Even larger increase was observed when it was used to predict trajectories of cells with different elastic properties.
CONCLUSIONS:
While for uncomplicated box channels there is no advantage in using a system of bases instead of a simple prediction using fluid streamlines, in a more complicated geometry, the neural network is significantly more accurate. Another application of this system of bases is using it as a comparison tool for different modeled situations. This has a significant future potential when applied to processing data from videos of microfluidic flows.
KEYWORDS:
Cell trajectories; Microfluidic device; Neural network; Red blood cell; Simulation of fluid
8.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):89. doi: 10.1186/s12859-020-3356-6.
Robust pathway sampling in phenotype prediction. Application to triple negative breast cancer.
Cernea A1, Fernández-Martínez JL2, deAndrés-Galiana EJ1,3, Fernández-Ovies FJ1, Alvarez-Machancoses O1, Fernández-Muñiz Z1, Saligan LN4, Sonis ST5,6.
Abstract
BACKGROUND:
Phenotype prediction problems are usually considered ill-posed, as the amount of samples is very limited with respect to the scrutinized genetic probes. This fact complicates the sampling of the defective genetic pathways due to the high number of possible discriminatory genetic networks involved. In this research, we outline three novel sampling algorithms utilized to identify, classify and characterize the defective pathways in phenotype prediction problems, such as the Fisher's ratio sampler, the Holdout sampler and the Random sampler, and apply each one to the analysis of genetic pathways involved in tumor behavior and outcomes of triple negative breast cancers (TNBC). Altered biological pathways are identified using the most frequently sampled genes and are compared to those obtained via Bayesian Networks (BNs).
RESULTS:
Random, Fisher's ratio and Holdout samplers were more accurate and robust than BNs, while providing comparable insights about disease genomics.
CONCLUSIONS:
The three samplers tested are good alternatives to Bayesian Networks since they are less computationally demanding algorithms. Importantly, this analysis confirms the concept of "biological invariance" since the altered pathways should be independent of the sampling methodology and the classifier used for their inference. Nevertheless, still some modifications are needed in the Bayesian networks to be able to sample correctly the uncertainty space in phenotype prediction problems, since the probabilistic parameterization of the uncertainty space is not unique and the use of the optimum network might falsify the pathways analysis.
9.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):77. doi: 10.1186/s12859-020-3344-x.
Dynamic incorporation of prior knowledge from multiple domains in biomarker discovery.
Abstract
BACKGROUND:
In biomarker discovery, applying domain knowledge is an effective approach to eliminating false positive features, prioritizing functionally impactful markers and facilitating the interpretation of predictive signatures. Several computational methods have been developed that formulate the knowledge-based biomarker discovery as a feature selection problem guided by prior information. These methods often require that prior information is encoded as a single score and the algorithms are optimized for biological knowledge of a specific type. However, in practice, domain knowledge from diverse resources can provide complementary information. But no current methods can integrate heterogeneous prior information for biomarker discovery. To address this problem, we developed the Know-GRRF (know-guided regularized random forest) method that enables dynamic incorporation of domain knowledge from multiple disciplines to guide feature selection.
RESULTS:
Know-GRRF embeds domain knowledge in a regularized random forest framework. It combines prior information from multiple domains in a linear model to derive a composite score, which, together with other tuning parameters, controls the regularization of the random forests model. Know-GRRF concurrently optimizes the weight given to each type of domain knowledge and other tuning parameters to minimize the AIC of out-of-bag predictions. The objective is to select a compact feature subset that has a high discriminative power and strong functional relevance to the biological phenotype. Via rigorous simulations, we show that Know-GRRF guided by multiple-domain prior information outperforms feature selection methods guided by single-domain prior information or no prior information. We then applied Known-GRRF to a real-world study to identify prognostic biomarkers of prostate cancers. We evaluated the combination of cancer-related gene annotations, evolutionary conservation and pre-computed statistical scores as the prior knowledge to assemble a panel of biomarkers. We discovered a compact set of biomarkers with significant improvements on prediction accuracies.
CONCLUSIONS:
Know-GRRF is a powerful novel method to incorporate knowledge from multiple domains for feature selection. It has a broad range of applications in biomarker discoveries. We implemented this method and released a KnowGRRF package in the R/CRAN archive.
KEYWORDS:
Biomarker discovery; Domain knowledge; Feature selection; Regularized random forest
10.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):92. doi: 10.1186/s12859-020-3359-3.
Visually guided classification trees for analyzing chronic patients.
Soguero-Ruiz C1, Mora-Jiménez I2, Mohedano-Munoz MA3, Rubio-Sanchez M3, Miguel-Bohoyo P4, Sanchez A3,5.
Abstract
BACKGROUND:
Chronic diseases are becoming more widespread each year in developed countries, mainly due to increasing life expectancy. Among them, diabetes mellitus (DM) and essential hypertension (EH) are two of the most prevalent ones. Furthermore, they can be the onset of other chronic conditions such as kidney or obstructive pulmonary diseases. The need to comprehend the factors related to such complex diseases motivates the development of interpretative and visual analysis methods, such as classification trees, which not only provide predictive models for diagnosing patients, but can also help to discover new clinical insights.
RESULTS:
In this paper, we analyzed healthy and chronic (diabetic, hypertensive) patients associated with the University Hospital of Fuenlabrada in Spain. Each patient was classified into a single health status according to clinical risk groups (CRGs). The CRGs characterize a patient through features such as age, gender, diagnosis codes, and drug codes. Based on these features and the CRGs, we have designed classification trees to determine the most discriminative decision features among different health statuses. In particular, we propose to make use of statistical data visualizations to guide the selection of features in each node when constructing a tree. We created several classification trees to distinguish among patients with different health statuses. We analyzed their performance in terms of classification accuracy, and drew clinical conclusions regarding the decision features considered in each tree. As expected, healthy patients and patients with a single chronic condition were better classified than patients with comorbidities. The constructed classification trees also show that the use of antipsychotics and the diagnosis of chronic airway obstruction are relevant for classifying patients with more than one chronic condition, in conjunction with the usual DM and/or EH diagnoses.
CONCLUSIONS:
We propose a methodology for constructing classification trees in a visually guided manner. The approach allows clinicians to progressively select the decision features at each of the tree nodes. The process is guided by exploratory data analysis visualizations, which may provide new insights and unexpected clinical information.
KEYWORDS:
Chronic health status; Classification trees; Diabetes; Diagnoses; Drugs; Hypertension; Multivariate visualization
11.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):91. doi: 10.1186/s12859-020-3358-4.
A machine learning approach on multiscale texture analysis for breast microcalcification diagnosis.
Fanizzi A1, Basile TMA2,3, Losurdo L4, Bellotti R2,3, Bottigli U5, Dentamaro R1, Didonna V1, Fausto A6, Massafra R1, Moschetta M7, Popescu O1, Tamborra P1, Tangaro S3, La Forgia D1.
Abstract
BACKGROUND:
Screening programs use mammography as primary diagnostic tool for detecting breast cancer at an early stage. The diagnosis of some lesions, such as microcalcifications, is still difficult today for radiologists. In this paper, we proposed an automatic binary model for discriminating tissue in digital mammograms, as support tool for the radiologists. In particular, we compared the contribution of different methods on the feature selection process in terms of the learning performances and selected features.
RESULTS:
For each ROI, we extracted textural features on Haar wavelet decompositions and also interest points and corners detected by using Speeded Up Robust Feature (SURF) and Minimum Eigenvalue Algorithm (MinEigenAlg). Then a Random Forest binary classifier is trained on a subset of a sub-set features selected by two different kinds of feature selection techniques, such as filter and embedded methods. We tested the proposed model on 260 ROIs extracted from digital mammograms of the BCDR public database. The best prediction performance for the normal/abnormal and benign/malignant problems reaches a median AUC value of 98.16% and 92.08%, and an accuracy of 97.31% and 88.46%, respectively. The experimental result was comparable with related work performance.
CONCLUSIONS:
The best performing result obtained with embedded method is more parsimonious than the filter one. The SURF and MinEigen algorithms provide a strong informative content useful for the characterization of microcalcification clusters.
KEYWORDS:
Computer-aided diagnosis; Digital mammograms; Feature selection; Haar wavelet transform; Microcalcifications; Minimum eigenvalue algorithm; Random forest; SURF
12.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):84. doi: 10.1186/s12859-020-3351-y.
DePicT Melanoma Deep-CLASS: a deep convolutional neural networks approach to classify skin lesion images.
Abstract
BACKGROUND:
Melanoma results in the vast majority of skin cancer deaths during the last decades, even though this disease accounts for only one percent of all skin cancers' instances. The survival rates of melanoma from early to terminal stages is more than fifty percent. Therefore, having the right information at the right time by early detection with monitoring skin lesions to find potential problems is essential to surviving this type of cancer.
RESULTS:
An approach to classify skin lesions using deep learning for early detection of melanoma in a case-based reasoning (CBR) system is proposed. This approach has been employed for retrieving new input images from the case base of the proposed system DePicT Melanoma Deep-CLASS to support users with more accurate recommendations relevant to their requested problem (e.g., image of affected area). The efficiency of our system has been verified by utilizing the ISIC Archive dataset in analysis of skin lesion classification as a benign and malignant melanoma. The kernel of DePicT Melanoma Deep-CLASS is built upon a convolutional neural network (CNN) composed of sixteen layers (excluding input and ouput layers), which can be recursively trained and learned. Our approach depicts an improved performance and accuracy in testing on the ISIC Archive dataset.
CONCLUSIONS:
Our methodology derived from a deep CNN, generates case representations for our case base to use in the retrieval process. Integration of this approach to DePicT Melanoma CLASS, significantly improving the efficiency of its image classification and the quality of the recommendation part of the system. The proposed method has been tested and validated on 1796 dermoscopy images. Analyzed results indicate that it is efficient on malignancy detection.
KEYWORDS:
Case-based reasoning; Classification; Deep learning; Early detection; Information retrieval; Melanoma; Skin cancer
13.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):88. doi: 10.1186/s12859-020-3355-7.
FLIR vs SEEK thermal cameras in biomedicine: comparative diagnosis through infrared thermography.
Abstract
BACKGROUND:
In biomedicine, infrared thermography is the most promising technique among other conventional methods for revealing the differences in skin temperature, resulting from the irregular temperature dispersion, which is the significant signaling of diseases and disorders in human body. Given the process of detecting emitted thermal radiation of human body temperature by infrared imaging, we, in this study, present the current utility of thermal camera models namely FLIR and SEEK in biomedical applications as an extension of our previous article.
RESULTS:
The most significant result is the differences between image qualities of the thermograms captured by thermal camera models. In other words, the image quality of the thermal images in FLIR One is higher than SEEK Compact PRO. However, the thermal images of FLIR One are noisier than SEEK Compact PRO since the thermal resolution of FLIR One is 160 × 120 while it is 320 × 240 in SEEK Compact PRO.
CONCLUSION:
Detecting and revealing the inhomogeneous temperature distribution on the injured toe of the subject, we, in this paper, analyzed the imaging results of two different smartphone-based thermal camera models by making comparison among various thermograms. Utilizing the feasibility of the proposed method for faster and comparative diagnosis in biomedical problems is the main contribution of this study.
KEYWORDS:
Biomedicine; Comparative diagnosis; FLIR; Infrared camera; Infrared thermography; SEEK
14.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):82. doi: 10.1186/s12859-020-3349-5.
Accurately estimating the length distributions of genomic micro-satellites by tumor purity deconvolution.
Abstract
BACKGROUND:
Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. Estimating the length distribution and state of a micro-satellite region is an important computational step in cancer sequencing data pipelines, which is suggested to facilitate the downstream analysis and clinical decision supporting. Although several state-of-the-art approaches have been proposed to identify micro-satellite instability (MSI) events, they are limited in dealing with regions longer than one read length. Moreover, based on our best knowledge, all of these approaches imply a hypothesis that the tumor purity of the sequenced samples is sufficiently high, which is inconsistent with the reality, leading the inferred length distribution to dilute the data signal and introducing the false positive errors.
RESULTS:
In this article, we proposed a computational approach, named ELMSI, which detected MSI events based on the next generation sequencing technology. ELMSI can estimate the specific length distributions and states of micro-satellite regions from a mixed tumor sample paired with a control one. It first estimated the purity of the tumor sample based on the read counts of the filtered SNVs loci. Then, the algorithm identified the length distributions and the states of short micro-satellites by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm. After that, ELMSI continued to infer the length distributions of long micro-satellites by incorporating a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then used statistical tests to output the states of these micro-satellites. Based on our experimental results, ELMSI was able to handle micro-satellites with lengths ranging from shorter than one read length to 10kbps.
CONCLUSIONS:
To verify the reliability of our algorithm, we first compared the ability of classifying the shorter micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.
KEYWORDS:
Cancer genomics; Computational pipeline; Genomic micro-satellite; Length distribution estimation; Sequencing data analysis; Tumor purity
15.
BMC Bioinformatics. 2020 Mar 12;21(1):102. doi: 10.1186/s12859-020-3429-6.
A big data approach to metagenomics for all-food-sequencing.
Kobus R1, Abuín JM2,3, Müller A1, Hellmann SL4, Pichel JC3, Pena TF3, Hildebrandt A1, Hankeln T4, Schmidt B5.
Abstract
BACKGROUND:
All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches.
RESULTS:
We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark).
CONCLUSIONS:
We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).
KEYWORDS:
Big data; Eukaryotic genomes; Locality sensitive hashing; Metagenomics; Next-generation sequencing; Species identification
16.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):79. doi: 10.1186/s12859-020-3346-8.
Ensemble disease gene prediction by clinical sample-based networks.
Abstract
BACKGROUND:
Disease gene prediction is a critical and challenging task. Many computational methods have been developed to predict disease genes, which can reduce the money and time used in the experimental validation. Since proteins (products of genes) usually work together to achieve a specific function, biomolecular networks, such as the protein-protein interaction (PPI) network and gene co-expression networks, are widely used to predict disease genes by analyzing the relationships between known disease genes and other genes in the networks. However, existing methods commonly use a universal static PPI network, which ignore the fact that PPIs are dynamic, and PPIs in various patients should also be different.
RESULTS:
To address these issues, we develop an ensemble algorithm to predict disease genes from clinical sample-based networks (EdgCSN). The algorithm first constructs single sample-based networks for each case sample of the disease under study. Then, these single sample-based networks are merged to several fused networks based on the clustering results of the samples. After that, logistic models are trained with centrality features extracted from the fused networks, and an ensemble strategy is used to predict the finial probability of each gene being disease-associated. EdgCSN is evaluated on breast cancer (BC), thyroid cancer (TC) and Alzheimer's disease (AD) and obtains AUC values of 0.970, 0.971 and 0.966, respectively, which are much better than the competing algorithms. Subsequent de novo validations also demonstrate the ability of EdgCSN in predicting new disease genes.
CONCLUSIONS:
In this study, we propose EdgCSN, which is an ensemble learning algorithm for predicting disease genes with models trained by centrality features extracted from clinical sample-based networks. Results of the leave-one-out cross validation show that our EdgCSN performs much better than the competing algorithms in predicting BC-associated, TC-associated and AD-associated genes. de novo validations also show that EdgCSN is valuable for identifying new disease genes.
KEYWORDS:
Disease gene prediction; Ensemble learning; Network centrality; Protein-protein interaction network; Sample-based networks
17.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):86. doi: 10.1186/s12859-020-3353-9.
Sarcopenia negatively affects hip structure analysis variables in a group of Lebanese postmenopausal women.
Saddik H1,2, Nasr R1,3, Pinti A4,5, Watelain E6, Fayad I1, Baddoura R7, Berro AJ1,8, Al Rassy N9,10, Lespessailles E2, Toumi H2, El Hage R1.
Abstract
BACKGROUND:
The current study's purpose is to compare hip structural analysis variables in a group of postmenopausal women with sarcopenia and another group of postmenopausal women with normal skeletal muscle mass index. To do so, the current study included 8 postmenopausal women (whose ages ranged between 65 and 84 years) with sarcopenia and 60 age-matched controls (with normal skeletal muscle mass index (SMI)). Body composition and bone parameters were evaluated by dual-energy X-ray absorptiometry (DXA).
RESULTS:
Weight, lean mass, body mass index, femoral neck cross-sectional area (FN CSA), FN section modulus (Z), FN cross sectional moment of inertia (CSMI), intertrochanteric (IT) CSA, IT Z, IT CSMI, IT cortical thickness (CT), femoral shaft (FS) CSA, FS Z and FS CSMI were significantly greater (p < 0.05) in women with normal SMI compared to women with sarcopenia. In the whole population, SMI was positively associated with IT CSA, IT Z, IT CSMI, IT CT, FS CSA, FS Z, FS CSMI, FS CT but negatively correlated to IT buckling ratio (BR) and FS BR.
CONCLUSION:
The current study suggests that sarcopenia has a negative effect on hip bone strength indices in postmenopausal women.
KEYWORDS:
Bone strength indices; DXA imaging; Fracture risk; Menopause; Sarcopenia
18.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):78. doi: 10.1186/s12859-020-3345-9.
Prediction of tumor location in prostate cancer tissue using a machine learning system on gene expression data.
Abstract
BACKGROUND:
Finding the tumor location in the prostate is an essential pathological step for prostate cancer diagnosis and treatment. The location of the tumor - the laterality - can be unilateral (the tumor is affecting one side of the prostate), or bilateral on both sides. Nevertheless, the tumor can be overestimated or underestimated by standard screening methods. In this work, a combination of efficient machine learning methods for feature selection and classification are proposed to analyze gene activity and select them as relevant biomarkers for different laterality samples.
RESULTS:
A data set that consists of 450 samples was used in this study. The samples were divided into three laterality classes (left, right, bilateral). The aim of this work is to understand the genomic activity in each class and find relevant genes as indicators for each class with nearly 99% accuracy. The system identified groups of differentially expressed genes (RTN1, HLA-DMB, MRI1) that are able to differentiate samples among the three classes.
CONCLUSION:
The proposed method was able to detect sets of genes that can identify different laterality classes. The resulting genes are found to be strongly correlated with disease progression. HLA-DMB and EIF4G2, which are detected in the set of genes can detect the left laterality, were reported earlier to be in the same pathway called Allograft rejection SuperPath.
KEYWORDS:
Biomarkers; Classification; Machine learning; Prostate cancer laterality
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου