Research Projects

BIOINFORMATICS SOFTWARE

categoryCompare
CATEGORYCOMPAREis a methodology for cross-platform and cross-sample comparison of high-throughput data at the annotation level (such as GO ontologies; KEGG pathways; and gene sets (GSEA)). This approach allows for the comparison of datasets from heterogeneous platforms. CategoryCompare provides a powerful visualization utilizing Cytoscape that allows for users to quickly view the shared features between annotations. CategoryCompare is available as an R bioconductor package. A web version of categoryCompare is currently under construction which employs cytoscape.js.

Flight RM, Harrison BJ, Mohammad F, Bunge MB, Moon LDF, Petruska JC, Rouchka EC:categoryCompare, an analytical tool based on feature annotations.Frontiers in Genetics 2014, 5:98. doi: 10.3389/fgene.2014.00098

Available as R bioconductor package

absolute ID convert
With the availability of gene and protein centric databases (NCBI, Ensembl, UCSC, and others), as well as the wide variety of available platforms for measuring gene expression (Affymetrix, Agilent, custom arrays, and RNA-Seq), biological researchers need reliable methods for converting various identifiers from one type to another.AbsIDConvertis based on the unique idea that genomic identifiers can be converted to genomic intervals, and therefore conversion between identifiers requires simply finding overlapping intervals.
Mohammad F, Flight RM, Harrison BJ, Petruska JC, Rouchka EC: AbsIDconvert: An absolute approach for converting genetic identifiers at different granularities.BMC Bioinformatics 2012, 13:229. doi:10.1186/1471-2105-13-229.

Available as web interface and virtual machine

MPrime
MPrime is an interface which allows the effiicient high-throughput detection of multiple primers or oligonucleotides for genic regions in either the human, mouse, rat, zebrafish, or fruit fly genomes. In order to choose the regions of interest for primer or oligo design, you must choose the organism you are interested in, as well as the genic regions of interest. Genic regions can be identified by the gene name, GenBank or RefSeq accession, or by a keyword. Additionally, MPrime1.3 will now allow you to enter in fasta formatted sequences. Before primers are designed, you will be sent to a page that will allow you to select the genic regions you wish to use.
Rouchka EC, Khalyfa A, Cooper NGF. (2005) MPrime: efficient large scale multiple primer and oligonucleotide design for cutomized gene arrays. BMC Bioinformatics6:175. (doi:10.1186/1471-2105-6-175).

Available as a web interface

zebrafish repeats
This database contains a total of 116,915 exact tandem repeats with a base length of at least three and a copy number of at least ten have been detected in the Zv8 assembly of the zebrafish genome. This web interface can be used to browse for repeats and primers for amplifying these repeats within a certain genomic region. In addition, the repeats can be downloaded as tracks for either the UCSC Genome browser (December 2008 build) or FishMap Zv8 browser.
Rouchka, EC. (2010) Database of exact tandem repeats in the Zebrafish genome.BMC Genomics11:347. doi:10.1186/1471-2164-11-347

Available as a web interface and genome browser tracks

rMotifGen
rMotifGenis a solution with the sole purpose of generating a number of random DNA or amino acid sequences containing short sequence motifs. Each motif consensus can be either user-defined, or randomly generated. Insertions and mutations within these motifs are created according to user-defined parameters. The resulting sequences can be helpful in mutational simulations and in testing the limits of motif detection algorithms.
Rouchka EC, Hardin CT (2007). rMotifGen: random motif generator for DNA and protein sequences.BMC Bioinformatics8:292. (doi:10.1186/1471-2105-8-292).

Available as a web interface and source code

RBF-TSS
RBF-TSSis a novel identification method for identifying transcription start sites that improves upon published TSS detection models. RBF-TSS incorporates a metric feature based on oligonucleotide positional frequencies, taking into account the nature of promoters. A radial basis function network for identifying transcription start sites is created using non-overlapping chunks (windows) of size 50 and 500 on the human genome.
Mahdi RN, Rouchka EC. (2009) RBF-TSS: Identification of transcription start site in human using radial basis functions network and oligonucleotide positional freqeuncies.PLoS One, 4(3):e4878. (10.1371/journal.pone.0004878)

Available as source code

DiffSplice
DiffSplice is a novel tool for discovering and quantitating alternative splicing variants present in an RNA-seq dataset, without relying on annotated transcriptome or pre-determined splice pattern. For two groups of samples, DiffSplice further utilizes a non-parametric permutation test to identify significant differences in expression at both gene level and transcription level. DiffSplice takes as input the SAM files that supply the alignment of the RNA-seq reads on the reference genome, obtained from an RNA-seq aligner like MapSplice. The results of DiffSplice are summarized as a decomposition of the genome and can be visualized using the UCSC genome browser.
Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, Monroy A, Kuan PF, Hammond SM, Makowski L, Randell SH, Chiang DY, Hayes DN, Jones C, Liu Y, Prins JF, Liu J,"DiffSplice: the genome-wide detection of differential splicing events with RNA-seq"Nucleic Acids Res. 2013 Jan;41(2):e39. (doi:10.1093/nar/gks1026)

Available as source code

MapSplice
MapSpliceis a software for mapping RNA-seq read to reference genome for splice junction discovery. It depends only on reference genome, and not on any further annotations. It supports both paired-end reads and single-end reads, and utilizes the advantage of pair-end read for better mapping accuracy. It supports variable length reads, and it aligns unspliced and spliced alignments simultaneously. MapSplice can be used to detect novel canonical, semi-canonical and non-canonical splice junctions; novel insertions and deletions; and novel gene fusion events.
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J: MapSplice: Accurate mapping of RNA-seq reads for splice junction discoveryNucleic Acids Research 2010; (doi: 10.1093/nar/gkq622)

Available as source code

MapPER
MapPER is a probabilistic framework to predict the alignment to the genome of all RNA-seq paired-end read (PER) transcript fragments in a PER dataset. Starting from possible exonic and spliced alignments of all end reads, MapPER constructs potential splicing paths connecting paired ends. An expectation maximization method assigns likelihood values to all splice junctions and assigns the most probable alignment for each transcript fragment.
Hu Y, Wang K, He X, Chiang DY, Prins JF, Liu Z: A probabilistic framework for aligning paired-end RNA-seq dataBioinformatics 2010, 26(16):1950-1957. (doi"101093/bioinformatics/btq336).

Available as source code

FDM
FDM , or Flow Difference Metric, identifies regions of differential RNA-transcript expression between pairs of splice graphs, without need for an underlying gene model or catalog of transcripts. This novel non-parametric statistical test is applied between splice graphs to assess the significance of differential transcription, and extend it to group-wise comparison incorporating sample replicates.

Singh D, Orellana CF, Hu Y, Jones CD, Liu Y, Chiang DY, Liu J, Prins JF: FDM: A graph-based statistical method to detect differential transcription using RNA-seq data.Bioinformatics 2011, 27(19)2633-2640. (doi: 10.1093/bioinformatics/btr458).

Available as source code

MultiSplice
MultiSplice implements a general linear framework for accurate transcript quantification using a set of new structural features: MultiSplices. Our software has several desirable features: 1. It utilizes all the information implied in the read alignment and alleviate the identifiability issues. 2. By solving the linear system using LASSO, it can achieve the most accurate set of dominantly expressed transcripts. 3. It is very efficient. For example, the analysis of the human transcriptome can be finished in less than one hour.
Huang Y, Hu Y, Jones CD, MacLeod JN, Chiang DY, Liu Y, Prins JF, Liu J: A robust method for transcript quantification with RNA-seq dataJ Comput Biol 2013, 20(3):167-187. (doi: 10.1089/cmb.2012.0230)

Available as source code

Asteroid
Asteroid is a novel algorithm to simultaneously reconstruct transcripts and estimate their abundance.
Huang Y, Hu Y, Liu J: Piecing the Puzzle Together: a Revisit to Transcript Reconstruction Problem in RNA-seqRECOMB-seq: fourth annual RECOMB satellite workshop on massively parallel sequencing 2012.

Available as source code

PYNAC
PYNAC is an algorithm and software system capable of handling high volumes of stable isotope-resolved metabolomics data, while including quality control methods for maintaining data quality. We validate this new algorithm against a previous single isotope correction algorithm in a two-step cross-validation. Next, we demonstrate the algorithm and correct for the effects of natural abundance for both 13C and 15N isotopes on a set of raw isotopologue intensities of UDP-N-acetyl-D-glucosamine derived from a 13C/15N-tracing experiment. Finally, we demonstrate the algorithm on a full omics-level dataset.
Carreer WJ, Flight RM, Moseley HNB: A Computational Framework for High-Throughput Isotopic Natural Abundance Correction of Omics-Level Ultra-High Resolution FT-MS DatasetsMetabolites 2013, 3(4):853-866. (doi: 10.3390/metabo3040853).

Available as source code

GENOMER
Genomeris command line glue for genome projects. It simplifies the small but tedious tasks required when finishing a genome. Genomer makes it easy to reorganise contigs in a genome, map annotations on to the genome and generate the files required to submit a genome. Furthermore genomer aims make genome projects more reproducible and robust. Genomer is designed to work well with build tools such as GNU Make and revision control tools such as git. This makes genome projects easy to share and reproduce.
Barton MD, Barton HA: Genomer: A swiss army knife for genome scaffoldingPLoS One 2013, 8(6):e66922. (doi: 10.1371/journal.pone.0066922).

Available as source code

P-NONMEM
P-NONMEM combines the global search strategy by particle swarm optimization (PSO) and the local estimation strategy of NONMEM. In the proposed algorithm, initial values (particles) are generated randomly by PSO, and NONMEM is implemented for each particle to find a local optimum for fixed effects and variance parameters. P-NONMEM guarantees the global optimization for fixed effects and variance parameters. Under certain regularity conditions, it also leads to global optimization for random effects. Because P-NONMEM doesn.t run PSO search for random effect estimation, it avoids tremendous computational burden. In the simulation studies, we have shown that P-NONMEM has much improved convergence performance than NONMEM. Even when the initial values were far away from the global optima, P-NONMEM converged nicely for all fixed effects, random effects, and variance components.
Kim S, Li L: A novel global search algorithm for nonlinear mixed-effects models using particle swarm optimizationJ Pharmacokinet Pharmacodyn 2011, 38(4): 471-495. (doi: 10.1007/s10928-011-9204-6).

Available as source code upon request