Data Analysis Fusion Code

(Updated on 2013/3/26)

PraHaploProgram package for haplotype-based whole-genome association study using parallel computing.
ID:D-1
Principal developer

Tatsuhiko TSUNODA, Center for Genomic Medicine, RIKEN

General description

The ParaHaplo code is an invaluable parallel computing tool for conducting haplotype-based genome-wide association studies (GWAS) as the data sizes of projects continue to increase.

Computational model

Exact probability calculation of type I error using haplotype frequencies.

Computational method

Markov-chain Monte Carlo (MCMC) algorithm

Parallelization

Hybrid parallelization (Threads and MPI)

Required language and library

C and MPI

Status of code for public release

Source code is available through ISLIM download site.

Maximum computing size in present experiences
  • Analysis of regional differences in genetic variation for 90 people's data
  • An 8,192 core parallel computing with RIKEN RICC cluster
  • Required memory size/disk storage size: 10 GB/10 GB
Expected computing size in K computer
  • Analysis of almost 50 diseases for about 200,000-people's data
  • parallel computing with 640-thousands of cores
  • Required memory size/storage size: 2TB/2TB

    Ref.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2774321/
ParaHaploイメージ

Figure 1. The figure shows -log (p-value) where the p-value in the Armitage test represents disease/SNP associations.

What does the code enable?
  • Precise and comprehensive findings for disease associated genomic information among a variety of personal whole genomes (about 3 giga bases per person).
NGS analyzer Next-generation genome sequencer analysis program
ID:D-2
Principal developer

Tatsuhiko TSUNODA, Center for Genomic Medicine, RIKEN

General description

The output data generated by a next-generation genome sequencer are analyzed in high speed by the NGS analyzer. The code identifies genetic differences among persons or cancer cell's mutations much precisely

Computational model

Mapping for the human genome sequences and detection of diversities based on the probability calculations

Computational method

Diagonalization of a dense matrix with a direct method

Parallelization

domain decomposition

Required language and library

Perl, C

Status of code for public release

Source code is available through ISLIM download site.

Maximum computing size in present experiences
  • The first Japanese genome sequence analysis and an identification of mutations in a cancer genome analysis using a 2,000-core x86 cluster system
  • Required memory/disk: 4 TB/100 TB
Expected computing size in K computer
  • An identification of mutations in a 500-people's cancer genome analysis using 640 thousands of cores
  • Required memory/disk storage: 2PB/50 PB.
NGSanalyzerイメージ

Figure 1. The comparisons among seven persons' whole genomes. The number of SNPs/Mbp is shown for (a) chromosome 1, (b)chromosome 6, and(c)chromosome X (Nature Genetics 42, 931–936)

What does the code enable?
  • Comprehensive and precise detection of the differences among the personal genetic data (whole genome has about 3 billion bases per person)
  • Find all of cancer's mutations faster and search target molecules for a drug discovery.
ExRAT Genome-wide Association Study that takes SNP-SNP interaction into account.
ID:D-3
Principal developer

Tatsuhiko TSUNODA, Center for Genomic Medicine, RIKEN

General description
  • ExRAT exhaustively searches the genome for combinations of disease related genes/SNPs to identify significant disease association.
  • The algorithm implements two methods - a massively parallel method for looking at all SNP-pairs, and a more precise method that takes the linkage disequilibrium between SNPs into account. The former picks up candidates, whereas the latter calculates empirical p-values.
Computational model

RAT(Rapid Association Test)

Computational method

Importance sampling

Parallelization

Data decomposition

Required language and library

C++, MPI, OpenMP

Status of code for public release

Source code is available through ISLIM download site.

Maximum computing size in present experiences
  • Genotyped data of 100,000 SNPs x 4,000 individuals
  • Five billions of SNP-SNP combinations × 4,000 individuals
  • RICC system with 8,192 cores
  • Required memory/disk storage: 1.1 GB/2 GB
Expected computing size in K computer
  • Analysis on tens of thousands individuals (per one disease)
    x 250 billion combinations of two SNPs in the human genome
    x 50 diseases
  • Required memory/disk storage: 20 GB/2 TB
ExRATイメージ

Figure 1. Searching the disease related genes that increase disease risks through epistatic interactions.

Purpose of the algorithm
  • ExRAT can identify gene-gene interactions that increase disease risk and identify novel disease related genes.
SiGN Large scale gene network estimation software series.
ID:D-4, 5, 6
Principal developer

Satoru MIYANO, Professor, Institute of Medical Science, University of Tokyo

General description

Massively parallel software series for modeling and estimation of a gene expression control system (genome network) in a cell

Computational model

Nonparametric regression Bayesian networks, state space models, graphical Gaussian models, vector autoregressive models

Computational method

Heuristic structure estimation algorithms + the bootstrap method, the neighbor node sampling & repeat algorithm, a parallel optimal structure estimation algorithm, the EM method、the L1 regularization method

Parallelization

MPI,OpenMP, EP

Required language and library

Fortran90, C, R

Status of code for public release

Code is available through ISLIM download site.

Maximum computing size in present experiences
  • A gene network having 20 thousands of genes
  • RICC's 8,192-core and the supercomputer at the Human Genome Center
  • Required memory/disk storage: 12 TB/500 MB (Need network)
Expected computing size in K computer
  • A gene network estimation for all transcripts (>100 thousands)
  • Use 640 thousands of cores
  • Required memory/disk storage: 1PB/10 GB (Need network)
SiGNイメージ

Figure 1.In silico search of the targeting genes for the drug discovery.

What does the code enable?
  • A large scale in silico search of the targeting genes for the drug discovery using the method of the gene network estimation that will cover all human gene's transcripts
  • Using the gene network, the identification of influenced genes, the estimation of active sites, the prediction and avoiding of side effects, the search of drug discovery targets and pathways being associated with toxicity can be done on a large scale
  • A variety of gene networks can be solved in the shortest time.
  • Reference: http://sign.hgc.jp/index.html
SBiP Data analysis fusion platform.
ID:D-7
Principal developer

Satoru MIYANO, Professor, Institute of Medical Science, University of Tokyo

General description
  • The code provides a data analysis fusion platform that a user can use, in particular, SiGN and LiSDAS more easily among the developed codes for the data analysis fusion platform, combined with a highly functional and easy to use GUI.
  • The K computer's job scheduling system supports to run a part of the analysis pipeline on the K computer, and a user can visualize and store the results using the SBiP's visualization component series
Computational model

Depending on the components in analysis pipelines

Computational method

Depending on the components in analysis pipelines

Parallelization

Depending on the components in analysis pipelines

Required language and library

JAVA, R

Status of code for public release

Code is available through ISLIM download site.

Maximum computing size in present experiences

The pipeline created in RICC can be run as a batch job

Expected computing size in K computer

Depending on the components in an analysis pipeline.

SBiPイメージ

Figure 1. The design of the pipeline for a gene network estimation.

What does the code enable?
  • By combining a variety of analysis components already being available, users can run their customized analysis flows.
  • The analysis flow, for example, from a gene network estimation derived by gene expression information with SiGN-SSM, SiGN-L1, or SiGN-BN running on the K computer to a visualization after data reduction will be available.
LiSDAS Life science data assimilation systems.
ID:D-8
Principal developer

Tomoyuki HIGUCHI, Professor, Institute of Statistical Mathematics

General description

The software features basic functions to explore kinetic parameters in a biological pathway model as well as to reconstruct the graphical structure of an initial input simulator so that the reproducibility and predictivity of refined simulation models are improved for given experimental data on the endogenous variables.

Computational model

Finite difference method

Computational method

Hierarchical particle filter with two-layers

Parallelization

MPI for the upper layer, and OpenMP for the lower layer

Required language and library

Fortran90, C, C++, MPI, OpenMP

Status of code for public release

Source code is available through ISLIM download site.

Maximum computing size in present experiences
  • Estimation of kinetic parameters in a transcription control model for the mammalian circadian rhythm
  • Particles: 10 billion, unknown parameters: 44
  • RICC's 8,192 cores
  • Required memory/disk storage: 3.5 TB/16 TB
Expected computing size in K computer
  • Parameter estimations in a large scale pathway model
  • Particles: 50 billion, unknown parameters: 250
  • Use 320,000 cores
  • Required memory/disk storage: 100 TB/1 PB.
LiSDASイメージ

Figure 1. The fine tuning of a computational model.

What does the code enable?
  • LiSDAS features basic functions to explore kinetic parameter values in a biological pathway model and to retrieve new hypothetical models (reconstruction of unreliable models) from a quite huge space of potential model sets such that simulation trajectories reproduce experimentally-observed profiles of model variables on diverse scales.
MEGADOCK Exhaustive protein-protein interaction network prediction tools based on 3D conformation data.
ID:D-9
Principal developer

Yutaka AKIYAMA, Professor, Tokyo Institute of Technology

General description

The code implements the simplified novel evaluation model for protein-protein interactions (PPI) based on the surface shape complementarily and the electrostatic interactions derived from the protein's 3D conformation data. Adopting FFT much reduces computing time to run faster on a large scale parallel computer

Computational model

The complex convolutions on 3-D voxel models evaluate the binding between two protein conformations

Computational method

The surface shape complementarities and the electrostatic interactions are scored with a single complex number. The convolution calculation for two conformations is done in Fourier space, and the inverse Fourier transformation gives the evaluated values.

Parallelization

OpenMP for intra-node and MPI for inter-node are applied to the rotation angle parallelization.

Required language and library

C++, MPI, OpenMP, FFTW3.2.2

Status of code for public release

Source code is available through ISLIM download site.

Maximum computing size in present experiences
  • Interaction prediction of human EGFR proteins (250,000 pairs)
  • A 512-core x86 cluster (20,000,000 seconds)
  • Required memory/disk storage: 1.5 TB/250 TB
Expected computing size in K computer
  • Interaction prediction for signal transduction pathway proteins relating human lung cancer (1000x1000 pairs for variable conditions. About 10x10/pair ensemble calculations corresponding to conformation change)
  • Use 320 thousands of cores (100,000 sec/case)
  • Required memory/disk storage: 640 TB/1 PB.
MEGADOCKイメージ

Figure 1. The flow of MEGADOCK's PPI network prediction. First, enter the target protein conformations. Second, the docking calculations are done for all combinations of the proteins. Third, the output score distributions are analyzed to get the protein pairs contributing to PPI. Fourth, the PPI network is predicted from the protein pair data.

MEGADOCKイメージ

Figure 2. MEGADOCK Scalability

What does the code enable?
  • The code can give, for example, understanding how several millions of proteins existing in a human cell mutually interact and control themselves. Such knowledge can give the important direction for drug discovery and resolving a disease cause.