(Updated on 2013/3/26)
ID:D-1
Principal developer
Tatsuhiko TSUNODA, Center for Genomic Medicine, RIKEN
General description
The ParaHaplo code is an invaluable parallel computing tool for conducting haplotype-based genome-wide association studies (GWAS) as the data sizes of projects continue to increase.
Computational model
Exact probability calculation of type I error using haplotype frequencies.
Computational method
Markov-chain Monte Carlo (MCMC) algorithm
Parallelization
Hybrid parallelization (Threads and MPI)
Required language and library
C and MPI
Status of code for public release
Source code is available through ISLIM download site.
Maximum computing size in present experiences
- Analysis of regional differences in genetic variation for 90 people's data
- An 8,192 core parallel computing with RIKEN RICC cluster
- Required memory size/disk storage size: 10 GB/10 GB
Expected computing size in K computer
- Analysis of almost 50 diseases for about 200,000-people's data
- parallel computing with 640-thousands of cores
- Required memory size/storage size: 2TB/2TB
Ref.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2774321/
Figure 1. The figure shows -log (p-value) where the p-value in the Armitage test represents disease/SNP associations.
What does the code enable?
- Precise and comprehensive findings for disease associated genomic information among a variety of personal whole genomes (about 3 giga bases per person).
ID:D-2
Principal developer
Tatsuhiko TSUNODA, Center for Genomic Medicine, RIKEN
General description
The output data generated by a next-generation genome sequencer are analyzed in high speed by the NGS analyzer. The code identifies genetic differences among persons or cancer cell's mutations much precisely
Computational model
Mapping for the human genome sequences and detection of diversities based on the probability calculations
Computational method
Diagonalization of a dense matrix with a direct method
Parallelization
domain decomposition
Required language and library
Perl, C
Status of code for public release
Source code is available through ISLIM download site.
Maximum computing size in present experiences
- The first Japanese genome sequence analysis and an identification of mutations in a cancer genome analysis using a 2,000-core x86 cluster system
- Required memory/disk: 4 TB/100 TB
Expected computing size in K computer
- An identification of mutations in a 500-people's cancer genome analysis using 640 thousands of cores
- Required memory/disk storage: 2PB/50 PB.
Figure 1. The comparisons among seven persons' whole genomes. The number of SNPs/Mbp is shown for (a) chromosome 1, (b)chromosome 6, and(c)chromosome X (Nature Genetics 42, 931–936)
What does the code enable?
- Comprehensive and precise detection of the differences among the personal genetic data (whole genome has about 3 billion bases per person)
- Find all of cancer's mutations faster and search target molecules for a drug discovery.
ID:D-3
Principal developer
Tatsuhiko TSUNODA, Center for Genomic Medicine, RIKEN
General description
- ExRAT exhaustively searches the genome for combinations of disease related genes/SNPs to identify significant disease association.
- The algorithm implements two methods - a massively parallel method for looking at all SNP-pairs, and a more precise method that takes the linkage disequilibrium between SNPs into account. The former picks up candidates, whereas the latter calculates empirical p-values.
Computational model
RAT(Rapid Association Test)
Computational method
Importance sampling
Parallelization
Data decomposition
Required language and library
C++, MPI, OpenMP
Status of code for public release
Source code is available through ISLIM download site.
Maximum computing size in present experiences
- Genotyped data of 100,000 SNPs x 4,000 individuals
- Five billions of SNP-SNP combinations × 4,000 individuals
- RICC system with 8,192 cores
- Required memory/disk storage: 1.1 GB/2 GB
Expected computing size in K computer
- Analysis on tens of thousands individuals (per one disease)
x 250 billion combinations of two SNPs in the human genome
x 50 diseases
- Required memory/disk storage: 20 GB/2 TB
Figure 1. Searching the disease related genes that increase disease risks through epistatic interactions.
Purpose of the algorithm
- ExRAT can identify gene-gene interactions that increase disease risk and identify novel disease related genes.
ID:D-4, 5, 6
Principal developer
Satoru MIYANO, Professor, Institute of Medical Science, University of Tokyo
General description
Massively parallel software series for modeling and estimation of a gene expression control system (genome network) in a cell
Computational model
Nonparametric regression Bayesian networks, state space models, graphical Gaussian models, vector autoregressive models
Computational method
Heuristic structure estimation algorithms + the bootstrap method, the neighbor node sampling & repeat algorithm, a parallel optimal structure estimation algorithm, the EM method、the L1 regularization method
Parallelization
MPI,OpenMP, EP
Required language and library
Fortran90, C, R
Status of code for public release
Code is available through ISLIM download site.
Maximum computing size in present experiences
- A gene network having 20 thousands of genes
- RICC's 8,192-core and the supercomputer at the Human Genome Center
- Required memory/disk storage: 12 TB/500 MB (Need network)
Expected computing size in K computer
- A gene network estimation for all transcripts (>100 thousands)
- Use 640 thousands of cores
- Required memory/disk storage: 1PB/10 GB (Need network)
Figure 1.In silico search of the targeting genes for the drug discovery.
What does the code enable?
- A large scale in silico search of the targeting genes for the drug discovery using the method of the gene network estimation that will cover all human gene's transcripts
- Using the gene network, the identification of influenced genes, the estimation of active sites, the prediction and avoiding of side effects, the search of drug discovery targets and pathways being associated with toxicity can be done on a large scale
- A variety of gene networks can be solved in the shortest time.
Reference: http://sign.hgc.jp/index.html
ID:D-7
Principal developer
Satoru MIYANO, Professor, Institute of Medical Science, University of Tokyo
General description
- The code provides a data analysis fusion platform that a user can use, in particular, SiGN and LiSDAS more easily among the developed codes for the data analysis fusion platform, combined with a highly functional and easy to use GUI.
- The K computer's job scheduling system supports to run a part of the analysis pipeline on the K computer, and a user can visualize and store the results using the SBiP's visualization component series
Computational model
Depending on the components in analysis pipelines
Computational method
Depending on the components in analysis pipelines
Parallelization
Depending on the components in analysis pipelines
Required language and library
JAVA, R
Status of code for public release
Code is available through ISLIM download site.
Maximum computing size in present experiences
The pipeline created in RICC can be run as a batch job
Expected computing size in K computer
Depending on the components in an analysis pipeline.
Figure 1. The design of the pipeline for a gene network estimation.
What does the code enable?
- By combining a variety of analysis components already being available, users can run their customized analysis flows.
- The analysis flow, for example, from a gene network estimation derived by gene expression information with SiGN-SSM, SiGN-L1, or SiGN-BN running on the K computer to a visualization after data reduction will be available.
ID:D-8
Principal developer
Tomoyuki HIGUCHI, Professor, Institute of Statistical Mathematics
General description
The software features basic functions to explore kinetic parameters in a biological pathway model as well as to reconstruct the graphical structure of an initial input simulator so that the reproducibility and predictivity of refined simulation models are improved for given experimental data on the endogenous variables.
Computational model
Finite difference method
Computational method
Hierarchical particle filter with two-layers
Parallelization
MPI for the upper layer, and OpenMP for the lower layer
Required language and library
Fortran90, C, C++, MPI, OpenMP
Status of code for public release
Source code is available through ISLIM download site.
Maximum computing size in present experiences
- Estimation of kinetic parameters in a transcription control model for the mammalian circadian rhythm
- Particles: 10 billion, unknown parameters: 44
- RICC's 8,192 cores
- Required memory/disk storage: 3.5 TB/16 TB
Expected computing size in K computer
- Parameter estimations in a large scale pathway model
- Particles: 50 billion, unknown parameters: 250
- Use 320,000 cores
- Required memory/disk storage: 100 TB/1 PB.
Figure 1. The fine tuning of a computational model.
What does the code enable?
- LiSDAS features basic functions to explore kinetic parameter values in a biological pathway model and to retrieve new hypothetical models (reconstruction of unreliable models) from a quite huge space of potential model sets such that simulation trajectories reproduce experimentally-observed profiles of model variables on diverse scales.
ID:D-9
Principal developer
Yutaka AKIYAMA, Professor, Tokyo Institute of Technology
General description
The code implements the simplified novel evaluation model for protein-protein interactions (PPI) based on the surface shape complementarily and the electrostatic interactions derived from the protein's 3D conformation data. Adopting FFT much reduces computing time to run faster on a large scale parallel computer
Computational model
The complex convolutions on 3-D voxel models evaluate the binding between two protein conformations
Computational method
The surface shape complementarities and the electrostatic interactions are scored with a single complex number. The convolution calculation for two conformations is done in Fourier space, and the inverse Fourier transformation gives the evaluated values.
Parallelization
OpenMP for intra-node and MPI for inter-node are applied to the rotation angle parallelization.
Required language and library
C++, MPI, OpenMP, FFTW3.2.2
Status of code for public release
Source code is available through ISLIM download site.
Maximum computing size in present experiences
- Interaction prediction of human EGFR proteins (250,000 pairs)
- A 512-core x86 cluster (20,000,000 seconds)
- Required memory/disk storage: 1.5 TB/250 TB
Expected computing size in K computer
- Interaction prediction for signal transduction pathway proteins relating human lung cancer (1000x1000 pairs for variable conditions. About 10x10/pair ensemble calculations corresponding to conformation change)
- Use 320 thousands of cores (100,000 sec/case)
- Required memory/disk storage: 640 TB/1 PB.
Figure 1. The flow of MEGADOCK's PPI network prediction. First, enter the target protein conformations. Second, the docking calculations are done for all combinations of the proteins. Third, the output score distributions are analyzed to get the protein pairs contributing to PPI. Fourth, the PPI network is predicted from the protein pair data.
Figure 2. MEGADOCK Scalability
What does the code enable?
- The code can give, for example, understanding how several millions of proteins existing in a human cell mutually interact and control themselves. Such knowledge can give the important direction for drug discovery and resolving a disease cause.