Graduate School of Information Science and Technology,
The University of Tokyo
（Data Analysis Fusion WG)
Human cells are said to have about 20 to 30 thousand genes. The human body is mostly composed of proteins. The gene is a blueprint for proteins made in the cell. As well as the kind of protein, the timing and quantity of protein production is also regulated by special genes. Those genes (≒ proteins) are also regulated by another gene. Briefly, genes form a complicated regulation network. Most of the system is still poorly understood. Even among the same human cells, they have different networks in different organs. Drugs modify gene networks and cancer cells have destroyed networks. Gene network estimation is an approach to infer or estimate such gene regulatory networks (=gene network) from measurable data by mathematical, statistical and informational scientific methods. Although it is impossible to measure all proteins produced in the cell with current technologies, the amount of mRNA synthesized prior to protein production is measurable for every gene. Data measured like this is called gene expression data. Data obtained from one measurement are a snapshot of cell status. It is impossible to infer or estimate regulations between genes only with this one measurement. Massive data are necessary for that. Therefore, we collect data necessary for estimation by applying various stimuli to the cell, collecting cells from patients with a particular disease or taking data temporally at regular time intervals. Inference and estimation of gene networks enables clarification of regulations between genes by exhaustive computation instead of the conventional time-consuming method of searching for genes one by one and repeating experiments. It is expected that this approach will enable efficient development of new drugs, identification of cancer-specific genes, and understanding of the functions of such genes.
SiGN is software for estimating a gene network with a supercomputer from gene expression data. As the gene network, various models have been proposed. However, every model has both merits and demerits. None of them is by far the best. After deciding a model, we still have to choose a method for estimating parameters from the data. Those methods also have good and bad points. SiGN is a gene network estimation software implementing multiple gene network models and estimation algorithms, both of which requires vast amount of computation time assuming computation using a supercomputer. In particular, SiGN is composed of three sub-programs, SiGN-BN using static and dynamic Bayesian networks, SiGN-SSM using a State Space Model (SSM) and SiGN-L1 implementing a parameter estimation method by L1 regularization. SiGN-BN implements a new algorithm called NNSR. Conventionally, gene network estimation using Bayesian networks was applicable to about 1000 genes. Now it is applicable to all genomes (all genes) thanks to NNSR. Temporal data allows SiGN-SSM to estimate dynamic gene networks that are able to be simulated. It does not give the network structure but the strength of relationships among all genes as mathematical values. Thanks to supercomputers, network structures which have been difficult to compute, are now computable with a degree of confidence. L1 regularization was originally applicable to large-scale gene networks. However, the computation time of conventional methods is not enough to estimate networks focusing on individual differences in gene expression. By using the K computer, it is able to be computed within a realistic time-frame.
Development of SiGN is targeted mainly at the K computer and Shirokane, a supercomputer of the Human Genome Center. Several sub-programs have already been installed in Shirokane and are available for users. For more information, please contact the SiGN website at http://sign.hgc.jp.