HPCI Strategic Program Field 1 Supercomputational Life Science
Graduate School of Information Science and Engineering, Tokyo Institute of Technology
(From the left) Yutaka Akiyama, Takashi Ishida, Masanori Kakuta, Shuji Suzuki (Field 1- Program 4)
The Next-Generation DNA Sequencer allows DNA sequencing of about 1 trillion bases with a single run by a dramatic technological innovation. Although the "number of bases sequenced per dollar" doubled with a marvelous extension rate in about 19 months with what we call a firstgeneration sequencer, it doubled in only about 5 months after the appearance of Next-Generation Sequencers which are coming into wide use everywhere. The largest bottleneck now is the calculation of semantic interpretation for read sequences. It is common knowledge that supercomputers are required for integrated advanced calculation such as gene network prediction. But now, large-scale calculations are also required for elementary processes for DNA fragments from sequencers.
A high-speed algorithm which is suitable to explore an exact match with a known sequence or highly similar character string can be used for the study of subtle variations of human genes, and large-scale personal computer clusters have managed to address it. However, in the study of a comprehensive analysis of enteric microbe's genome fragments of humans or the study of significant sequence mutations, fragment mapping procedure (finding correspondence with a portion of known genome sequence) is not possible without detection of sequence similarity among distant or largely mutated sequences. For these studies, incomparably large-scale calculation is required, and personal computer clusters are not effective.
Human health is maintained by the properly balanced work of many symbiotic microbes in the body. In addition, the balance is considered to have several locally stable points. Comprehensive sequencing of genome fragments of microorganism gastro-intestinal which exist in the body's internal environment including oral, digestive, skin and urogenital system has been implemented for about 500 people in the Human Microbiome Project (HMP) in the USA and for about 300 people in the MetaHIT Project in Europe. However, when analyzing these valuable data, the rate of non-analyzable sequence fragments to be excluded will be significantly increased only with insensitive methods hypothesizing highly similar character strings.
We firstly established the super parallel execution environment of the existing BLASTX software with the cooperation of Prof. Ken Kurokawa (Tokyo Institute of Technology), who is an expert in genomic analysis. A scalability up to 16,000 cores was confirmed on TSUBAME 2.0 of the Tokyo Institute of Technology (Fig. 1). Binary-tree transfer was used for applying the bandwidth of networks at the initial delivery of the database to each calculation node. After translating DNA fragment sequences into amino acid sequences in six possible frames, we flexibly compared them with existing amino acid sequences. Event though the target DNA sequence fragments mutated at the DNA sequence level, they are often consistent with residues of existing sequences at the amino acid sequence level, or are changed into residues with similar physicochemical properties. Therefore, our flexible method which compares using a distance matrix between characters improves sensitivity for sequence search to explore distant genes with the same functions.
We then created a software, GHOSTX, as a substitute for BLASTX, with drastic speedup on the algorithm [1, 2]. GHOSTX can find similar short strings between database and query sequences immediately by multito-multi comparison by describing a partitioned database stored in memory using Suffix Array, translating a query sequence to Suffix Array, and comparing theme ffectively (Fig. 2). While BLASTX compares only character strings with a fixed-length, charac ter strings for comparison canbe extended with variable-length until they satisfy a similarity threshold score in GHOSTX, whic h improves the sensitivity per calculation cost. GHOSTX achieved about 20 times processing speed even when an equivalent sensitivity an equivalent sensitivity with BLASTX was required. When the sensitivity is decreased to a degree of no difficulty, GHOSTX attained over 100 times the speed. Its sensitivity is much better and it is 2 to 3 times faster than the BLAT algorithm which had high speed but was not able to be used because of poor sensitivity. We established a super parallel analysis system named "GHOST-MP" by implementing the GHOSTX algorithm on the ten quadrillion speed computer "K computer ", and integrating thread parallelization using OpenMP and data parallelism between nodes using MPI.
For the principal parts of homologous search, good scalability up to 1,536 nodes (12,288 cores) was measured with "K computer". Although the execution time for each sequence fragment varies widely, a scalability of up to several tens of thousands of nodes is expected because of the load balancing mechanism. In GHOST-MP, I/O operation is not completely intensive because its calculations are proportional to the product (multiplication) of length of the database and length of query sequence sets. Even though, careful I/O design is required. In the case of "K computer", I/O transfer to and from I/O nodes takes place in the z axis direction. We aim to achieve highly- efficient implementation up to several tens of thousands of nodes by adjusting data input using representative nodes and a broadcast mechanism, and by improving dispersion and reduction of I/O load.
 Suzuki, Ishida, Akiyama (2010) IPSJ Technical report, 2010-BIO-23(21):1-6
 Suzuki, Ishida, Akiyama (2011) IPSJ Technical report, 2011-BIO-25(32):1-8
Fig 1 : Parallel performance in
Fig 2 : Processing flowchart in GHOSTX algorithm