BioSupercomputing Newsletter Vol.6

Yousuke Ohno, Makoto Taiji, Gen Masumoto
(from the left, front row),
Hiroshi Koyama and Aki Hasegawa
(from the left, back row)

Group Director of Computational Molecular Design Group,
Quantitative Biology Center, RIKEN
Makoto Taiji
Senior Researcher of High-performance Computing Team,
Integrated Simulation of Living Matter Group,
Computational Science Research Program, RIKEN
Yousuke Ohno
Senior Researcher of High Performance Computing Development Team,
High Performance Computing Development Group,
RIKEN HPCI Program for Computational Life Sciences
Hiroshi Koyama
Researcher of High-performance Computing Team,
Integrated Simulation of Living Matter Group,
Computational Science Research Program, RIKEN
Gen Masumoto
Research Associate of High-performance Computing Team,
Integrated Simulation of Living Matter Group,
Computational Science Research Program, RIKEN
Aki Hasegawa

●The K computer proved to be more stable than expected.

━　As the development of the supercomputer K continues, we are taking up the great challenge of running application programs on it, and performing tuning and tentative computation although on a trial basis. The “MD core program for large-scale parallel computers (cppmd)” you developed has already recorded 1.3 PFLOPS*, and development of other application programs has continued with a view to achieving more than 1 PFLOPS. Today, members of the High-performance Computing Team who are providing various support for researchers talk about their actual impression of the K computer along with other problems found in their support work.

TAIJI　(Honorifics omitted): Speaking of the hardware aspects, I have the impression that the system was quite stable from the word "go". I have hardly ever heard of any sudden unexpected shutdown, and the system managed to operate on a rather large scale right from the beginning. At the outset, we expected much trouble due to malfunctions, but the system became very stable by last spring when we got involved in it.

OHNO:　Admittedly, it was really surprising that we had so little hardwarerelated trouble. Rather, it was software-related problems that we had difficulty in coping with. In particular, since the compiler was under development, C or C++ based compilers did not keep up with the development of the Fortran compiler. We had compilation failure and erroneous results although the program as such is not defective. In any case, with regard to software-related issues, we are fully prepared to cooperate in finding bugs, because we encourage and benefit from earlier delivery…(laugh).

●Fortran user achieved higher performance

━ Does it mean that you obtained better results from Fortran?

OHNO:　In the beginning, Fortran was also faster in terms of performance. In addition, I tried to write such codes that allow optimization using the compiler under development. In the meantime, we are getting improvements from other compilers these days.

KOYAMA:　This is not because I am a Fortran user, and I have no intention to value specific manufacturers over others, but automatic parallelization of the Fortran compiler is very complete. Regardless of the relatively limited number of users, it will be easy to use by those who have already been accustomed to Fortran for many years. On the other hand, we have to pay attention to the problem of memory bandwidth. I first thought that Fortran users who have used vector computers such as earth simulators can readily adapt to porting, but there are two distinct cases, i.e., successfully exploiting performance within the range of available memory bandwidth, or failing to do so. The demarcation is very clear. If that’s not possible, you will have to make changes to the algorithm itself to exploit higher performance. Reluctantly, we have to accept the constraints as they are, although that’s of course challenging…

━ Certainly, when compared with the earth simulators, we need to accept the memory bandwidth on an as-is basis.

KOYAMA:　I'd agree, but the memory bandwidth issue affects the real consequences, I mean, some people succeed, and others do not succeed. That’s hard.

━ In that context, it seems that ZZ-EFSI proved that it works successfully.

OHNO:　Certainly, loop statements written in Fortran leads to favorable performance in automatic parallelization.

KOYAMA:　It’s of course a result of our continued development efforts, but in a sense we were very lucky (laugh). However, many K computer users want sufficient compiler optimization for programs using C and C++. I want manufacturers to respond their needs.

MASUMOTO:　People having experience in fluid dynamics ought to have experience in vector supercomputers. On the other hand, quite a few people in biotechnology have not used supercomputers. They should have made necessary preparations taking the many C users and C++ users into consideration.

KOYAMA:　At any rate, that applies to the current stage. The situation will be quite different when the system is in full-scale operation.

OHNO:　Being sophisticated in terms of language specification makes it more difficult to construct a compiler. It’s possible that construction of the compiler may be intentionally postponed. Also, there must have been some pressure to realize high performance from the very beginning.

MASUMOTO:　So getting started with relatively easy-to-handle Fortran, that’s not unreasonable.

TAIJI:　The K computer incorporates a parallel computation model called VISIMPACT for effective massively parallel computing. It facilitates high-speed core synchronization. That function facilitates vectorization-oriented parallelization. For that reason, usually, parallelization between cores is done at an upstream location, but advantageously, parallelization in the K computer takes place at a downstream location and then the loop is decomposed. Probably, it depends on whether or not automatic parallelization can take place as to whether automatic parallelization as a whole is successful.

KOYAMA:　People using many loop structures and a lot of data in them are getting good and faster performance. Although there is the memory bandwidth issue, I think that, in terms of optimization, people accustomed to vector types are can readily adapt to automatic parallelization.

MASUMOTO:　The application programs that I was involved in are all written using C and C++, and I had trouble with compilation at the beginning. Some application programs have to demonstrate performance immediately, and Mr. Ohno cooperated with me to do as what can be done in the current situation, so recently I have had much better results. With regard to other application programs, I am wondering whether we had better wait for maturity of the compiler or modify codes in accordance with the currently available compiler. If time allows, we should refrain from doing something tricky and wait for completion of the compiler…

━ So you have to solve the problem of how to deal with computers under development.

MASUMOTO:　Sure, but that’s a problem specific to C++. C is much better than that. As regards Mr. Koyama’s comment, I am very much surprised by the fact that Fortran is much, much better… (laugh)

KOYAMA:　It was certainly easy, and I have a very good impression of Fortran (laugh).

●Real performance to be assessed in future

HASEGAWA:　It was not so complicated or time-consuming to port the data analysis application programs that I am in charge of. However, there is much room for expanding capability through hybrid parallelization. I’m going to direct my efforts toward that end. In addition, many I/O components are still pending because the system itself has not entered the full-scale operation phase, so tuning will be done somewhat later.

MASUMOTO:　When compared with other application programs, I feel that I/O-related components create a bottleneck.

HASEGAWA:　It may sound misleading to say that I don't have much trouble with it (laugh). I do have a lot of trouble. The point is that we don't have many options we can choose at this stage. Further development of the I/O-related components will offer us more options in due course.

KOYAMA:　For example, in the computation regarding fluids, the computation results do not differ so much from computations using dummy data, and they are meaningfully predictable, so the I/O configuration can be defined afterward. In contrast, in biotechnology, data should be given first. This means that the computation could lack significance in the absence of a predefined benchmark affecting the hit ratio. An operation using dummy data would give you no hits, and you have to use more or less fully-fledged data to tune the system and thereby get meaningful results. Accordingly, in some cases, I/O features have to be available at an early stage. That is where some of our efforts are directed.

━ The K computer incorporates many advanced techniques that have drawn much attention as early as in its development phase, including the 6D Mesh/Torus Interconnect (Tofu Network), and the sector cache feature in which specific data can be maintained in the cache memory. What is your impression of these after actually using them?

OHNO:　In the MD computation in the cppmd, the amount of computation is relatively large compared with the amount of data, and the computation does not cause much memory bottleneck. Since the cache capacity is, actually, sufficient in most cases, we were successful in delivering performance without using the sector cache functionality. In the application programs in life science files that I now handle, I have not had to use the sector cache thus far. With regard to networks, in the case of MD, the communications ratio is relatively low, and favorable performance is achieved even when tuning is left to be done to the maximum extent possible. Either of them is not used full-scale, and it is a long way to go until it proves its true performance.

●Challenges specific to development phase

━ When the K computer was under development, Mr. Taiji told me that one can’t tell what will happen until a computer this large is actually used. Is there something you did not expect or predict, unexpected findings, or something surprising?

TAIJI:　I still hold to the idea “you can't tell unless you try,” but what was the most surprising was that, as I said previously, we didn't have much trouble in the hardware aspects and the system had more stability than expected (laugh).

━ So it was an unexpected benefit, wasn’t it? And…

TAIJI:　Yes. The downside to this, as we were discussing, the compiler is still Fortran- based. We of course understand that it is in the development phase, but I want a C++ compiler that really works at this stage.

KOYAMA:　What is most challenging is that some features are not implemented in the development phase, and we have to decide on the extent to which the application should be modified to avoid bugs. For example, version upgrades could lead to improved performance even when major modifications are not made at this point so that the performance is improved. We need to select the most promising out of the available approaches…

MASUMOTO:　We have oftentimes had similar situations up to now (laugh). However, such efforts might not be meaningful to the application programs, but they help us build our skills, and it is somewhat useful for us to understand the characteristics of the K computer.

KOYAMA:　I understand that gaining experience is very important. However,it would be too much burden on those involved in development of individual application programs to yield their own research results, for that increases their workload.

MASUMOTO:　Even in that case, it will never happen that you get splendid results from the original codes without making modifications. You need to try many things. Of course, having tried many things often ends up going back to the original state (laugh).

━ What are the questions frequently asked by users who develop application programs?

MASUMOTO:　I had many inquiries before, saying, “I cannot compile successfully” (laugh).

KOYAMA:　In some cases we had trouble in cross-compiling.

━ Cross-compiling?

OHNO:　Since the K computer is a machine in which the front-end CPU and the computation node CPU are different, they may fail to run in different environments. The operation is successful now, but at the beginning I had much trouble.

KOYAMA:　As the K computer is under development and there are many functional limitations, some people report that system development is affected by such functional limitations specific to development phases, although they suffer from their own bugs, of course (laugh).

MASUMOTO:　Since there are still many beginners, they often ask me elementary questions. Meanwhile, the manuals are very complete. They seem more complete than expected in the first version.

OHNO:　Rather, the machines have to be upgraded to be in line with the integrity of the manuals (laugh).

●Efforts by application developers are required.

━ I think many people would like to use the K computer. Please give them some advice from a professional viewpoint.

KOYAMA:　There are many specific topics, but it is difficult to offer general advice.

OHNO:　When performing large-scale computation using the K computer, it will be necessary to provide hybrid parallelization where two parallelization schemes are used. I have observed some cases where people find difficulty in this hybrid parallelization.

MASUMOTO:　The Headquarters also recommend hybrid parallelization.

KOYAMA:　Not only in the K computer but also in other large-scale machines, the number of cores per node tends to increase, so now we will have to rely on hybrid parallelization to exploit the capability of CPUs, given the current trends of high-performance computing (HPC).

OHNO:　As we go back to basics, we need much effort by those who write the application programs. In fact, their efforts are the key to harnessing HPC potentials.

━ Nevertheless, researchers developing application programs use computers to output their research results, so it would be rather demanding for them.

OHNO:　In that sense, a High-performance Computing Team like us can provide support for them. Complete division of labor is an extreme example, where there are two distinct groups, i.e., members dedicated to code tuning and members dedicated to research using the tuned codes. In that case, revising the algorithm itself will have to ensure that researchers’ computation targets are left intact. Even when you do not have a clear idea of computation targets, you have to tune codes with computation methods, the presence of computation targets, and specific objects of computation taken into account. Complete division of labor could cause communication problems. I think that people who can handle code tuning while being involved to some extent in the research field as such are required.

━ The K computer achieved first place in the TOP500 on the LINPACK benchmarks in two consecutive years, and research on silicon nanowire materials using the K computer won the Gordon Bell Prize, which further raised expectations for the research results that will follow these two achievements. Your endeavor will become most important for the K computer to meet expectations. I am looking forward to you getting superb achievements. Thank you for making time in your busy schedule.

※Results obtained by special use for 2011 Gordon Bell Prize