Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Implementation and performance of FDPS: a framework for developing parallel particle simulation codes

Implementation and performance of FDPS: a framework for developing parallel particle simulation... We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application- development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle- based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each com- puting node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes– Hut tree algorithm or the Fast Multipole Method should be used in the case of long- range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are neces- sary for efficient parallel execution of particle-based simulations as “templates,” which are independent of the actual data structure of particles and the functional form of the particle–particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N ) calculation cost, and yet the program, once compiled with FDPS, will run effi- ciently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs The Author 2016. Published by Oxford University Press on behalf of the Astronomical Society of Japan. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-2 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum 7 9 calculation time per timestep is in the range of 30 ms (N = 10 ) to 300 ms (N = 10 ). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks. Key words: dark matter — Galaxy: evolution — methods: numerical — planets and satellites: formation 1 Introduction In the field of computational astronomy, simulations based on its particles. In addition, we need to make use of mul- on particle methods have been widely used. In such sim- tiple CPU cores in one processor chip and SIMD (single ulations, a system is either physically a collection of par- instruction multiple data) execution units in one CPU core, ticles, as in the case of star clusters, galaxies, and dark- or in some cases GPGPUs (general-purpose computing on matter halos, or modeled by a collection of particles, as graphics processing units) or other accelerators. in SPH (smoothed particle hydrodynamics) simulation of In the case of gravitational N-body problems, there are a astrophysical fluids. Since particles move automatically as number of works in which the efficient parallelization is dis- the result of integration of the equation of motion of the par- cussed (Salmon & Warren 1994; Dubinski 1996; Makino ticle, particle-based simulations have an advantage for sys- 2004; Ishiyama et al. 2009, 2012). The use of SIMD units tems experiencing strong deformation or systems with high is discussed in Nitadori, Makino, and Hut (2006)and density contrast. This is one of the reasons why particle- Tanikawa et al. (2012, 2013), and GPGPUs in Gaburov, based simulations are widely used in astronomy. Examples Harfst, and Portegies Zwart (2009), Hamada et al. (2009a, of particle-based simulations include cosmological simu- 2009b), Hamada and Nitadori (2010), Bedorf, ´ Gaburov, lations or planet-formation simulations with gravitational and Portegies Zwart (2012), and Bedorf ´ et al. (2014). N-body code, simulations of star and galaxy formation with In the field of molecular dynamics, several groups the SPH code or other particle-based codes, and simulations have been working on parallel implementations. Exam- of planetesimal formation with the DEM (discrete element ples of such efforts include Amber (Case et al. 2015), method) code. CHARMM (Brooks et al. 2009), Desmond (Shaw et al. We need to use a large number of particles to improve the 2014), GROMACS (Abrahama et al. 2014), LAMMPS resolution and accuracy of particle-based simulations, and (Plimpton 1995), and NAMD (Phillips et al. 2005). In the in order to do so we need to increase the calculation speed field of CFD (computational fluid dynamics), many com- and need to use distributed-memory parallel machines effi- mercial and non-commercial packages now support SPH or 1 2 ciently. In other words, we need to implement efficient algo- other particle-based methods (PAM-CRASH, LS-DYNA, rithms such as the Barnes–Hut tree algorithm (Barnes & and Adventure/LexADV ). Hut 1986), the TreePM algorithm (Xu 1995) or the Fast Currently, parallel application codes are being developed Multipole Method (Dehnen 2000) to distributed-memory for each specific application of particle methods. Each of parallel computers. In order to achieve high efficiency, we these codes requires the multi-year effort of a multi-person need to divide a computational domain into subdomains team. We believe this situation is problematic because of in such a way that minimizes the need for communication the following reasons. between processors to maintain the division and to per- First, it has become difficult for researchers to try a new form interaction calculations. To be more specific, parallel method or just a new experiment which requires even a implementations of particle-based simulations contain the small modification of existing large codes. If one wants to following three procedures to achieve the high efficiency: test a new numerical scheme, the first thing he or she would (a) domain decomposition, in which the subdomains to be do is write a small program and do simple tests. This can be assigned to computing nodes are determined so that the cal- easily done, as far as that program runs on one processor. culation times are balanced, (b) particle exchange, in which However, if he or she then wants to try a production-level particles are moved to computing nodes corresponding to the subdomains to which they belong, and (c) interaction https://www.esi-group.com/pam-crash. information exchange, in which each computing node col- http://www.lstc.com/products/ls-dyna. lects the information necessary to calculate the interactions http://adventure.sys.t.u-tokyo.ac.jp/lexadv. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-3 large calculation using the new method, the parallelization been widely used for numerical relativity, and BoxLib is for distributed-memory machines is necessary, and other designed for AMR (adaptive mesh refinement). For particle- optimizations are also necessary. However, to develop such based simulations, such frameworks have not been widely a program in a reasonable time is impossible for a single used yet, though there were early efforts, as in Warren person, or even for a team, unless they already have expe- and Salmon (1995), which is limited to long-range, 1/r riences of developing such a code. potential. More recently, LexADV_EMPS is currently being Secondly, even for a team of people developing a parallel developed (Yamada et al. 2015). As its name suggests, it is code for one specific problem, it has become difficult to take rather specialized to the EMPS (Explicit Moving Particle care of all the optimizations necessary to achieve reason- Simulation) method (Murotani et al. 2014). able efficiency on recent processors. In fact, the efficiency In section 2, we describe the basic design concept of of many simulation codes mentioned above on today’s latest FDPS. In section 3, we describe the implementation of microprocessors is rather poor, simply because the develop- parallel algorithms in FDPS. In section 4, we present the ment team does not have enough time or expertise to imple- measured performance for three astrophysical applications ment necessary optimizations (in some cases they require a developed using FDPS. In section 5, we construct a perfor- change of data structure, control structure, and algorithms). mance model and predict the performance of FDPS on near- In our opinion, these difficulties have significantly future supercomputers. Finally, we summarize this study in slowed down research into numerical methods and also section 6. research using large-scale simulations. We have developed FDPS (Framework for Developing 2 How FDPS works Particle Simulator: Iwasawa et al. 2015) in order to solve these difficulties. The goal of FDPS is to let researchers con- In this section, we describe the design concept of FDPS. In centrate on the implementation of numerical schemes and subsection 2.1, we present the design concept of FDPS. In physics, without spending too much time on parallelization subsection 2.2, we show an N-body simulation code written and code optimization. To achieve this goal, we separate a using FDPS, and describe how FDPS is used to perform par- code into domain decomposition, particle exchange, inter- allelization algorithms. Part of the contents of this section action information exchange, and fast interaction calcula- have been published in Iwasawa et al. (2015). tion using a Barnes–Hut tree algorithm and/or neighbor search and the rest of the code. We implement these func- 2.1 Design concept tions as “template” libraries in C++ language. The reason we use the template libraries is to allow the researchers to In this subsection, we present the basic design concept of define their own data structure for particles and their own FDPS. We first present the abstract view of calculation codes functions for particle–particle interactions, and to provide for particle-based simulations on distributed-memory par- them with highly-optimized libraries with small software allel computers, and then describe how such abstraction is overhead. A user of FDPS needs to define the data structure realized in FDPS. and the function to evaluate particle–particle interaction. Using them as template arguments, FDPS effectively gener- 2.1.1 Abstract view of particle-based simulation codes ates the highly-optimized library functions which perform In a particle-based simulation code that uses the space the complex operations described above. decomposition on distributed-memory parallel computers, From the users’ point of view, what is necessary is to the calculation proceeds in the following steps. write the program in C++, using FDPS library functions (1) The computational domain is divided into subdomains, and to compile it with a standard C++ compiler. Using and each subdomain is assigned to one MPI process. FDPS, users can thus write their particle-based simula- This step is usually called “domain decomposition.” tion codes for gravitational N-body problems, SPH, MD Here, minimization of inter-process communication (molecular dynamics), DEM, or many other particle-based and a good load balance among processes should be methods, without spending their time on parallelization achieved. and complex optimization. The compiled code will run effi- (2) Particles are exchanged among processes, so that each ciently on large-scale parallel machines. process owns particles in its subdomain. In this paper For grid-based or FEM (Finite Element Method) appli- we call this step “particle exchange.” cations, there are many frameworks for developing parallel applications. For example, Cactus (Goodale et al. 2003)has 4 5 https://github.com/FDPS/FDPS. https://ccse.lbl.gov/BoxLib/. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-4 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 (3) Each process collects the information necessary to cal- culate the interactions on its particles. We call this step “interaction-information exchange.” (4) Each process calculates interactions between particles in its subdomain. We call this step “interaction calcu- lation.” (5) The data for each particle are updated using the cal- culated interactions. This part is done without inter- process communication. Steps 1, 2, and 3 involve parallelization and inter-process communications. FDPS provides library functions to per- form these parts. Therefore, users of FDPS do not have to Fig. 1. Basic concept of FDPS. The user program gives the definitions of write the parallelization and/or inter-process communica- particle and interaction to FDPS, and calls FDPS APIs. (Color online) tion part of their own code at all. Step 4 does not involve inter-process communication. force in the simulation of organic molecules can be done in However, this step is necessary to perform the actual calcu- the user code, with small additional computing cost. lation of interactions between two particles. Users of FDPS should write a simple interaction kernel. The actual interac- 2.1.2 Design concept of FDPS tion calculation using the tree algorithm or neighbor search In this sub-subsection, we describe how the abstract view is done in FDPS. presented in the previous section is actually expressed in the Step 5 involves neither inter-process communication nor FDPS API (application programming interface). The API of interaction calculation. Users of FDPS can and should write FDPS is defined as a set of template library functions in their own program for this part. C++ language. FDPS can be used to implement particle-based simula- Figure 1 shows how a user program and FDPS library tion codes for initial value problems which can be expressed functions interact. The user program gives the definition of as the following ordinary differential equation: a particle u and particle–particle interaction f to FDPS at the time of compilation. They are written in the standard N C++ language. Thus, the user program [at least the main() du = g f (u , u ), u . (1) i j i function] currently should be written in C++. dt The user program first does the initialization of FDPS library. When the user program is compiled for the MPI Here, N is the number of particles in the system, u is a environment, the initialization of MPI communication is vector which represents the physical quantities of particle i, done in the FDPS initialization function. The setup of the f is a function which describes the contribution of particle j initial condition is done in the user program. It is also to the time derivative of physical quantities of particle i,and possible to use the file input function defined in FDPS. In g is a function which converts the sum of the contributions the main integration loop, domain decomposition, particle exchange, interaction information exchange, and force cal- to the actual time derivative. In the case of gravitational culation are all done through library calls to FDPS. The N-body simulation, u contains position, velocity, mass, time integration of the physical quantities of particles using and other parameters of particle i, f is the gravitational the calculated interaction is done within the user program. force from particle j to particle i,and g gives velocity as Note that it is possible to implement multi-stage integra- the time derivative of position and acceleration calculated tion schemes such as the Runge–Kutta schemes using FDPS. as the time derivative of velocity. FDPS can evaluate the right-hand side of equation (1) for Hereafter, we call a particle that receives the interaction agiven setof u . Therefore, the derivative calculation for an “i particle,” and a particle exerting that interaction an “j particle.” The actual contents of vector u and the func- intermediate steps can be done by u containing appropriate tional forms of f and g depend on the physical system and values. numerical scheme used. The parallelization using MPI is completely taken care of In equation (1) we included only the pairwise interac- by FDPS, and the use of OpenMP is also managed by FDPS tions, because usually the calculation cost of the pairwise interaction is the highest even when many-body interaction 6 We will investigate a possible way to use APIs of FDPS from programs written in is important. For example, the angle and torsion of bonding Fortran. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-5 modified to ⎡ ⎤ N N J,i S,i du ⎣ ⎦ = g f (u , u ) + f (u , u ), u , (2) i j i  i dt j j where N and N are, the number of j particles and J, i S, i superparticles for which we apply multipole-like expansion, respectively, the vector u is the physical quantity vector of a superparticle, and the function f indicates the interac- tion exerted on particle i by the superparticle j .Insimu- lations with a large number of particles N, N ,and N J, i S, i Fig. 2. Long-range interaction (left) and short-range interaction (right). are many orders of magnitude smaller than N. A user needs Gray, red, and blue points are i particles, j particles, and superparticles, respectively. (Color online) to provide functions to construct superparticles from par- ticles and to calculate the interaction from superparticles. Since the most common use of the long-range interaction is for the interaction calculation. In order to achieve high per- for 1/r potential, FDPS includes standard implementation formance, the interaction calculation should make efficient of these functions for 1/r potential up to the quadrupole use of the cache memory and SIMD units. In FDPS, this moment. is done by requiring an interaction calculation function to calculate the interactions between multiple i and j particles. In this way, the amount of memory access is significantly 2.2 An example — gravitational N-body problem reduced, since a single j particle is used to calculate the In this subsection, we present a complete user code for the interaction of multiple i particles (i particles are also in the gravitational N-body problem with the open boundary con- cache memory). To make efficient use of the SIMD execu- dition, to illustrate how a user could write an application tion units, currently the user should write the interaction calculation loop so that the compiler can generate SIMD program using FDPS. The gravitational interaction is han- instructions. Of course, the use of libraries optimized for dled as a “long-range” type in FDPS. Therefore, we need to specific architectures (Nitadori et al. 2006; Tanikawa et al. provide the data type and interaction calculation functions for superparticles. In order to keep the sample code short, 2012, 2013) would guarantee very high performance. we use the center-of-mass approximation and use the same It is also possible to use GPUs and other accelerators for data class and interaction function for real particles and the interaction calculation. In order to reduce the communi- superparticles. cation overhead, a so-called “multiwalk” method (Hamada For the gravitational N-body problem, the physical et al. 2009b) is implemented. Thus, interaction calculation quantity vector u and interaction functions f , f ,and g kernels for accelerators should take multiple sets of pairs of are given by i-and j particles. The performance of this version will be discussed elsewhere. u = (r , v , m ), (3) i i i i As stated earlier, FDPS performs the neighbor search if the interaction is of a short-range nature. If the long-range Gm r − r j j i f (u , u ) = , (4) i j 3/2 interaction is used, currently FDPS uses the Barnes–Hut |r − r | + j i tree algorithm. In other words, within FDPS, the distinction between the long-range and short-range interactions is not Gm r − r j i f (u , u ) = , (5) i j a physical one but an operational one. If we want to apply 3/2 2 2 |r − r | + j i the treecode, it is a long-range interaction. Otherwise, it is a short-range interaction. Thus, we can use the simple tree g(F , u ) = (v , F , 0), (6) i i algorithm for pure 1/r gravity and the TreePM scheme (Xu 1995; Bode et al. 2000; Bagla 2002; Dubinski et al. 2004; F = f (u , u ) + f (u , u ), u , (7) i j i j i Springel 2005; Yoshikawa & Fukushige 2005; Ishiyama j j et al. 2009, 2012) for the periodic boundary. Figure 2 illustrates the long-range and short-range inter- where m , r , v ,and  are the mass, position, velocity, and i i i i actions and how they are calculated in FDPS. gravitational softening of particle i, respectively; m and r j j For long-range interactions, a Barnes–Hut algorithm is are the mass and position of a superparticle j, respectively; used. Thus, the interactions from a group of distant particles and G is the gravitational constant. Note that the shapes of are replaced by that of a superparticle, and equation (1) is the functions f and f are the same. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-6 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 Listing 1 shows the complete code which can be com- 49 for(S32 j=0; j<nj;j++) { 50 F64vec xj = jp[j].pos; piled and run, not only on a single-core machine but also on massively-parallel, distributed-memory machines such 51 F64vec dr = xi - xj; as the K computer. The total number of lines is 117. 52 F64 mj = jp[j].mass; 1 #include <particle_simulator.hpp> 53 F64 dr2 = dr * dr + ep2; 2 using namespace PS; 54 F64 dri = 1.0 / sqrt(dr2); 3 55 ai -= (dri * dri * dri 4 class Nbody{ 56 * mj) * dr; 5 public: 57 } 6 F64 mass, eps; 58 force[i].acc += ai; 7 F64vec pos, vel, acc; 59 } 8 F64vec getPos() const {return pos;} 60 } 9 F64 getCharge() const {return mass;} 61 }; 10 void copyFromFP(const Nbody &in) { 62 11 mass = in.mass; 63 template <class Tpsys> 12 pos = in.pos; 64 void predict(Tpsys &p, 13 eps = in.eps; 65 const F64 dt) { 14 } 66 S32 n = p.getNumberOfParticleLocal(); 15 void copyFromForce (const Nbody &out) { 67 for(S32 i = 0; i < n; i++) 16 acc = out.acc; 68 p[i].predict(dt); 17 } 69 } 18 void clear() { 70 71 template <class Tpsys> 19 acc = 0.0; 20 } 72 void correct(Tpsys &p, 21 void readAscii(FILE *fp) { 73 const F64 dt) { 22 fscanf(fp, 74 S32 n = p.getNumberOfParticleLocal(); 23 "%lf%lf%lf%lf%lf%lf%lf%lf" 75 for(S32 i = 0; i < n; i++) 24 &mass, &eps, 76 p[i].correct(dt); 25 &pos.x, &pos.y, &pos.z, 77 } 26 &vel.x, &vel.y, &vel.z); 78 27 } 79 template <class TDI, class TPS, class TTFF> 28 void predict(F64 dt) { 80 void calcGravAllAndWriteBack (TDI &dinfo, 29 vel += (0.5 * dt) * acc; 81 TPS &ptcl, 30 pos += dt * vel; 82 TTFF &tree) { 31 } 83 dinfo.decomposeDomainAll(ptcl); 32 void correct(F64 dt) { 84 ptcl.exchangeParticle(dinfo); 33 vel += (0.5 * dt) * acc; 85 tree.calcForceAllAndWriteBack 34 } 86 (CalcGrav<Nbody>(), 35 }; 87 CalcGrav<SPJMonopole>(), 36 88 ptcl, dinfo); 37 template <class TPJ> 89 } 38 struct CalcGrav{ 90 39 void operator () (const Nbody * ip, 91 int main(int argc, char *argv[]) { 40 const S32 ni, 92 PS :: Initialize(argc, argv); 41 const TPJ * jp, 93 F32 time = 0.0; 42 const S32 nj, 94 const F32 tend = 10.0; 43 Nbody * force) { 95 const F32 dtime = 1.0 / 128.0; 44 for(S32 i=0; i<ni; i++) { 96 PS :: DomainInfo dinfo; 45 F64vec xi = ip[i].pos; 97 dinfo.initialize(); 46 F64 ep2 = ip[i].eps 98 PS :: ParticleSystem<Nbody> ptcl; 47 * ip[i].eps; 99 ptcl.initialize(); 48 F64vec ai = 0.0; 100 PS :: TreeForForceLong<Nbody, Nbody, Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-7 101 Nbody> :: Monopole grav; A particle class for FDPS must provide public member func- 102 grav.initialize(0); tions getPos, copyFromFP, copyFromForce,and clear,so 103 ptcl.readParticleAscii(argv[1]); that the internal functions of FDPS can access the data 104 calcGravAllAndWriteBack(dinfo, within the particle class. For the name of the particle class 105 ptcl, itself and the names of the member variables, a user can use 106 grav); whatever names allowed by the C++ syntax. The member 107 while(time < tend) { functions predict and correct are used to integrate the 108 predict(ptcl, dtime); orbits of particles. These are not related to FDPS. Since the 109 calcGravAllAndWriteBack(dinfo, interaction is pure 1/r type, the construction method for 110 ptcl, superparticles provided by FDPS can be used and they are 111 grav); not shown here. 112 correct(ptcl, dtime); In the third part, the interaction functions f and f are 113 time += dtime; defined. In this example, actually they are the same, except 114 } for the class definition for j particles. Therefore, this argu- 115 PS :: Finalize(); ment is given as an argument with the template data type 116 return 0; TPJ, so that a single source code can be used to generate two 117 } functions. The interaction function used in FDPS should have the following five arguments. The first argument, ip, In the following we describe how this sample code is the pointer to the array of i particles which receive the works. It consists of four parts: the declaration to use FDPS interaction. The second argument, ni, is the number of i (lines 1 and 2), the definition of the particle (the vector particles. The third argument, jp, is the pointer to the array u ) (lines 4 to 35), the definition of the gravitational force of j particles or superparticles which exert the interaction. (the functions f and f ) (lines 37 to 61), and the actual The fourth argument, nj, is the number of j particles or user program. The actual user program consists of a main super-particles. The fifth argument, force, is the pointer routine and functions which call library functions of FDPS to the array of variables of a user-defined class in which (lines 63 to line 117). In the following, we discuss each of the calculated interactions on i particles can be stored. In them. this example we used the particle class itself, but it can be In order to use FDPS, the user program should include another class or a simple array. the header file “particle_simulator.hpp.” All functionalities In this example, the interaction function is a function of the standard FDPS library are implemented as the header object declared as a struct, with the only member func- source library, since they are template libraries which need tion operator (). FDPS can also accept a function pointer to receive particle class and interaction functions. FDPS for the interaction function, which would look a bit more data types and functions are in the namespace “PS.” In this familiar to most readers. In this example, the interaction sample program, we declare “PS” as the default namespace is calculated through a simple double loop. In order to to simplify the code. In this sample, however, we did not achieve high efficiency, this part should be replaced by a omit “PS” for FDPS functions and class templates, to show code optimized for specific architectures. In other words, a that they come from FDPS. user needs to optimize just this single function to achieve FDPS defines several data types. F32/F64 are data types very high efficiency. of 32-bit and 64-bit floating points. S32 is the data type In the fourth part, the main routine and user-defined of a 32-bit signed integer. F64vec is the class of a vector functions are defined. In the following, we describe the main consisting of three 64-bit floating points. This class provides routine in detail, and briefly discuss other functions. The several operators, such as the addition, subtraction, and main routine consists of the following seven steps. inner product indicated by “∗.” It is not necessary to use (1) Initialize FDPS (line 92). these data types in the user program, but in some of the (2) Set simulation time and timestep (lines 93 to 95). functions the user should provide these data types for the (3) Create and initialize objects of FDPS classes (lines 96 return value. to 102). In the second part, the particle data type (i.e., the (4) Read-in particle data from a file (line 103). vector u ) is defined as class Nbody. It has the following (5) Calculate the gravitational forces on all the particles at member variables: mass (m ), eps ( ), pos (r ), vel (v ), i i i i the initial time (lines 104 to 106). and acc (dv /dt). Although the member variable acc does (6) Integrate the orbits of all the particles with the Leap- not appear in equations (3)–(7), we need this variable Frog method (lines 107 to 114). to store the result of the gravitational force calculation. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-8 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 (7) Finish the use of FDPS (line 115). TreeForForceLong. This function takes the user-defined function object CalcGrav as the first and second arguments, In the following, we describe steps 1, 3, 4, 5, and 7, and and calculates particle–particle and particle-superparticle skip steps 2 and 6. In step 2, we do not call FDPS libraries. interactions using them. Although we call FDPS libraries in step 6, the usage is the In step 7, the FDPS function Finalize is called. It calls same as in step 5. the MPI_Finalize function. In step 1, the FDPS function Initialize is called. In In this section, we have described in detail what a user this function, MPI and OpenMP libraries are initialized. If program written using FDPS looks like. As we stated earlier, neither of them are used, this function does nothing. All this program can be compiled with or without paralleliza- functions of FDPS must be called between this function and tion using MPI and/or OpenMP, without any change in the function Finalize. the user program. The executable parallelized with MPI is In step 3, we create and initialize three objects of the generated by using an appropriate compiler with MPI sup- FDPS classes: port and a compile-time flag. Thus, a user need not worry dinfo: An object of class DomainInfo. It is used for about complicated bookkeeping necessary for paralleliza- domain decomposition. tion using MPI. ptcl: An object of class template ParticleSystem. In the next section, we describe how FDPS provides a It takes the user-defined particle class (in this example, generic framework which takes care of parallelization and Nbody) as the template argument. From the user program, bookkeeping for particle-based simulations. this object looks as an array of i particles. grav: An object of data type Monopole defined in class template TreeForForceLong. This object is used for the 3 Implementation calculation of long-range interaction using the tree algo- In this section, we describe how the operations discussed rithm. It receives three user-defined classes as template in the previous section are implemented in FDPS. In arguments: the class to store the calculated interaction, subsection 3.1 we describe the domain decomposition and the class for i particles and the class for j particles. In this particle exchange, and in subsection 3.2, the calculation of example, all these three classes are the same as the original interactions. Part of the contents of this section have been class of particles. It is possible to define classes with min- published in Iwasawa et al. (2015). imal data for these purposes and use them here, in order to optimize the cache usage. The data type Monopole indi- cates that the center-of-mass approximation is used for 3.1 Domain decomposition and particle exchange superparticles. In this subsection, we describe how the domain decomposi- In step 4, the data of particles are read from a file into the tion and the exchange of particles are implemented in FDPS. object ptcl, using FDPS function readParticleAscii.In We used the multisection method (Makino 2004)withthe this function, a member function of class Nbody, readAscii, so-called sampling method (Blackston & Suel 1997). The is called. Note that the user can write and use his/her own multisection method is a generalization of ORB (Orthog- I/O functions. In this case, readParticleAscii is unneces- onal Recursive Bisection). In ORB, as its name suggests, sary. bisection is applied to each coordinate axis recursively. In In step 5, the forces on all particles are calculated the multisection method, division in one coordinate is not through the function calcGravAllAndWriteBack,which to two domains but to an arbitrary number of domains. is defined in lines 79 to 89. In this function, steps 1 Since one dimension can be divided to more than two sec- to 4 in sub-subsection 2.1.1 are performed. In other tions, it is not necessary to apply divisions many times. words, all of the actual work of FDPS libraries to Thus we apply divisions only once to each coordinate axis. calculate interaction between particles takes place here. A practical advantage of this method is that the number of For step 1, decomposeDomainAll, a member function processors is not limited to powers of two. of class DomainInfo is called. This function takes the Figure 3 illustrates an example of the multisection object ptcl as an argument to use the positions of par- method with (n , n , n ) = (7, 6, 1). We can see that the size x y z ticles to determine the domain decomposition. Step 2 and shape of subdomains show large variation. By allowing is performed in exchangeParticle, a member func- this variation, FDPS achieves quite good load balance and tion of class ParticleSystem. This function takes the high scalability. Note that n = n n n is the number of MPI x y z object dinfo as an argument and redistributes particles processes. By default, values of n , n ,and n are chosen so x y z 1/3 among MPI processes. Steps 3 and 4 are performed in that they are integers close to n . For figure 3, we force the calcForceAllAndWriteBack, a member function of class numbers used to make a two-dimensional decomposition. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-9 A naive implementation of the above algorithm requires “global” sorting of sample particles over all (i, 0, 0) pro- cesses. In order to simplify this part, before each pro- cess sends the sample particles to (i, 0, 0) processes, they exchange their samples with other processes with the same location in y-and z-process coordinates, so that they have sample particles in the current domain decomposition in the x-direction. As a result, particles sent to (i, 0, 0) processes are already sorted at the level of domain decomposition in the x-direction, and we need only sort within each of the (i, 0, 0) processes to obtain the globally sorted particles. Thus, our implementation of parallelized domain decomposition is as follows. (1) Each process samples particles randomly from its own particles. In order to achieve an optimal load balance, the sampling rate of particles is changed so that it is proportional to the CPU time per particle spent on that process (Ishiyama et al. 2009). FDPS provides several Fig. 3. Example of the domain decomposition. The division is 7 × 6in options including this optimal balance. two dimensions. (Color online) (2) Each process exchanges the sample particles according to the current domain boundary in the x-direction with In the sampling method, first each process performs the process with the same y and z indices, so that they random sampling of particles under it, and sends them to have sample particles in the current domain decompo- the process with rank 0 (“root” process hereafter). Then sition in the x-direction. the root process calculates the division so that sample (3) Each process with index (i, y, z) sends the sample par- particles are equally divided over all processes, and broad- ticles to the process with index (i, 0, 0), in other words, casts the geometry of domains to all other processes. In the root processes in each y–z plane collects subsam- order to achieve good load balance, sampling frequency ples. should be changed according to the calculation cost per (4) Each root process sorts the sample particles in the particle (Ishiyama et al. 2009). x-direction. Now, the sample particles are sorted glob- The sampling method works fine, if the number of par- ally in the x-direction. ticles per process is significantly larger than the number of (5) Each root process sends the number of the sample par- process. This is, however, not the case for runs with a large ticles to the other root processes and determines the number of nodes. When the number of particles per process global rank of the sample particles. is not much larger than the number of processes, the total (6) The x coordinate of new domains is determined by number of sample particles which the root process needs to dividing all sample particles into n subsets with equal handle exceeds the number of particles per process itself, numbers of sample particles. and thus calculation time of domain decomposition in the (7) Each root process exchanges sample particles with root process becomes visible. other root processes, so that they have the sample par- In order to reduce the calculation time, we also paral- ticles in new domain in the x-direction. lelized the domain decomposition, currently in the direction (8) Each root process determines the y and z coordinates of the x-axis only. The basic idea is that each node sends of new domains. the sample particles not to the root process of the all MPI (9) Each root process broadcasts the geometries of new processes but to the processes with index (i, 0, 0). Then pro- domains to all other processes. cesses (i, 0, 0) sort the sample particles and exchange the number of sample particles they received. Using these two It is also possible to parallelize the determination of sub- pieces of information, each of (i, 0, 0) processes can deter- domains in step 8, but even for the full-node runs on the K mine all domain boundaries inside its current domain in the computer we found the current parallelization is sufficient. x-direction. Thus, they can determine which sample parti- For particle exchange and also for interaction infor- cles should be sent to where. After the exchange of sample mation exchange, we use MPI_Alltoall to exchange the particles, each of the (i, 0, 0) processes can determine the length of the data and MPI_Isend and MPI_Irecv to actually decompositions in the y-and z-directions. exchange the data. At least on the K computer, we found Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-10 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 that the performance of vendor-provided MPI_Alltoall is computer of the RIKEN Advanced Institute for Computa- not optimal for short messages. We implemented a hand- tional Science (AICS), and the other is the Cray XC30 of the crafted version in which the messages sent to the same relay Center for Computational Astrophysics (CfCA), National points are combined in order to reduce the total number of Astronomical Observatory of Japan. The K computer con- messages. sists of 82944 Fujitsu SPARC64 VIIIfx processors, each After the domain decomposition is done and the result is with eight cores. The theoretical peak performance of one broadcasted to all processes, they exchange particles so that core is 16 Gflops, for both single- and double-precision each of them has particles in its domain. Since each process operations. Cray XC30 of the CfCA consists of 1060 has the complete information of the domain decomposi- nodes, or 2120 Intel Xeon E5-2690v3 processors (12 cores, tion, this part is pretty straightforward to implement. Each 2.6 GHz). The theoretical peak performance of one core process looks at each of its particles, and determines if that is 83.2 and 41.6 Gflops for single- and double-precision particle is still in its domain. If not, the process determines operations, respectively. In all runs on the K computer, to which process that particle should be sent. After the des- we use the hybrid MPI-OpenMP mode of FDPS, in which tinations of all particles are determined, each process sends one MPI process is assigned to one node. On the other them out, using MPI_Isend and MPI_Irecv functions. hand, for XC30, we use the flat MPI mode of FDPS. The source code is the same except for that used for the interac- tion calculation functions. The interaction calculation part 3.2 Interaction calculation was written to take full advantage of the SIMD instruc- In this subsection, we describe the implementations of inter- tion set of the target architecture, and thus written specif- action information exchange and actual interaction calcu- ically for SPARC64 VIIIfx (HPC-ACE instruction set) and lation. In the interaction information exchange step, each Xeon E5 v3 (AVX2 instruction set). process sends the data required by other nodes. In the interaction calculation step, actual interaction calculation 4.1 Disk galaxy simulation is done using the received data. For both steps, the Barnes– Hut octree structure is used, for both long- and short-range In this subsection, we discuss the performance and scal- interactions. ability of a gravitational N-body simulation code imple- First, each process constructs the tree of its local parti- mented using FDPS. Some results in this section have been cles. This tree is then used to determine the data to be sent published in Iwasawa et al. (2015). The code is essentially to other processes. For the long-range interaction, the pro- the same as the sample code described in subsection 2.2, cedure is the same as that for usual tree traversal (Barnes & except for the following two differences in the user code for Hut 1986; Barnes 1990). The tree traversal is used also for the calculation of the interaction. First, to improve the accu- short-range interactions. FDPS can currently handle four racy, we used the expansion up to the quadrupole moment, different types of cutoff length for the short-range interac- instead of the monopole-only one used in the sample code. tions: fixed, j-dependent, i-dependent, and symmetric. For Secondly, we used the highly optimized kernel developed i-dependent and symmetric cutoffs, the tree traversal should using SIMD built-in functions, instead of the simple one in be done twice. the sample code. Using the received data, each process performs the force We apply this code for the simulation of a Milky Way- calculation. To do so, it first constructs the tree of all data like galaxy, which consists of a bulge, a disk, and a dark received and local particles, and then uses it to calculate the matter halo. For examples of recent large-scale simulations, interaction of local particles. see Fujii et al. (2011)and Bedorf ´ et al. (2014). The initial condition is the Milky Way model, which is the same as that in Bedorf ´ et al. (2014). The mass of 4 Performance of applications developed 9 the bulge is 4.6 × 10 M , and it has a spherically sym- using FDPS metric density profile of the Hernquist model (Hernquist In this section, we present the performance of three astro- 1990) with a half-mass radius of 0.5 kpc. The disk is an physical applications developed using FDPS. One is the pure axisymmetric exponential disk with a scale radius of 3 kpc, gravity code with open boundary applied to disk galaxy scale height of 200 pc and mass 5.0 × 10 M . The dark simulation. The second one is again a pure gravity appli- halo has a Navarro–Frenk–White (NFW) density profile cation but with periodic boundary applied to cosmological (Navarro et al. 1996) with the half-mass radius of 40 kpc simulation. The third one is a gravity + SPH calculation and the mass of 6.0 × 10 M . In order to realize the Milky applied to the giant impact (GI) simulation. For the perfor- Way model, we used GalacticICS (Widrow & Dubinski mance measurement, we used two systems. One is the K 2005). For all simulations in this subsection, we adopt Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-11 XC30 10 50% of TPP (K) 35% of TPP (XC30) -1 -1 -2 total -3 Fig. 4. Face-on surface density maps of the bulge and disk. (Color online) domain decomposition exchange particle grav -4 1 2 3 4 5 6 θ = 0.4 for the opening angle for the tree algorithm. We 10 10 10 10 10 10 # of cores set the average number of particles sampled for the domain decomposition to 500. Fig. 5. Weak-scaling performance of the gravitational N-body code. Figure 4 illustrates the time evolution of the bulge and The speed of the floating-point operation (top) and wall-clock time per disk in the run with 512 nodes on the K computer. The disk timestep (bottom) are plotted as functions of the number of cores. Open and filled symbols indicate the performances of the K computer and is initially axisymmetric. We can see that a spiral structure cray XC30, respectively. In the top panel, the solid line indicates 50% develops (0.5 and 1 Gyr) and a central bar follows the spiral of the theoretical peak performance of the K computer and the dotted (1 Gyr and later). As the bar grows, the two-arm structure line indicates 35% of the theoretical peak performance of XC30. In the becomes more visible (3 Gyr). bottom panel, time spent for the interaction calculation (diamond), the domain decomposition (square) the exchange particles (triangle) are Figure 5 shows the measured weak-scaling performance. also shown. (Color online) We fixed the number of particles per core to 266367 and measured the performance for the number of cores in the Bedorf ´ et al. (2014) reported the wall-clock time of 4 s range of from 4096 to 663552 on the K computer, and in for their 27-billion particle simulation on the Titan system the range of from 32 to 2048 on XC30. We can see that with 2048 NVIDIA Tesla K20X, with the theoretical peak the measured efficiency and scalability are both very good. performance of 8PF (single precision, since the single pre- The efficiency is more than 50% for the entire range of cision was used for the interaction calculation). This corre- cores on the K computer. The efficiency of XC30 is a bit sponds to 0.8 billion particles per second per petaflops. Our worse than that of the K computer. This difference comes code on the K computer requires 15 s on 16384 nodes (2PF from the difference of two processors. The Fujitsu pro- theoretical peak), resulting in 1 billion particles per second cessor showed higher efficiency, while the Intel processor per petaflops. Therefore, we can conclude that our FDPS has 5.2 times higher peak performance per core. We can code achieved a performance that is slightly better than see that the time for domain decomposition increases as one of the best codes specialized to gravitational N-body we increase the number of cores. The slope is around 2/3, problems. as can be expected from our current algorithm discussed in subsection 3.1. 4.2 Cosmological simulation Figure 6 shows the measured strong-scaling perfor- mance. We fixed the total number of particles to 550 million In this subsection, we discuss the performance of a cos- and measured the performance for 512 to 32768 cores on mological simulation code implemented using FDPS. We the K computer and 256 to 2048 cores on XC30. We can implemented TreePM (Tree Particle-Mesh) method and also see that the measured efficiency and scalability are both measured the performance on XC30. Our TreePM code very good, for the strong-scaling performance. is based on the code developed by K. Yoshikawa. The wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-12 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 XC30 50% of TPP (K) 35% of TPP (XC30) 10 XC30 7% of TPP (XC30) total domain decomposition exchange particle 1 grav -1 -1 total domain decomposition -2 -2 exchange particle grav PP -3 grav PM -3 2 3 4 5 2 3 4 10 10 10 10 10 10 10 # of cores # of cores Fig. 6. Same figure as figure 5 but for the strong-scaling performance Fig. 7. Weak-scaling performance of the TreePM code. The speed of the for 550 million particles. (Color online) floating-point operation (top) and wall-clock time per timestep (bottom) are plotted as functions of the number of cores. In the top panel, the solid line indicates 7% of the theoretical peak performance of XC30. In the bottom panel, time spent for the particle–particle interaction calcu- Particle-Mesh part of the code was developed by Ishiyama, lation (diamond), the particle–mesh interaction (pentagon), the domain Nitadori, and Makino (2012) and this code is included in decomposition (square) and the exchange particles (triangle) are also the FDPS package as an external module. shown. (Color online) We initially place particles uniformly in a cube and gave them zero velocity. For the calculation of the tree force, we for large number of cores, because we did not use paral- used a monopole-only kernel with cutoff. The cutoff length lelized domain decomposition here. The efficiency is 7% of of the force is three times larger than the width of the mesh. the theoretical peak performance. It is rather low compared We set θ to 0.5. For the calculation of the mesh force, the to that for the disk galaxy simulations in subsection 4.1. The mass density is assigned to each of the grid points, using the main reason is that we use a lookup table for the force cal- triangular-shaped cloud scheme, and the density profile we culation. If we evaluate the force without the lookup table, used is the S2 density profile (Hockney & Eastwood 1988). the nominal efficiency would be much better, but the total Figures 7 and 8 show the weak and strong scaling per- time would be longer. formances, respectively. For the weak-scaling measurement, we fixed the number of particles per process to 5.73 million 4.3 Giant impact simulation and measured the performance for the number of cores in the range of 192 to 12000 on XC30. For the strong-scaling In this section, we discuss the performance of an SPH sim- measurements, we fixed the total number of particles to ulation code with self-gravity implemented using FDPS. 2048 and measured the performance for the number of Some results in this section have been published in Iwasawa cores in the range of 1536 to 12000 on XC30. We can see et al. (2015). The test problem used is the simulation of GI. that the time for the calculation of the tree force is dominant The GI hypothesis (Hartmann & Davis 1975; Cameron and both of the weak and strong scalings are good except for & Ward 1976) is one of the most popular scenarios for the very large number of cores (12000) for the strong scaling the formation of the Moon. The hypothesis is as follows. measurement. One reason is that the scalability of the calcu- About five billion years ago, a Mars-sized object (hereafter, lation of the mesh force is not very good. Another reason is the impactor) collided with the proto-Earth (hereafter, the that the time for the domain decomposition grows linearly target). A large amount of debris was scattered, which first wallclock time per timestep[s] performance[TFLOPS] wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-13 XC30 7% of TPP (XC30) total domain decomposition exchange particle grav PP 10 grav PM Fig. 9. Temperature maps of the target and impactor in the run with 9.9 million particles at four different epochs. (Color online) -1 disrupted (t = 2847 s) and debris are ejected. A part of the 3 4 10 10 debris falls back to the target, while the rest will eventually # of cores form the disk and the Moon. So far, the resolution used in Fig. 8. Same as figure 7 but for the strong-scaling performance. In this the published papers has been much lower. We plan to use case, the number of particles is 2048 . (Color online) this code to improve the accuracy of the GI simulations. Figures 10 and 11 show the measured weak and strong scaling performance. For the weak-scaling measurement, formed the debris disk and eventually the Moon. Many we fixed the number of particles per core to 20000 and mea- researchers have performed simulations of GI, using the sured the performance for the number of cores in the range SPH method (Benz et al. 1986; Canup et al. 2013; Asphaug of 256 to 131072 on the K computer. On the other hand, for & Reufer 2014). the strong-scaling measurement, we fixed the total number For the gravity, we used a monopole-only kernel with of particles to 39 million and measured the performance for θ = 0.5. We adopt the standard SPH scheme (Monaghan the number of cores in the range of 512 to 16384 on the 1992; Rosswog 2009; Springel 2010) for the hydro part. K computer. We can see that the performance is good even Artificial viscosity is used to handle shocks (Monaghan for a very large number of cores. The efficiency is about 1997), and the standard Balsara switch is used to reduce 40% of the theoretical peak performance. The hydro part the shear viscosity (Balsara 1995). The kernel function we 6 consumes more time than the gravity part does, mainly used is the Wendland C and the cutoff radius is 4.2 times because the particle–particle interaction is more compli- larger than the local mean inter-particle distance. In other cated. words, each particle interacts with about 300 other parti- cles. This neighbor is appropriate for this kernel to avoid the pairing instability (Dehnen & Aly 2012). 5 Performance model Assuming that the target and impactor consist of granite, In this section, we present the performance model of appli- we adopt the equation of state of granite (Benz et al. cations implemented using FDPS. As described in section 2, 1986) for the particles. For the initial condition, we the calculation of a typical application written using FDPS assume a parabolic orbit with the initial angular momentum proceeds in the following steps. 1.21 times that of the current Earth–Moon system. Figure 9 shows the time evolution of the target and (1) Update the domain decomposition and exchange par- impactor for a run with 9.9 million particles. We can see ticles accordingly (not in every timestep). that the shocks are formed just after the moment of impact (2) Construct the local tree structure and exchange par- in both the target and impactor (t = 2050 s). The shock ticles and superparticles necessary for an interaction propagates in the target, while the impactor is completely calculation. wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-14 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 3 3 10 10 40% of TPP 2 2 10 10 1 1 10 10 40% of TPP 0 0 10 10 total domain decomposition 10 1 exchange particle hydr grav -1 -1 -2 total domain decomposition -2 -3 10 exchange particle hydr grav -4 -3 10 10 2 3 4 5 6 2 3 4 5 10 10 10 10 10 10 10 10 10 # of cores # of cores Fig. 10. Weak-scaling performance of the SPH code. The speed of the Fig. 11. Same as figure 10 but for the strong-scaling performance for floating-point operation (top) and wall-clock time per timestep (bottom) 39 million particles. (Color online) are plotted as functions of the number of cores. In the top panel, the solid line indicates 40% of the theoretical peak performance of the K term of the right hand side of equation 8, and finally we computer. In the bottom panel, time spent for the hydrodynamics cal- compare the model with the actual measurement presented culation (cross), the gravity calculation (diamond), the domain decom- position (square), and the exchange particles (triangle) are also shown. in subsection 5.6. (Color online) 5.1 Communication model (3) Construct the “global” tree. (4) Perform the interaction calculation. What ultimately determines the efficiency of a calculation (5) Update the physical quantities of particles using the performed on a large-scale parallel machine is the commu- calculated interactions. nication overhead. Thus, it is very important to understand what types of communication would take what amount In the case of complex applications which require more of time on actual hardware. In this section, we summa- than one interaction calculation, each of the above steps, rize the characteristics of the communication performance except for the domain decomposition, may be executed of the K computer. more than one time per timestep. In FDPS, almost all communications are through the For a simple application, thus, the total wall-clock time use of collective communications, such as MPI_Allreduce, per timestep should be expressed as MPI_Alltoall,and MPI_Alltoallv. However, measure- ment of the performance of these routines for uniform T = T /n + T + T + T + T , (8) step dc dc lt exch icalc misc message length is not enough, since the amount of data where T , T , T , T ,and T are the times for domain to be transferred between processes generally depends on dc lt exch icalc misc composition and particle exchange, local tree construc- the physical distance between domains assigned to those tion, exchange of particles and superparticles for interaction processes. Therefore, we first present the timing results for calculation, interaction calculation, and other calculations simple point-to-point communication, and then for collec- such as particle update, respectively. The term n is the tive communications. dc interval at which the domain decomposition is performed. Figure 12 shows the elapsed time as the function of the In the following, we first construct the model for the message length, for point-to-point communication between communication time. Then we construct models for each “neighboring” processes. In the case of the K computer, we wallclock time per timestep[s] performance[TFLOPS] wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-15 Table 1. Time coefficients in equation (10). n = 32 n = 256 n = 2048 p p p -1 T [ms] 0.103 0.460 2.87 alltoallv,startup −1 −6 −5 −3 T [ms byte ]8.25 × 10 9.13 × 10 1.32 × 10 alltoallv,word -2 2 10 10 8byte 32kbyte -3 0 1 2 3 4 5 6 10 10 10 10 10 10 10 size[byte] -1 Fig. 12. Elapsed time for point-to-point communication as a function of size of message measured on the K computer. -2 1 2 3 n =32 2 10 10 10 n =256 n =2048 p Fig. 14. Elapsed time of MPI_Alltoallv to send message of 8 bytes (cir- cles) and 32 kilobytes (squares) as a function of the number of processes measured on the K computer. Solid and dashed curves indicate the results for the message sizes of 8 bytes and 32 kilobytes, respectively. (Color online) where T is the startup time and T is the alltoallv,startup alltoallv,word time to transfer one byte of message. We list these values in -1 table 1. 1 2 3 4 5 10 10 10 10 10 The coefficients themselves in equation (10) depend on size[byte] the number of MPI processes, n , as shown in figure 14. They are modeled as Fig. 13. Elapsed time of MPI_Alltoallv as a function of message size measured on the K computer. (Color online) T = τ n , (11) alltoallv,startup alltoallv,startup p used a three-dimensional node allocation, so that “neigh- 4/3 T = τ n . (12) alltoallv,word alltoallv,word boring” processes are actually close to each other in the torus network. Here we assume that the speed to transfer a message using We can see that the elapsed time can be fitted reasonably MPI_Alltoallv is limited to the bisection bandwidth of the well as system. Under this assumption, T should be pro- alltoallv,word 4/3 portional to n . To estimate τ and τ , alltoallv,startup alltoallv,word T = T + n T , (9) p2p p2p,startup word p2p,word we use measurements for message sizes of 8 bytes and 32 kilobytes. In the K computer, we found that τ is alltoallv,startup where T is the startup time which is independent p2p,startup −7 0.00166 ms and τ is 1.11 × 10 ms per byte. If alltoallv,word of the message length, and T is the time to transfer p2p,word MPI_Alltoallv is limited to the bisection bandwidth in the one byte of the message. Here, n is the length of the word −8 K computer, τ would be 5 × 10 ms per byte. We alltoallv,word message in units of bytes. On the K computer, T p2p,startup can see that the actual performance of MPI_Alltoallv on −7 is 0.0101 ms and T is 2.11 × 10 ms per byte. For p2p,word the K computer is quite good. a short message, there is a rather big discrepancy between the analytic model and measured points, because for short 5.2 Domain decomposition messages the K computer used several different algorithms. Figure 13 shows the elapsed times for MPI_Alltoallv. For the hierarchical domain decomposition method The number of processes, n , is 32 to 2048. They are again described in subsection 3.1, the calculation time is modeled by the simple form expressed as T = T + n T , (10) T = T + T + T + T , (13) alltoallv alltoallv,startup word alltoallv,word dc dc,gather dc,sort dc,exch dc,misc T [ms] T [ms] alltoallv p2p T [ms] alltoallv Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-16 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 where T is the time for the (i, 0, 0) process to collect dc,gather sample particles, T is the time to sort sample particles dc,sort on the (i, 0, 0) process, T is the time to exchange dc,exch -2 particles after the new domains are determined, and T 10 dc,misc is the time for remaining procedures such as initial exchange of samples in the x-direction, exchange of sample particles and domain boundaries in the x-direction, and broadcasting of the domain boundaries in the y–z plane. On the machines we tested, T and T are much dc,gather dc,misc smaller than T and T . Therefore we consider these -3 dc,sort dc,exch 2 3 two terms only. 10 10 First, we consider the time to sort sample particles. Since we use the quick sort, the term T is expressed as dc,sort Fig. 15. Measured T and its analytical model as a function of n ,in dc,sort p T = τ 2n n n log(n n n ) dc,sort sort smp y z smp y z thecaseof n = 500 and n ∼ 5.3 × 10 . smp 2/3 + n n n log(n n ) ∼ τ n n , (14) y smp z smp z dc,sort smp 0.01 where n is the average number of sample particles per smp process, and n , n and n are the numbers of processes x y z in the x, y and z directions, respectively. Here, τ ∼ dc,sort 3 5/3 log(n n )τ . The first term expresses the time to sort sort smp p samples in the y–z plane with respect to the x and y direc- tions. The second term expresses that time to sort samples with respect to the z-direction. In order to model T , we need to model the number dc,exch of particles which moves from one domain to another. This 0.002 number would depend on various factors, in particular the 2 3 10 10 nature of the system we consider. For example, if we are calculating the early phase of the cosmological structure formation, particles do not move much in a single timestep, Fig. 16. Measured T and its analytical model as a function of n ,in dc,exch and thus the number of particles moved between domains thecaseof n = 500 and n ∼ 5.3 × 10 . smp is small. On the other hand, if we are calculating a single virialized self-gravitating system, particles move relatively large distances (comparable to the average interparticle dis- tance) in a single timestep. In this case, if one process con- tains n particles, half of the particles in the “surface” of the domain might migrate in and out of the domain. Thus, -2 2/3 O(n ) particles could be exchanged in this case. 10 Figures 15, 16,and 17 show the elapsed time for sorting samples, exchanging samples, and domain decomposition for the case of disk galaxy simulations with n = 500 and smp n ∼ 5.3 × 10 . We also plot the analytical models given 2 3 10 10 by T ∼ T + T dc dc,sort dc,exch Fig. 17. Measured T and its analytical model as a function of n ,inthe 2/3 2/3 dc p = τ n n + τ σt/r n b , (15) dc,sort smp dc,exch p case of n = 500 and n ∼ 5.3 × 10 . smp where τ and τ are the execution time for sorting dc,sort dc,exch one particle and for exchanging one particle, respectively, In figure 16, for small n , the analytical model gives a σ is the typical velocity of particles, t is the timestep, and value about two times smaller than the measured point. r is the average interparticle distance. For simplicity we This is because the measured values include not only the ignore the weak log term in T . On the K computer, time to exchange particles but also the time to determine dc,sort −7 −7 τ = 2.67 × 10 sand τ = 1.42 × 10 s per byte. appropriate processes to send particles, while the analytical dc,sort dc,exch Note that τ ∼ 672 T . model includes only the time to exchange particles. For dc,exch p2p,word T [s] T [s] T [s] dc dc,exch dc,sort Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-17 -1 small n , the time to determine the appropriate process is long not negligible, and therefore the analytic model gives an short underestimation. The analysis above, however, indicates that T is, dc,exch even when it is relatively large, still much smaller than T , exch which is the time to exchange particles and superparticles for interaction calculations (see subsection 5.4). 5.3 Tree construction -2 Theoretically, the cost of tree construction is O(nlogn), and 5 is of the same order as the interaction calculation itself. exch,list However, in our current implementation, the interaction calculation is much more expensive, independent of target Fig. 18. Time for the construction of the exchange list plotted against the average length of the list, for the case of n = 2048 and n ∼ 2.7 × architecture and the type of the interaction. Thus, we ignore 5 5 6 6 10 ,5.3 × 10 ,1.1 × 10 ,and2.1 × 10 . Circles and squares indicate the time for the tree constructions. the results for long-range and short-range force, respectively. Solid and dashed curves are analytic models [equation (17)]. (Color online) 5.4 Exchange of particles and superparticles long short For the exchange of particles and superparticles, in the cur- rent implementation of FDPS, first each node constructs the -1 list of particles and superparticles (hereafter the exchange 10 list) to be sent to all other nodes, and then data are exchanged through a single call to MPI_Alltoallv.Theway the exchange list is constructed depends on the force cal- culation mode. In the case of long-range forces, usual tree -2 traversal with a fixed opening angle θ is performed. For the short-range forces, the procedure used depends on the sub- types of the interaction. In the case of fixed or j-dependent n /n exch,list p cutoff, the exchange list for a node can be constructed by a single traversal of the local tree. On the other hand, for Fig. 19. Time for the communication of the exchange list against the i-dependent or symmetric cutoff, first each node constructs average length of the list per process, for the case of n = 2048 and the j-dependent exchange lists and sends them to all other 5 5 6 6 n ∼ 2.7 × 10 ,5.3 × 10 ,1.1 × 10 ,and 2.1 × 10 . Circles and squares indicate the results for long-range and short-range force, respectively. nodes. Each node then constructs the i-dependent exchange The curve is predicted from equation (17). (Color online) lists and sends them again. The time for the construction and exchange of exchange list is thus given by T (n ) = T n /n b , (18) exch,comm msg alltoallv exch,list p p where n is the average length of the exchange list exch,list T = k (T + T ). (16) exch type exch,const exch,comm and τ is the execution time for constructing one exch,const exchange list. Figures 18 and 19 show the execution time Here, k is a coefficient that is unity for fixed and type for constructing and exchanging the exchange list against j-dependent cutoffs and two for other cutoffs. Strictly the average length of the list. Here, b = 48 bytes for both speaking, the communication cost does not double for i- short- and long-range interactions. From figure 18,wecan dependent or symmetric cutoffs, since we send only par- see that the elapsed time can be fitted well by equation (17). ticles which were not sent in the first step. However, for −7 Here τ is 1.12 × 10 s for long-range interaction exch,const simplicity we use k = 2 for both calculation and communi- −7 and 2.31 × 10 s for short-range interaction. cation. From figure 19,wecan seealargediscrepancy The two terms in equation (16) are then approximated between measured points and the curves predicted from as equation (18). In the measurement of the performance of MPI_Alltoallv in subsection 5.1, we used a uniform mes- T = τ n , (17) exch,const exch,const exch,list sage length across all processes. In actual use in exchange T [s] T [s] exch,comm exch,const Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-18 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 long long short short -1 7 8 10 10 nn /n icalc,list grp Fig. 21. Time for the construction of the interaction list for long-range force (circles) and short-range force (squares), for the case of n ∼ 5.3 Fig. 20. Average length of the exchange lists for long-range interaction × 10 and θ = 0.4. Solid and dashed curves are the analytical models (circles) and for short-range interaction (squares) as a function of n,in for long-rage and short range forces, respectively [equation (22)]. (Color the case of θ = 0.4 and n = 2048. Solid and dashed curves are predicted online) from equations (19) and (20), respectively. (Color online) 5.5 Tree traverse and interaction calculation particles, the length of the message is not uniform. Neigh- The time for the force calculation is given by boring processes generally exchange large messages, while distant processes exchange short messages. For such cases, T = T + T , (21) icalc icalc,force icalc,const theoretically, communication speed measured in terms of average message length should be faster. In practice, how- where T and T are the time for the force cal- icalc,force icalc,const ever, we observed a serious degradation of performance. culations for all particles and the tree traverses for all inter- This degradation seems to imply that the current imple- action lists, respectively. mentation of MPI_Alltoallv is suboptimal for non-uniform T and T are expressed as icalc,force icalc,const message size. T = τ nn /n , (22) icalc,const icalc,const icalc,list grp In the following, we estimate n . If we consider a exch,list rather idealized case, in which all domains are cubes con- T = τ nn , (23) icalc,force icalc,force icalc,list taining n particles, the total length of the exchange lists for where n is the average length of the interaction list, icalc,list one domain can approximately be given by n is the number of i-particle groups for modified tree grp algorithms from Barnes (1990), and τ and τ 2/3 1/3 icalc,force icalc,const 14n 21πn n ∼ + exch,list are the time for one force calculation and for constructing θ θ one interaction list. In figure 21, we plot the time for the 28π θ 1/3 1/3 + log nn − n (19) construction of the interaction list as a function of n .On 2 p grp 3θ 2.8 −8 the K computer, τ is 3.72 × 10 s for the long-range icalc,const −8 force and 6.59 × 10 s for the short-range force. for the case of long-range interactions, and The length of the interaction list is given by 2/3 1/3 14n 21πn grp grp cut n ∼ n + + 1/3 icalc,list grp n ∼ n − 1 + 2 − n (20) exch,list θ θ 28π θ 1/3 1/3 + log (nn ) − n (24) 2 p grp 3θ 2.8 for the case of short-range interactions, where r is the cut for the case of long-range interactions and average cutoff length and and r is the average interparticle distance. In this subsection we set r so that the number cut cut 1/3 n ∼ n − 1 + 2 (25) of particles in the neighbor sphere is 100. In other words, icalc,list grp r ∼ 3r. cut In figure 20, we plot the list length for short and long for the case of short-range interactions. interactions against the average number of particles. The In figure 22, we plot the length of the interaction lists for rough estimate of equations (19) and (20) agree very well the long-range and short-range forces. We can see that the with the measurements. length of the interaction lists can be fitted reasonably well. exch,list T [s] icalc,const Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-19 Table 2. Time coefficients in equation (27) long for the K computer. short −6 τ [s] 1.66 × 10 alltoallv,startup −1 −10 τ [s byte]1.11 × 10 alltoallv,word −7 τ [s] 2.67 × 10 dc,sort −7 τ [s] 1.12 × 10 10 exch,const −8 τ [s] 3.72 × 10 icalc,const −10 τ [s] 3.05 × 10 icalc,force τ is the value for gravity. icalc,force 1 2 10 10 grp Fig. 22. Average length of the interaction list for long-range force (cir- cles) and short-range force (squares), for the case of n ∼ 5.3 × 10 and θ = 0.4. Solid and dashed curves are analytical models for long-range (equation 24) and short-range (equation 25) forces. (Color online) n =8 grp n =64 grp n =512 grp -9 Fig. 24. Breakdown of the total time of the calculation per timestep against n ,for thecaseof N = 5.5 × 10 , n = 500, θ = 0.4, and p smp n = 130. (Color online) grp 1 2 3 4 10 10 10 10 2/3 ∼ τ n n /n dc,sort smp dc icalc,list + k τ n + τ n type exch,const exch,list alltoallv,startup p Fig. 23. Time for the evaluation of one gravity force against n for icalc,list various n . (Color online) grp 1/3 + τ n b n alltoallv,word exch,list p In the following, we discuss the time for the force cal- + τ nn icalc,force icalc,list culation. The time for the force calculation for one particle + τ nn /n . (27) icalc,const icalc,list grp pair τ has different values for different kinds of inter- icalc,force actions. We plot τ against n for various n in icalc,force icalc,list grp The time coefficients in equation (27) for the K computer figure 23. We can see that for larger n , τ becomes grp icalc,force are summarized in table 2.Inthissectionweuse n = 1. dc smaller. However, from equation (24), large n leads to grp To see if the time predicted by equation (27) is reason- large n and the number of interactions becomes larger. icalc,list able, we compare the predicted time and the time obtained Thus, there is an optimal n . In our disk galaxy simulations grp from the disk galaxy simulation with the total number of using the K computer, the optimal n is a few hundreds, grp particles (N) of 550 million and θ = 0.4. In our simulations, and dependence on n is weak. grp we use up to the quadrupole moment. On the other hand, we assume the monopole moment only in equation (27). Thus, we have to correct the time for the force calculation 5.6 Total time in equation (27). In our simulations, the cost of the force Now we can predict the total time of the calculation using calculation of the quadrupole moment is two times higher the above discussions. The total time per timestep is given than that of the monopole moment and about 75% of par- by ticles in the interactions list are superparticles. Thus the cost of the force calculation in the simulation is 75% higher than T ∼ T /n + k (T + T ) step dc,sort dc type exch,const exch,comm the prediction. We apply this correction to equation (27). In + T + T (26) figure 24, we plot the breakdown of the predicted time with icalc,force icalc,const τ [s] icalc,list icalc,force Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-20 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 Fig. 26. Same as figure 25, but for the floating-point operation perfor- Fig. 25. Breakdown of the total time of the calculation per timestep mance 10 times faster than the K computer. (Color online) 9 7 against n , for the case of N = 10 (top panel) and =10 (bottom panel), n = 500, θ = 0.4, and n = 300. (Color online) smp grp computer, but τ , τ , τ ,and τ are dc,sort exch,const icalc,const icalc,force 10 times smaller than those of the K computer. We can the correction and the obtained time from the disk galaxy see that the optimal n is shifted to smaller n for both p p simulations. We can see that our predicted times agree with cases of N of 1 billion and 10 million, because T exch,comm the measurements very well. is unchanged. However, the shortest time per timestep is In the following, we analyze the performance of the improved by about a factor of five. If the network perfor- gravitational many-body simulations for various hypothet- mance is also improved by a factor of 10, we would get ical computers. In figure 25, we plot the breakdown of the same factor of 10 improvement for the shortest time the calculation time predicted using equation (27) for the per timestep. In other words, by reducing the network per- cases of 1 billion and 10 million particles against n . For formance by a factor of 10, we suffer only a factor of two the case of 1 billion particles, we can see that the slope of degradation of the shortest time. T becomes shallower for n  10000 and increases for In figure 27, we plot the predicted T for three hypo- step p step n  30000, because T dominates. Note that T thetical computers and the K computer. Two of these p dc,sort exch,comm also has the minimum value. The reason is as follows. computers, the K computer and X10, are the same com- For small n , T is dominant in T and puter models we used above. Another is a computer with p alltoallv,word exch,comm it decrease as n increases, because the length of n a floating-point operation performance that is 100 times p exch,list becomes smaller. For large n , T becomes domi- faster than the K computer (hereafter X100). The last one p alltoallv,startup nant and it increases linearly. We can see the same tendency is a computer which can perform the force calculation for the case of 10 million particles. However, the optimal 10 times faster than the K computer (hereafter ACL). In n ,atwhich T is the minimum, is smaller than that for other words, only τ is 10 times smaller than that p step icalc,force 1 billion particles, because T is independent of N. of the K computer. This computer is roughly mimicking a dc,sort In figure 26, we plot the breakdown of the predicted computer with an accelerator, such as GPU (Hamada et al. calculation time for a hypothetical computer which has a 2009b), GRAPE (Sugimoto et al. 1990; Makino et al. 2003), floating-point operation performance that is 10 times faster and PEZY-SC. Here we use the optimal n ,atwhich T grp step than that of the K computer (hereafter X10). In other words, is at the minimum, for each computer. For the case of τ and τ are the same as those of the K N = 10 , the optimal n ∼ 300 for the K computer and alltoallv,startup alltoallv,word grp Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-21 From figures 25 and 26, we can see that for large n , performance will be limited by T and T . There- dc,sort exch,comm fore, if we can reduce them further, we can improve the efficiency of the calculation with large n . It is possible to reduce the time for sorting by applying the algorithm used in the x-direction to y-direction as well, or setting n to dc more than unity. It is more difficult to reduce T , exch,comm since we are using the system-provided MPI_Alltoallv. 6 Conclusion In this paper, we present the basic idea, implementation, measured performance, and performance model of FDPS, a framework for developing efficient, parallel particle-based simulation codes. FDPS provides all of these necessary func- tions for the parallel execution of particle-based simula- tions. By using FDPS, researchers can easily develop their programs which run on large-scale parallel supercomputers. For example, a simple gravitational N-body program can be written in around 120 lines. We implemented three astrophysical applications using FDPS and measured their performances. All applications showed good performance and scalability. In the case of the disk galaxy simulation, the achieved efficiency is around Fig. 27. Predicted total calculation time for three hypothetical computers 50% of the theoretical peak, 7% for the cosmological sim- and the K computer as a function of n ,for thecaseof n = 500 and p smp ulation, and 40% for the giant impact simulation. θ = 0.4. Top and bottom panels indicate the results of the case for 12 9 N = 10 and N = 10 , respectively. (Color online) We constructed the performance model of FDPS and analyzed the performance of applications using FDPS. We found that the performance for a small number of parti- X10, ∼400 for X100, and ∼1600 for ACL. For the case of cles would be limited by the time for the calculation nec- N = 10 , the optimal n for K, X10, and X100 is the same essary for the domain decomposition and communication grp as those for N = 10 , but ∼1800 for ACL. The optimal necessary for the interaction calculation. value of n for ACL is larger than those of any other grp computers, because large n reduces the cost of the con- grp Acknowledgement struction of the interaction list. From figure 27, we can see that for small n , X10, and We thank M. Fujii for providing initial conditions of spiral sim- ulations, T. Ishiyama for providing his Particle Mesh code, K. X100 are 10 and 100 times faster than the K computer, Yoshikawa for providing his TreePM code and Y. Maruyama for respectively. However, for the case of N = 10 , T of step being the first user of FDPS. We are grateful to M. Tsubouchi for her the values of X10 and X100 increase for n  15000 and help in managing the FDPS development team. This research used 7000, because the T becomes the bottleneck. ACL exch,comm computational resources of the K computer provided by the RIKEN shows a similar performance to that of X10 up to optimal Advanced Institute for Computational Science through the HPCI System Research project (Project ID:ra000008). Part of the research n , because the force calculation is dominant in the total covered in this paper research was funded by MEXT’s program for calculation time. On the other hand, for large n , the perfor- the Development and Improvement for the Next Generation Ultra mance of ACL is almost the same as that of the K computer, High-Speed Computer System, under its Subsidies for Operating the because ACL has the same bottleneck as the K computer Specific Advanced Large Research Facilities. Numerical computa- has, which is the communication of the exchange list. On tions were in part carried out on Cray XC30 at Center for Compu- the other hand, for the case of N = 10 , T is scaled up tational Astrophysics, National Astronomical Observatory of Japan. step to n ∼ 10 for all computers. This is because for larger N simulations, the costs of the force calculation and the con- References struction of the interaction list are relatively high compared to those of the communication of the exchange list. Thus Abrahama, M. J., Murtolad, T., Schulz, R., Palla, S., Smith, J., Hessa, B., & Lindahl, E. 2015, SoftwareX, 1, 19 the optimal n is larger value if we use larger N. p Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-22 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 Asphaug, E., & Reufer, A. 2014, Nature Geoscience, 7, 564 Hamada, T., Nitadori, K., Benkrid, K., Ohno, Y., Morimoto, G., Bagla, J. S. 2002, J. Astrophys. Astron., 23, 185 Masada, T., Shibata, Y., Oguri, K., & Taiji, M. 2009b, Comp. Balsara, D. S. 1995, J. Comput. Phys., 121, 357 Sci.-Res. Development, 24, 21 Barnes, J. 1990, J. Comput. Phys., 87, 161 Hartmann, W. K., & Davis, D. R. 1975, Icarus, 24, 504 Barnes, J., & Hut, P. 1986, Nature, 324, 446 Hernquist, L. 1990, ApJ, 356, 359 Bedorf, ´ J., Gaburov, E., Fujii, M., Nitadori, K., Ishiyama, T., & Hockney, R. W., & Eastwood, J. W. 1988, Computer Simulation Portegies Zwart, S. 2014, Proc. Int. Conf. for High Perfor- Using Particles (Boca Raton, FL: CRC Press) mance Computing, Networking, Storage and Analysis, ed. Ishiyama, T., Fukushige, T., & Makino, J. 2009, PASJ, 61, 1319 T. Damkroger & J. Dongarra (Piscataway, NJ: IEEE Press), Ishiyama, T., Nitadori, K., & Makino, J. 2012, Proc. International 54 Conference on High Performance Computing, Networking, Bedorf, ´ J., Gaburov, E., & Portegies Zwart, S. 2012, J. Comput. Storage and Analysis, ed. J. K. Hollingsworth (New York: Phys., 231, 2825 Association for Computing Machinery), 1161 Benz, W., Slattery, W. L., & Cameron, A. G. W. 1986, Icarus, 66, Iwasawa, M., Tanikawa, A., Hosono, N., Nitadori, K., 515 Muranushi, T., & Makino, J. 2015, WOLFHPC ’15 (New York: Blackston, D., & Suel, T. 1997, Proc. 1997 ACM/IEEE Conf. ACM), 1 on Supercomputing (New York: Association for Computing Navarro, J. F., Frenk, C. S., & White, S. D. M. 1996, ApJ, 462, 563 Machinery), doi:10.1145/509593.509597 Nitadori, K., Makino, J., & Hut, P. 2006, New Astron., 12, 169 Bode, P., Ostriker, J. P., & Xu, G. 2000, ApJS, 128, 561 Makino, J. 2004, PASJ, 56, 521 Brooks, B., et al. 2009, J. Comp. Chem., 30, 1545 Makino, J., Fukushige, T., Koga, M., & Namura, K. 2003, PASJ, Cameron, A. G. W., & Ward, W. R. 1976, Proc. 7th Lunar Science 55, 1163 Conf., Geochim. Cosmochim. Acta Suppl., 120 Monaghan, J. J. 1992, ARA&A, 30, 543 Canup, R. M., Barr, A. C., & Crawford, D. A. 2013, Icarus, 222, Monaghan, J. J. 1997, J. Comp. Phys., 136, 298 200 Murotani, K., et al. 2014, J. Adv. Simulation Sci. Eng., 1, 16 Case, D. A., et al. 2015, AMBER 2015 (San Francisco: University of Phillips, J., et al. 2005 J. Comp. Chem., 26, 1781 California) Plimpton, S. 1995, J. Comp. Phys., 117, 1 Dehnen, W. 2000, ApJ, 536, L39 Rosswog, S. 2009, New Astron. Rev., 53, 78 Dehnen, W., & Aly, H. 2012, MNRAS, 425, 1068 Salmon, J. K., & Warren, M. S. 1994, 111, 136 Dubinski, J. 1996, New Astron., 1, 133 Shaw, D. E., et al. 2014, Proc. Int. Conf. for High Performance Dubinski, J., Kim, J., Park, C., & Humble, R. 2004, New Astron., Computing, Networking, Storage and Analysis, ed. T. 9, 111 Damkroger & J. Dongarra (Piscataway, NJ: IEEE Press), 41 Fujii, M. S., Baba, J., Saitoh, T. R., Makino, J., Kokubo, E., & Springel, V. 2010, ARA&A, 48, 391 Wada, K. 2011, ApJ, 730, 109 Springel, V., et al. 2005, Nature, 435, 629 Gaburov, E., Harfst, S., & Portegies Zwart, S. 2009, New Astron., Sugimoto, D., et al. 1990, Nature, 345, 33 14, 630 Tanikawa, A., Yoshikawa, K., Nitadori, K., & Okamoto, T. 2013, Goodale, T., Allen, G., Lanfermann, G., Masso, ´ J., Radke, T., New Astron., 19, 74 Seidel, E., & Shalf, J. 2003, High Performance Computing for Tanikawa, A., Yoshikawa, K., Okamoto, T., & Nitadori, K. 2012, Computational Science — VECPAR 2002, ed. J. M. L. M. Palma New Astron., 17, 82 et al. (Berlin: Springer-Verlag), 197 Warren, M. S., & Salmon, J. K. 1995, Comp. Phys. Communications, Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori, K., 87, 266 & Taiji, M. 2009a, in Proc. 2009 Workshop on Component- Widrow, L. M., & Dubinski, J. 2005, ApJ, 631, 838 Based High Performance Computing, ed. R. M. Badia & N. Xu, G. 1995, ApJS, 98, 355 Wang (New York: Association for Computing Machinery), Yamada, T., Mitsume, N., Yoshimura, S., & Murotani, K. 2015, 778 Coupled Problems in Science and Engineering VI, ed. B. Schrefler Hamada, T., & Nitadori, K. 2010, Proc. ACM/IEEE Int. Conf. for et al. (Barcelona: International Center for Numerical Methods High Performance Computing, Networking, Storage and Anal- in Engineering) ysis (New York: Association for Computing Machinery) Yoshikawa, K., & Fukushige, T. 2005, PASJ, 57, 849 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Publications of the Astronomical Society of Japan Oxford University Press

Implementation and performance of FDPS: a framework for developing parallel particle simulation codes

Loading next page...
 
/lp/oxford-university-press/implementation-and-performance-of-fdps-a-framework-for-developing-VLwGXhuAN9

References (64)

Publisher
Oxford University Press
Copyright
© The Author 2016. Published by Oxford University Press on behalf of the Astronomical Society of Japan.
ISSN
0004-6264
eISSN
2053-051X
DOI
10.1093/pasj/psw053
Publisher site
See Article on Publisher Site

Abstract

We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application- development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle- based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each com- puting node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes– Hut tree algorithm or the Fast Multipole Method should be used in the case of long- range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are neces- sary for efficient parallel execution of particle-based simulations as “templates,” which are independent of the actual data structure of particles and the functional form of the particle–particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N ) calculation cost, and yet the program, once compiled with FDPS, will run effi- ciently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs The Author 2016. Published by Oxford University Press on behalf of the Astronomical Society of Japan. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-2 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum 7 9 calculation time per timestep is in the range of 30 ms (N = 10 ) to 300 ms (N = 10 ). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks. Key words: dark matter — Galaxy: evolution — methods: numerical — planets and satellites: formation 1 Introduction In the field of computational astronomy, simulations based on its particles. In addition, we need to make use of mul- on particle methods have been widely used. In such sim- tiple CPU cores in one processor chip and SIMD (single ulations, a system is either physically a collection of par- instruction multiple data) execution units in one CPU core, ticles, as in the case of star clusters, galaxies, and dark- or in some cases GPGPUs (general-purpose computing on matter halos, or modeled by a collection of particles, as graphics processing units) or other accelerators. in SPH (smoothed particle hydrodynamics) simulation of In the case of gravitational N-body problems, there are a astrophysical fluids. Since particles move automatically as number of works in which the efficient parallelization is dis- the result of integration of the equation of motion of the par- cussed (Salmon & Warren 1994; Dubinski 1996; Makino ticle, particle-based simulations have an advantage for sys- 2004; Ishiyama et al. 2009, 2012). The use of SIMD units tems experiencing strong deformation or systems with high is discussed in Nitadori, Makino, and Hut (2006)and density contrast. This is one of the reasons why particle- Tanikawa et al. (2012, 2013), and GPGPUs in Gaburov, based simulations are widely used in astronomy. Examples Harfst, and Portegies Zwart (2009), Hamada et al. (2009a, of particle-based simulations include cosmological simu- 2009b), Hamada and Nitadori (2010), Bedorf, ´ Gaburov, lations or planet-formation simulations with gravitational and Portegies Zwart (2012), and Bedorf ´ et al. (2014). N-body code, simulations of star and galaxy formation with In the field of molecular dynamics, several groups the SPH code or other particle-based codes, and simulations have been working on parallel implementations. Exam- of planetesimal formation with the DEM (discrete element ples of such efforts include Amber (Case et al. 2015), method) code. CHARMM (Brooks et al. 2009), Desmond (Shaw et al. We need to use a large number of particles to improve the 2014), GROMACS (Abrahama et al. 2014), LAMMPS resolution and accuracy of particle-based simulations, and (Plimpton 1995), and NAMD (Phillips et al. 2005). In the in order to do so we need to increase the calculation speed field of CFD (computational fluid dynamics), many com- and need to use distributed-memory parallel machines effi- mercial and non-commercial packages now support SPH or 1 2 ciently. In other words, we need to implement efficient algo- other particle-based methods (PAM-CRASH, LS-DYNA, rithms such as the Barnes–Hut tree algorithm (Barnes & and Adventure/LexADV ). Hut 1986), the TreePM algorithm (Xu 1995) or the Fast Currently, parallel application codes are being developed Multipole Method (Dehnen 2000) to distributed-memory for each specific application of particle methods. Each of parallel computers. In order to achieve high efficiency, we these codes requires the multi-year effort of a multi-person need to divide a computational domain into subdomains team. We believe this situation is problematic because of in such a way that minimizes the need for communication the following reasons. between processors to maintain the division and to per- First, it has become difficult for researchers to try a new form interaction calculations. To be more specific, parallel method or just a new experiment which requires even a implementations of particle-based simulations contain the small modification of existing large codes. If one wants to following three procedures to achieve the high efficiency: test a new numerical scheme, the first thing he or she would (a) domain decomposition, in which the subdomains to be do is write a small program and do simple tests. This can be assigned to computing nodes are determined so that the cal- easily done, as far as that program runs on one processor. culation times are balanced, (b) particle exchange, in which However, if he or she then wants to try a production-level particles are moved to computing nodes corresponding to the subdomains to which they belong, and (c) interaction https://www.esi-group.com/pam-crash. information exchange, in which each computing node col- http://www.lstc.com/products/ls-dyna. lects the information necessary to calculate the interactions http://adventure.sys.t.u-tokyo.ac.jp/lexadv. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-3 large calculation using the new method, the parallelization been widely used for numerical relativity, and BoxLib is for distributed-memory machines is necessary, and other designed for AMR (adaptive mesh refinement). For particle- optimizations are also necessary. However, to develop such based simulations, such frameworks have not been widely a program in a reasonable time is impossible for a single used yet, though there were early efforts, as in Warren person, or even for a team, unless they already have expe- and Salmon (1995), which is limited to long-range, 1/r riences of developing such a code. potential. More recently, LexADV_EMPS is currently being Secondly, even for a team of people developing a parallel developed (Yamada et al. 2015). As its name suggests, it is code for one specific problem, it has become difficult to take rather specialized to the EMPS (Explicit Moving Particle care of all the optimizations necessary to achieve reason- Simulation) method (Murotani et al. 2014). able efficiency on recent processors. In fact, the efficiency In section 2, we describe the basic design concept of of many simulation codes mentioned above on today’s latest FDPS. In section 3, we describe the implementation of microprocessors is rather poor, simply because the develop- parallel algorithms in FDPS. In section 4, we present the ment team does not have enough time or expertise to imple- measured performance for three astrophysical applications ment necessary optimizations (in some cases they require a developed using FDPS. In section 5, we construct a perfor- change of data structure, control structure, and algorithms). mance model and predict the performance of FDPS on near- In our opinion, these difficulties have significantly future supercomputers. Finally, we summarize this study in slowed down research into numerical methods and also section 6. research using large-scale simulations. We have developed FDPS (Framework for Developing 2 How FDPS works Particle Simulator: Iwasawa et al. 2015) in order to solve these difficulties. The goal of FDPS is to let researchers con- In this section, we describe the design concept of FDPS. In centrate on the implementation of numerical schemes and subsection 2.1, we present the design concept of FDPS. In physics, without spending too much time on parallelization subsection 2.2, we show an N-body simulation code written and code optimization. To achieve this goal, we separate a using FDPS, and describe how FDPS is used to perform par- code into domain decomposition, particle exchange, inter- allelization algorithms. Part of the contents of this section action information exchange, and fast interaction calcula- have been published in Iwasawa et al. (2015). tion using a Barnes–Hut tree algorithm and/or neighbor search and the rest of the code. We implement these func- 2.1 Design concept tions as “template” libraries in C++ language. The reason we use the template libraries is to allow the researchers to In this subsection, we present the basic design concept of define their own data structure for particles and their own FDPS. We first present the abstract view of calculation codes functions for particle–particle interactions, and to provide for particle-based simulations on distributed-memory par- them with highly-optimized libraries with small software allel computers, and then describe how such abstraction is overhead. A user of FDPS needs to define the data structure realized in FDPS. and the function to evaluate particle–particle interaction. Using them as template arguments, FDPS effectively gener- 2.1.1 Abstract view of particle-based simulation codes ates the highly-optimized library functions which perform In a particle-based simulation code that uses the space the complex operations described above. decomposition on distributed-memory parallel computers, From the users’ point of view, what is necessary is to the calculation proceeds in the following steps. write the program in C++, using FDPS library functions (1) The computational domain is divided into subdomains, and to compile it with a standard C++ compiler. Using and each subdomain is assigned to one MPI process. FDPS, users can thus write their particle-based simula- This step is usually called “domain decomposition.” tion codes for gravitational N-body problems, SPH, MD Here, minimization of inter-process communication (molecular dynamics), DEM, or many other particle-based and a good load balance among processes should be methods, without spending their time on parallelization achieved. and complex optimization. The compiled code will run effi- (2) Particles are exchanged among processes, so that each ciently on large-scale parallel machines. process owns particles in its subdomain. In this paper For grid-based or FEM (Finite Element Method) appli- we call this step “particle exchange.” cations, there are many frameworks for developing parallel applications. For example, Cactus (Goodale et al. 2003)has 4 5 https://github.com/FDPS/FDPS. https://ccse.lbl.gov/BoxLib/. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-4 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 (3) Each process collects the information necessary to cal- culate the interactions on its particles. We call this step “interaction-information exchange.” (4) Each process calculates interactions between particles in its subdomain. We call this step “interaction calcu- lation.” (5) The data for each particle are updated using the cal- culated interactions. This part is done without inter- process communication. Steps 1, 2, and 3 involve parallelization and inter-process communications. FDPS provides library functions to per- form these parts. Therefore, users of FDPS do not have to Fig. 1. Basic concept of FDPS. The user program gives the definitions of write the parallelization and/or inter-process communica- particle and interaction to FDPS, and calls FDPS APIs. (Color online) tion part of their own code at all. Step 4 does not involve inter-process communication. force in the simulation of organic molecules can be done in However, this step is necessary to perform the actual calcu- the user code, with small additional computing cost. lation of interactions between two particles. Users of FDPS should write a simple interaction kernel. The actual interac- 2.1.2 Design concept of FDPS tion calculation using the tree algorithm or neighbor search In this sub-subsection, we describe how the abstract view is done in FDPS. presented in the previous section is actually expressed in the Step 5 involves neither inter-process communication nor FDPS API (application programming interface). The API of interaction calculation. Users of FDPS can and should write FDPS is defined as a set of template library functions in their own program for this part. C++ language. FDPS can be used to implement particle-based simula- Figure 1 shows how a user program and FDPS library tion codes for initial value problems which can be expressed functions interact. The user program gives the definition of as the following ordinary differential equation: a particle u and particle–particle interaction f to FDPS at the time of compilation. They are written in the standard N C++ language. Thus, the user program [at least the main() du = g f (u , u ), u . (1) i j i function] currently should be written in C++. dt The user program first does the initialization of FDPS library. When the user program is compiled for the MPI Here, N is the number of particles in the system, u is a environment, the initialization of MPI communication is vector which represents the physical quantities of particle i, done in the FDPS initialization function. The setup of the f is a function which describes the contribution of particle j initial condition is done in the user program. It is also to the time derivative of physical quantities of particle i,and possible to use the file input function defined in FDPS. In g is a function which converts the sum of the contributions the main integration loop, domain decomposition, particle exchange, interaction information exchange, and force cal- to the actual time derivative. In the case of gravitational culation are all done through library calls to FDPS. The N-body simulation, u contains position, velocity, mass, time integration of the physical quantities of particles using and other parameters of particle i, f is the gravitational the calculated interaction is done within the user program. force from particle j to particle i,and g gives velocity as Note that it is possible to implement multi-stage integra- the time derivative of position and acceleration calculated tion schemes such as the Runge–Kutta schemes using FDPS. as the time derivative of velocity. FDPS can evaluate the right-hand side of equation (1) for Hereafter, we call a particle that receives the interaction agiven setof u . Therefore, the derivative calculation for an “i particle,” and a particle exerting that interaction an “j particle.” The actual contents of vector u and the func- intermediate steps can be done by u containing appropriate tional forms of f and g depend on the physical system and values. numerical scheme used. The parallelization using MPI is completely taken care of In equation (1) we included only the pairwise interac- by FDPS, and the use of OpenMP is also managed by FDPS tions, because usually the calculation cost of the pairwise interaction is the highest even when many-body interaction 6 We will investigate a possible way to use APIs of FDPS from programs written in is important. For example, the angle and torsion of bonding Fortran. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-5 modified to ⎡ ⎤ N N J,i S,i du ⎣ ⎦ = g f (u , u ) + f (u , u ), u , (2) i j i  i dt j j where N and N are, the number of j particles and J, i S, i superparticles for which we apply multipole-like expansion, respectively, the vector u is the physical quantity vector of a superparticle, and the function f indicates the interac- tion exerted on particle i by the superparticle j .Insimu- lations with a large number of particles N, N ,and N J, i S, i Fig. 2. Long-range interaction (left) and short-range interaction (right). are many orders of magnitude smaller than N. A user needs Gray, red, and blue points are i particles, j particles, and superparticles, respectively. (Color online) to provide functions to construct superparticles from par- ticles and to calculate the interaction from superparticles. Since the most common use of the long-range interaction is for the interaction calculation. In order to achieve high per- for 1/r potential, FDPS includes standard implementation formance, the interaction calculation should make efficient of these functions for 1/r potential up to the quadrupole use of the cache memory and SIMD units. In FDPS, this moment. is done by requiring an interaction calculation function to calculate the interactions between multiple i and j particles. In this way, the amount of memory access is significantly 2.2 An example — gravitational N-body problem reduced, since a single j particle is used to calculate the In this subsection, we present a complete user code for the interaction of multiple i particles (i particles are also in the gravitational N-body problem with the open boundary con- cache memory). To make efficient use of the SIMD execu- dition, to illustrate how a user could write an application tion units, currently the user should write the interaction calculation loop so that the compiler can generate SIMD program using FDPS. The gravitational interaction is han- instructions. Of course, the use of libraries optimized for dled as a “long-range” type in FDPS. Therefore, we need to specific architectures (Nitadori et al. 2006; Tanikawa et al. provide the data type and interaction calculation functions for superparticles. In order to keep the sample code short, 2012, 2013) would guarantee very high performance. we use the center-of-mass approximation and use the same It is also possible to use GPUs and other accelerators for data class and interaction function for real particles and the interaction calculation. In order to reduce the communi- superparticles. cation overhead, a so-called “multiwalk” method (Hamada For the gravitational N-body problem, the physical et al. 2009b) is implemented. Thus, interaction calculation quantity vector u and interaction functions f , f ,and g kernels for accelerators should take multiple sets of pairs of are given by i-and j particles. The performance of this version will be discussed elsewhere. u = (r , v , m ), (3) i i i i As stated earlier, FDPS performs the neighbor search if the interaction is of a short-range nature. If the long-range Gm r − r j j i f (u , u ) = , (4) i j 3/2 interaction is used, currently FDPS uses the Barnes–Hut |r − r | + j i tree algorithm. In other words, within FDPS, the distinction between the long-range and short-range interactions is not Gm r − r j i f (u , u ) = , (5) i j a physical one but an operational one. If we want to apply 3/2 2 2 |r − r | + j i the treecode, it is a long-range interaction. Otherwise, it is a short-range interaction. Thus, we can use the simple tree g(F , u ) = (v , F , 0), (6) i i algorithm for pure 1/r gravity and the TreePM scheme (Xu 1995; Bode et al. 2000; Bagla 2002; Dubinski et al. 2004; F = f (u , u ) + f (u , u ), u , (7) i j i j i Springel 2005; Yoshikawa & Fukushige 2005; Ishiyama j j et al. 2009, 2012) for the periodic boundary. Figure 2 illustrates the long-range and short-range inter- where m , r , v ,and  are the mass, position, velocity, and i i i i actions and how they are calculated in FDPS. gravitational softening of particle i, respectively; m and r j j For long-range interactions, a Barnes–Hut algorithm is are the mass and position of a superparticle j, respectively; used. Thus, the interactions from a group of distant particles and G is the gravitational constant. Note that the shapes of are replaced by that of a superparticle, and equation (1) is the functions f and f are the same. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-6 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 Listing 1 shows the complete code which can be com- 49 for(S32 j=0; j<nj;j++) { 50 F64vec xj = jp[j].pos; piled and run, not only on a single-core machine but also on massively-parallel, distributed-memory machines such 51 F64vec dr = xi - xj; as the K computer. The total number of lines is 117. 52 F64 mj = jp[j].mass; 1 #include <particle_simulator.hpp> 53 F64 dr2 = dr * dr + ep2; 2 using namespace PS; 54 F64 dri = 1.0 / sqrt(dr2); 3 55 ai -= (dri * dri * dri 4 class Nbody{ 56 * mj) * dr; 5 public: 57 } 6 F64 mass, eps; 58 force[i].acc += ai; 7 F64vec pos, vel, acc; 59 } 8 F64vec getPos() const {return pos;} 60 } 9 F64 getCharge() const {return mass;} 61 }; 10 void copyFromFP(const Nbody &in) { 62 11 mass = in.mass; 63 template <class Tpsys> 12 pos = in.pos; 64 void predict(Tpsys &p, 13 eps = in.eps; 65 const F64 dt) { 14 } 66 S32 n = p.getNumberOfParticleLocal(); 15 void copyFromForce (const Nbody &out) { 67 for(S32 i = 0; i < n; i++) 16 acc = out.acc; 68 p[i].predict(dt); 17 } 69 } 18 void clear() { 70 71 template <class Tpsys> 19 acc = 0.0; 20 } 72 void correct(Tpsys &p, 21 void readAscii(FILE *fp) { 73 const F64 dt) { 22 fscanf(fp, 74 S32 n = p.getNumberOfParticleLocal(); 23 "%lf%lf%lf%lf%lf%lf%lf%lf" 75 for(S32 i = 0; i < n; i++) 24 &mass, &eps, 76 p[i].correct(dt); 25 &pos.x, &pos.y, &pos.z, 77 } 26 &vel.x, &vel.y, &vel.z); 78 27 } 79 template <class TDI, class TPS, class TTFF> 28 void predict(F64 dt) { 80 void calcGravAllAndWriteBack (TDI &dinfo, 29 vel += (0.5 * dt) * acc; 81 TPS &ptcl, 30 pos += dt * vel; 82 TTFF &tree) { 31 } 83 dinfo.decomposeDomainAll(ptcl); 32 void correct(F64 dt) { 84 ptcl.exchangeParticle(dinfo); 33 vel += (0.5 * dt) * acc; 85 tree.calcForceAllAndWriteBack 34 } 86 (CalcGrav<Nbody>(), 35 }; 87 CalcGrav<SPJMonopole>(), 36 88 ptcl, dinfo); 37 template <class TPJ> 89 } 38 struct CalcGrav{ 90 39 void operator () (const Nbody * ip, 91 int main(int argc, char *argv[]) { 40 const S32 ni, 92 PS :: Initialize(argc, argv); 41 const TPJ * jp, 93 F32 time = 0.0; 42 const S32 nj, 94 const F32 tend = 10.0; 43 Nbody * force) { 95 const F32 dtime = 1.0 / 128.0; 44 for(S32 i=0; i<ni; i++) { 96 PS :: DomainInfo dinfo; 45 F64vec xi = ip[i].pos; 97 dinfo.initialize(); 46 F64 ep2 = ip[i].eps 98 PS :: ParticleSystem<Nbody> ptcl; 47 * ip[i].eps; 99 ptcl.initialize(); 48 F64vec ai = 0.0; 100 PS :: TreeForForceLong<Nbody, Nbody, Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-7 101 Nbody> :: Monopole grav; A particle class for FDPS must provide public member func- 102 grav.initialize(0); tions getPos, copyFromFP, copyFromForce,and clear,so 103 ptcl.readParticleAscii(argv[1]); that the internal functions of FDPS can access the data 104 calcGravAllAndWriteBack(dinfo, within the particle class. For the name of the particle class 105 ptcl, itself and the names of the member variables, a user can use 106 grav); whatever names allowed by the C++ syntax. The member 107 while(time < tend) { functions predict and correct are used to integrate the 108 predict(ptcl, dtime); orbits of particles. These are not related to FDPS. Since the 109 calcGravAllAndWriteBack(dinfo, interaction is pure 1/r type, the construction method for 110 ptcl, superparticles provided by FDPS can be used and they are 111 grav); not shown here. 112 correct(ptcl, dtime); In the third part, the interaction functions f and f are 113 time += dtime; defined. In this example, actually they are the same, except 114 } for the class definition for j particles. Therefore, this argu- 115 PS :: Finalize(); ment is given as an argument with the template data type 116 return 0; TPJ, so that a single source code can be used to generate two 117 } functions. The interaction function used in FDPS should have the following five arguments. The first argument, ip, In the following we describe how this sample code is the pointer to the array of i particles which receive the works. It consists of four parts: the declaration to use FDPS interaction. The second argument, ni, is the number of i (lines 1 and 2), the definition of the particle (the vector particles. The third argument, jp, is the pointer to the array u ) (lines 4 to 35), the definition of the gravitational force of j particles or superparticles which exert the interaction. (the functions f and f ) (lines 37 to 61), and the actual The fourth argument, nj, is the number of j particles or user program. The actual user program consists of a main super-particles. The fifth argument, force, is the pointer routine and functions which call library functions of FDPS to the array of variables of a user-defined class in which (lines 63 to line 117). In the following, we discuss each of the calculated interactions on i particles can be stored. In them. this example we used the particle class itself, but it can be In order to use FDPS, the user program should include another class or a simple array. the header file “particle_simulator.hpp.” All functionalities In this example, the interaction function is a function of the standard FDPS library are implemented as the header object declared as a struct, with the only member func- source library, since they are template libraries which need tion operator (). FDPS can also accept a function pointer to receive particle class and interaction functions. FDPS for the interaction function, which would look a bit more data types and functions are in the namespace “PS.” In this familiar to most readers. In this example, the interaction sample program, we declare “PS” as the default namespace is calculated through a simple double loop. In order to to simplify the code. In this sample, however, we did not achieve high efficiency, this part should be replaced by a omit “PS” for FDPS functions and class templates, to show code optimized for specific architectures. In other words, a that they come from FDPS. user needs to optimize just this single function to achieve FDPS defines several data types. F32/F64 are data types very high efficiency. of 32-bit and 64-bit floating points. S32 is the data type In the fourth part, the main routine and user-defined of a 32-bit signed integer. F64vec is the class of a vector functions are defined. In the following, we describe the main consisting of three 64-bit floating points. This class provides routine in detail, and briefly discuss other functions. The several operators, such as the addition, subtraction, and main routine consists of the following seven steps. inner product indicated by “∗.” It is not necessary to use (1) Initialize FDPS (line 92). these data types in the user program, but in some of the (2) Set simulation time and timestep (lines 93 to 95). functions the user should provide these data types for the (3) Create and initialize objects of FDPS classes (lines 96 return value. to 102). In the second part, the particle data type (i.e., the (4) Read-in particle data from a file (line 103). vector u ) is defined as class Nbody. It has the following (5) Calculate the gravitational forces on all the particles at member variables: mass (m ), eps ( ), pos (r ), vel (v ), i i i i the initial time (lines 104 to 106). and acc (dv /dt). Although the member variable acc does (6) Integrate the orbits of all the particles with the Leap- not appear in equations (3)–(7), we need this variable Frog method (lines 107 to 114). to store the result of the gravitational force calculation. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-8 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 (7) Finish the use of FDPS (line 115). TreeForForceLong. This function takes the user-defined function object CalcGrav as the first and second arguments, In the following, we describe steps 1, 3, 4, 5, and 7, and and calculates particle–particle and particle-superparticle skip steps 2 and 6. In step 2, we do not call FDPS libraries. interactions using them. Although we call FDPS libraries in step 6, the usage is the In step 7, the FDPS function Finalize is called. It calls same as in step 5. the MPI_Finalize function. In step 1, the FDPS function Initialize is called. In In this section, we have described in detail what a user this function, MPI and OpenMP libraries are initialized. If program written using FDPS looks like. As we stated earlier, neither of them are used, this function does nothing. All this program can be compiled with or without paralleliza- functions of FDPS must be called between this function and tion using MPI and/or OpenMP, without any change in the function Finalize. the user program. The executable parallelized with MPI is In step 3, we create and initialize three objects of the generated by using an appropriate compiler with MPI sup- FDPS classes: port and a compile-time flag. Thus, a user need not worry dinfo: An object of class DomainInfo. It is used for about complicated bookkeeping necessary for paralleliza- domain decomposition. tion using MPI. ptcl: An object of class template ParticleSystem. In the next section, we describe how FDPS provides a It takes the user-defined particle class (in this example, generic framework which takes care of parallelization and Nbody) as the template argument. From the user program, bookkeeping for particle-based simulations. this object looks as an array of i particles. grav: An object of data type Monopole defined in class template TreeForForceLong. This object is used for the 3 Implementation calculation of long-range interaction using the tree algo- In this section, we describe how the operations discussed rithm. It receives three user-defined classes as template in the previous section are implemented in FDPS. In arguments: the class to store the calculated interaction, subsection 3.1 we describe the domain decomposition and the class for i particles and the class for j particles. In this particle exchange, and in subsection 3.2, the calculation of example, all these three classes are the same as the original interactions. Part of the contents of this section have been class of particles. It is possible to define classes with min- published in Iwasawa et al. (2015). imal data for these purposes and use them here, in order to optimize the cache usage. The data type Monopole indi- cates that the center-of-mass approximation is used for 3.1 Domain decomposition and particle exchange superparticles. In this subsection, we describe how the domain decomposi- In step 4, the data of particles are read from a file into the tion and the exchange of particles are implemented in FDPS. object ptcl, using FDPS function readParticleAscii.In We used the multisection method (Makino 2004)withthe this function, a member function of class Nbody, readAscii, so-called sampling method (Blackston & Suel 1997). The is called. Note that the user can write and use his/her own multisection method is a generalization of ORB (Orthog- I/O functions. In this case, readParticleAscii is unneces- onal Recursive Bisection). In ORB, as its name suggests, sary. bisection is applied to each coordinate axis recursively. In In step 5, the forces on all particles are calculated the multisection method, division in one coordinate is not through the function calcGravAllAndWriteBack,which to two domains but to an arbitrary number of domains. is defined in lines 79 to 89. In this function, steps 1 Since one dimension can be divided to more than two sec- to 4 in sub-subsection 2.1.1 are performed. In other tions, it is not necessary to apply divisions many times. words, all of the actual work of FDPS libraries to Thus we apply divisions only once to each coordinate axis. calculate interaction between particles takes place here. A practical advantage of this method is that the number of For step 1, decomposeDomainAll, a member function processors is not limited to powers of two. of class DomainInfo is called. This function takes the Figure 3 illustrates an example of the multisection object ptcl as an argument to use the positions of par- method with (n , n , n ) = (7, 6, 1). We can see that the size x y z ticles to determine the domain decomposition. Step 2 and shape of subdomains show large variation. By allowing is performed in exchangeParticle, a member func- this variation, FDPS achieves quite good load balance and tion of class ParticleSystem. This function takes the high scalability. Note that n = n n n is the number of MPI x y z object dinfo as an argument and redistributes particles processes. By default, values of n , n ,and n are chosen so x y z 1/3 among MPI processes. Steps 3 and 4 are performed in that they are integers close to n . For figure 3, we force the calcForceAllAndWriteBack, a member function of class numbers used to make a two-dimensional decomposition. Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-9 A naive implementation of the above algorithm requires “global” sorting of sample particles over all (i, 0, 0) pro- cesses. In order to simplify this part, before each pro- cess sends the sample particles to (i, 0, 0) processes, they exchange their samples with other processes with the same location in y-and z-process coordinates, so that they have sample particles in the current domain decomposition in the x-direction. As a result, particles sent to (i, 0, 0) processes are already sorted at the level of domain decomposition in the x-direction, and we need only sort within each of the (i, 0, 0) processes to obtain the globally sorted particles. Thus, our implementation of parallelized domain decomposition is as follows. (1) Each process samples particles randomly from its own particles. In order to achieve an optimal load balance, the sampling rate of particles is changed so that it is proportional to the CPU time per particle spent on that process (Ishiyama et al. 2009). FDPS provides several Fig. 3. Example of the domain decomposition. The division is 7 × 6in options including this optimal balance. two dimensions. (Color online) (2) Each process exchanges the sample particles according to the current domain boundary in the x-direction with In the sampling method, first each process performs the process with the same y and z indices, so that they random sampling of particles under it, and sends them to have sample particles in the current domain decompo- the process with rank 0 (“root” process hereafter). Then sition in the x-direction. the root process calculates the division so that sample (3) Each process with index (i, y, z) sends the sample par- particles are equally divided over all processes, and broad- ticles to the process with index (i, 0, 0), in other words, casts the geometry of domains to all other processes. In the root processes in each y–z plane collects subsam- order to achieve good load balance, sampling frequency ples. should be changed according to the calculation cost per (4) Each root process sorts the sample particles in the particle (Ishiyama et al. 2009). x-direction. Now, the sample particles are sorted glob- The sampling method works fine, if the number of par- ally in the x-direction. ticles per process is significantly larger than the number of (5) Each root process sends the number of the sample par- process. This is, however, not the case for runs with a large ticles to the other root processes and determines the number of nodes. When the number of particles per process global rank of the sample particles. is not much larger than the number of processes, the total (6) The x coordinate of new domains is determined by number of sample particles which the root process needs to dividing all sample particles into n subsets with equal handle exceeds the number of particles per process itself, numbers of sample particles. and thus calculation time of domain decomposition in the (7) Each root process exchanges sample particles with root process becomes visible. other root processes, so that they have the sample par- In order to reduce the calculation time, we also paral- ticles in new domain in the x-direction. lelized the domain decomposition, currently in the direction (8) Each root process determines the y and z coordinates of the x-axis only. The basic idea is that each node sends of new domains. the sample particles not to the root process of the all MPI (9) Each root process broadcasts the geometries of new processes but to the processes with index (i, 0, 0). Then pro- domains to all other processes. cesses (i, 0, 0) sort the sample particles and exchange the number of sample particles they received. Using these two It is also possible to parallelize the determination of sub- pieces of information, each of (i, 0, 0) processes can deter- domains in step 8, but even for the full-node runs on the K mine all domain boundaries inside its current domain in the computer we found the current parallelization is sufficient. x-direction. Thus, they can determine which sample parti- For particle exchange and also for interaction infor- cles should be sent to where. After the exchange of sample mation exchange, we use MPI_Alltoall to exchange the particles, each of the (i, 0, 0) processes can determine the length of the data and MPI_Isend and MPI_Irecv to actually decompositions in the y-and z-directions. exchange the data. At least on the K computer, we found Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-10 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 that the performance of vendor-provided MPI_Alltoall is computer of the RIKEN Advanced Institute for Computa- not optimal for short messages. We implemented a hand- tional Science (AICS), and the other is the Cray XC30 of the crafted version in which the messages sent to the same relay Center for Computational Astrophysics (CfCA), National points are combined in order to reduce the total number of Astronomical Observatory of Japan. The K computer con- messages. sists of 82944 Fujitsu SPARC64 VIIIfx processors, each After the domain decomposition is done and the result is with eight cores. The theoretical peak performance of one broadcasted to all processes, they exchange particles so that core is 16 Gflops, for both single- and double-precision each of them has particles in its domain. Since each process operations. Cray XC30 of the CfCA consists of 1060 has the complete information of the domain decomposi- nodes, or 2120 Intel Xeon E5-2690v3 processors (12 cores, tion, this part is pretty straightforward to implement. Each 2.6 GHz). The theoretical peak performance of one core process looks at each of its particles, and determines if that is 83.2 and 41.6 Gflops for single- and double-precision particle is still in its domain. If not, the process determines operations, respectively. In all runs on the K computer, to which process that particle should be sent. After the des- we use the hybrid MPI-OpenMP mode of FDPS, in which tinations of all particles are determined, each process sends one MPI process is assigned to one node. On the other them out, using MPI_Isend and MPI_Irecv functions. hand, for XC30, we use the flat MPI mode of FDPS. The source code is the same except for that used for the interac- tion calculation functions. The interaction calculation part 3.2 Interaction calculation was written to take full advantage of the SIMD instruc- In this subsection, we describe the implementations of inter- tion set of the target architecture, and thus written specif- action information exchange and actual interaction calcu- ically for SPARC64 VIIIfx (HPC-ACE instruction set) and lation. In the interaction information exchange step, each Xeon E5 v3 (AVX2 instruction set). process sends the data required by other nodes. In the interaction calculation step, actual interaction calculation 4.1 Disk galaxy simulation is done using the received data. For both steps, the Barnes– Hut octree structure is used, for both long- and short-range In this subsection, we discuss the performance and scal- interactions. ability of a gravitational N-body simulation code imple- First, each process constructs the tree of its local parti- mented using FDPS. Some results in this section have been cles. This tree is then used to determine the data to be sent published in Iwasawa et al. (2015). The code is essentially to other processes. For the long-range interaction, the pro- the same as the sample code described in subsection 2.2, cedure is the same as that for usual tree traversal (Barnes & except for the following two differences in the user code for Hut 1986; Barnes 1990). The tree traversal is used also for the calculation of the interaction. First, to improve the accu- short-range interactions. FDPS can currently handle four racy, we used the expansion up to the quadrupole moment, different types of cutoff length for the short-range interac- instead of the monopole-only one used in the sample code. tions: fixed, j-dependent, i-dependent, and symmetric. For Secondly, we used the highly optimized kernel developed i-dependent and symmetric cutoffs, the tree traversal should using SIMD built-in functions, instead of the simple one in be done twice. the sample code. Using the received data, each process performs the force We apply this code for the simulation of a Milky Way- calculation. To do so, it first constructs the tree of all data like galaxy, which consists of a bulge, a disk, and a dark received and local particles, and then uses it to calculate the matter halo. For examples of recent large-scale simulations, interaction of local particles. see Fujii et al. (2011)and Bedorf ´ et al. (2014). The initial condition is the Milky Way model, which is the same as that in Bedorf ´ et al. (2014). The mass of 4 Performance of applications developed 9 the bulge is 4.6 × 10 M , and it has a spherically sym- using FDPS metric density profile of the Hernquist model (Hernquist In this section, we present the performance of three astro- 1990) with a half-mass radius of 0.5 kpc. The disk is an physical applications developed using FDPS. One is the pure axisymmetric exponential disk with a scale radius of 3 kpc, gravity code with open boundary applied to disk galaxy scale height of 200 pc and mass 5.0 × 10 M . The dark simulation. The second one is again a pure gravity appli- halo has a Navarro–Frenk–White (NFW) density profile cation but with periodic boundary applied to cosmological (Navarro et al. 1996) with the half-mass radius of 40 kpc simulation. The third one is a gravity + SPH calculation and the mass of 6.0 × 10 M . In order to realize the Milky applied to the giant impact (GI) simulation. For the perfor- Way model, we used GalacticICS (Widrow & Dubinski mance measurement, we used two systems. One is the K 2005). For all simulations in this subsection, we adopt Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-11 XC30 10 50% of TPP (K) 35% of TPP (XC30) -1 -1 -2 total -3 Fig. 4. Face-on surface density maps of the bulge and disk. (Color online) domain decomposition exchange particle grav -4 1 2 3 4 5 6 θ = 0.4 for the opening angle for the tree algorithm. We 10 10 10 10 10 10 # of cores set the average number of particles sampled for the domain decomposition to 500. Fig. 5. Weak-scaling performance of the gravitational N-body code. Figure 4 illustrates the time evolution of the bulge and The speed of the floating-point operation (top) and wall-clock time per disk in the run with 512 nodes on the K computer. The disk timestep (bottom) are plotted as functions of the number of cores. Open and filled symbols indicate the performances of the K computer and is initially axisymmetric. We can see that a spiral structure cray XC30, respectively. In the top panel, the solid line indicates 50% develops (0.5 and 1 Gyr) and a central bar follows the spiral of the theoretical peak performance of the K computer and the dotted (1 Gyr and later). As the bar grows, the two-arm structure line indicates 35% of the theoretical peak performance of XC30. In the becomes more visible (3 Gyr). bottom panel, time spent for the interaction calculation (diamond), the domain decomposition (square) the exchange particles (triangle) are Figure 5 shows the measured weak-scaling performance. also shown. (Color online) We fixed the number of particles per core to 266367 and measured the performance for the number of cores in the Bedorf ´ et al. (2014) reported the wall-clock time of 4 s range of from 4096 to 663552 on the K computer, and in for their 27-billion particle simulation on the Titan system the range of from 32 to 2048 on XC30. We can see that with 2048 NVIDIA Tesla K20X, with the theoretical peak the measured efficiency and scalability are both very good. performance of 8PF (single precision, since the single pre- The efficiency is more than 50% for the entire range of cision was used for the interaction calculation). This corre- cores on the K computer. The efficiency of XC30 is a bit sponds to 0.8 billion particles per second per petaflops. Our worse than that of the K computer. This difference comes code on the K computer requires 15 s on 16384 nodes (2PF from the difference of two processors. The Fujitsu pro- theoretical peak), resulting in 1 billion particles per second cessor showed higher efficiency, while the Intel processor per petaflops. Therefore, we can conclude that our FDPS has 5.2 times higher peak performance per core. We can code achieved a performance that is slightly better than see that the time for domain decomposition increases as one of the best codes specialized to gravitational N-body we increase the number of cores. The slope is around 2/3, problems. as can be expected from our current algorithm discussed in subsection 3.1. 4.2 Cosmological simulation Figure 6 shows the measured strong-scaling perfor- mance. We fixed the total number of particles to 550 million In this subsection, we discuss the performance of a cos- and measured the performance for 512 to 32768 cores on mological simulation code implemented using FDPS. We the K computer and 256 to 2048 cores on XC30. We can implemented TreePM (Tree Particle-Mesh) method and also see that the measured efficiency and scalability are both measured the performance on XC30. Our TreePM code very good, for the strong-scaling performance. is based on the code developed by K. Yoshikawa. The wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-12 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 XC30 50% of TPP (K) 35% of TPP (XC30) 10 XC30 7% of TPP (XC30) total domain decomposition exchange particle 1 grav -1 -1 total domain decomposition -2 -2 exchange particle grav PP -3 grav PM -3 2 3 4 5 2 3 4 10 10 10 10 10 10 10 # of cores # of cores Fig. 6. Same figure as figure 5 but for the strong-scaling performance Fig. 7. Weak-scaling performance of the TreePM code. The speed of the for 550 million particles. (Color online) floating-point operation (top) and wall-clock time per timestep (bottom) are plotted as functions of the number of cores. In the top panel, the solid line indicates 7% of the theoretical peak performance of XC30. In the bottom panel, time spent for the particle–particle interaction calcu- Particle-Mesh part of the code was developed by Ishiyama, lation (diamond), the particle–mesh interaction (pentagon), the domain Nitadori, and Makino (2012) and this code is included in decomposition (square) and the exchange particles (triangle) are also the FDPS package as an external module. shown. (Color online) We initially place particles uniformly in a cube and gave them zero velocity. For the calculation of the tree force, we for large number of cores, because we did not use paral- used a monopole-only kernel with cutoff. The cutoff length lelized domain decomposition here. The efficiency is 7% of of the force is three times larger than the width of the mesh. the theoretical peak performance. It is rather low compared We set θ to 0.5. For the calculation of the mesh force, the to that for the disk galaxy simulations in subsection 4.1. The mass density is assigned to each of the grid points, using the main reason is that we use a lookup table for the force cal- triangular-shaped cloud scheme, and the density profile we culation. If we evaluate the force without the lookup table, used is the S2 density profile (Hockney & Eastwood 1988). the nominal efficiency would be much better, but the total Figures 7 and 8 show the weak and strong scaling per- time would be longer. formances, respectively. For the weak-scaling measurement, we fixed the number of particles per process to 5.73 million 4.3 Giant impact simulation and measured the performance for the number of cores in the range of 192 to 12000 on XC30. For the strong-scaling In this section, we discuss the performance of an SPH sim- measurements, we fixed the total number of particles to ulation code with self-gravity implemented using FDPS. 2048 and measured the performance for the number of Some results in this section have been published in Iwasawa cores in the range of 1536 to 12000 on XC30. We can see et al. (2015). The test problem used is the simulation of GI. that the time for the calculation of the tree force is dominant The GI hypothesis (Hartmann & Davis 1975; Cameron and both of the weak and strong scalings are good except for & Ward 1976) is one of the most popular scenarios for the very large number of cores (12000) for the strong scaling the formation of the Moon. The hypothesis is as follows. measurement. One reason is that the scalability of the calcu- About five billion years ago, a Mars-sized object (hereafter, lation of the mesh force is not very good. Another reason is the impactor) collided with the proto-Earth (hereafter, the that the time for the domain decomposition grows linearly target). A large amount of debris was scattered, which first wallclock time per timestep[s] performance[TFLOPS] wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-13 XC30 7% of TPP (XC30) total domain decomposition exchange particle grav PP 10 grav PM Fig. 9. Temperature maps of the target and impactor in the run with 9.9 million particles at four different epochs. (Color online) -1 disrupted (t = 2847 s) and debris are ejected. A part of the 3 4 10 10 debris falls back to the target, while the rest will eventually # of cores form the disk and the Moon. So far, the resolution used in Fig. 8. Same as figure 7 but for the strong-scaling performance. In this the published papers has been much lower. We plan to use case, the number of particles is 2048 . (Color online) this code to improve the accuracy of the GI simulations. Figures 10 and 11 show the measured weak and strong scaling performance. For the weak-scaling measurement, formed the debris disk and eventually the Moon. Many we fixed the number of particles per core to 20000 and mea- researchers have performed simulations of GI, using the sured the performance for the number of cores in the range SPH method (Benz et al. 1986; Canup et al. 2013; Asphaug of 256 to 131072 on the K computer. On the other hand, for & Reufer 2014). the strong-scaling measurement, we fixed the total number For the gravity, we used a monopole-only kernel with of particles to 39 million and measured the performance for θ = 0.5. We adopt the standard SPH scheme (Monaghan the number of cores in the range of 512 to 16384 on the 1992; Rosswog 2009; Springel 2010) for the hydro part. K computer. We can see that the performance is good even Artificial viscosity is used to handle shocks (Monaghan for a very large number of cores. The efficiency is about 1997), and the standard Balsara switch is used to reduce 40% of the theoretical peak performance. The hydro part the shear viscosity (Balsara 1995). The kernel function we 6 consumes more time than the gravity part does, mainly used is the Wendland C and the cutoff radius is 4.2 times because the particle–particle interaction is more compli- larger than the local mean inter-particle distance. In other cated. words, each particle interacts with about 300 other parti- cles. This neighbor is appropriate for this kernel to avoid the pairing instability (Dehnen & Aly 2012). 5 Performance model Assuming that the target and impactor consist of granite, In this section, we present the performance model of appli- we adopt the equation of state of granite (Benz et al. cations implemented using FDPS. As described in section 2, 1986) for the particles. For the initial condition, we the calculation of a typical application written using FDPS assume a parabolic orbit with the initial angular momentum proceeds in the following steps. 1.21 times that of the current Earth–Moon system. Figure 9 shows the time evolution of the target and (1) Update the domain decomposition and exchange par- impactor for a run with 9.9 million particles. We can see ticles accordingly (not in every timestep). that the shocks are formed just after the moment of impact (2) Construct the local tree structure and exchange par- in both the target and impactor (t = 2050 s). The shock ticles and superparticles necessary for an interaction propagates in the target, while the impactor is completely calculation. wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-14 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 3 3 10 10 40% of TPP 2 2 10 10 1 1 10 10 40% of TPP 0 0 10 10 total domain decomposition 10 1 exchange particle hydr grav -1 -1 -2 total domain decomposition -2 -3 10 exchange particle hydr grav -4 -3 10 10 2 3 4 5 6 2 3 4 5 10 10 10 10 10 10 10 10 10 # of cores # of cores Fig. 10. Weak-scaling performance of the SPH code. The speed of the Fig. 11. Same as figure 10 but for the strong-scaling performance for floating-point operation (top) and wall-clock time per timestep (bottom) 39 million particles. (Color online) are plotted as functions of the number of cores. In the top panel, the solid line indicates 40% of the theoretical peak performance of the K term of the right hand side of equation 8, and finally we computer. In the bottom panel, time spent for the hydrodynamics cal- compare the model with the actual measurement presented culation (cross), the gravity calculation (diamond), the domain decom- position (square), and the exchange particles (triangle) are also shown. in subsection 5.6. (Color online) 5.1 Communication model (3) Construct the “global” tree. (4) Perform the interaction calculation. What ultimately determines the efficiency of a calculation (5) Update the physical quantities of particles using the performed on a large-scale parallel machine is the commu- calculated interactions. nication overhead. Thus, it is very important to understand what types of communication would take what amount In the case of complex applications which require more of time on actual hardware. In this section, we summa- than one interaction calculation, each of the above steps, rize the characteristics of the communication performance except for the domain decomposition, may be executed of the K computer. more than one time per timestep. In FDPS, almost all communications are through the For a simple application, thus, the total wall-clock time use of collective communications, such as MPI_Allreduce, per timestep should be expressed as MPI_Alltoall,and MPI_Alltoallv. However, measure- ment of the performance of these routines for uniform T = T /n + T + T + T + T , (8) step dc dc lt exch icalc misc message length is not enough, since the amount of data where T , T , T , T ,and T are the times for domain to be transferred between processes generally depends on dc lt exch icalc misc composition and particle exchange, local tree construc- the physical distance between domains assigned to those tion, exchange of particles and superparticles for interaction processes. Therefore, we first present the timing results for calculation, interaction calculation, and other calculations simple point-to-point communication, and then for collec- such as particle update, respectively. The term n is the tive communications. dc interval at which the domain decomposition is performed. Figure 12 shows the elapsed time as the function of the In the following, we first construct the model for the message length, for point-to-point communication between communication time. Then we construct models for each “neighboring” processes. In the case of the K computer, we wallclock time per timestep[s] performance[TFLOPS] wallclock time per timestep[s] performance[TFLOPS] Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-15 Table 1. Time coefficients in equation (10). n = 32 n = 256 n = 2048 p p p -1 T [ms] 0.103 0.460 2.87 alltoallv,startup −1 −6 −5 −3 T [ms byte ]8.25 × 10 9.13 × 10 1.32 × 10 alltoallv,word -2 2 10 10 8byte 32kbyte -3 0 1 2 3 4 5 6 10 10 10 10 10 10 10 size[byte] -1 Fig. 12. Elapsed time for point-to-point communication as a function of size of message measured on the K computer. -2 1 2 3 n =32 2 10 10 10 n =256 n =2048 p Fig. 14. Elapsed time of MPI_Alltoallv to send message of 8 bytes (cir- cles) and 32 kilobytes (squares) as a function of the number of processes measured on the K computer. Solid and dashed curves indicate the results for the message sizes of 8 bytes and 32 kilobytes, respectively. (Color online) where T is the startup time and T is the alltoallv,startup alltoallv,word time to transfer one byte of message. We list these values in -1 table 1. 1 2 3 4 5 10 10 10 10 10 The coefficients themselves in equation (10) depend on size[byte] the number of MPI processes, n , as shown in figure 14. They are modeled as Fig. 13. Elapsed time of MPI_Alltoallv as a function of message size measured on the K computer. (Color online) T = τ n , (11) alltoallv,startup alltoallv,startup p used a three-dimensional node allocation, so that “neigh- 4/3 T = τ n . (12) alltoallv,word alltoallv,word boring” processes are actually close to each other in the torus network. Here we assume that the speed to transfer a message using We can see that the elapsed time can be fitted reasonably MPI_Alltoallv is limited to the bisection bandwidth of the well as system. Under this assumption, T should be pro- alltoallv,word 4/3 portional to n . To estimate τ and τ , alltoallv,startup alltoallv,word T = T + n T , (9) p2p p2p,startup word p2p,word we use measurements for message sizes of 8 bytes and 32 kilobytes. In the K computer, we found that τ is alltoallv,startup where T is the startup time which is independent p2p,startup −7 0.00166 ms and τ is 1.11 × 10 ms per byte. If alltoallv,word of the message length, and T is the time to transfer p2p,word MPI_Alltoallv is limited to the bisection bandwidth in the one byte of the message. Here, n is the length of the word −8 K computer, τ would be 5 × 10 ms per byte. We alltoallv,word message in units of bytes. On the K computer, T p2p,startup can see that the actual performance of MPI_Alltoallv on −7 is 0.0101 ms and T is 2.11 × 10 ms per byte. For p2p,word the K computer is quite good. a short message, there is a rather big discrepancy between the analytic model and measured points, because for short 5.2 Domain decomposition messages the K computer used several different algorithms. Figure 13 shows the elapsed times for MPI_Alltoallv. For the hierarchical domain decomposition method The number of processes, n , is 32 to 2048. They are again described in subsection 3.1, the calculation time is modeled by the simple form expressed as T = T + n T , (10) T = T + T + T + T , (13) alltoallv alltoallv,startup word alltoallv,word dc dc,gather dc,sort dc,exch dc,misc T [ms] T [ms] alltoallv p2p T [ms] alltoallv Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-16 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 where T is the time for the (i, 0, 0) process to collect dc,gather sample particles, T is the time to sort sample particles dc,sort on the (i, 0, 0) process, T is the time to exchange dc,exch -2 particles after the new domains are determined, and T 10 dc,misc is the time for remaining procedures such as initial exchange of samples in the x-direction, exchange of sample particles and domain boundaries in the x-direction, and broadcasting of the domain boundaries in the y–z plane. On the machines we tested, T and T are much dc,gather dc,misc smaller than T and T . Therefore we consider these -3 dc,sort dc,exch 2 3 two terms only. 10 10 First, we consider the time to sort sample particles. Since we use the quick sort, the term T is expressed as dc,sort Fig. 15. Measured T and its analytical model as a function of n ,in dc,sort p T = τ 2n n n log(n n n ) dc,sort sort smp y z smp y z thecaseof n = 500 and n ∼ 5.3 × 10 . smp 2/3 + n n n log(n n ) ∼ τ n n , (14) y smp z smp z dc,sort smp 0.01 where n is the average number of sample particles per smp process, and n , n and n are the numbers of processes x y z in the x, y and z directions, respectively. Here, τ ∼ dc,sort 3 5/3 log(n n )τ . The first term expresses the time to sort sort smp p samples in the y–z plane with respect to the x and y direc- tions. The second term expresses that time to sort samples with respect to the z-direction. In order to model T , we need to model the number dc,exch of particles which moves from one domain to another. This 0.002 number would depend on various factors, in particular the 2 3 10 10 nature of the system we consider. For example, if we are calculating the early phase of the cosmological structure formation, particles do not move much in a single timestep, Fig. 16. Measured T and its analytical model as a function of n ,in dc,exch and thus the number of particles moved between domains thecaseof n = 500 and n ∼ 5.3 × 10 . smp is small. On the other hand, if we are calculating a single virialized self-gravitating system, particles move relatively large distances (comparable to the average interparticle dis- tance) in a single timestep. In this case, if one process con- tains n particles, half of the particles in the “surface” of the domain might migrate in and out of the domain. Thus, -2 2/3 O(n ) particles could be exchanged in this case. 10 Figures 15, 16,and 17 show the elapsed time for sorting samples, exchanging samples, and domain decomposition for the case of disk galaxy simulations with n = 500 and smp n ∼ 5.3 × 10 . We also plot the analytical models given 2 3 10 10 by T ∼ T + T dc dc,sort dc,exch Fig. 17. Measured T and its analytical model as a function of n ,inthe 2/3 2/3 dc p = τ n n + τ σt/r n b , (15) dc,sort smp dc,exch p case of n = 500 and n ∼ 5.3 × 10 . smp where τ and τ are the execution time for sorting dc,sort dc,exch one particle and for exchanging one particle, respectively, In figure 16, for small n , the analytical model gives a σ is the typical velocity of particles, t is the timestep, and value about two times smaller than the measured point. r is the average interparticle distance. For simplicity we This is because the measured values include not only the ignore the weak log term in T . On the K computer, time to exchange particles but also the time to determine dc,sort −7 −7 τ = 2.67 × 10 sand τ = 1.42 × 10 s per byte. appropriate processes to send particles, while the analytical dc,sort dc,exch Note that τ ∼ 672 T . model includes only the time to exchange particles. For dc,exch p2p,word T [s] T [s] T [s] dc dc,exch dc,sort Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-17 -1 small n , the time to determine the appropriate process is long not negligible, and therefore the analytic model gives an short underestimation. The analysis above, however, indicates that T is, dc,exch even when it is relatively large, still much smaller than T , exch which is the time to exchange particles and superparticles for interaction calculations (see subsection 5.4). 5.3 Tree construction -2 Theoretically, the cost of tree construction is O(nlogn), and 5 is of the same order as the interaction calculation itself. exch,list However, in our current implementation, the interaction calculation is much more expensive, independent of target Fig. 18. Time for the construction of the exchange list plotted against the average length of the list, for the case of n = 2048 and n ∼ 2.7 × architecture and the type of the interaction. Thus, we ignore 5 5 6 6 10 ,5.3 × 10 ,1.1 × 10 ,and2.1 × 10 . Circles and squares indicate the time for the tree constructions. the results for long-range and short-range force, respectively. Solid and dashed curves are analytic models [equation (17)]. (Color online) 5.4 Exchange of particles and superparticles long short For the exchange of particles and superparticles, in the cur- rent implementation of FDPS, first each node constructs the -1 list of particles and superparticles (hereafter the exchange 10 list) to be sent to all other nodes, and then data are exchanged through a single call to MPI_Alltoallv.Theway the exchange list is constructed depends on the force cal- culation mode. In the case of long-range forces, usual tree -2 traversal with a fixed opening angle θ is performed. For the short-range forces, the procedure used depends on the sub- types of the interaction. In the case of fixed or j-dependent n /n exch,list p cutoff, the exchange list for a node can be constructed by a single traversal of the local tree. On the other hand, for Fig. 19. Time for the communication of the exchange list against the i-dependent or symmetric cutoff, first each node constructs average length of the list per process, for the case of n = 2048 and the j-dependent exchange lists and sends them to all other 5 5 6 6 n ∼ 2.7 × 10 ,5.3 × 10 ,1.1 × 10 ,and 2.1 × 10 . Circles and squares indicate the results for long-range and short-range force, respectively. nodes. Each node then constructs the i-dependent exchange The curve is predicted from equation (17). (Color online) lists and sends them again. The time for the construction and exchange of exchange list is thus given by T (n ) = T n /n b , (18) exch,comm msg alltoallv exch,list p p where n is the average length of the exchange list exch,list T = k (T + T ). (16) exch type exch,const exch,comm and τ is the execution time for constructing one exch,const exchange list. Figures 18 and 19 show the execution time Here, k is a coefficient that is unity for fixed and type for constructing and exchanging the exchange list against j-dependent cutoffs and two for other cutoffs. Strictly the average length of the list. Here, b = 48 bytes for both speaking, the communication cost does not double for i- short- and long-range interactions. From figure 18,wecan dependent or symmetric cutoffs, since we send only par- see that the elapsed time can be fitted well by equation (17). ticles which were not sent in the first step. However, for −7 Here τ is 1.12 × 10 s for long-range interaction exch,const simplicity we use k = 2 for both calculation and communi- −7 and 2.31 × 10 s for short-range interaction. cation. From figure 19,wecan seealargediscrepancy The two terms in equation (16) are then approximated between measured points and the curves predicted from as equation (18). In the measurement of the performance of MPI_Alltoallv in subsection 5.1, we used a uniform mes- T = τ n , (17) exch,const exch,const exch,list sage length across all processes. In actual use in exchange T [s] T [s] exch,comm exch,const Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-18 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 long long short short -1 7 8 10 10 nn /n icalc,list grp Fig. 21. Time for the construction of the interaction list for long-range force (circles) and short-range force (squares), for the case of n ∼ 5.3 Fig. 20. Average length of the exchange lists for long-range interaction × 10 and θ = 0.4. Solid and dashed curves are the analytical models (circles) and for short-range interaction (squares) as a function of n,in for long-rage and short range forces, respectively [equation (22)]. (Color the case of θ = 0.4 and n = 2048. Solid and dashed curves are predicted online) from equations (19) and (20), respectively. (Color online) 5.5 Tree traverse and interaction calculation particles, the length of the message is not uniform. Neigh- The time for the force calculation is given by boring processes generally exchange large messages, while distant processes exchange short messages. For such cases, T = T + T , (21) icalc icalc,force icalc,const theoretically, communication speed measured in terms of average message length should be faster. In practice, how- where T and T are the time for the force cal- icalc,force icalc,const ever, we observed a serious degradation of performance. culations for all particles and the tree traverses for all inter- This degradation seems to imply that the current imple- action lists, respectively. mentation of MPI_Alltoallv is suboptimal for non-uniform T and T are expressed as icalc,force icalc,const message size. T = τ nn /n , (22) icalc,const icalc,const icalc,list grp In the following, we estimate n . If we consider a exch,list rather idealized case, in which all domains are cubes con- T = τ nn , (23) icalc,force icalc,force icalc,list taining n particles, the total length of the exchange lists for where n is the average length of the interaction list, icalc,list one domain can approximately be given by n is the number of i-particle groups for modified tree grp algorithms from Barnes (1990), and τ and τ 2/3 1/3 icalc,force icalc,const 14n 21πn n ∼ + exch,list are the time for one force calculation and for constructing θ θ one interaction list. In figure 21, we plot the time for the 28π θ 1/3 1/3 + log nn − n (19) construction of the interaction list as a function of n .On 2 p grp 3θ 2.8 −8 the K computer, τ is 3.72 × 10 s for the long-range icalc,const −8 force and 6.59 × 10 s for the short-range force. for the case of long-range interactions, and The length of the interaction list is given by 2/3 1/3 14n 21πn grp grp cut n ∼ n + + 1/3 icalc,list grp n ∼ n − 1 + 2 − n (20) exch,list θ θ 28π θ 1/3 1/3 + log (nn ) − n (24) 2 p grp 3θ 2.8 for the case of short-range interactions, where r is the cut for the case of long-range interactions and average cutoff length and and r is the average interparticle distance. In this subsection we set r so that the number cut cut 1/3 n ∼ n − 1 + 2 (25) of particles in the neighbor sphere is 100. In other words, icalc,list grp r ∼ 3r. cut In figure 20, we plot the list length for short and long for the case of short-range interactions. interactions against the average number of particles. The In figure 22, we plot the length of the interaction lists for rough estimate of equations (19) and (20) agree very well the long-range and short-range forces. We can see that the with the measurements. length of the interaction lists can be fitted reasonably well. exch,list T [s] icalc,const Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-19 Table 2. Time coefficients in equation (27) long for the K computer. short −6 τ [s] 1.66 × 10 alltoallv,startup −1 −10 τ [s byte]1.11 × 10 alltoallv,word −7 τ [s] 2.67 × 10 dc,sort −7 τ [s] 1.12 × 10 10 exch,const −8 τ [s] 3.72 × 10 icalc,const −10 τ [s] 3.05 × 10 icalc,force τ is the value for gravity. icalc,force 1 2 10 10 grp Fig. 22. Average length of the interaction list for long-range force (cir- cles) and short-range force (squares), for the case of n ∼ 5.3 × 10 and θ = 0.4. Solid and dashed curves are analytical models for long-range (equation 24) and short-range (equation 25) forces. (Color online) n =8 grp n =64 grp n =512 grp -9 Fig. 24. Breakdown of the total time of the calculation per timestep against n ,for thecaseof N = 5.5 × 10 , n = 500, θ = 0.4, and p smp n = 130. (Color online) grp 1 2 3 4 10 10 10 10 2/3 ∼ τ n n /n dc,sort smp dc icalc,list + k τ n + τ n type exch,const exch,list alltoallv,startup p Fig. 23. Time for the evaluation of one gravity force against n for icalc,list various n . (Color online) grp 1/3 + τ n b n alltoallv,word exch,list p In the following, we discuss the time for the force cal- + τ nn icalc,force icalc,list culation. The time for the force calculation for one particle + τ nn /n . (27) icalc,const icalc,list grp pair τ has different values for different kinds of inter- icalc,force actions. We plot τ against n for various n in icalc,force icalc,list grp The time coefficients in equation (27) for the K computer figure 23. We can see that for larger n , τ becomes grp icalc,force are summarized in table 2.Inthissectionweuse n = 1. dc smaller. However, from equation (24), large n leads to grp To see if the time predicted by equation (27) is reason- large n and the number of interactions becomes larger. icalc,list able, we compare the predicted time and the time obtained Thus, there is an optimal n . In our disk galaxy simulations grp from the disk galaxy simulation with the total number of using the K computer, the optimal n is a few hundreds, grp particles (N) of 550 million and θ = 0.4. In our simulations, and dependence on n is weak. grp we use up to the quadrupole moment. On the other hand, we assume the monopole moment only in equation (27). Thus, we have to correct the time for the force calculation 5.6 Total time in equation (27). In our simulations, the cost of the force Now we can predict the total time of the calculation using calculation of the quadrupole moment is two times higher the above discussions. The total time per timestep is given than that of the monopole moment and about 75% of par- by ticles in the interactions list are superparticles. Thus the cost of the force calculation in the simulation is 75% higher than T ∼ T /n + k (T + T ) step dc,sort dc type exch,const exch,comm the prediction. We apply this correction to equation (27). In + T + T (26) figure 24, we plot the breakdown of the predicted time with icalc,force icalc,const τ [s] icalc,list icalc,force Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-20 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 Fig. 26. Same as figure 25, but for the floating-point operation perfor- Fig. 25. Breakdown of the total time of the calculation per timestep mance 10 times faster than the K computer. (Color online) 9 7 against n , for the case of N = 10 (top panel) and =10 (bottom panel), n = 500, θ = 0.4, and n = 300. (Color online) smp grp computer, but τ , τ , τ ,and τ are dc,sort exch,const icalc,const icalc,force 10 times smaller than those of the K computer. We can the correction and the obtained time from the disk galaxy see that the optimal n is shifted to smaller n for both p p simulations. We can see that our predicted times agree with cases of N of 1 billion and 10 million, because T exch,comm the measurements very well. is unchanged. However, the shortest time per timestep is In the following, we analyze the performance of the improved by about a factor of five. If the network perfor- gravitational many-body simulations for various hypothet- mance is also improved by a factor of 10, we would get ical computers. In figure 25, we plot the breakdown of the same factor of 10 improvement for the shortest time the calculation time predicted using equation (27) for the per timestep. In other words, by reducing the network per- cases of 1 billion and 10 million particles against n . For formance by a factor of 10, we suffer only a factor of two the case of 1 billion particles, we can see that the slope of degradation of the shortest time. T becomes shallower for n  10000 and increases for In figure 27, we plot the predicted T for three hypo- step p step n  30000, because T dominates. Note that T thetical computers and the K computer. Two of these p dc,sort exch,comm also has the minimum value. The reason is as follows. computers, the K computer and X10, are the same com- For small n , T is dominant in T and puter models we used above. Another is a computer with p alltoallv,word exch,comm it decrease as n increases, because the length of n a floating-point operation performance that is 100 times p exch,list becomes smaller. For large n , T becomes domi- faster than the K computer (hereafter X100). The last one p alltoallv,startup nant and it increases linearly. We can see the same tendency is a computer which can perform the force calculation for the case of 10 million particles. However, the optimal 10 times faster than the K computer (hereafter ACL). In n ,atwhich T is the minimum, is smaller than that for other words, only τ is 10 times smaller than that p step icalc,force 1 billion particles, because T is independent of N. of the K computer. This computer is roughly mimicking a dc,sort In figure 26, we plot the breakdown of the predicted computer with an accelerator, such as GPU (Hamada et al. calculation time for a hypothetical computer which has a 2009b), GRAPE (Sugimoto et al. 1990; Makino et al. 2003), floating-point operation performance that is 10 times faster and PEZY-SC. Here we use the optimal n ,atwhich T grp step than that of the K computer (hereafter X10). In other words, is at the minimum, for each computer. For the case of τ and τ are the same as those of the K N = 10 , the optimal n ∼ 300 for the K computer and alltoallv,startup alltoallv,word grp Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 54-21 From figures 25 and 26, we can see that for large n , performance will be limited by T and T . There- dc,sort exch,comm fore, if we can reduce them further, we can improve the efficiency of the calculation with large n . It is possible to reduce the time for sorting by applying the algorithm used in the x-direction to y-direction as well, or setting n to dc more than unity. It is more difficult to reduce T , exch,comm since we are using the system-provided MPI_Alltoallv. 6 Conclusion In this paper, we present the basic idea, implementation, measured performance, and performance model of FDPS, a framework for developing efficient, parallel particle-based simulation codes. FDPS provides all of these necessary func- tions for the parallel execution of particle-based simula- tions. By using FDPS, researchers can easily develop their programs which run on large-scale parallel supercomputers. For example, a simple gravitational N-body program can be written in around 120 lines. We implemented three astrophysical applications using FDPS and measured their performances. All applications showed good performance and scalability. In the case of the disk galaxy simulation, the achieved efficiency is around Fig. 27. Predicted total calculation time for three hypothetical computers 50% of the theoretical peak, 7% for the cosmological sim- and the K computer as a function of n ,for thecaseof n = 500 and p smp ulation, and 40% for the giant impact simulation. θ = 0.4. Top and bottom panels indicate the results of the case for 12 9 N = 10 and N = 10 , respectively. (Color online) We constructed the performance model of FDPS and analyzed the performance of applications using FDPS. We found that the performance for a small number of parti- X10, ∼400 for X100, and ∼1600 for ACL. For the case of cles would be limited by the time for the calculation nec- N = 10 , the optimal n for K, X10, and X100 is the same essary for the domain decomposition and communication grp as those for N = 10 , but ∼1800 for ACL. The optimal necessary for the interaction calculation. value of n for ACL is larger than those of any other grp computers, because large n reduces the cost of the con- grp Acknowledgement struction of the interaction list. From figure 27, we can see that for small n , X10, and We thank M. Fujii for providing initial conditions of spiral sim- ulations, T. Ishiyama for providing his Particle Mesh code, K. X100 are 10 and 100 times faster than the K computer, Yoshikawa for providing his TreePM code and Y. Maruyama for respectively. However, for the case of N = 10 , T of step being the first user of FDPS. We are grateful to M. Tsubouchi for her the values of X10 and X100 increase for n  15000 and help in managing the FDPS development team. This research used 7000, because the T becomes the bottleneck. ACL exch,comm computational resources of the K computer provided by the RIKEN shows a similar performance to that of X10 up to optimal Advanced Institute for Computational Science through the HPCI System Research project (Project ID:ra000008). Part of the research n , because the force calculation is dominant in the total covered in this paper research was funded by MEXT’s program for calculation time. On the other hand, for large n , the perfor- the Development and Improvement for the Next Generation Ultra mance of ACL is almost the same as that of the K computer, High-Speed Computer System, under its Subsidies for Operating the because ACL has the same bottleneck as the K computer Specific Advanced Large Research Facilities. Numerical computa- has, which is the communication of the exchange list. On tions were in part carried out on Cray XC30 at Center for Compu- the other hand, for the case of N = 10 , T is scaled up tational Astrophysics, National Astronomical Observatory of Japan. step to n ∼ 10 for all computers. This is because for larger N simulations, the costs of the force calculation and the con- References struction of the interaction list are relatively high compared to those of the communication of the exchange list. Thus Abrahama, M. J., Murtolad, T., Schulz, R., Palla, S., Smith, J., Hessa, B., & Lindahl, E. 2015, SoftwareX, 1, 19 the optimal n is larger value if we use larger N. p Downloaded from https://academic.oup.com/pasj/article/68/4/54/2223184 by DeepDyve user on 15 July 2022 54-22 Publications of the Astronomical Society of Japan (2016), Vol. 68, No. 4 Asphaug, E., & Reufer, A. 2014, Nature Geoscience, 7, 564 Hamada, T., Nitadori, K., Benkrid, K., Ohno, Y., Morimoto, G., Bagla, J. S. 2002, J. Astrophys. Astron., 23, 185 Masada, T., Shibata, Y., Oguri, K., & Taiji, M. 2009b, Comp. Balsara, D. S. 1995, J. Comput. Phys., 121, 357 Sci.-Res. Development, 24, 21 Barnes, J. 1990, J. Comput. Phys., 87, 161 Hartmann, W. K., & Davis, D. R. 1975, Icarus, 24, 504 Barnes, J., & Hut, P. 1986, Nature, 324, 446 Hernquist, L. 1990, ApJ, 356, 359 Bedorf, ´ J., Gaburov, E., Fujii, M., Nitadori, K., Ishiyama, T., & Hockney, R. W., & Eastwood, J. W. 1988, Computer Simulation Portegies Zwart, S. 2014, Proc. Int. Conf. for High Perfor- Using Particles (Boca Raton, FL: CRC Press) mance Computing, Networking, Storage and Analysis, ed. Ishiyama, T., Fukushige, T., & Makino, J. 2009, PASJ, 61, 1319 T. Damkroger & J. Dongarra (Piscataway, NJ: IEEE Press), Ishiyama, T., Nitadori, K., & Makino, J. 2012, Proc. International 54 Conference on High Performance Computing, Networking, Bedorf, ´ J., Gaburov, E., & Portegies Zwart, S. 2012, J. Comput. Storage and Analysis, ed. J. K. Hollingsworth (New York: Phys., 231, 2825 Association for Computing Machinery), 1161 Benz, W., Slattery, W. L., & Cameron, A. G. W. 1986, Icarus, 66, Iwasawa, M., Tanikawa, A., Hosono, N., Nitadori, K., 515 Muranushi, T., & Makino, J. 2015, WOLFHPC ’15 (New York: Blackston, D., & Suel, T. 1997, Proc. 1997 ACM/IEEE Conf. ACM), 1 on Supercomputing (New York: Association for Computing Navarro, J. F., Frenk, C. S., & White, S. D. M. 1996, ApJ, 462, 563 Machinery), doi:10.1145/509593.509597 Nitadori, K., Makino, J., & Hut, P. 2006, New Astron., 12, 169 Bode, P., Ostriker, J. P., & Xu, G. 2000, ApJS, 128, 561 Makino, J. 2004, PASJ, 56, 521 Brooks, B., et al. 2009, J. Comp. Chem., 30, 1545 Makino, J., Fukushige, T., Koga, M., & Namura, K. 2003, PASJ, Cameron, A. G. W., & Ward, W. R. 1976, Proc. 7th Lunar Science 55, 1163 Conf., Geochim. Cosmochim. Acta Suppl., 120 Monaghan, J. J. 1992, ARA&A, 30, 543 Canup, R. M., Barr, A. C., & Crawford, D. A. 2013, Icarus, 222, Monaghan, J. J. 1997, J. Comp. Phys., 136, 298 200 Murotani, K., et al. 2014, J. Adv. Simulation Sci. Eng., 1, 16 Case, D. A., et al. 2015, AMBER 2015 (San Francisco: University of Phillips, J., et al. 2005 J. Comp. Chem., 26, 1781 California) Plimpton, S. 1995, J. Comp. Phys., 117, 1 Dehnen, W. 2000, ApJ, 536, L39 Rosswog, S. 2009, New Astron. Rev., 53, 78 Dehnen, W., & Aly, H. 2012, MNRAS, 425, 1068 Salmon, J. K., & Warren, M. S. 1994, 111, 136 Dubinski, J. 1996, New Astron., 1, 133 Shaw, D. E., et al. 2014, Proc. Int. Conf. for High Performance Dubinski, J., Kim, J., Park, C., & Humble, R. 2004, New Astron., Computing, Networking, Storage and Analysis, ed. T. 9, 111 Damkroger & J. Dongarra (Piscataway, NJ: IEEE Press), 41 Fujii, M. S., Baba, J., Saitoh, T. R., Makino, J., Kokubo, E., & Springel, V. 2010, ARA&A, 48, 391 Wada, K. 2011, ApJ, 730, 109 Springel, V., et al. 2005, Nature, 435, 629 Gaburov, E., Harfst, S., & Portegies Zwart, S. 2009, New Astron., Sugimoto, D., et al. 1990, Nature, 345, 33 14, 630 Tanikawa, A., Yoshikawa, K., Nitadori, K., & Okamoto, T. 2013, Goodale, T., Allen, G., Lanfermann, G., Masso, ´ J., Radke, T., New Astron., 19, 74 Seidel, E., & Shalf, J. 2003, High Performance Computing for Tanikawa, A., Yoshikawa, K., Okamoto, T., & Nitadori, K. 2012, Computational Science — VECPAR 2002, ed. J. M. L. M. Palma New Astron., 17, 82 et al. (Berlin: Springer-Verlag), 197 Warren, M. S., & Salmon, J. K. 1995, Comp. Phys. Communications, Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori, K., 87, 266 & Taiji, M. 2009a, in Proc. 2009 Workshop on Component- Widrow, L. M., & Dubinski, J. 2005, ApJ, 631, 838 Based High Performance Computing, ed. R. M. Badia & N. Xu, G. 1995, ApJS, 98, 355 Wang (New York: Association for Computing Machinery), Yamada, T., Mitsume, N., Yoshimura, S., & Murotani, K. 2015, 778 Coupled Problems in Science and Engineering VI, ed. B. Schrefler Hamada, T., & Nitadori, K. 2010, Proc. ACM/IEEE Int. Conf. for et al. (Barcelona: International Center for Numerical Methods High Performance Computing, Networking, Storage and Anal- in Engineering) ysis (New York: Association for Computing Machinery) Yoshikawa, K., & Fukushige, T. 2005, PASJ, 57, 849

Journal

Publications of the Astronomical Society of JapanOxford University Press

Published: Aug 1, 2016

Keywords: dark matter; Galaxy: evolution; methods: numerical; planets and satellites: formation

There are no references for this article.