espressomd-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Discussion: Switching Esprseso to shared memory parallelization


From: Christoph Junghans
Subject: Re: Discussion: Switching Esprseso to shared memory parallelization
Date: Wed, 14 Jul 2021 16:30:22 -0600



On Tue, Jul 13, 2021 at 21:33 Ulf D Schiller <uschill@clemson.edu> wrote:
Hi Rudolf and All,

I second Steffen's comments on weighing different aspects. My group is
currently not using ESPResSo for any large scale applications. Having
followed the development almost since its inception, I hope you
nevertheless allow me to add some considerations and perhaps clear up
some conflation in the arguments.

* When comparing paradigms, one has to be careful not to compare apples
with oranges. You correctly describe shared-memory and
distributed-memory parallelization as different paradigms, but also mix
it with a consideration of overheads/delays. It is certainly true that
MPI-based parallelization involves accessing and copying data between
processes. However, modern MPI implementations can handle intra-node
communication fairly efficiently -- both MPICH and OpenMPI can use
shared memory for message transfer and since MPI-3 there is also
explicit shared memory programming. Even for inter-node communication,
latency can typically be hidden by overlapping communication and
computation. I am not sure that one would find a substantial performance
difference between a semi-decent MPI-code on a single node and its
OpenMP counterpart. An optimized version of the latter can in principle
be faster, but that would be demonstrated best with benchmark data.

Apart from comparing the paradigms per-se, I would give some
consideration to programming models and tool stacks. MPI is arguably the
de-facto standard for distributed memory parallelization which brings
the advantage of composability and portability (and to a good extent
backwards compatibility). For shared-memory parallelization, the
landscape is more diverse with different toolchains depending on
architecture/vendor, and there is some uncertainty as to how that
landscape will evolve, say in the next decade.
One could use an performance portability layer like kokkos, Raja or cabana to get more shared memory parallelization option with one code base, but that will add yet another dependency.

There are toy md codes using kokkos and cabana here if you are interested:
Kokkos: 
https://github.com/ECP-copa/ExaMiniMD
Cabana: 
https://github.com/ECP-copa/CabanaMD
Both run on multiple backendes incl. openmp and cuda.


* "Adding new features to Espresso will be easier, because a lot of
non-trivial communication code does not have to be written."
"Writing and validating MPI-parallel code is difficult."

I beg to differ on these points. MPI parallelization does not have to be
that difficult to implement and test. In my experience, debugging
shared-memory/thread-level parallelism can quickly become much more
non-trivial and cumbersome. Generally, I don't think the learning curve
for MPI is steeper than that for OpenMP when going beyond trivial loop
parallelization.

To give one example of extensibility, the original halo communication
scheme for the grid-based kernels in ESPResSo (P3M, MEMD, LB) was
designed to accommodate varying halo extents and data content, so adding
a new property was as simple as adding a field to the underlying
MPI_Datatype. I can see how things are a bit more involved for
particles, perhaps because the ghost particle scheme began to evolve at
the time of MPI-1 when derived datatypes weren't around. It might be
worth assessing whether this could be addressed by refactoring the ghost
communication and leveraging more modern MPI capabilities.
another option here would be to use something like hpx instead of MPI, which makes it quite a bit easier on the user.



* "The MPI and Boost::MPI dependencies complicate Espresso's
installation and make it virtually impossible to run Espresso on public
Python platforms such as Azure Notebooks or Google Collab as well as
building on Windows natively."

Hmm... I'm not sure I fully understand. I am regularly building other
software packages with MPI and Boost dependencies on various systems; I
am not a fan of cmake superbuilds but they seem to work reasonably well
in case of conflicting dependencies with system-wide installed
libraries. At any rate, this type of issue can occur with any other
library and is not an issue of MPI itself.
FYI espresso is part of the spack package manager which I use these days to build it and all its dependencies on my HPC clusters.



As for Azure and Collab, my personal take is that those are not really
platforms I desire to run a scientific computing application on. I would
rather look to federated HPC clouds where academic stakeholders might
have a say in future developments.

* "Assuming that one million time steps per day is acceptable"

For this assumption to be meaningful, one has to know what one million
time steps mean physically. And not just how many (pico, nano, micro,
...)-seconds, but what does it mean in terms of the characteristic
relaxation time? Given the range that one can find in soft matter, I'm
afraid it's really tough to find a general notion of what is acceptable.
I think that one can find many examples that are out of the question to
address on a single node.

As Steffen already pointed out, if ESPResSo becomes a single-node code,
the availability of HPC resources will be severely limited if not
completely axed. I would add a consideration regarding potential funding
sources: The major funding agencies including ERC are committed to
exascale computing and substantial amounts of money are pumped into
hard/software/co-design to demonstrate capability at scale. This is not
necessarily science driven, whether you like it or not, but a code that
cannot run across nodes is unlikely to be considered meritorious. I also
think ESPResSo would be less likely to attract users, as people tend to
select packages with the "bigger" set of features regardless whether
they actually need them or not.

So, eventually it boils down to the question who your targeted users are
(mostly ICP or broader community) and whether you want ESPResSo to play
a role in the HPC ecosystem. I hope that my comments do not come across
as too opinionated; I recognize that there are many more factors to
consider and hopefully my outside perspective will help you navigate the
decision-making process.
Yeah, I feel there are already many specialized shared memory MD codes out there and espresso would go under in that space.

Christoph 



Best regards,
Ulf

On 7/13/21 2:23 AM, Rudolf Weeber wrote:
> Hi Steffen,
>
> thank you for the detailed feedback and the points you raise.
>
>> 1) You asked about scenarios that need MPI parallelism. At an IPVS+ITV (U. Stuttgart) collaboration, we perform large simulations
>> that subject particles to a background flow field. Within that flow field, we want include multiple scales of turbulence.
>> These simulations need MPI parallelism as they have millions of particles and millions of bonds. This data does simply not fit into the RAM of a single node.
> This project was, to my knowledge, the biggest system that was ever run on Espresso by a very wide margin. The systems for which Espresso is typically used at the ICP are, to my understanding, all below 100k particles.
>
> Those of you running big systems, please speak up, so we are aware of it.
> Also, if you would like to run bigger systems but don't due to performance issues, that would be of interest.
>
>
>> 2) You talk about "HPC nodes" having about 20-64 cores. This is certainly true. I just want to make the remark that with a shared-memory paralleization there will be no more HPC nodes for ESPResSo users. When applying for runtime at an HPC center, you have to detail about the parallelization and the scalability of your code. If you run on one node only they will most likely turn you down and you are left with your local workstation.
> I should have written "node on a cluster". Espresso simulations would typicall go to tier 3 systems (i.e., university or regional clusters). But you are right that actually removing the possibility to run bigger systems and therefore not being seen as part of the HPC community may well be an issue.
>
>> 3) While I see your point that the current MPI parallelization might not be the easiest to understand and roll out, I want to make it clear that devising a well-performing shared-memory parallelization is not a trivial matter, too. "Sprinkling in" a couple of "#pragma omp parallel for" will certainly not be enough. As with the distributed-memory parallelization you will have to devise a spatial domain decomposition and come up with a workload distribution between the threads. You will have to know which threads imports data from others and devise locking mechanisms to guard these accesses. Reasoning about this code and debugging it might turn out to be as hard as for the MPI-based code. If you want to go down this path, I strongly suggest not reinventing the wheel and taking a look at, e.g., the AutoPAS [1] project.
> This is a valid point. Before we make any decision, we will definitely have some sort of technical preview/prototype to see what can be achieved at acceptable levels of complexity and performance.
>
> In my personal opinion, with Esprseso, the aim is for ease of extensibility rather than for best performance.
> How do other people in the community see this?
>
>
> Some areas where I would hope for simplifications in a purely shared memory code:
>
> * Relation between Python and C++-objects: In a shared memory code, the Python object could directly own the core object. In an MPI-simulation, an intermediate layer creates and manages mirror objects on the remote processes.
> * Due to the intermediate layer and the mirror objects, checkpointing and restoring a simulation is very difficult. We have currently disabled it for certain features.
> * We may not need (or have to replace by PFFT) the custom 3D FFt in the electrostatic and dipolar P3M. To my understanding, there is a thread-parallel drop-in replacement for FFTW. (1000 lines of code, currently)
> * Bonds and virtual sites with a range much larger than the Lennard-Jones cutoff would not force a larger cell size (thereby slowing the short-range calculation)
> * We would probably not need most of the ghost communication code (about 600 lines) as cells across boundaries can be linked directly.
> * We would probably not need most of the parallel callback and particle setup code (about 1500 lines of code, + some 90 callbacks scattered throughotu the codebase)
>
> Of course, all of this would need to be investigated in more detail, before a decision was made.
>

>> One particular problem that I encountered in the past and that I want to briefly mention here is bonds: They are only stored on one of the two (or more) involved particles. This is one of the reasons, why ESPResSo currently needs to communicate the forces back after calculating them and you will certainly need measures that deal with this circumstance in a shared-memory parallel code. Such details will increase the complexity of a shared-memory parallel code and it might end up not being easy to understand for newcomers or make it hard to implement new features, too.
> I agree. Although, in a shared memory code, there is the option to run some stuff serially and still getting the benefit of the parallelized short-range and bond loop and electrostatics.
>
> You are right. The bond storage will almost certainly have to be changed. Otherwise the bond loop cannot be executed in parallel without requiring all access to particle force to be atomic.
> If we stay with MPI, as you point out, this would eliminate one ghost communication per time step.
> By now, the bond storage has been abstracted somewhat, so this change is probably doable now.
>
> Thank you for pointing out AutoPas. Changing particle storage to struct-of-arrays would be extremely beneficial for performance.
>
>
> Thank you again for sharing your thoughts! Hearing different points of view is very important for us to make good decisions on Espresso development.
>
> There is certainly the need for further discussion and experimentation before we decide on the future parallelization paragdigm.
> The purpose of my post was to get an idea of how many use cases there actually are for very big systems, in the hope that it might help us to direct our (limited) resources to where they are most needed.
>
>
> Regards, Rudolf

--
ULF D. SCHILLER
ASSISTANT PROFESSOR, MATERIALS SCIENCE AND ENGINEERING
College of Engineering, Computing and Applied Sciences
https://www.clemson.edu/cecas/
Clemson University

299C Sirrine Hall
Clemson, SC 29634
o 864-656-2669
uschill@clemson.edu
https://www.clemson.edu/
https://cecas.clemson.edu/compmat/
--
Christoph Junghans
Web: http://www.compphys.de

reply via email to

[Prev in Thread] Current Thread [Next in Thread]