I am writing a program to run electrical simulations. Each simulation runs in series (no threading), so that the threading may be done to handle thousands of simulations.
I am testing the program on 3 machines:
- Windows machine (Windows 10, Intel i7 4GHZ, 32 GB RAM) -> (MSVC + Intel MKL)
- Linux machine (Ubuntu 20.04, Ryzen 5, 3.4 GHz, 24 GB RAM) -> (GCC + Intel MKL)
- MacOS machine (MacBook Air, M1, 8GB RAM) -> (Clang + Apple Accelerate)
When compiling, the only difference between the platforms is that I need to link against different OS-specific implementations of BLAS and LAPACK
I have declared the following parallel for template function:
template
static void ParallelFor(Index start, Index end, Callable func) {
// Estimate number of threads in the pool
const static unsigned nb_threads_hint = std::thread::hardware_concurrency();
const static unsigned nb_threads = (nb_threads_hint == 0u ? 8u : nb_threads_hint);
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
And I use it like this:
// prepare the execution plan
std::vector<std::pair<uword, uword>> bounds = prepareParallelChunks((uword) islands.size());
// run in parallel
ParallelFor(0, (int) bounds.size(), [&] (uword k) {
// run chunk
run_simulation(islands, res, *_options, bounds[k].first, bounds[k].second);
});
This last function is just to show the usage. In principle if I have 8 logical cores, there will be 8 threads.
When executing the program for 2000 simulations, the results are:
- Windows: Logical cores at about 20%, simulation time: 145s.
- Linux: Logical threads at about 100%, simulation time 35s.
- MacOS: Logical threads at about 100%, simulation time 35s.
PS: I've used C++17 standard parallelfor
algorithm and the same happens, I've used OpenMP and the same happens. I'm using the referenced parallelfor
ref because it is C++11 compliant and runs in all 3 systems with no modification.
