c++ - No speedup with OpenMP -
i working openmp in order obtain algorithm near-linear speedup. unfortunately noticed not desired speedup.
so, in order understand error in code, wrote code, easy one, double-check speedup in principle obtainable on hardware.
this toy example wrote:
#include <omp.h> #include <cmath> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <string.h> #include <cstdlib> #include <fstream> #include <sstream> #include <iomanip> #include <iostream> #include <stdexcept> #include <algorithm> #include "mkl.h" int main () { int number_of_threads = 1; int n = 600; int m = 50; int n = n/number_of_threads; int time_limit = 600; double total_clock = omp_get_wtime(); int time_flag = 0; #pragma omp parallel num_threads(number_of_threads) { int thread_id = omp_get_thread_num(); int iteration_number_local = 0; double *c = new double[n]; std::fill(c, c+n, 3.0); double *d = new double[n]; std::fill(d, d+n, 3.0); double *cd = new double[n]; std::fill(cd, cd+n, 0.0); while (time_flag == 0){ (int = 0; < n; i++) for(int z = 0; z < m; z++) for(int x = 0; x < n; x++) for(int c = 0; c < n; c++){ cd[c] = c[z]*d[x]; c[z] = cd[c] + d[x]; } iteration_number_local++; if ((omp_get_wtime() - total_clock) >= time_limit) time_flag = 1; } #pragma omp critical std::cout<<"i "<<thread_id<<" , got" <<iteration_number_local<<"iterations."<<std::endl; } }
i want highlight again code toy-example try see speedup: first for-cycle becomes shorter when number of parallel threads increases (since n decreases).
however, when go 1 2-4 threads number of iterations double expected; not case when use 8-10-20 threads: number of iterations not increase linearly number of threads.
could please me this? code correct? should expect near-linear speedup?
results
running code above got following results.
1 thread: 23 iterations.
20 threads: 397-401 iterations per thread (instead of 420-460).
your measurement methodology wrong. small number of iterations.
1 thread: 3 iterations.
3 reported iterations means 2 iterations finished in less 120 s. third 1 took longer. time of 1 iteration between 40 , 60 s.
2 threads: 5 iterations per thread (instead of 6).
4 iterations finished in less 120 s. time of 1 iteration between 24 , 30 s.
20 threads: 40-44 iterations per thread (instead of 60).
40 iterations finished in less 120 s. time of 1 iteration between 2.9 , 3 s.
as can see results not contradict linear speedup.
it simpler , accurate execute , time 1 single outer loop , see perfect linear speedup.
some reasons (non exhaustive) why don't see linear speedup are:
- memory bound performance. not case in toy example
n = 1000
. more general speaking: contention shared resource (main memory, caches, i/o). - synchronization between threads (e.g. critical sections). not case in toy example.
- load imbalance between threads. not case in toy example.
- turbo mode use lower frequencies when cores utilized. can happen in toy example.
from toy example approach openmp can improved better using high level abstractions, e.g. for
.
more general advise broad format , require more specific information non-toy example.
Comments
Post a Comment