vectorization - Why doesn't this C vector loop auto-vectorise? -
i trying optimise code use of avx intrinsics. simple test case compiles tells me loop not vectorised number of reasons don't understand.
this full program, simple.c
#include <math.h> #include <stdlib.h> #include <assert.h> #include <immintrin.h> int main(void) { __m256 * x = (__m256 *) calloc(1024,sizeof(__m256)); (int j=0;j<32;j++) x[j] = _mm256_set1_ps(1.); return(0); }
this command line: gcc simple.c -o1 -fopenmp -ffast-math -lm -mavx2 -ftree-vectorize -fopt-info-vec-missed
this output:
- simple.c:11:3: note: not vectorized: unsupported data-type
- simple.c:11:3: note: can't determine vectorization factor.
- simple.c:6:5: note: not vectorized: not enough data-refs in basic block.
- simple.c:11:3: note: not vectorized: not enough data-refs in basic block.
- simple.c:6:5: note: not vectorized: not enough data-refs in basic block.
- simple.c:6:5: note: not vectorized: not enough data-refs in basic block.
i have gcc version 5.4.
can me interpret these messages , understand going on?
you're manually vectorizing intrinsics, there's nothing left gcc auto-vectorize. leads uninteresting warnings, assume trying auto-vectorize intrinsic or loop-counter increments.
i asm gcc 5.3 (on godbolt compiler explorer) if don't silly write function optimize away, or try compile -o1
.
#include <immintrin.h> void set_to_1(__m256 * x) { (int j=0;j<32;j++) x[j] = _mm256_set1_ps(1.); } push rbp lea rax, [rdi+1024] vmovaps ymm0, ymmword ptr .lc0[rip] mov rbp, rsp push r10 # gcc weird r10 in functions ymm vectors .l2: # vector loop vmovaps ymmword ptr [rdi], ymm0 add rdi, 32 cmp rdi, rax jne .l2 vzeroupper pop r10 pop rbp ret .lc0: .long 1065353216 ... repeated several times because gcc failed use vbroadcastss load or generate constant on fly
i same asm -o1
, using -o1
not optimize things away isn't way see gcc do.
Comments
Post a Comment