Files
OpenCL-CTS/test_conformance
Chip Davis b73c3149ad Image streams optimization (#1616)
* Don't recalculate image parameters repeatedly in `test_read_image()`

We've already done this in the loop. There's no need to recalculate
those parameters over and over again in `sample_image_pixel*()` and
`read_image_pixel*()`. This should save some work during the image
streams test.

This only affects the 3D tests for now, but my time profiles indicate
this is where we spend the most time anyway.

* Vectorize read_image_pixel_float() and sample_image_pixel_float() for SSE/AVX

This shortens the image streams test time from 45 minutes without it to
37 minutes. Unfortunately, most of the time is now spent waiting for
memory, particularly in the 3D tests, because the 3D image doesn't
neatly fit in the cache, especially in the linear sampling case, where
pixels from two 2D slices must be sampled. Software prefetching won't
help; it only helps when execution time is dominated by operations, but
this is dominated by memory access. Randomized offsets are likely a
factor, because they throw off the hardware prefetcher.

One possible further optimization is, in the linear sampling case, to
load two sampled pixels at once. This is easy to do using AVX, which
extends SSE with 256-bit vectors.

Obviously, this only applies to x86 CPUs with SSE2. The greatest
performance gains, however, are seen with SSE4.1. Most modern x86 CPus
have SSE4. Work is needed to support other CPUs' vector units--ARM
Advanced SIMD/NEON is probably the most important one. Another
possibility is arranging the code so that the compiler's
autovectorization will kick in and do what I did here manually.
2023-02-07 08:46:15 -08:00
..
2021-05-18 18:12:55 +01:00
2023-01-24 08:53:18 -08:00
2020-07-23 17:21:07 +01:00
2023-02-06 15:09:04 +00:00
2023-01-14 15:18:27 +00:00
2023-01-24 08:48:53 -08:00
2023-02-07 08:46:15 -08:00
2023-01-31 09:50:21 -08:00