In the half-precision `fdim` test, the original code used `CL_HALF_RTE`
to convert the float result back to half, causing a mismatch in
computation results when the hardware uses RTZ. Some of the examples:
```
fdim(0x365f, 0xdc63) = fdim( 0.398193f, -280.75f) = 281.148193f (RTE=0x5c65, RTZ=0x5c64)
fdim(0xa4a3, 0xf0e9) = fdim(-0.018112f, 10056.0f) = 10055.981445f (RTE=0x70e9, RTZ=0x70e8)
fdim(0x1904, 0x9ab7) = fdim( 0.002449f, -0.003279f) = 0.005728f (RTE=0x1dde, RTZ=0x1ddd)
```
Fixed this by using the hardware's default rounding mode when converting
the result back to half.