OpenCV使用OpenCL/GPU的性能测试

[复制链接] · 发表于 2020-11-2 11:31:09

测试代码如下：

import cv2

import timeit



print('OpenCL available:', cv2.ocl.haveOpenCL())



# A simple image pipeline that runs on both Mat and Umat

def img_cal(img, mode='none'):

    if mode=='UMat':

        img = cv2.UMat(img)

    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    img = cv2.GaussianBlur(img, (7, 7), 1.5)

    img = cv2.Canny(img, 0, 50)

    if type(img) == cv2.UMat: 

        img = cv2.UMat.get(img)

    return img



# Timing function

def run(processor, function, n_threads, N):

    cv2.setNumThreads(n_threads)

    t = timeit.timeit(function, globals=globals(), number=N)/N*1000

    print('%s avg. with %d threads: %0.2f ms' % (processor, n, t))

    return t



img = cv2.imread('a.tif') 

N = 100

threads = [1,  6]



processor = {'GPU': "img_cal(img,'UMat')", 

             'CPU': "img_cal(img)"}

results = {}

for n in threads: 

    for pro in processor.keys():

        results[pro,n] = run(processor=pro, 

                             function= processor[pro], 

                             n_threads=n, N=N)



print('\nGPU speed increase over 1 CPU thread [%%]: %0.2f' % \

      (results[('CPU', 1)]/results[('GPU', 1)]*100))

print('CPU speed increase on 6 threads versus 1 thread [%%]: %0.2f' % \

      (results[('CPU', 1)]/results[('CPU', 6)]*100))

print('GPU speed increase versus 6 threads [%%]: %0.2f' % \

      (results[('GPU', 1)]/results[('GPU', 6)]*100))
复制代码

测试结果为：

toybrick@debian10:~/temp/pytest$ /usr/bin/python3 /home/toybrick/temp/pytest/main.py

OpenCL available: True

GPU avg. with 1 threads: 25.12 ms

CPU avg. with 1 threads: 3.34 ms

GPU avg. with 6 threads: 16.39 ms

CPU avg. with 6 threads: 3.14 ms



GPU speed increase over 1 CPU thread [%]: 13.31

CPU speed increase on 6 threads versus 1 thread [%]: 106.54

GPU speed increase versus 6 threads [%]: 153.28



复制代码

使用了OpenCL/GPU后，性能反而下降了，好奇怪？！

只看该作者 · 发表于 2020-11-2 11:32:58

toybrick@debian10:~$ clinfo 

Number of platforms                               1

  Platform Name                                   ARM Platform

  Platform Vendor                                 ARM

  Platform Version                                OpenCL 1.2 v1.r18p0-01rel0.5630b190419266e7fe8b09ec0007fb39

  Platform Profile                                FULL_PROFILE

  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory

  Platform Extensions function suffix             ARM



  Platform Name                                   ARM Platform

Number of devices                                 1

  Device Name                                     Mali-T860

  Device Vendor                                   ARM

  Device Vendor ID                                0x8602000

  Device Version                                  OpenCL 1.2 v1.r18p0-01rel0.5630b190419266e7fe8b09ec0007fb39

  Driver Version                                  1.2

  Device OpenCL C Version                         OpenCL C 1.2 v1.r18p0-01rel0.5630b190419266e7fe8b09ec0007fb39

  Device Type                                     GPU

  Device Profile                                  FULL_PROFILE

  Device Available                                Yes

  Compiler Available                              Yes

  Linker Available                                Yes

  Max compute units                               4

  Max clock frequency                             5MHz

  Device Partition                                (core)

    Max number of sub-devices                     0

    Supported partition types                     None

    Supported affinity domains                    (n/a)

  Max work item dimensions                        3

  Max work item sizes                             256x256x256

  Max work group size                             256

  Preferred work group size multiple              4

  Preferred / native vector sizes                 

    char                                                16 / 16      

    short                                                8 / 8       

    int                                                  4 / 4       

    long                                                 2 / 2       

    half                                                 8 / 8        (cl_khr_fp16)

    float                                                4 / 4       

    double                                               2 / 2        (cl_khr_fp64)

  Half-precision Floating-point support           (cl_khr_fp16)

    Denormals                                     Yes

    Infinity and NANs                             Yes

    Round to nearest                              Yes

    Round to zero                                 Yes

    Round to infinity                             Yes

    IEEE754-2008 fused multiply-add               Yes

    Support is emulated in software               No

  Single-precision Floating-point support         (core)

    Denormals                                     Yes

    Infinity and NANs                             Yes

    Round to nearest                              Yes

    Round to zero                                 Yes

    Round to infinity                             Yes

    IEEE754-2008 fused multiply-add               Yes

    Support is emulated in software               No

    Correctly-rounded divide and sqrt operations  No

  Double-precision Floating-point support         (cl_khr_fp64)

    Denormals                                     Yes

    Infinity and NANs                             Yes

    Round to nearest                              Yes

    Round to zero                                 Yes

    Round to infinity                             Yes

    IEEE754-2008 fused multiply-add               Yes

    Support is emulated in software               No

  Address bits                                    64, Little-Endian

  Global memory size                              4029292544 (3.753GiB)

  Error Correction support                        No

  Max memory allocation                           1007323136 (960.7MiB)

  Unified memory for Host and Device              Yes

  Minimum alignment for any data type             128 bytes

  Alignment of base address                       1024 bits (128 bytes)

  Global Memory cache type                        Read/Write

  Global Memory cache size                        262144 (256KiB)

  Global Memory cache line size                   64 bytes

  Image support                                   Yes

    Max number of samplers per kernel             16

    Max size for 1D images from buffer            65536 pixels

    Max 1D or 2D image array size                 2048 images

    Base address alignment for 2D image buffers   32 bytes

    Pitch alignment for 2D image buffers          16 pixels

    Max 2D image size                             65536x65536 pixels

    Max 3D image size                             65536x65536x65536 pixels

    Max number of read image args                 128

    Max number of write image args                8

  Local memory type                               Global

  Local memory size                               32768 (32KiB)

  Max number of constant args                     8

  Max constant buffer size                        65536 (64KiB)

  Max size of kernel argument                     1024

  Queue properties                                

    Out-of-order execution                        Yes

    Profiling                                     Yes

  Prefer user sync for interop                    No

  Profiling timer resolution                      1000ns

  Execution capabilities                          

    Run OpenCL kernels                            Yes

    Run native kernels                            No

  printf() buffer size                            1048576 (1024KiB)

  Built-in kernels                                (n/a)

  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory



NULL platform behavior

  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM Platform

  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]

  clCreateContext(NULL, ...) [default]            Success [ARM]

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)

    Platform Name                                 ARM Platform

    Device Name                                   Mali-T860

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)

    Platform Name                                 ARM Platform

    Device Name                                   Mali-T860

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)

    Platform Name                                 ARM Platform

    Device Name                                   Mali-T860

toybrick@debian10:~$ 
复制代码

只看该作者 · 发表于 2020-11-2 15:22:51

我不明白为什么你会觉得GPU一定要跑的比CPU快？
这里GPU主频只有200-800Mhz，CPU却有1.2g-1.8g，跑一些常规算法明显CPU是快的。
GPU只有在做矩阵运算时候才有并行运算优势。

只看该作者 · 发表于 2020-11-2 15:57:02

这个测试代码就是做的矩阵运算。在其他平台上，GPU比CPU快很多。

只看该作者 · 发表于 2020-11-2 15:58:55

与CPU想比，GPU有更多的运算单元，可以并行运算。所以虽然GPU主频低，但速度比CPU快。

只看该作者 · 发表于 2020-11-2 16:00:38

你可以把GPU频率抬到最高试试。
依然没有CPU快的话他就确实不会有CPU快的了。

只看该作者 · 发表于 2020-11-2 18:08:57

请问，如何设置GPU频率？