3Dlut滤镜性能优化

起因

在上一篇【opencv滤镜-3Dlut滤镜】介绍了基于opencv实现的3dlut滤镜,本篇介绍在此实现上做性能优化。先看一下之前介绍中的朴素实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
void Lut(const cv::Mat& src, const float intensity, cv::Mat& result)
{
    auto width = src.cols;
    auto height = src.rows ;
    result = src_img.clone();
    for(int i = 0; i < height; i++)
    {
        for(int j = 0; j < width; j++)
        {
            int r =  src.at<cv::Vec3b>(i, j)[2];
            int g =  src.at<cv::Vec3b>(i, j)[1];
            int b =  src.at<cv::Vec3b>(i, j)[0];
            //计算从8x8里面的第几块方块取色
            int block1 = std::floor(64 * static_cast<float>(b) / 256.0);
            int block2 = std::ceil(64 * static_cast<float>(b) / 256.0);

            int block_x1 = block1 % 8;
            int block_x2 = block2 % 8;
            int block_y1 = block1 / 8;
            int block_y2 = block2 / 8;

            //计算每块里面的具体index
            int lut_x1 = block_x1 * 64 + 64 * (r / 256.0);
            int lut_x2 = block_x2 * 64 + 64 * (r / 256.0);
            int lut_y1 = block_y1 * 64 + 64 * (g / 256.0);
            int lut_y2 = block_y2 * 64 + 64 * (g / 256.0);

            auto color1 = lut_img.at<cv::Vec3b>(cv::Point(lut_x1, lut_y1));
            auto color2 = lut_img.at<cv::Vec3b>(cv::Point(lut_x2, lut_y2));
            result.at<cv::Vec3b>(i, j)[0] = std::clamp((int)std::round((color1[0] + color2[0]) / 2.0), 0, 255);
            result.at<cv::Vec3b>(i, j)[1] = std::clamp((int)std::round((color1[1] + color2[1]) / 2.0), 0, 255);
            result.at<cv::Vec3b>(i, j)[2] = std::clamp((int)std::round((color1[2] + color2[2]) / 2.0), 0, 255);
        }
    }
    //依照强度进行混合
    cv::addWeighted(src_img,1.0 - intensity, result, intensity, 0.0, result);
}

之前实现非常朴素和简单,遍历每个像素,对像素进行一一映射后替换。这种数据、逻辑结构的优化也非常容易基于以上实现,主要关注for循环的代码块,可以盘点出来的优化点有

  • 外层for循环多线程并行
  • 内层运算逻辑简化、c++语法技巧
  • 内层计算simd优化

那下面就一个一个来,在开始之前需要做好2个前期准备:

  1. 将工程转换release, debug模式下面进行优化耗时比较没什么参考价值, 使用cmake组织工程,在CMakeLists中添加
1
set(CMAKE_BUILD_TYPE Release)

重新cmake,即可。

  1. 准备测量耗时和对比优化前后结果,耗时这个不用多说,效果在优化前后无差异也是必要的,或者在优化后性能带来的收益能够抵消效果的降低损失也是可以接受的,但都需要有数据量化做参考做决策。对比图像差异,逐像素相减最直观
1
2
3
4
5
void Subtract(const cv::Mat& a, const cv::Mat& b, cv::Mat& result)
{
    //使用addWeihted实现图像相减
    cv::addWeighted(a, 1, b, -1, 0, result);
}

耗时效果对比

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
    cv::Mat result, result2;
    float intensity = slider_intensity / 100.0;
    auto t0 = std::chrono::system_clock::now();
    Lut(src_img,  intensity, result);
    auto t1 = std::chrono::system_clock::now();
    Lut(src_img,  intensity, result2);
    auto t2 = std::chrono::system_clock::now();

    std::cout << "origin cost: "<< std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms."<<std::endl;
    std::cout << "optimize cost: "<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms."<<std::endl;
    
    cv::Mat diff;
    Subtract(result, result2, diff);
    cv::imshow("lut", diff);

下面开始进行优化

并行

图像是矩阵,for循环遍历像素的场景天然适合使用多线程并行。多线程并行的方法很多,可以用tbb、openmp以及特定平台的库等,甚至可以使用c++默认的线程支持手写一个。这里使用成本最低最方便的openmp, 首先引入openmp依赖

1
2
3
4
5
6
# ...
#OpenMP
find_package(OpenMP REQUIRED)

add_executable(${PROJECT_NAME} "main.cpp")
target_link_libraries(${PROJECT_NAME} PRIVATE  ${OpenCV_LIBS} OpenMP::OpenMP_CXX)

优化:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#include "omp.h"

void LutOpimize(const cv::Mat& src, const float intensity, cv::Mat& result)
{
    auto width = src.cols;
    auto height = src.rows ;
    result = src_img.clone();

#pragma omp parallel for
    for(int i = 0; i < height; i++)
    {
        for(int j = 0; j < width; j++)
        {
            //...遍历计算
        }
    }
    //依照强度进行混合
    cv::addWeighted(src_img,1.0 - intensity, result, intensity, 0.0, result);
}

优化结果:

  1. 效果完全无差异

diff

  1. 耗时优化幅度
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
origin cost: 27 ms.
optimize cost: 7 ms.
origin cost: 27 ms.
optimize cost: 6 ms.
origin cost: 27 ms.
optimize cost: 7 ms.
origin cost: 27 ms.
optimize cost: 7 ms.
origin cost: 27 ms.
optimize cost: 7 ms.
origin cost: 27 ms.
optimize cost: 7 ms.

效率提升还是非常明显的,这里注意并行优化for循环不能放在外层(行遍历的时候),否则优化效果会大打折扣(甚至会反向优化)。这是因为Mat的数据是以行优先的,基于cpu缓存命中的特性,连续访问数据的时候,数据在相邻位置时可以提高缓存的命中率。比如将并行的位置调一下,此时再次测试,效果虽然不变,但耗时反而劣化了

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    //...
    for(int i = 0; i < height; i++)
    {
        #pragma omp parallel for
        for(int j = 0; j < width; j++)
        {
            //...遍历计算
        }
    }
    //...

耗时

1
2
3
4
5
6
7
8
origin cost: 27 ms.
optimize cost: 23 ms.
origin cost: 27 ms.
optimize cost: 60 ms.
origin cost: 27 ms.
optimize cost: 40 ms.
origin cost: 27 ms.
optimize cost: 62 ms.

计算逻辑&语法

尝试后,发现在逻辑和语法上可以操作的空间非常小,按照下面的修改,测试下来没什么提升。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#pragma omp parallel for
    for(int i = 0; i < height; i++)
    {
        const uchar* src_row_ptr = src.ptr(i);
        for(int j = 0; j < width; j++)
        {
            auto r = src_row_ptr[3 * j + 2];
            auto g = src_row_ptr[3 * j + 1];
            auto b = src_row_ptr[3 * j];
            //计算从8x8里面的第几块方块取色

            int block1 = std::floor(b / 4.0);
            int block2 = std::ceil(b / 4.0);

            int block_x1 = block1 % 8;
            int block_x2 = block2 % 8;
            int block_y1 = block1 / 8;
            int block_y2 = block2 / 8;

            //计算每块里面的具体index
            int lut_x1 = block_x1 * 64 + r / 4.0;
            int lut_x2 = block_x2 * 64 + r / 4.0;
            int lut_y1 = block_y1 * 64 + g / 4.0;
            int lut_y2 = block_y2 * 64 + g / 4.0;

            auto color1 = lut_img.at<cv::Vec3b>(lut_y1, lut_x1);
            auto color2 = lut_img.at<cv::Vec3b>(lut_y2, lut_x2);

            result.at<cv::Vec3b>(i, j)[0] = std::clamp((int)std::round((color1[0] + color2[0]) / 2.0), 0, 255);
            result.at<cv::Vec3b>(i, j)[1] = std::clamp((int)std::round((color1[1] + color2[1]) / 2.0), 0, 255);
            result.at<cv::Vec3b>(i, j)[2] = std::clamp((int)std::round((color1[2] + color2[2]) / 2.0), 0, 255);
        }
    }

simd优化

尝试了一下,发现lut查找存在查找索引不连续的情况。此时对于simd优化的实现是非常困难的,暂时没法很好地解决这个问题,所以这一步没法继续做优化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
void LutOpimize(const cv::Mat& src, const float intensity, cv::Mat& result)
{
    auto width = src.cols;
    auto height = src.rows ;
    result = src_img.clone();
    int simd_step = cv::v_float32::nlanes;
#pragma omp parallel for
    for(int i = 0; i < height; i++)
    {
        int j = 0;
        const uchar* src_row_ptr = src_img.ptr(i);
        for(; j < width - simd_step; j += simd_step)
        {
            cv::v_uint8 vb, vr, vg;
            cv::v_load_deinterleave(src_row_ptr + j, vb, vg, vr);

            cv::v_uint16 vb16_0, vb16_1, vg16_0, vg16_1, vr16_0, vr16_1;
            cv::v_expand(vb, vb16_0, vb16_1);
            cv::v_expand(vb, vg16_0, vg16_1);
            cv::v_expand(vb, vr16_0, vr16_1);

            std::vector<cv::v_uint32> vb32(4);
            std::vector<cv::v_uint32> vg32(4);
            std::vector<cv::v_uint32> vr32(4);

            cv::v_expand(vb16_0, vb32[0], vb32[1]);
            cv::v_expand(vb16_1, vb32[2], vb32[3]);
            cv::v_expand(vg16_0, vg32[0], vg32[1]);
            cv::v_expand(vg16_1, vg32[2], vg32[3]);
            cv::v_expand(vr16_0, vr32[0], vr32[1]);
            cv::v_expand(vr16_1, vr32[2], vr32[3]);

            std::vector<cv::v_int32> v_block1(4);
            std::vector<cv::v_int32> v_block2(4);
            std::vector<cv::v_int32> v_block_x1(4);
            std::vector<cv::v_int32> lut_x1(4);
            std::vector<cv::v_int32> lut_x2(4);
            std::vector<cv::v_int32> lut_y1(4);
            std::vector<cv::v_int32> lut_y2(4);

            cv::v_float32 v_8 = cv::v_setall_f32(8.0f);
            cv::v_float32 v_4 = cv::v_setall_f32(4.0f);

            for(int v = 0; v < 4; v++)
            {
                v_block1[v] = cv::v_floor(cv::v_reinterpret_as_f32(vb32[v]) / v_4);
                v_block2[v] = cv::v_ceil(cv::v_reinterpret_as_f32(vb32[v]) / v_4);

                auto v_block_x1 = cv::v_reinterpret_as_f32(v_block1[v]) - cv::v_reinterpret_as_f32(v_block1[v]) / v_8;
                auto v_block_x2 = cv::v_reinterpret_as_f32(v_block2[v]) - cv::v_reinterpret_as_f32(v_block2[v]) / v_8;
                auto v_block_y1 = cv::v_reinterpret_as_f32(v_block1[v]) / v_8;
                auto v_block_y2 = cv::v_reinterpret_as_f32(v_block2[v]) / v_8;
                auto v_lut_x1 = v_block_x1 * cv::v_setall_f32(64.0f) + cv::v_reinterpret_as_f32(vr32[v]) / v_4;
                auto v_lut_x2 = v_block_x2 * cv::v_setall_f32(64.0f) + cv::v_reinterpret_as_f32(vr32[v]) / v_4;

                auto v_lut_y1 = v_block_y1 * cv::v_setall_f32(64.0f) + cv::v_reinterpret_as_f32(vg32[v]) / v_4;
                auto v_lut_y2 = v_block_y2 * cv::v_setall_f32(64.0f) + cv::v_reinterpret_as_f32(vg32[v]) / v_4;
                
                //lut图的x、y所以存在可能是不连续的情况,此时旧无法正常通过连续的simd机制去访问了
            }

        }

        for(; j < width; j++)
        {
            auto r = src_row_ptr[3 * j + 2];
            auto g = src_row_ptr[3 * j + 1];
            auto b = src_row_ptr[3 * j + 0];

            //计算从8x8里面的第几块方块取色

            int block1 = std::floor(b / 4.0);
            int block2 = std::ceil(b / 4.0);

            int block_x1 = block1 % 8;
            int block_x2 = block2 % 8;
            int block_y1 = block1 / 8;
            int block_y2 = block2 / 8;

            //计算每块里面的具体index
            int lut_x1 = block_x1 * 64 + r / 4.0;
            int lut_x2 = block_x2 * 64 + r / 4.0;
            int lut_y1 = block_y1 * 64 + g / 4.0;
            int lut_y2 = block_y2 * 64 + g / 4.0;

            auto color1 = lut_img.at<cv::Vec3b>(lut_y1, lut_x1);
            auto color2 = lut_img.at<cv::Vec3b>(lut_y2, lut_x2);

            result.at<cv::Vec3b>(i, j)[0] = std::clamp((int)std::round((color1[0] + color2[0]) / 2.0), 0, 255);
            result.at<cv::Vec3b>(i, j)[1] = std::clamp((int)std::round((color1[1] + color2[1]) / 2.0), 0, 255);
            result.at<cv::Vec3b>(i, j)[2] = std::clamp((int)std::round((color1[2] + color2[2]) / 2.0), 0, 255);
        }
    }
    //依照强度进行混合
    cv::addWeighted(src_img,1.0 - intensity, result, intensity, 0.0, result);
}

微信公众号