opencv滤镜-扩撒特效-性能优化

分析

在很早之前,使用opnecv实现了photoshop中的扩散特效滤镜。【OpenCV滤镜-PS扩散特效】。当时是只写了最朴素的实现,虽然是C++实现,但是性能还是比较差,有很大优化空间。本篇给出优化的过程和实现,首先看一下原来的朴素实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
void DiffusionFilterOrigin (const cv::Mat& img, cv::Mat& result)
{
    // random engine | 随机数引擎
    std::default_random_engine generator;
    std::uniform_int_distribution<int> dis(1, 8);

    for (size_t i = 1; i < img.rows - 1; i++) {
        for (size_t j = 1; j < img.cols - 1; j++) {
            // generator random | 产生随机数
            int r = dis(generator);
            // 1 2 3
            // 4 p 5
            // 6 7 8
            switch (r) {
                case 1:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i - 1, j - 1)[k];
                    }
                    break;
                case 2:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i - 1, j)[k];
                    }
                    break;
                case 3:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i - 1, j + 1)[k];
                    }
                    break;
                case 4:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i, j - 1)[k];
                    }
                    break;
                case 5:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i, j + 1)[k];
                    }
                    break;
                case 6:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i + 1, j - 1)[k];
                    }
                    break;
                case 7:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i + 1, j)[k];
                    }
                    break;
                case 8:
                    for (size_t k = 0; k < 3; k++) {
                        result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i + 1, j + 1)[k];
                    }
                    break;
                default:
                    assert(false);
                    break;
            }
        }
    }
}

分析盘点一下,可以进行的优化点有

  1. 逻辑优化,原实现逻辑太简单粗暴,且存在循环里面每次生成随机数、使用switch跳转等不友好的逻辑实现。需要优化
  2. 多线程并行,循环遍历处理像素,支持比较方便的多线程优化
  3. simd优化,处理时没有交错和跳转访问,没有复杂的非线性计算,适合采用simd优化

优化

下面开始优化,优化之前仍然首先考虑两个问题

  1. 优化前后效果无差异,这个在本次案例中无法做到,因为随机数的存在,即使相同的实现重复跑也会出现不一样输出,从原理上这个已经决定了。所以只能通过人眼观察输出效果是否符合预期。
  2. Release模式下测试,耗时对比

逻辑优化

上面分析在循环里面有随机数、和switch跳转问题,针对此可以将这两部分搬到循环外面,随机数统一提前生成,switch跳转可以使用查找表的方式进行优化。基于这个优化策略,另一个优点是,优化后可以更进一步方便进行simd优化。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
void DiffusionFilterOriginOpt (const cv::Mat& img, cv::Mat& result)
{
    // random engine
    std::default_random_engine generator;
    std::uniform_int_distribution<int> dis(0, 7);

    // generator random
    std::vector<int> random_index(img.rows * img.cols, 0);
    for(auto& index: random_index)
    {
        index = dis(generator);
    }

    // 1 2 3
    // 4 p 5
    // 6 7 8
    std::vector<std::pair<int, int>> lut =
    {
        {-1, -1}, {-1, 0},{-1, +1},
        {0, -1},          {0, +1},
        {+1, -1}, {+1, 0},{+1, +1},
    };

    for (int i = 1; i < img.rows - 1; i++)
    {
        for (int j = 1; j < img.cols - 1; j++)
        {
            auto pt = lut[random_index[i * img.cols + j]];
            for(int k = 0; k < 3; k++)
            {
                result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i + pt.first, j + pt.second)[k];
            }
        }
    }
}

优化后,观察效果符合预期,性能方面提升非常明显:

1
2
origin cost: 6.674 ms.
optimize cost: 2.644 ms.

提升高达3倍,可见原先朴素的实现逻辑写的太烂了😂

多线程并行

接着,可以加上openmp的多线程并行

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
//...
#pragma  omp parallel for 
    for (int i = 1; i < img.rows - 1; i++)
    {
        for (int j = 1; j < img.cols - 1; j++)
        {
            auto pt = lut[random_index[i * img.cols + j]];
            for(int k = 0; k < 3; k++)
            {
                result.at<cv::Vec3b>(i, j)[k] = img.at<cv::Vec3b>(i + pt.first, j + pt.second)[k];
            }
        }
    }
//...

之行效果符合预期,但是性能却出现了劣化,并没有提升

1
2
origin cost: 6.61 ms.
optimize cost: 3.025 ms.

估计是线程调度的新增的开销大于并行带来的收益了

simd优化

并行尝试无果,还有一招,继续做simd优化,在开始simd优化之前先把处理改成使用指针访问

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
void DiffusionFilterOriginOpt (const cv::Mat& img, cv::Mat& result)
{
    // random engine
    std::default_random_engine generator;
    std::uniform_int_distribution<int> dis(0, 7);

    // generator random
    int channel = img.channels();
    std::vector<int> random_index(img.rows * img.cols, 0);
    for(auto& index: random_index)
    {
        index = dis(generator);
    }

    // 1 2 3
    // 4 p 5
    // 6 7 8
    std::vector<std::pair<int, int>> lut =
    {
        {0, -channel}, {0, 0},{0, channel},
        {1, -channel},                {1, channel},
        {2, -channel}, {2, 0},{2, channel},
    };

    int height = img.rows;
    int width = img.cols * channel;

    for (int i = 1; i < height - 1; i++)
    {
        std::vector<const uchar*> src_row_ptr = {img.ptr(i - 1), img.ptr(i), img.ptr(i + 1 )};
        uchar* dst_ptr = result.ptr(i);
        for (int j = channel; j < width - channel; j++)
        {
            auto pt = lut[random_index[i * width + (j/channel)]];
            dst_ptr[j] = src_row_ptr[pt.first][j + pt.second];
        }
    }
}

指针访问去除了通道的for循环,方便simd批量加载,单改完后发现lut还是存在批量加载的问题, 此时又需要把lut由二维拆分成两个一维

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
void DiffusionFilterOriginOpt (const cv::Mat& img, cv::Mat& result)
{
    // random engine
    std::default_random_engine generator;
    std::uniform_int_distribution<int> dis(0, 8);

    // generator random
    std::int8_t channel = img.channels();
    std::int8_t _channel = -channel;
    std::vector<int> random_index(img.rows * img.cols, 0);
    for(auto& index: random_index)
    {
        index = dis(generator);
    }

    std::vector<std::int8_t> lut_row = { 0, 0, 0, 1, 1, 1, 2, 2, 2};
    std::vector<std::int8_t> lut_col =  {_channel, 0, channel, _channel, 0, channel,_channel, 0, channel};

    // 1 2 3
    // 4 p 5
    // 6 7 8
    int height = img.rows;
    int width = img.cols * channel;

    for (int i = 1; i < height - 1; i++)
    {
        std::vector<const uchar*> src_row_ptr = {img.ptr(i - 1), img.ptr(i), img.ptr(i + 1 )};
        uchar* dst_ptr = result.ptr(i);
        for (int j = channel; j < width - channel; j++)
        {
            auto index = i * width + (j/channel);
            auto row_index = lut_row[random_index[index]];
            auto col_index = lut_col[random_index[index]];
            dst_ptr[j] = src_row_ptr[row_index][j + col_index];
        }
    }
}

但拆分完发现新问题了,多一个维度的访问,耗时就上去了

1
2
origin cost: 6.646 ms.
optimize cost: 6.294 ms.

预感不妙,继续改成simd实现,但发现还是无法继续,有lut的还是会面临无法联系访问的问题

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
void DiffusionFilterOriginOpt (const cv::Mat& img, cv::Mat& result)
{
    // random engine
    std::default_random_engine generator;
    std::uniform_int_distribution<std::int8_t> dis(0, 8);

    // generator random
    std::int8_t channel = img.channels();
    std::int8_t _channel = -channel;
    std::vector<std::int8_t> random_index(img.rows * img.cols, 0);
    for(auto& index: random_index)
    {
        index = dis(generator);
    }

    std::vector<std::int8_t> lut_row = { 0, 0, 0, 1, 1, 1, 2, 2, 2};
    std::vector<std::int8_t> lut_col =  {_channel, 0, channel, _channel, 0, channel,_channel, 0, channel};

    // 1 2 3
    // 4 p 5
    // 6 7 8
    int height = img.rows;
    int width = img.cols * channel;

    int index = 0;
    int step = 16;
    for (int i = 1; i < height - 1; i++)
    {
        std::vector<const uchar*> src_row_ptr = {img.ptr(i - 1), img.ptr(i), img.ptr(i + 1 )};
        uchar* dst_ptr = result.ptr(i);
        int j = channel;
        for(; j < width - channel; j+= step, index += step)
        {
            auto v_index = cv::vx_load(random_index.data() + index);
            //到此处就无法继续连续访问了
        }

        for (; j < width - channel; j++, index++)
        {
            auto row_index = lut_row[random_index[index]];
            auto col_index = lut_col[random_index[index]];
            dst_ptr[j] = src_row_ptr[row_index][j + col_index];
        }
    }
}

到此,最终的优化以下的指针版本为最优解

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
void DiffusionFilterOriginOpt (const cv::Mat& img, cv::Mat& result)
{
    // random engine
    std::default_random_engine generator;
    std::uniform_int_distribution<int> dis(0, 7);

    // generator random
    int channel = img.channels();
    std::vector<int> random_index(img.rows * img.cols, 0);
    for(auto& index: random_index)
    {
        index = dis(generator);
    }

    // 1 2 3
    // 4 p 5
    // 6 7 8
    std::vector<std::pair<int, int>> lut =
            {
                    {0, -channel}, {0, 0},{0, channel},
                    {1, -channel},                {1, channel},
                    {2, -channel}, {2, 0},{2, channel},
            };

    int height = img.rows;
    int width = img.cols * channel;

    for (int i = 1; i < height - 1; i++)
    {
        std::vector<const uchar*> src_row_ptr = {img.ptr(i - 1), img.ptr(i), img.ptr(i + 1 )};
        uchar* dst_ptr = result.ptr(i);
        for (int j = channel; j < width - channel; j++)
        {
            auto pt = lut[random_index[i * width + (j/channel)]];
            dst_ptr[j] = src_row_ptr[pt.first][j + pt.second];
        }
    }
}

微信公众号