Toybrick

标题: 【已解决！】RKNN转换Tensorflow官方的Deeplabv3后inference结果不 [打印本页]

作者: protossw512 时间: 2019-3-5 05:11
标题: 【已解决！】RKNN转换Tensorflow官方的Deeplabv3后inference结果不
本帖最后由 protossw512 于 2019-3-6 07:32 编辑

问题已经解决了，原因是中间层的输出是channel first的，所以reshap的时候需要注意。

作者: raul 时间: 2019-3-5 16:17
谢谢反馈，我这里复现下。

作者: protossw512 时间: 2019-3-5 16:28

raul 发表于 2019-3-5 16:17
谢谢反馈，我这里复现下。

顺便一说，我在这里调出了第一层conv(MobilenetV2/Conv/Conv2D)的结果，发现已经不对了。
然后我尝试避免使用rknn自带的图像归一化，直接用 (0. 0. 0. 1.0)，然后自己把图像转成np.float32，再自己做归一化，结果inference的时候出现了segmentation fault不知道为什么，np.float16也不行。
我的quantization一直都没开。
有没有其他什么办法能调出rknn归一化后输入网络之前的数据么？

作者: protossw512 时间: 2019-3-6 03:26

raul 发表于 2019-3-5 16:17
谢谢反馈，我这里复现下。

你好，问题已经解决了，原因是中间层的输出是channel在前面的，所以最后的输出reshape不应该是1, 65, 65, 21, 而应该是1, 21, 65, 65。然后再np.transpose成 1, 65, 65, 21进行处理。

非常感谢大家的帮助！

作者: raul 时间: 2019-3-6 09:31

protossw512 发表于 2019-3-6 03:26
你好，问题已经解决了，原因是中间层的输出是channel在前面的，所以最后的输出reshape不应该是1, 65, 65, ...

好的。我这里验证也是结果没错。

作者: chuyee 时间: 2019-3-8 07:56
What's the inference time do you get for DeepLabv3? Is it possible for less than 1s?

作者: protossw512 时间: 2019-3-8 14:25

chuyee 发表于 2019-3-8 07:56
What's the inference time do you get for DeepLabv3? Is it possible for less than 1s?

Deeplabv3+ is not actually as computation intensive as you think of. Depending on your network architecture, you can run mobilnetv2_dm0.5 up to 15 fps with input size of 513x513.

作者: chuyee 时间: 2019-3-11 10:07

protossw512 发表于 2019-3-8 14:25
Deeplabv3+ is not actually as computation intensive as you think of. Depending on your network arc ...

That's amazing!

作者: chuyee 时间: 2019-3-11 11:50

protossw512 发表于 2019-3-8 14:25
Deeplabv3+ is not actually as computation intensive as you think of. Depending on your network arc ...

What does "mobilnetv2_dm0.5" stand for? I got only ~1.2s with GTX 1080 Ti with the demo code https://github.com/tensorflow/mo ... /deeplab_demo.ipynb, which uses model deeplabv3_mnv2_pascal_train_aug_2018_01_29.tar.gz (513x513, mobilenet_v2 coco dataset). I haven't ported it to rknn successfully yet. But do you think I can achieve 15FPS after the porting?

作者: chuyee 时间: 2019-3-12 02:28

chuyee 发表于 2019-3-11 10:07
That's amazing!

你的输入层设的是 MobilenetV2/Conv/Conv2D吧？那前面那些层都在CPU上处理吗？mobilenet_v2在rknn上到40fps都没问题。问题是加上前期和后期的处理后你处理一副513x513的图片在3399pro上需要多少秒呢？

作者: chuyee 时间: 2019-3-12 16:55

chuyee 发表于 2019-3-11 11:50
What does "mobilnetv2_dm0.5" stand for? I got only ~1.2s with GTX 1080 Ti with the demo code https ...

I answer it myself, dm stands for depth multiplier. 0.5 means halve the number of channels used in each layer, which cuts down the number of computations by a factor of 4 and the number of learnable parameters by a factor 3. It is therefore much faster than the full model but also less accurate. On my GTX 1080Ti, the first frame is always ~1.2s, but the following ones reduce to ~0.015s. That's about 70FPS! So it makes sense for 3399pro to achieve 15FPS.

作者: protossw512 时间: 2019-3-13 06:08

chuyee 发表于 2019-3-12 16:55
I answer it myself, dm stands for depth multiplier. 0.5 means halve the number of channels used in ...

Yep, the first frame cannot represent general runtime on tensorflow framework. Even if you use dm=1.0 and add aspp and decoder, you can still run it with 10 fps with input size of 513x513, pretty amazing.

作者: chuyee 时间: 2019-3-13 15:29
Here is my deeplabv3 result on 3399pro. I only get 3FPS. I'm using the rknn model converted from deeplabv3_mnv2_dm05_pascal.pb, input size 513x513. Seem there is still some gap to 15FPS. Did I miss anything?

inference time: 0.29389023780822754
inference time: 0.3319542407989502
inference time: 0.29692697525024414
inference time: 0.2921435832977295
inference time: 0.28955864906311035
inference time: 0.28310418128967285
inference time: 0.28433656692504883
inference time: 0.28536295890808105
inference time: 0.28377389907836914
inference time: 0.28389525413513184
--> Begin evaluate model performance
========================================================================
Performance
========================================================================
Total Time(us): 260892
FPS: 3.83
========================================================================

作者: protossw512 时间: 2019-3-15 04:46

chuyee 发表于 2019-3-13 15:29
Here is my deeplabv3 result on 3399pro. I only get 3FPS. I'm using the rknn model converted from dee ...

depends on what input / output node are you using, and whether you quantize your model.

作者: chuyee 时间: 2019-3-15 06:49

protossw512 发表于 2019-3-15 04:46
depends on what input / output node are you using, and whether you quantize your model.

Quantize is turned on. Input is "MobilenetV2/Conv/Conv2D" and "ArgMax" is output.. I assume that's the best can be achieved. Otherwise more stuff needs to be moved from NPU to CPU, which will make fps even worse. Are you sure your 15fps is achieved with 513x513 input size?

作者: kitedream 时间: 2019-3-20 21:02
本帖最后由 kitedream 于 2019-5-16 10:32 编辑

厉害，厉害，我也在做转换，但是前后处理cpu耗时目前还比较大

作者: chuyee 时间: 2019-3-26 03:56

chuyee 发表于 2019-3-15 06:49
Quantize is turned on. Input is "MobilenetV2/Conv/Conv2D" and "ArgMax" is output.. I assume that' ...

@protossw512, see my bug report for rknn.perf_eval() on http://t.rock-chips.com/forum.ph ... &extra=page%3D1 . Could it be the reason you claimed 15FPS?

作者: protossw512 时间: 2019-3-26 04:18

chuyee 发表于 2019-3-26 03:56
@protossw512, see my bug report for rknn.perf_eval() on http://t.rock-chips.com/forum.php?mod=view ...

I am pretty sure, since after tested on Python with official mobilenet deeplabv3 I switched to C++ and use native C++ code evaluated the performance on my own deeplabv3.
In addition to official mobilenet version, I added decoder and aspp modules, which brings additional operations and input size of 400x400. I am able to run it with 9.x FPS.
I also find using node argmax is pretty slow, so I used biasAdd as output, and write my own implementation on C++ to get segmentation result.

作者: chuyee 时间: 2019-3-26 07:45

protossw512 发表于 2019-3-26 04:18
I am pretty sure, since after tested on Python with official mobilenet deeplabv3 I switched to C++ ...

Good point. If I replace 'ArgMax' with 'logits/semantic/BiasAdd', I can also get 12~15FPS (without postprocessing).

inference time: 10.533683061599731
inference time: 0.11503362655639648
inference time: 0.08590936660766602
inference time: 0.08321857452392578
inference time: 0.08301472663879395
inference time: 0.0832514762878418
inference time: 0.08804035186767578
inference time: 0.08095145225524902
inference time: 0.08228850364685059
inference time: 0.08656930923461914
done
--> Begin evaluate model performance
========================================================================
Performance
========================================================================
Total Time(us): 64915
FPS: 15.40
========================================================================

Question for RK folks, could you please check with your argmax implementation? Why does it take so long (263127 - 64915) ~= 200000 us?

欢迎光临 Toybrick (https://t.rock-chips.com/)