|
在使用rknn-toolkit转换大型模型并开启了量化功能时,板子的内存可能不够用导致出现OOM问题。
报错log如下:
- --> Loading model
- done
- --> Building model
- Killed
- 串口信息:
- [toybrick@localhost work]$ [746756.249341] python3 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
- [746756.250074] python3 cpuset=/ mems_allowed=0
- [746756.250506] CPU: 0 PID: 21169 Comm: python3 Not tainted 4.4.167 #17
- [746756.251071] Hardware name: rockchip,rk3399pro-toybrick-prod-linux (DT)
- [746756.251660] Call trace:
- [746756.251905] [<ffffff8008088948>] dump_backtrace+0x0/0x220
- [746756.252394] [<ffffff8008088b8c>] show_stack+0x24/0x30
- [746756.252851] [<ffffff80083a85ac>] dump_stack+0x94/0xbc
- [746756.253307] [<ffffff80081a5224>] dump_header.isra.5+0x50/0x15c
- [746756.253831] [<ffffff8008166bd4>] oom_kill_process+0x94/0x3d4
- [746756.254341] [<ffffff8008167188>] out_of_memory+0x1d8/0x2a0
- [746756.254840] [<ffffff800816b670>] __alloc_pages_nodemask+0x6b0/0x724
- [746756.255407] [<ffffff8008165a6c>] filemap_fault+0x24c/0x35c
- [746756.255907] [<ffffff8008226e40>] ext4_filemap_fault+0x40/0x60
- [746756.256430] [<ffffff8008185dec>] __do_fault+0x78/0xdc
- [746756.256883] [<ffffff800818906c>] handle_mm_fault+0x538/0xca4
- [746756.257394] [<ffffff80080944e0>] do_page_fault+0x214/0x36c
- [746756.257892] [<ffffff800809468c>] do_translation_fault+0x54/0xc8
- [746756.258425] [<ffffff8008080b08>] do_mem_abort+0x54/0xac
或者:
- --> Loading model
- done
- --> Building model
- E Catch exception when building RKNN model!
- T Traceback (most recent call last):
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
- T return fn(*args)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
- T options, feed_dict, fetch_list, target_list, run_metadata)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
- T run_metadata)
- T tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[7,608,608,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
- T [[Node: convolution_1_2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](convolution_1_2/Pad, convolution_1/weight/read)]]
- T Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
- T During handling of the above exception, another exception occurred:
- T Traceback (most recent call last):
- T File "rknn/api/rknn_base.py", line 470, in rknn.api.rknn_base.RKNNBase.build
- T File "rknn/api/rknn_base.py", line 888, in rknn.api.rknn_base.RKNNBase._quantize
- T File "rknn/base/rknnlib/app/tensorzone/quantization.py", line 248, in rknn.base.rknnlib.app.tensorzone.quantization.Quantization.run
- T File "rknn/base/rknnlib/app/tensorzone/quantization.py", line 141, in rknn.base.rknnlib.app.tensorzone.quantization.Quantization._run_quantization
- T File "rknn/base/rknnlib/app/tensorzone/graph.py", line 98, in rknn.base.rknnlib.app.tensorzone.graph.Graph.run
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
- T run_metadata_ptr)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
- T feed_dict_tensor, options, run_metadata)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
- T run_metadata)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
- T raise type(e)(node_def, op, message)
- T tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[7,608,608,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
- T [[Node: convolution_1_2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](convolution_1_2/Pad, convolution_1/weight/read)]]
- T Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
- T Caused by op 'convolution_1_2/Conv2D', defined at:
- T File "rknn_transform.py", line 29, in <module>
- T rknn.build(do_quantization=True, dataset='./dataset.txt')
- T File "/usr/local/lib64/python3.6/site-packages/rknn/api/rknn.py", line 162, in build
- T ret = self.rknn_base.build(do_quantization=do_quantization, dataset=dataset, pack_vdata=pre_compile)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
- T data_format=data_format, dilations=dilations, name=name)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
- T op_def=op_def)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
- T return func(*args, **kwargs)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
- T op_def=op_def)
- T File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in __init__
- T self._traceback = tf_stack.extract_stack()
- T ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[7,608,608,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
- T [[Node: convolution_1_2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](convolution_1_2/Pad, convolution_1/weight/read)]]
- T Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
- done
当出现以上错误log时,说明系统内存已经不够用了,这个时候可以创建一个swap分区来解决内存不足。
用free -mh查看当前swap分区大小,默认应该是没有swap分区的:
- $ free -mh
- total used free shared buff/cache available
- Mem: 3.8G 149M 3.2G 365M 439M 3.2G
- Swap: 0B 0B 0B
用文件创建swap分区方法如下(这边创建swap文件在/mnt目录下,大小为500MB,大小可自行调整,建议2048):
- cd /mnt
- sudo dd if=/dev/zero of=swap bs=1M count=500
- sudo chmod 600 swap
- sudo mkswap swap
- sudo swapon swap
创建好后可以查看swap信息:
- $ swapon -s
- Filename Type Size Used Priority
- /mnt/swap file 511996 459592 -1
此时再用free -mh查看当前swap分区大小,swap分区已增大为500M:
- $ free -mh
- total used free shared buff/cache available
- Mem: 3.8G 82M 25M 371M 3.7G 3.3G
- Swap: 499M 449M 50M
可以编辑/etc/fstab文件让系统每次开机启动时自动启用swap文件:
- sudo vi /etc/fstab
- 新增一行:/mnt/swap swap swap defaults 0 0
fstab的含义可以自行百度学习。
若不需要swap文件,可以关闭swap分区,然后删除swap文件:
- swapoff /mnt/swap
- sudo rm /mnt/swap
|
|