【ATU Book-i.MX8系列 - TFLite 進階】模組量化(上)

一. 概述

在邊緣運算的重點技術之中，除了簡化複雜的模組架構，來簡化參數量以提高運算速度的這項模組輕量化網路架構技術之外。另一項技術就是各家神經網路框架 (TensorFlow、Pytorch etc…) 的 模組優化 能力，主要探討 TensorFlow Lite 的 訓練後之量化方式(Post-training quantization) 與 感知量化訓練(Quantization-aware Training) ，依序分為上與下兩篇幅，本篇將介紹前者資訊為主。所謂的量化就是將以最小精度的方式，來進行模組推理，使模組應用至各種 Edge Device 之中，並達到足夠成本效益，如下圖所示。順帶一提，恩智浦 NXP i.MX8M Plus 的 NPU(Neural Processing Unit) 神經處理單元，屬於純整數的 AI 加速器，就僅適用於 8位元的整數運算才能獲得最佳效益 !! 此系列的後續章節，也會利用 NPU 來實現算法加速之目的。

TensorFlow 模型應用概念之示意圖

利用 TensorFlow Lite 量化方式 所構成的模組，就是將訓練完成的輕量化模組，透過量化與優化的方式來提升推理速度 !! 如下模型運作概念圖所示，儲存模型完成後，即可依序執行凍結模型、優化模型到最後輕量化模型 (.tflite)，讓模型運行在移動式裝置時可達到最佳化的效果。

※ MobileNet 模組是一種輕量化模組的架構，而此篇重點是如何透過模組量化轉換為輕量化模組(tflite)

TensorFlow 模型運作概念之示意圖

若新讀者欲理解更多人工智慧、機器學習以及深度學習的資訊，可點選查閱下方博文
大大通精彩博文 【ATU Book-i.MX8系列】博文索引

TensorFlow Lite 進階系列博文-文章架構示意圖

二. 量化理論

何謂量化 ? 在此文章是泛指數值程度上的量化，亦指有限範圍的數值表示方式。其作用是為了降低數值資料量與模組大小，來提升傳輸與執行(推理)速度!! 而所謂的訓練後之量化(Post-training quantization) 就是利用訓練完成的模組，再次進行量化的一種優化方式。主要特色就是僅須要儲存後的模組( SaveModel / .h5 /ckpt)，且不需要訓練時的資料庫即可量化。

舉例來說，如下圖所示，是須將原本數值分布為 -3e³⁸ 到 +3e³⁸ 的浮點數型態 float，量化為數值分布 -2³¹到 2³¹ 的整數型態 int ，並以原本數據的最大值與最小值來找出有效的數值範圍，將有一定概率大幅度減少資料量。

數據量化示意圖

如何量化? 下列以最實際公式進行演示

假設原始數值的範圍為 [-2 : 6.0] 的浮點表示，將其量化至目標範圍 [-128 : 127] 的 8bit 整數範圍

第一步，找出量化後能夠表示的最小刻度

第二步，找出相對應的量化定點值

第三步，找出相對應的量化定點值

即可找出浮點數為 3.0 時，所對應的量化數值為 32。若將上述量化方式，將浮點數數值範圍量化為整數範圍，即如同下方表格所示。

量化優勢? 劣勢 ? 對於 TensorFlow Lite輕量化的應用而言

優勢：

- 減少模組尺寸：最多能縮減 75% 的大小
- 加快推理速度 : 使用整數計算大幅度提升速度
- 支援硬體較佳 : 能使處理八位元的處理器進行推理
- 傳輸速度提升 : 因模組尺寸縮小，能更獲得更好的傳輸品質

缺點 :

- 精度損失 : 因為數值的表示範圍縮減，故模組的準確度將會大幅度的降低

三. 訓練後之量化(Post-training quantization)

在 TensorFlow 提供的轉換過程中，大致上可分成三種量化方式，如下圖所示 :

1. 動態範圍量化(Dynamic Range Quantify)

該浮點數之權重(Parameters) 轉化整數型態，但部分激勵函數(Activation) 不支持整數化的方式，故會保留部分浮點數權重來保持模組的精確度，而稱 "動態範圍量化"。

※ 作用 : 模組大小最多可縮小 75 %、推理速度加快 2-3 倍，適用於 CPU 運算。

# Dynamic Range Habrid Quantify Python Code
import tensorflow as tf
import numpy as np 
converter = tf.lite.TFLiteConverter.from_saved_model("SavedModel_Path")
converter.optimizations = [tf.lite.Optimize. DEFAULT]
tflite_model = converter.convert()
with tf.io.gfile.GFile( "model.tflite" , 'wb') as f:
   f.write(tflite_model)
print("Quantization complete! - model.tflite ")

2. 半浮點量化(Half-Float Quantify)

該浮點數之權重(Parameters) 轉化半浮點數型態(float16)，保有一定的精度與縮減模組大小，故稱"半浮點數量化"。

※ 作用 : 模組大小最多可縮小 50 %、推理速度加快 1 倍，適用於 GPU 運算。

# Half-Float Quantify Python Code
import tensorflow as tf
import numpy as np 
converter = tf.lite.TFLiteConverter.from_saved_model("SavedModel_Path")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
with tf.io.gfile.GFile( "model.tflite" , 'wb') as f:
   f.write(tflite_model)
print("Quantization complete! - model.tflite ")

3. 全整數量化(Full Integer Quantify)

該浮點數之權重(Parameters)、激勵函數(Activation) 皆轉化整數型態，且額外建立一個小數據集來保持模組的精確度，而稱 "全整數量化"。

※ 作用 : 模組大小最多可縮小 75 %、推理速度加快 3 倍以上，適用於 CPU Edge / TPU / NPU 運算。

快速轉換 - 代碼 :

# Full Integer Quantify Python Code
import tensorflow as tf
import numpy as np 

# 建立模擬數據集 (已亂隨構成的測試集會影像精準度)
def representative_dataset_gen(): 
    for _ in range(250):
        yield [np.random.uniform(0.0, 1.0, size=(1, H,W, C)).astype(np.float32)] # H,W,C 為模組所須輸入的影像大小

# 量化轉換
converter = tf.lite.TFLiteConverter.from_saved_model(“SavedModel Path”)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type  = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_dataset_gen
tflite_model = converter.convert()
with open("model.tflite",'wb') as f:
    f.write(tflite_model)
print("tranfer done!!")

補充說明 :

(1) TFLiteConverter 支援模組格式 ( Keras / SavedModel / Function / Session / Frozen Graph ) : link
(2) TFLiteConverter 官方介紹 : link
(3) 若碰到兩個以上的輸入端時，須將representative_dataset_gen 輸出改成類似 yield [ data1, data2 ]
(4) 若欲想更改輸入端的大小，則可以嘗試用 from_concrete_functions 進行轉換

這裡，必須要強調 TensorFlow Lite 在轉換過程中所用的數據集的重要性 !! 如同上述代碼中的數據集描述 representative_dataset_gen() ，而該代碼之所以為簡單快速是因為無須考慮原始來源的數據集，利用隨機產生的資料即轉換成輕量化模型(.tflite)，但這有可能大幅度影響準確度。這是因為在 TensorFlow 進行轉換的過程中，會提取數據集之中的資料進行微調或小批的再訓練，將會導致前後結果之間的權重數值。因此，若用於精度需求比較高的模組時，強烈建議拿原始訓練集或測試集進行轉換。掌握這個重點，將能在整數的 TensorFlow Lite 轉換上有顯著的幫助!! 如同以下代碼 :

準確轉換 - 代碼 :

# 建立 keras 資料集

train_ds = tf.keras.preprocessing.image_dataset_from_directory("your data",batch_size=1, color_mode='rgb', image_size=(192, 192))

# 建立由數張影像資料(從keras 資料集取出)所建構的 list 結構
num = 0
images = []
for image_batch, labels_batch in train_ds:
num = num +1
images.append(image_batch[0])
if num == 250:print(num);break

# 建立真實數據集 (建議使用, 能提高精準度)
# Full Integer Quantify Python Code
import tensorflow as tf
import numpy as np
def representative_dataset_gen():
for data in tf.data.Dataset.from_tensor_slices((images)).batch(1).take(100):
yield [tf.dtypes.cast(data, tf.float32)]

# 量化轉換
converter = tf.lite.TFLiteConverter.from_saved_model(“SavedModel Path”)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_dataset_gen
tflite_model = converter.convert()
with open(TFliteModel,'wb') as f:
f.write(tflite_model)
print("tranfer done!!")

4. 量化方式技巧

TensorFlow Lite 量化方式可分作動態範圍量化、全整數量化、半浮點數量化三種。

為了支持不同的模組格式 (如 .h5 , pb) 與 TensorFlow 1.x /2.x 版本，官方提供數種轉換方式…

※ 測試版本為 TensorFlow 2.4.0 & TensorFlow 1.5.0，若欲了解更多模組轉換，請至下一章節查閱。

TensorFlow 1.x 支援之轉換格式

# (1) Saved Model

# TensorFlow Lite Converter ( savemodel to .tflite )
import tensorflow as tf
converter = tf.compat.v1.lite.TFLiteConverter.from_saved_mode( "savedmodel_path" ) 
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with tf.io.gfile.GFile( "model.tflite" , 'wb') as f:
   f.write(tflite_model)
print("tranfer done!!")

# (2) Keras ( .h5 file)

# TensorFlow Lite Converter ( .h5 to .tflite )
import tensorflow as tf
converter = tf.compat.v1.lite.TFLiteConverter.from_keras_model_file("model.h5")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with tf.io.gfile.GFile( "model.tflite" , 'wb') as f:
   f.write(tflite_model)
print("tranfer done!!")

# (3) ckpt (.ckpt file)

# TensorFlow Lite Converter ( .ckpt to .tflite )
import tensorflow as tf
from src import transform # 讀取模組架構
g = tf.compat.v1.Graph()
soft_config = tf.compat.v1.ConfigProto(allow_soft_placement=True)
soft_config.gpu_options.allow_growth = True
with g.as_default(), tf.compat.v1.Session(config=soft_config) as sess:
  img_placeholder = tf.compat.v1.placeholder(tf.float32, shape=[1, 474, 712, 3], name='img_placeholder') #輸入端節點(Netron 查看)
  preds = transform.net(img_placeholder) # 輸出端節點 ( Netron 查看 )
  saver = tf.compat.v1.train.Saver()
  saver.restore(sess, "model.ckpt" ) # ckpt file
  converter = tf.compat.v1.lite.TFLiteConverter.from_session(sess, [img_placeholder], [preds]) 
  converter.optimizations = [tf.lite.Optimize.DEFAULT]
  tflite_model = converter.convert()
  with tf.io.gfile.GFile( "model.tflite", 'wb') as f:
    f.write(tflite_model)
  print("tranfer done!!")

# (4) Frozen Graph (.pb file)

# TensorFlow Lite Converter ( .pb to .tflite )
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
# Weight Quantization - Input/Output=float32
input_arrays=["normalized_input_image_tensor"]
output_arrays=['TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1', 
               'TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3']
input_tensor={"normalized_input_image_tensor":[1,300,300,3]}
converter = tf.lite.TFLiteConverter.from_frozen_graph("model.pb", input_arrays,output_arrays,input_tensor)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.allow_custom_ops = True
tflite_quant_model = converter.convert()
with open('model.tflite', 'wb') as w:
    w.write(tflite_quant_model)
print("tranfer done!!")

TensorFlow 2.x 支援之轉換格式

# (1) Saved Model

# TensorFlow Lite Converter ( savemodel to .tflite )
import tensorflow as tf
import numpy as np 
converter = tf.lite.TFLiteConverter.from_saved_model("SavedModel_Path")
converter.optimizations = [tf.lite.Optimize. DEFAULT]
tflite_model = converter.convert()
with tf.io.gfile.GFile( "model.tflite" , 'wb') as f:
   f.write(tflite_model)
print("Quantization complete! - model.tflite ")

# (2) Keras

# TensorFlow Lite Converter ( .h5 to .tflite )
import tensorflow as tf
import numpy as np 
model = tf.keras.models.load_model( "model.h5" )
converter = tf.lite.TFLiteConverter.from_keras_model( model )
converter.optimizations = [tf.lite.Optimize. DEFAULT]
tflite_model = converter.convert()
with tf.io.gfile.GFile( "model.tflite" , 'wb') as f:
   f.write(tflite_model)
print("Quantization complete! - model.tflite ")

# (3) function

# TensorFlow Lite Converter ( func to .tflite )
import tensorflow as tf
import numpy as np 
model = tf.saved_model.load( "savedmodel" )
concrete_func = model.signatures[ "serving_default " ] 
concrete_func.inputs[0].set_shape([1, 256, 256, 3])  
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.optimizations = [tf.lite.Optimize. DEFAULT]
tflite_model = converter.convert()
with tf.io.gfile.GFile( "model.tflite" , 'wb') as f:
   f.write(tflite_model)
print("Quantization complete! - model.tflite ")

四. 結語

本文介紹了 TensorFlow 訓練後的量化，來實現如何將各種 TenosrFlow 的模組量化為輕量化模組 (.tflite) 的方式。這種量化方式分為三種用法，依序為動態範圍量化、半浮點量化、全整數量化。且能夠在不給予參考的數據條件下，快速達到量化之目的。但若需要提升準確度時，則可以用原生的數據集來達到更高準確度的模型結果 !! 對於現今熱門的邊緣運算環境與整數類型的 AI 加速器而言，可謂是必經的一道優化程序 !! 接下來，將會介紹另一個官方所推薦的量化方式 - 感知量化訓練，敬請期待 !!

五. 參考文件

[1] 官方文件 - i.MX Machine Learning User's Guide pdf
[2] 官方文件 - TensorFlow Lite 轉換工具
[3] 官方文件 - Post-training quantization
[4] 官方文件 - TensorFlow API
[5] 第三方文件 - Tensorflow模型量化(Quantization)原理及其实现方法

如有任何相關 TensorFlow Lite 進階技術問題，歡迎至博文底下留言提問 !!
接下來還會分享更多 TensorFlow Lite 進階文章 !!敬請期待 【ATU Book-i.MX8系列 – TFLite 進階】 !!

★博文內容均由個人提供，與平台無關，如有違法或侵權，請與網站管理員聯繫。

★文明上網，請理性發言。內容一周內被舉報5次，發文人進小黑屋喔~

【ATU Book-i.MX8系列 - TFLite 進階】 模組量化(上)

評論

【ATU Book-i.MX8系列 - TFLite 進階】模組量化(上)