MNIST inference for number image .. Tensorflow Lite CPU vs TPU.

In my previous Tensorflow MNIST demo we bench-marked digit image recognition inference using standard Tensorflow vs. Tensorflow Lite model and I was quite impressed with the almost 200x improvement. With Tensorflow ..


vs Tensorflow Lite …


And our test images we were like following ..


Today I wanted to test if we can reduce further the inference time “at the edge” as we are running Debian on an industrial IMX8 gateway board. So we are going to compare the same digit image classification between Tensorflow Lite running on CPU (i.e the IMX8 SOC) and with ML ‘acceleration’ using a Google Coral M.2 Edge TPU card (4 TOPS using 2W of power).

We did have a bit of issue getting the software to integrate the EDGE TPU card, due to MSI-X not enabled in the kernel of our Linux 5.15, but it as sorted out by our board BSP team at Compulab.

Google Coral EDGE M.2 TPU accelerator board 4 TOPS at 2W power


According to Tensorflow TPU webpage, there are just 2 script modifications required for our Tensorflow Lite script to run inferences on the Edge TPU accelerator ..

  1. Instead of using import tensorflow as tf, load the tflite_runtime package like this:import tflite_runtime.interpreter as tflite
  2. Add the Edge TPU delegate when constructing the Interpreter.For example, your TensorFlow Lite code will ordinarily have a line like this:interpreter = tflite.Interpreter(model_path)So change it to this:interpreter = tflite.Interpreter(model_path, experimental_delegates=[tflite.load_delegate('libedgetpu.so.1')])

Then we ran first the Tensorflow Lite on CPU (IMX8 SoC 4 x CortexA-53)

MNIST digit inference with CPU only IMX8


.. and with the Edge TPU accelerator

MNIST digit inference on CPU with a 4TOPS edge TPU accelerator


Shocker!

Inference time with the TPU accelerator board was actually SLOWER than just running with the SoC. My suspect is that perhaps, due to the test images used for our MNIST digit recognition experiment, the time taken to move data back and forth between the CPU and TPU likely out-weight the faster inference bit of the entire cycle? I need to ping the crew at PyCoral github and check if that is the case.

Next, I would want to run inference on more data/larger test object, perhaps an OpenCV object detector and see how much faster indeed adding an accelerator works in practice.

Related posts

Latest posts

Unknown's avatar

Author: stangeek71

51 year old Electrical/Electronics Engineer, background in industrial servo controls but been developing and pushing industrial gateways for past 15 years. Nowadays, working hard to catch up on machine learning knowledge to develop, test and deploy useful new features for customers of our next gen industrial gateways.

Leave a comment