TfLite Inference Example

The cube:evk is built on the i.MX8MP SoC with an integrated 2.3 TOPS NPU. The NPU accelerates neural-network inference on the device, avoiding cloud round-trips and reducing CPU load.

This page describes an open-source example that demonstrates object detection with LiteRT (TensorFlow Lite) using hardware acceleration via the VX Delegate.

Example Repository

tflite-inference-example https://github.com/cubesys-GmbH/tflite-inference-example

The example shows:

Loading and executing a TFLite model
Using the NPU (VX Delegate) for acceleration
Falling back to CPU-only inference
Drawing bounding boxes on detected objects
End-to-end inference pipeline for still images

It is a minimal, readable starting point for custom perception workloads.

Getting Started

1. Clone the Repository

1
2
git clone https://github.com/cubesys-GmbH/tflite-inference-example.git
cd tflite-inference-example

2. Create a Python Virtual Environment

This keeps all dependencies self-contained.

1
2
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

1
2
pip install --upgrade pip
pip install -r requirements.txt

This installs:

tflite-runtime (LiteRT)
OpenCV
Pillow
Utility libraries used by the example

Hardware Acceleration: VX Delegate

The i.MX8MP NPU is accessed via the VX Delegate (libvx_delegate.so). Verify that the delegate is available on the cube:

1
ls /usr/lib/libvx_delegate.so

If not found, the example will automatically fall back to CPU inference.

Run the Example

Run inference with NPU acceleration:

1
python image_detection.py --input input/example.jpg --output output/result.jpg

If the VX delegate loads successfully, you'll see:

1
VX delegate loaded (NPU acceleration enabled)

Force CPU-only inference:

1
python image_detection.py --input input/example.jpg --output output/result.jpg --no-delegate

The output image with bounding boxes is saved under:

1
output/result.jpg

Understanding the Model

The example uses:

1
models/ssd_mobilenet_v1_1/ssd_mobilenet_v1_1.tflite

This is a Single Shot Detector (SSD) with MobileNet v1 as its backbone.

Why this model?

Fast – real-time capable on embedded hardware
Lightweight – designed for edge devices
NPU friendly – fully quantized INT8 version runs efficiently
Widely used – common choice for demos, prototyping, and education

SSD Model Outputs

The TFLite version provides three tensors:

Bounding boxes Normalized coordinates (ymin, xmin, ymax, xmax)
Class IDs
Confidence scores

The example filters detections with confidence > 0.6 and draws boxes accordingly.

Example Output

After running the script, the result includes bounding boxes around detected objects, saved as:

1
output/result.jpg

Code Overview – End-to-End Inference Pipeline

This section walks through the main building blocks of image_detection.py so you can adapt it for your own models.

1. Loading the Interpreter (with optional VX Delegate)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from tflite_runtime.interpreter import Interpreter

def load_interpreter(model_path: str, use_delegate: bool) -> Interpreter:
    delegates = []

    if use_delegate:
        # Try to load VX delegate (NPU)
        try:
            from tflite_runtime.interpreter import load_delegate
            vx_delegate = load_delegate('/usr/lib/libvx_delegate.so')
            delegates.append(vx_delegate)
            print("VX delegate loaded (NPU acceleration enabled)")
        except Exception as e:
            print(e)
            print("Running on CPU fallback")
    else:
        print("Running inference on CPU (delegate disabled)")

    from multiprocessing import cpu_count
    interpreter = Interpreter(
        model_path=model_path,
        experimental_delegates=delegates,
        num_threads=cpu_count(),  # use all CPU cores if on CPU
    )
    return interpreter

If use_delegate=True and libvx_delegate.so is available, inference is offloaded to the NPU.
If the delegate cannot be loaded, it automatically falls back to CPU.

2. Preprocessing the Input Image

1
2
3
4
5
6
7
8
9
10
import cv2
import numpy as np
from PIL import Image

def resize_image(cv_image: np.ndarray, height: int, width: int) -> np.ndarray:
    # Convert OpenCV BGR to RGB, then use PIL for resizing
    color_converted = cv2.cvtColor(cv_image, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(color_converted)
    image_resized = pil_image.resize((width, height))
    return image_resized

Usage inside the main script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Load image with OpenCV
cv_image = cv2.imread(INPUT_IMAGE)

# Get input tensor shape from the model
input_details = interpreter.get_input_details()
input_height = input_details[0]['shape'][1]
input_width  = input_details[0]['shape'][2]

# Resize to model input size
image_resized = resize_image(cv_image=cv_image, height=input_height, width=input_width)

# Add batch dimension: (H, W, C) -> (1, H, W, C)
image_batch = np.expand_dims(image_resized, axis=0)

# Normalize if model expects float input
if input_details[0]['dtype'] == np.float32:
    input_data = image_batch / 255.0
else:
    input_data = image_batch

3. Running Inference

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Allocate model tensors once
interpreter.allocate_tensors()

# Warmup (optional, but nice for timing)
import time
warmup_start = time.time()
interpreter.invoke()
warmup_end = time.time()
print(f"Interpreter warmup time: {warmup_end - warmup_start:.2f} sec")

# Set input tensor and run inference
inference_start = time.time()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
inference_end = time.time()
print(f"Inference time: {inference_end - inference_start:.3f} sec")

4. Postprocessing SSD Outputs

SSD models usually expose:

boxes: bounding boxes (ymin, xmin, ymax, xmax)
classes: class indices
scores: confidence values

1
2
3
4
5
6
7
8
9
10
11
12
13
14
output_details = interpreter.get_output_details()

boxes      = np.squeeze(interpreter.get_tensor(output_details[0]['index']))
classes    = np.squeeze(interpreter.get_tensor(output_details[1]['index'])).astype(int)
scores     = np.squeeze(interpreter.get_tensor(output_details[2]['index']))

detections = []
for idx, class_id in enumerate(classes):
    if scores[idx] > 0.6:
        detections.append((class_id, scores[idx], boxes[idx]))

print("Detections (class_id, score, box):")
for det in detections:
    print(det)

5. Drawing Bounding Boxes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def draw_bounding_box(cv_image: np.ndarray, detections: list,
                      color=(0, 255, 0), thickness: int = 2) -> np.ndarray:
    image = cv_image.copy()
    frame_height, frame_width, _ = cv_image.shape

    for class_id, score, box in detections:
        y1, x1, y2, x2 = box  # normalized [0, 1]

        # Scale back to image coordinates
        x1 = int(x1 * frame_width)
        x2 = int(x2 * frame_width)
        y1 = int(y1 * frame_height)
        y2 = int(y2 * frame_height)

        top    = max(0, np.floor(y1 + 0.5))
        left   = max(0, np.floor(x1 + 0.5))
        bottom = min(frame_height, np.floor(y2 + 0.5))
        right  = min(frame_width,  np.floor(x2 + 0.5))

        cv2.rectangle(
            image,
            (int(left), int(top)),
            (int(right), int(bottom)),
            color,
            thickness
        )

        label = f"{class_id}:{score:.2f}"
        cv2.putText(
            image, label,
            (int(left), int(top) - 5),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.5,
            color,
            1,
            cv2.LINE_AA
        )

    return image

Usage:

1
2
3
4
result_image = draw_bounding_box(cv_image, detections)
os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)
cv2.imwrite(OUTPUT_PATH, result_image)
print(f"Output saved at {OUTPUT_PATH}")

6. Command-Line Interface

The script is controlled via simple CLI flags:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import argparse

parser = argparse.ArgumentParser(description="TFLite inference on cube:evk")
parser.add_argument("--input", type=str, default="input/example.jpg",
                    help="Path to input image")
parser.add_argument("--output", type=str, default="output/result.jpg",
                    help="Path to save output image")
parser.add_argument("--no-delegate", action="store_true",
                    help="Run inference without VX delegate (CPU only)")
args = parser.parse_args()

MODEL_PATH  = "models/ssd_mobilenet_v1_1/ssd_mobilenet_v1_1.tflite"
INPUT_IMAGE = args.input
OUTPUT_PATH = args.output

interpreter = load_interpreter(MODEL_PATH, use_delegate=not args.no_delegate)

You can now just run:

1
2
3
python image_detection.py --input input/example.jpg --output output/result.jpg
# or
python image_detection.py --input my.jpg --output my-result.jpg --no-delegate

Going Further

The example is intentionally small. It can be extended to:

Process video streams from USB cameras
Run inference through ROS 2 and feed results into cube:its
Perform multi-modal fusion (for example, GNSS plus AI detection feeding into CPM messages)