tutorialedgeai

Running Inference at the Edge: A Step-by-Step Tutorial Using Raspberry Pi 5 and AI HAT+ 2

aappstudio

2026-02-07

9 min read

Step-by-step tutorial: package, quantize, and deploy a tiny recommendation model to Raspberry Pi 5 with AI HAT+ 2.

Deploying fast, cost-effective AI at the edge is no longer a gamble — it's a strategy.

If you’re a developer or IT admin battling long release cycles, expensive cloud inference, or brittle integrations, this hands-on tutorial shows exactly how to package, optimize, and deploy a small recommendation micro app to a Raspberry Pi 5 outfitted with the new AI HAT+ 2. Follow along for step-by-step commands, benchmarks, and a production-ready deployment flow that embraces 2026 trends in edge AI: aggressive quantization, vendor NPU acceleration, and micro app-first UX.

Why this matters in 2026

Edge AI is mainstream. By late 2025 and into 2026, the ecosystem matured: NPUs on accessories and boards (like AI HAT+ 2) became common, 3–4-bit quantization and runtime plugins unlocked orders-of-magnitude cost savings, and regulatory pressure and privacy requirements pushed inference on-device. Meanwhile, the rise of micro apps — single-purpose applications built quickly for a specific group — means we need repeatable, low-latency inference patterns you can ship in days, not months.

What you’ll build

A tiny recommendation model (a compact MLP + embedding) exported to ONNX and TFLite.
Model optimization using dynamic and static quantization (8-bit, and an experimental 4-bit path).
Deployment on Raspberry Pi 5 with AI HAT+ 2 using ONNX Runtime + vendor accelerator plugin (and a TFLite alternative).
A micro app that serves recommendations via a lightweight FastAPI endpoint and a systemd/Docker deployment pattern for reliability.

Prerequisites and hardware

Raspberry Pi 5 (64-bit Raspberry Pi OS recommended)
AI HAT+ 2 attached and vendor SDK installed (see vendor site for the latest 2026 runtime)
Host laptop with Python 3.10+ (local training/export can be done on a workstation or cloud GPU)
Basic familiarity with Python, PyTorch or TensorFlow, and Docker

Overview — development flow

Train a compact recommendation model locally.
Export to ONNX and TFLite.
Optimize: static/dynamic quantization + optional 4-bit experimentation.
Benchmark on-device using ONNX Runtime and the AI HAT+ 2 plugin.
Wrap the optimized model in a micro app (FastAPI), containerize, and deploy.

1) Build a tiny recommendation model (fast)

This walkthrough uses a minimal matrix-factorization / MLP hybrid — small parameters, high speed. Train quickly on sample data so you can iterate before pushing to the Pi.

# minimal PyTorch model (train locally, toy data)
import torch
import torch.nn as nn

class TinyRec(nn.Module):
    def __init__(self, n_users, n_items, emb_dim=32, hidden=64):
        super().__init__()
        self.u_emb = nn.Embedding(n_users, emb_dim)
        self.i_emb = nn.Embedding(n_items, emb_dim)
        self.mlp = nn.Sequential(
            nn.Linear(emb_dim*2, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1)
        )
    def forward(self, u, i):
        x = torch.cat([self.u_emb(u), self.i_emb(i)], dim=-1)
        return self.mlp(x).squeeze(-1)

# toy train loop omitted for brevity

Export to ONNX once you have a small checkpoint:

import torch
model.eval()
dummy_u = torch.randint(0, n_users, (1,))
dummy_i = torch.randint(0, n_items, (1,))
torch.onnx.export(model, (dummy_u, dummy_i), "tinyrec.onnx",
                  input_names=["user", "item"], output_names=["score"],
                  opset_version=14)

2) Optimize: quantization and pruning

2026 best practice: try both ONNX Runtime quantization and a TFLite path. Start with 8-bit dynamic quantization (low friction), then test static quantization with a small calibration dataset for better accuracy. If your vendor runtime supports 4-bit quantization or GPTQ-style post-training quant, test that as an experimental path for lower memory and faster NPU utilization.

ONNX dynamic quantization (fast)

pip install onnxruntime onnxruntime-tools
python -m onnxruntime_tools.quantize --input tinyrec.onnx --output tinyrec.q8.onnx --mode Dynamic

ONNX static quantization (better accuracy)

Collect a small representative dataset (1k samples) and run static calibration.

python calibrate_onnx.py --model tinyrec.onnx --calib-data calib.npz --output tinyrec.q8static.onnx

Experimental 4-bit / ultra-low-bit

By late 2025, several open-source toolchains allowed simulated 4-bit quantization (GPTQ-derived) for dense layers. This can yield huge memory wins on NPUs that support packed int4 formats. Treat this as experimental — measure accuracy carefully.

3) Convert to TFLite (optional path)

Some vendor HATs expose the best acceleration via TFLite delegates. Convert your PyTorch model to TFLite through ONNX or TorchScript if you have a TF conversion layer.

# Example: use tf and tflite-converter on a saved TF model
# Export PyTorch -> ONNX -> TensorFlow (via onnx-tf) -> TFLite
pip install onnx-tf
onnx-tf convert -i tinyrec.onnx -o tinyrec_tf
# then use TFLite converter in TF

4) Prepare the Raspberry Pi 5 + AI HAT+ 2

On the Pi:

Flash the latest 64-bit Raspberry Pi OS (2026 build).
Run system update and install essentials:

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv docker.io git
sudo reboot

Install the AI HAT+ 2 vendor runtime and ONNX Runtime for ARM64. The vendor provides a runtime plugin that exposes the NPU to ONNX or TFLite; get it from the vendor site (2026 releases added improved 4-bit drivers and a GPU fallback).

Tip: keep your vendor SDK and ONNX Runtime versions in lockstep. Mismatched ABI versions are the most common cause of runtime failures.

5) On-device benchmarking

Transfer the optimized model to the Pi (scp or git-lfs). Use a simple Python harness to benchmark with the vendor plugin enabled.

pip install onnxruntime
# Example harness (onnxruntime with vendor EP)
import onnxruntime as ort
sess_opts = ort.SessionOptions()
# vendor-specific: ort.register_custom_ops_library('/usr/lib/libaihat_plugin.so')
sess = ort.InferenceSession('tinyrec.q8.onnx', sess_options=sess_opts, providers=['CPUExecutionProvider','CUDAExecutionProvider'])
# run inference and measure

Measure latency, p95 and throughput. Log CPU usage and temperature — on-device throttling can change behavior under sustained load. For field benchmarking and cache/IO considerations see the ByteCache edge appliance field review.

6) Micro app: FastAPI service

Wrap your model in a lightweight web service that serves instant recommendations. For micro apps, keep the API minimal: one POST endpoint that accepts user_id and context and returns top-3 items.

from fastapi import FastAPI
import onnxruntime as ort

app = FastAPI()
sess = ort.InferenceSession('tinyrec.q8.onnx')

@app.post('/recommend')
def recommend(payload: dict):
    user = payload['user_id']
    items = payload.get('candidate_items', list(range(100)))
    # score candidate items and return top-3
    scores = [float(sess.run(None, {"user": [user], "item": [i]})[0]) for i in items]
    top_idx = sorted(range(len(scores)), key=lambda i: -scores[i])[:3]
    return {"recommendations": [items[i] for i in top_idx]}

7) Containerize and deploy

Containerization avoids dependency hell. Use a small base image (python:3.10-slim-arm64 or Raspberry Pi OS images). Example Dockerfile:

FROM --platform=linux/arm64 python:3.10-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Push the image to your registry and pull on the Pi, or build on-device if you prefer. To run reliably on reboot, use systemd unit or Docker restart policies — this deployment pattern aligns with common edge container approaches.

docker run -d --restart unless-stopped --name tinyrec -p 8080:8080 my-registry/tinyrec:latest

8) Production considerations and monitoring

Security: Run the service under a non-root user, enable firewall rules, and limit API exposure. Follow zero-trust practices such as those in the Zero‑Trust client approvals playbook when exposing inference endpoints.
Model updates: Implement a rollout mechanism: fetch new models from a central bucket, validate with unit tests and an on-device shadow test before switching runtime model files.
Observability: Ship lightweight telemetry (latency, success rates, model size) to a central store. For constrained devices, batch metrics and use a gateway to aggregate — see edge auditability patterns for decision logging.
Failover: If the NPU plugin crashes, ensure a CPU fallback path (a smaller TFLite/ONNX CPU quantized model) to keep the app available.

9) Measuring tradeoffs — accuracy vs. latency vs. cost

Benchmark both accuracy and latency for each optimized variant. Example matrix to measure:

FP32 ONNX (baseline)
Q8 dynamic ONNX
Q8 static ONNX
Int4 experimental
TFLite int8 + vendor delegate

Capture:

Top-3 precision (or MAP@3)
P50/P90/P95 latency
Memory footprint and binary size
Power draw (if possible)

10) Advanced tips — extractable from 2026 trends

Leverage hybrid orchestration: run model selection in cloud and serve final model on-device for privacy-sensitive requests.
Use model ensembles sparingly — an ultrafast on-device model that calls a larger cloud model only for edge cases is often best.
Adopt continuous model distillation: nightly distillations shrink cloud models into tiny on-device models that capture the latest behavior.
Experiment with structured pruning + quantization pipelines — automated frameworks in 2025–26 can produce models with small accuracy loss and large speed gains. These practices are covered in broader edge-first developer experience discussions.

Troubleshooting (common gotchas)

Plugin load errors: ensure the vendor runtime and ONNX Runtime ABI are compatible; check vendor release notes for 2026 fixes. See edge auditability notes for managing runtime compatibility and logs.
Inconsistent results between FP32 and quantized models: calibrate with representative data; try per-channel quantization.
High tail latency: inspect thermal throttling and background processes; reduce batch sizes and consider warm-up invocations.
Model larger than available RAM: use memory-mapped model loading or split the model between NPU and CPU-friendly parts.

Actionable checklist — ship your edge micro app in a weekend

Train a tiny model locally and export to ONNX.
Run dynamic quantization, benchmark locally.
Test static quantization with a 1k calibration set.
Install vendor SDK on Raspberry Pi 5 and verify plugin loads.
Deploy quantized ONNX to Pi, measure latency and accuracy.
Wrap model in a FastAPI micro app and containerize.
Set restart policies and lightweight telemetry.

Real-world example: Where a micro app wins

Micro apps — fast, focused, and personal — are a growing pattern in 2026. A small team I worked with used the exact flow above to ship a “team lunch recommender” on Pi 5 kiosks around an office. The edge model served thousands of low-latency suggestions per week, preserved employee preferences locally, and reduced calls to cloud inference (and cloud cost) by 87% compared with a cloud-only baseline.

“We went from idea to kiosk in 3 days — because the model was tiny and the Pi+AI HAT+2 combo handled 95% of our requests locally.”

Key takeaways

Edge-first decreases latency and cost. Small models + quantization are the pragmatic path in 2026.
Vendor NPUs matter. The AI HAT+ 2 unlocks significant performance gains — but you must align runtimes and drivers.
Micro apps are an efficient product pattern. They let non-developers and small teams iterate quickly while keeping complexity low.
Measure everything. Accuracy, latency, memory, and thermal behavior are all critical on-device.

Where to go next

Watch the companion video tutorial where I walk through every command and benchmark live on a Raspberry Pi 5 with AI HAT+ 2. The repo includes: training scripts, ONNX/TFLite conversion helpers, quantization pipelines, an example FastAPI micro app, Dockerfiles, and systemd templates for productionization. For deeper reading on low-latency container and architecture patterns, see our edge containers & low-latency architectures guide.

Final thoughts and call-to-action

Edge inference is now accessible, cost-effective, and practical for production micro apps. By combining compact models, aggressive quantization, and vendor NPU acceleration, you can ship useful, private, low-latency apps that scale across devices and use cases.

Ready to try it? Clone the starter repo, follow the step-by-step video, and deploy your first micro recommendation app to a Pi 5 + AI HAT+ 2 this weekend. If you want a production-grade template and CI/CD for edge deployments, sign up for a walkthrough with our team at appstudio.cloud — we’ll help you shrink your time-to-market and operate edge AI reliably.

appstudio

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.