edgeraspberrypimicroapps

Edge Micro Apps on Raspberry Pi 5: Build an Offline Recommendation App Like Rebecca Yu’s Dining Tool

UUnknown

2026-01-24

9 min read

Run offline recommendation engines on Raspberry Pi 5 with AI HAT+ 2. A step-by-step guide to build, deploy, and personalize a low-latency micro app.

Cut decision fatigue, latency, and cloud bills: run a personal recommendation micro app on Raspberry Pi 5 with AI HAT+ 2

If long dev cycles, costly cloud inference, and slow personalization are slowing your teams down, running edge micro apps on a Raspberry Pi 5 with the new AI HAT+ 2 changes the calculus. In this guide (inspired by Rebecca Yu’s Where2Eat), you’ll learn how to build an offline recommendation micro app that serves low-latency, private, and personalized suggestions — all hosted on-device. We cover hardware setup, model hosting, inference, containers, CI/CD, and practical personalization strategies you can use today in 2026.

Why this matters in 2026

Two key trends make edge micro apps essential this year:

Local-first AI adoption — late 2025 and early 2026 saw a growing movement to run inference on-device for privacy, resiliency, and latency. Devices with NPUs are now common enough that personal micro apps are realistic for devs and admins; see our note on privacy-first personalization.
Micro apps and vibe-coding — a wave of rapid, personal apps (Rebecca Yu’s Where2Eat is a high-profile example covered by TechCrunch) shows non-enterprise projects benefit hugely from quick, private compute at the edge.

ZDNET observed that the AI HAT+ 2 unlocks generative and inference workloads for Raspberry Pi 5 owners. That hardware step, combined with mature edge runtimes (ONNX Runtime, TFLite, PyTorch Mobile delegations), lets us host recommendation models offline and deliver recommendations in tens to low hundreds of milliseconds depending on model size.

What the AI HAT+ 2 brings to the table

The AI HAT+ 2 upgrades Raspberry Pi 5 into a capable inference host. At a high level the HAT provides:

Dedicated NPU acceleration for on-device model execution (delegates available in common runtimes).
Hardware-accelerated quantized model support (TFLite/ONNX quantized models typically run faster and use less memory).
Standardized SDKs and drivers for ARM64 Linux distributions, enabling toolchains to compile models and load device delegates.
Low power and low-latency inference for micro apps — critical for user interactions and personalization.

Reference architecture for a dining recommendation micro app

Keep the micro app architecture intentionally small and reproducible. The essential components:

Model binary (ONNX or TFLite) stored under /opt/models on the Pi.
Inference service — lightweight FastAPI or Flask microservice that loads the model with a hardware delegate and exposes a /recommend endpoint.
Local storage — SQLite for user events, a small key-value for cached embeddings.
Frontend — simple SPA served from the Pi or a mobile/web client that hits the local API.
Sync/ci — optional: synchronize anonymized updates or model deltas to a central server for aggregate retraining (see approaches to sync and failover).

Model selection and pattern

For a Where2Eat-style tool, use a two-stage approach that is robust on-device:

Candidate retrieval — lightweight embedding-based nearest neighbor (e.g., a 64–128-d embedding, stored in an approximate nearest neighbor index such as Annoy or nmslib).
Re-ranking — small MLP (a few layers) that takes the user embedding + candidate features and produces a score. Convert that MLP to ONNX and run it on the HAT+ 2 NPU.

This pattern keeps memory and compute small while delivering personalized results.

Convert a tiny PyTorch re-ranker to ONNX (example)

# PyTorch model conversion (run on a workstation or Pi if you have build tools)
import torch

class ReRanker(torch.nn.Module):
    def __init__(self, emb_dim=64):
        super().__init__()
        self.net = torch.nn.Sequential(
            torch.nn.Linear(emb_dim*2 + 10, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, 32),
            torch.nn.ReLU(),
            torch.nn.Linear(32, 1)
        )
    def forward(self, x):
        return self.net(x)

model = ReRanker()
model.eval()
dummy = torch.randn(1, 64*2 + 10)
torch.onnx.export(model, dummy, 're_ranker.onnx', opset_version=14)

Then quantize if needed (post-training static or dynamic quantization) and validate accuracy on-device. For developer productivity, combine model conversion with automated boilerplate generation (see tools that turn prompts into starter apps, e.g. quick-generation guides).

Serving the model on Raspberry Pi 5: runtime and code

Use ONNX Runtime with the HAT+ 2 delegate when available. The pattern below loads ONNX and attempts to use a hardware execution provider, falling back to CPU.

# app/inference.py
import onnxruntime as ort
import numpy as np

def make_session(model_path):
    providers = ort.get_available_providers()
    # Prefer the NPU delegate if the SDK exposes one; name may vary (example: 'AIHATExecutionProvider')
    preferred = [p for p in providers if 'AIHAT' in p or 'NPU' in p]
    if preferred:
        print('Using hardware provider', preferred)
        return ort.InferenceSession(model_path, providers=preferred + ['CPUExecutionProvider'])
    print('Hardware provider not found; using CPU')
    return ort.InferenceSession(model_path)

session = make_session('/opt/models/re_ranker.onnx')

def predict(features: np.ndarray):
    inputs = {session.get_inputs()[0].name: features.astype(np.float32)}
    return session.run(None, inputs)[0]

Wrap this inference layer in a lightweight FastAPI app with endpoints for /recommend and /feedback. Store feedback to SQLite for incremental personalization.

FastAPI skeleton: endpoints and local DB

# app/main.py
from fastapi import FastAPI
import sqlite3
from inference import predict

app = FastAPI()
conn = sqlite3.connect('events.db', check_same_thread=False)
conn.execute('CREATE TABLE IF NOT EXISTS events (user_id TEXT, item_id TEXT, event TEXT, ts INTEGER)')

@app.post('/recommend')
def recommend(payload: dict):
    # payload contains user_embedding and candidate features
    user_emb = payload['user_emb']
    candidates = payload['candidates']
    # stack features and call predict()
    # return ranked items

@app.post('/feedback')
def feedback(evt: dict):
    conn.execute('INSERT INTO events VALUES (?,?,?,?)', (evt['user'], evt['item'], evt['event'], evt['ts']))
    conn.commit()
    return {'ok': True}

Containerization and deployment

Containers make micro app deployment repeatable. For Raspberry Pi 5 (ARM64) use multi-arch builds and small base images.

# Dockerfile (simplified)
FROM --platform=linux/arm64 python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Use docker buildx for multi-arch images and push to a registry. Example build command:

docker buildx build --platform linux/arm64,linux/amd64 -t your-registry/where2eat:latest --push .

For fleet management and secure updates at scale, choose tools that suit micro deployments:

Local single-device: systemd unit that pulls and runs a local container.
Small fleets: balena or Mender for OTA container updates.
Larger edge fleets: lightweight orchestration with k3s or KubeEdge where appropriate, but micro apps often benefit from simpler patterns.

CI/CD: build, test, and release

Example GitHub Actions job (high-level):

name: build-and-push
on:
  push:
    branches: [main]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v2
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Login to registry
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Build and push
        run: docker buildx build --platform linux/arm64 -t ghcr.io/owner/where2eat:latest --push .

Personalization and offline learning

Edge personalization strategies avoid large on-device retraining while still adapting to individual users:

Embedding updates: Maintain per-user lightweight embeddings that update with each event (click, like). Re-ranker uses these embeddings to personalize scores.
Local fine-tuning: For tiny models, fine-tune the last layer on device using a small buffer of recent events (10–100 datapoints). Ensure quantized models are fine-tune-friendly.
Hybrid retraining: Periodically (when on trusted Wi‑Fi and user opts in) upload anonymized deltas to a cloud service for batch retraining and then push compact updated models back to devices; this pattern benefits from robust cloud platform workflows for model distribution.

Use a lightweight approximate nearest neighbor library (Annoy, HNSW via nmslib) to store item embeddings. Those indexes are fast to query and modest in size.

Testing & performance: what to measure

Before shipping, profile three things:

Request latency (95th percentile). Measure cold start, warm start, and concurrent requests.
Power & thermal — sustained NPU loads can throttle; add thermal profiling and graceful degradation to CPU-only mode.
Memory footprint — keep the container and model small to avoid swapping.

Expectation: a compact ONNX re-ranker + NPU delegate should produce single-request latencies that are suitable for interactive micro apps (benchmark on your model to know exact numbers — device and model size matter). For observability and profiling best practices, see notes on modern observability.

Security, privacy, and compliance

Model and data at rest: Encrypt sensitive local data (use file-system encryption or a key store).
Local API: Restrict to loopback or require a token — mobile clients should authenticate to the Pi before pulling recommendations.
OTA & updates: Sign your container artifacts and models; verify signatures on-device before applying updates. Adopt zero-trust signing and verification patterns where possible.
User consent: If you ever upload deltas, get explicit consent and provide opt-out controls.

Advanced strategies & 2026 predictions

Looking ahead, here are trends to plan for:

Federated personalization will be more turnkey by mid-2026, letting you aggregate model improvements without centralizing raw data; see discussions on privacy-first personalization.
Model compiler improvements (TVM, ONNX optimizers) will produce faster, smaller kernels for NPUs like the AI HAT+ 2; plan to recompile models for each hardware revision.
Edge marketplaces for models and micro app templates will reduce time-to-first-app. Expect pre-compiled recommendation kernels for small devices.

“Once vibe-coding apps emerged, I started hearing about people with no tech backgrounds successfully building their own apps,” — Rebecca Yu, on building Where2Eat (TechCrunch)

Actionable checklist: get a working offline recommender on your Pi 5

Buy Raspberry Pi 5 + AI HAT+ 2 and latest Pi OS (2026 build is recommended). Install drivers per HAT vendor docs.
Prepare an ONNX re-ranker and quantize it. Put the model under /opt/models.
Install ONNX Runtime and your HAT delegate on the Pi. Validate delegate availability with ort.get_available_providers().
Deploy the FastAPI inference container (ARM64) locally. Expose only the required ports and secure endpoints with tokens.
Implement local embedding store using Annoy or nmslib. Use SQLite for events and simple analytics.
Measure latency and power. Add a fallback to CPU inference for thermal or hardware errors.
Set up CI to build multi-arch images and test on-device before rollout. Use signed releases for production fleets.

Key takeaways

Edge micro apps on Raspberry Pi 5 with AI HAT+ 2 enable private, low-latency personalization without continuous cloud inference.
Design for small models, embedding-based retrieval, and tiny re-rankers; convert to ONNX and use a hardware delegate for best latency.
Use containers and simple fleet tooling for reproducible deployments; secure model/data at rest and in transit.
Adopt hybrid personalization: local adaptation plus periodic anonymized cloud retraining for robust models.

Next steps — try the repo and a sample build

Clone the sample micro app, flash your Pi 5 with the recommended OS image, attach the AI HAT+ 2, and run the pre-built container. Benchmark your model, and iterate: swap in different re-rankers, try 8-bit quantization, or add federated sync when you’re ready.

If you want a jump-start, grab the sample code from the companion repository (contains Dockerfile, FastAPI server, ONNX conversion scripts, and CI examples). Run the included scripts/deploy_pi.sh to boot a device and deploy a signed image.

Edge micro apps are not just for hobbyists anymore. In 2026, they’re practical options for teams that need speed, privacy, and low operating cost — and the combination of Raspberry Pi 5 plus AI HAT+ 2 is a proven platform to build on. Start small, measure, and iterate: your next micro app could be live in days, not months.

Call to action: Clone the sample repo, flash your Pi 5, and deploy the Where2Eat-style micro app. Share benchmarks and join the discussion in the edge-dev community to compare optimizations and model trade-offs.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.