A fast-moving mix this week: AI tooling, ARM readiness, Docker sandboxes, and real-world lessons from agents. Practical insights across .NET, DevOps, and local-first workflows.
Our Favorite Agent Setups (Agentic DevOps) - Nice discussion that goes through many AI harnesses, agents, models, and what they are playing with right now. OpenClaw, OpenCode, Claude Code, Copilot, and all of it.
Running an AI model as a one-shot script is useful, but it forces you to restart the model every time you need a result. Setting it up as a service lets any application send requests to it continuously, without reloading the model. This guide shows how to serve Reka Edge using vLLM and an open-source plugin, then connect a web app to it for image description and object detection.
You need a machine with a GPU and either Linux, macOS, or Windows (with WSL). I use UV, a fast Python package and project manager, or pip + venv if you prefer.
Clone the vLLM Reka Plugin
Reka models require a dedicated plugin to run under vLLM, not all models need this extra step, but Reka's architecture requires it. Clone the plugin repository and enter the directory:
git clone https://github.com/reka-ai/vllm-reka
cd vllm-reka
The repository contains the plugin code and a serve.sh script you will use to start the service.
Download the Reka Edge Model
Before starting the service, you need the model weights locally. Install the Hugging Face Hub CLI and use it to pull the reka-edge-2603 model into your project directory:
This is a large model, so make sure you have enough disk space and a stable connection.
Start the Service
Once the model is downloaded, start the vLLM service using the serve.sh script included in the plugin:
uv run bash serve.sh ./models/reka-edge-2603
The script accepts environment variables to configure which model to load and how much GPU memory to allocate. If your GPU cannot fit the model at default settings, open serve.sh and adjust the variables at the top. The repository README lists the available options. The service takes a few seconds to load the model weights, then starts listening for HTTP requests.
As an example with an NVIDIA GeForce RTX 5070, here are the settings I used to run the model:
With the backend running, time to start the Media Library app. Clone the repository, jump into the directory, and run it with Docker:
git clone https://github.com/fboucher/media-library
cd media-library
docker compose up --build -d
Open http://localhost:8080 in your browser, then add a new connection with these settings:
Name: local (or any label you want)
IP address: your machine's local network IP (e.g. 192.168.x.x)
API key: leave blank or enter anything — no key is required for a local connection
Model: reka-edge-2603
Click Test to confirm the connection, then save it.
Try It: Image Description and Object Detection
Select an image in the app and choose your local connection, then click Fill with AI. The app sends the image to your vLLM service, and the model returns a natural language description. You can watch the request hit your backend in the terminal where the service is running.
Reka Edge also supports object detection. Type a prompt asking the model to locate a specific feature (ex: "face") and the model returns bounding-box coordinates. The app renders these as red boxes overlaid on the image. This works for any region you can describe in a prompt.
Switch to the Reka Cloud API
If your local GPU is too slow for production use, you can point the app at the Reka APIs instead. Add a new connection in the app and set the base URL to the Reka API endpoint. Get your API key from platform.reka.ai. OpenRouter is another option if you prefer a unified API across providers.
The model name stays the same (reka-edge-2603), so switching between local and cloud is just a matter of selecting a different connection in the app. The cloud API is noticeably faster because Reka's servers are more powerful than a local GPU (at least mine :) ). During development, use the local service to avoid burning credits; switch to the API for speed when you need it.
What You Can Build
The service you just set up accepts any image, or video via HTTP — point a script at a folder and you have a batch pipeline for descriptions, tags, or bounding boxes. Swap the prompt and you change what it extracts. The workflow is the same whether you are running locally or through the API.
Reka just released Reka Edge, a compact but powerful vision-language model that runs entirely on your own machine. No API keys, no cloud, no data leaving your computer. I work at Reka and putting together this tutorial was genuinely fun; I hope you enjoy running it as much as I did.
In three steps, you'll go from zero to asking an AI what's in any image or video.
What You'll Need
A machine with enough RAM to run a 7B parameter model (~16 GB recommended)
Git
uv, a fast Python package manager. Install it with:
curl -LsSf https://astral.sh/uv/install.sh | sh
This works on macOS, Linux, and Windows (WSL). If you're on Windows without WSL, grab the Windows installer instead.
Step 1: Get the Model and Inference Code
Clone the Reka Edge repository from Hugging Face. This includes both the model weights and the inference code:
git clone https://huggingface.co/RekaAI/reka-edge-2603
cd reka-edge-2603
Step 2: Fetch the Large Files
Hugging Face stores large files (model weights and images) using Git LFS. After cloning, these files exist on disk but contain only small pointer files, not the actual content.
First, make sure Git LFS is installed. The command varies by platform:
Then pull all large files, including model weights and media samples:
git lfs pull
Grab a coffee while it downloads, the model weights are several GB.
Step 3: Ask the Model About an Image or Video
To analyze an image, use the sample included in the media/ folder:
uv run example.py \
--image ./media/hamburger.jpg \
--prompt "What is in this image?"
Or pass a video with --video:
uv run example.py \
--video ./media/many_penguins.mp4 \
--prompt "What is in this?"
The model will load, process your input, and print a description, all locally, all private.
Try different prompts to unlock more:
"Describe this scene in detail."
"What text is visible in this image?"
"Is there anything unusual or unexpected here?"
What's Actually Happening?
You don't need this to use the model, but if you're anything like me and can't help wondering what's going on under the hood, here's the magic behind example.py:
1. It picks the best hardware available.
The script checks whether your machine has a GPU (CUDA for Nvidia, Metal for Apple Silicon) and uses it automatically. If neither is available, it falls back to the CPU. This affects speed, not quality.
2. It loads the model into memory.
The 7 billion parameter model is read from the folder you cloned. This is the "weights": billions of numbers that encode everything the model has learned. Loading takes ~30 seconds depending on your hardware.
processor = AutoProcessor.from_pretrained(args.model, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(args.model, ...).eval()
3. It packages your input into a structured message.
Your image (or video) and your text prompt are wrapped together into a conversation-style format, the same way a chat message works, except one part is visual instead of text.
4. It converts everything into numbers.
The processor translates your image into a grid of numerical patches and your prompt into tokens (small chunks of text, each mapped to a number). The model only understands numbers, so this step bridges the gap.
5. The model generates a response, token by token.
Starting from your input, the model predicts the most likely next word, then the next, up to 256 tokens. It stops when it hits a natural end-of-response marker.
6. It converts the numbers back into text and prints it.
The token IDs are decoded back into human-readable words and printed to your terminal. No internet involved at any point.
If you prefer watching and reading, here is the video version:
That's Pretty Cool, Right?
A single script. No API key. No cloud. You just ran a 7 billion parameter vision-language model entirely on your own machine, and it works whether you're on a Mac, Linux, or Windows with WSL, which is what I was using when I wrote this.
This works great as a one-off script: drop in a file, ask a question, get an answer. But what if you wanted to build something on top of it? A web app, a tool that watches a folder, or anything that needs to talk to the model repeatedly?
That's exactly what the next post is about. I'll show you how to wrap Edge as a local API, so instead of running a script, you have a service running on your machine that any app can plug into. Same model, same privacy, but now it's a proper building block.