Showing posts with label vision. Show all posts
Showing posts with label vision. Show all posts

Private Vision AI: Run Reka Edge Entirely on Your Machine

Reka just released Reka Edge, a compact but powerful vision-language model that runs entirely on your own machine. No API keys, no cloud, no data leaving your computer. I work at Reka and putting together this tutorial was genuinely fun; I hope you enjoy running it as much as I did.

[Originally published at dev.to/reka]

In three steps, you'll go from zero to asking an AI what's in any image or video.

What You'll Need

  • A machine with enough RAM to run a 7B parameter model (~16 GB recommended)
  • Git
  • uv, a fast Python package manager. Install it with:
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    This works on macOS, Linux, and Windows (WSL). If you're on Windows without WSL, grab the Windows installer instead.

Step 1: Get the Model and Inference Code

Clone the Reka Edge repository from Hugging Face. This includes both the model weights and the inference code:

git clone https://huggingface.co/RekaAI/reka-edge-2603
cd reka-edge-2603

Step 2: Fetch the Large Files

Hugging Face stores large files (model weights and images) using Git LFS. After cloning, these files exist on disk but contain only small pointer files, not the actual content.

First, make sure Git LFS is installed. The command varies by platform:

# macOS
brew install git-lfs

# Linux / WSL (Ubuntu/Debian)
sudo apt install git-lfs

Then initialize it:

git lfs install

Then pull all large files, including model weights and media samples:

git lfs pull

Grab a coffee while it downloads, the model weights are several GB.


Step 3: Ask the Model About an Image or Video

To analyze an image, use the sample included in the media/ folder:

uv run example.py \
  --image ./media/hamburger.jpg \
  --prompt "What is in this image?"
the prompt and the burger image

Or pass a video with --video:

uv run example.py \
  --video ./media/many_penguins.mp4 \
  --prompt "What is in this?"

The model will load, process your input, and print a description, all locally, all private.

Try different prompts to unlock more:

  • "Describe this scene in detail."
  • "What text is visible in this image?"
  • "Is there anything unusual or unexpected here?"

What's Actually Happening? 

You don't need this to use the model, but if you're anything like me and can't help wondering what's going on under the hood, here's the magic behind example.py:

1. It picks the best hardware available. The script checks whether your machine has a GPU (CUDA for Nvidia, Metal for Apple Silicon) and uses it automatically. If neither is available, it falls back to the CPU. This affects speed, not quality.

if torch.cuda.is_available():
    device = torch.device("cuda")
elif mps_ok:
    device = torch.device("mps")
else:
    device = torch.device("cpu")

2. It loads the model into memory. The 7 billion parameter model is read from the folder you cloned. This is the "weights": billions of numbers that encode everything the model has learned. Loading takes ~30 seconds depending on your hardware.

processor = AutoProcessor.from_pretrained(args.model, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(args.model, ...).eval()

3. It packages your input into a structured message. Your image (or video) and your text prompt are wrapped together into a conversation-style format, the same way a chat message works, except one part is visual instead of text.

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": args.image},
        {"type": "text", "text": args.prompt},
    ],
}]

4. It converts everything into numbers. The processor translates your image into a grid of numerical patches and your prompt into tokens (small chunks of text, each mapped to a number). The model only understands numbers, so this step bridges the gap.

inputs = processor.apply_chat_template(
    messages, tokenize=True, return_tensors="pt", return_dict=True
)

5. The model generates a response, token by token. Starting from your input, the model predicts the most likely next word, then the next, up to 256 tokens. It stops when it hits a natural end-of-response marker.

output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

6. It converts the numbers back into text and prints it. The token IDs are decoded back into human-readable words and printed to your terminal. No internet involved at any point.

output_text = processor.tokenizer.decode(new_tokens, skip_special_tokens=True)
print(output_text)


Here the video

If you prefer watching and reading, here is the video version:

 

That's Pretty Cool, Right?

A single script. No API key. No cloud. You just ran a 7 billion parameter vision-language model entirely on your own machine, and it works whether you're on a Mac, Linux, or Windows with WSL, which is what I was using when I wrote this.

This works great as a one-off script: drop in a file, ask a question, get an answer. But what if you wanted to build something on top of it? A web app, a tool that watches a folder, or anything that needs to talk to the model repeatedly?

That's exactly what the next post is about. I'll show you how to wrap Edge as a local API, so instead of running a script, you have a service running on your machine that any app can plug into. Same model, same privacy, but now it's a proper building block.


~frank 

AI Vision: Turning Your Videos into Comedy Gold (or Cringe)

I've spent most of my career building software in C# and .NET, and only used Python in IoT projects. When I wanted to build a fun project—an app that uses AI to roast videos, I knew it was the perfect opportunity to finally dig into Python web development.

The question was: where do I start? I hopped into a brainstorming session with Reka's AI chat and asked about options for building web apps in Python. It mentioned Flask, and I remembered friends talking about it being lightweight and perfect for getting started. That sounded right.

In this post, I share how I built "Roast My Life," a Flask app using the Reka Vision API.

The Vision (Pun Intended)

The app needed three core things:

  1. List videos: Show me what videos are in my collection
  2. Upload videos: Let me add new ones via URL
  3. Roast a video: Send a selected video to an AI and get back some hilarious commentary

See it in action

Part 1: Getting Started Environment Setup

The first hurdle was always going to be environment setup. I'm serious about keeping my Python projects isolated, so I did the standard dance:

Before even touching dependencies, I scaffolded a super bare-bones Flask app. Then one thing I enjoy from C# is that all dependencies are brought in one shot, so I like doing the same with my python projects using requirements.txt instead of installing things ad‑hoc (pip install flask then later freezing).

Dropping that file in first means the setup snippet below is deterministic. When you run pip install -r requirements.txt, Flask spins up using the exact versions I tested with, and you won't accidentally grab a breaking major update.

Here's the shell dance that activates the virtual environment and installs everything:

cd roast_my_life/workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Then came the configuration. We will need an API key and I don't want to have it hardoced so I created a .env file to store my API credentials:

API_KEY=YOUR_REKA_API_KEY
BASE_URL=https://vision-agent.api.reka.ai

To get that API key, I visited the Reka Platform and grabbed a free one. Seriously, a free key for playing with AI vision APIs? I was in.

With python app.py, I fired up the Flask development server and opened http://127.0.0.1:5000 in my browser. The UI was there, but... it was dead. Nothing worked.

Perfect. Time to build.

The Backend: Flask Routing and API Integration

Coming from ASP.NET Core's controller-based routing and Blazor, Flask's decorator-based approach felt just like home. All the code code goes in the app.py file, and each route is defined with a simple decorator. But first things first: loading configuration from the .env file using python-dotenv:

from flask import Flask, request, jsonify
import requests
import os
from dotenv import load_dotenv

app = Flask(__name__)

# Load environment variables (like appsettings.json)
load_dotenv()
api_key = os.environ.get('API_KEY')
base_url = os.environ.get('BASE_URL')

All the imports packages are the same ones that needs to be in the requirements.txt. And we retreive the API key and base URL from environment variables, just like in .NET Core.

Now, to be able to get roasted we need first to upload a video to the Reka Vision API. Here's the code—I'll go over some details after.


@app.route('/api/upload_video', methods=['POST'])
def upload_video():
    """Upload a video to Reka Vision API"""
    data = request.get_json() or {}
    video_name = data.get('video_name', '').strip()
    video_url = data.get('video_url', '').strip()
    
    if not video_name or not video_url:
        return jsonify({"error": "Both video_name and video_url are required"}), 400
    
    if not api_key:
        return jsonify({"error": "API key not configured"}), 500
    
    try:
        response = requests.post(
            f"{base_url.rstrip('/')}/videos/upload",
            headers={"X-Api-Key": api_key},
            data={
                'video_name': video_name,
                'index': 'true',  # Required: tells Reka to process the video
                'video_url': video_url
            },
            timeout=30
        )
        
        response_data = response.json() if response.ok else {}
        
        if response.ok:
            video_id = response_data.get('video_id', 'unknown')
            return jsonify({
                "success": True,
                "video_id": video_id,
                "message": "Video uploaded successfully"
            })
        else:
            error_msg = response_data.get('error', f"HTTP {response.status_code}")
            return jsonify({"success": False, "error": error_msg}), response.status_code
            
    except requests.Timeout:
        return jsonify({"success": False, "error": "Request timed out"}), 504
    except Exception as e:
        return jsonify({"success": False, "error": f"Upload failed: {str(e)}"}), 500

Once the information from the frontend is validated we make a POST request to the Reka Vision API's /videos/upload endpoint. The parameters are sent as form data, and we include the API key in the headers for authentication. Here I was using URLs to upload videos, but you can also upload local files by adjusting the request accordingly. As you can see, it's pretty straightforward, and the documentation from Reka made it easy to understand what was needed.

The Magic: Sending Roast Requests to Reka Vision API

Here's where things get interesting. Once a video is uploaded, we can ask the AI to analyze it and generate content. The Reka Vision API supports conversational queries about video content:

def call_reka_vision_qa(video_id: str) -> Dict[str, Any]:
    """Call the Reka Video QA API to generate a roast"""
    
    headers = {'X-Api-Key': api_key} if api_key else {}
    
    payload = {
        "video_id": video_id,
        "messages": [
            {
                "role": "user",
                "content": "Write a funny and gentle roast about the person, or the voice in this video. Reply in markdown format."
            }
        ]
    }
    
    try:
        resp = requests.post(
            f"{base_url}/qa/chat",
            headers=headers,
            json=payload,
            timeout=30
        )
        
        data = resp.json() if resp.ok else {"error": f"HTTP {resp.status_code}"}
        
        if not resp.ok and 'error' not in data:
            data['error'] = f"HTTP {resp.status_code} calling chat endpoint"
        
        return data
        
    except requests.Timeout:
        return {"error": "Request to chat API timed out"}
    except Exception as e:
        return {"error": f"Chat API call failed: {e}"}

Here we pass the video ID and a prompt asking for a "funny and gentle roast." The API responds with AI-generated content, which we can then send back to the frontend for display. I try to give more "freedom" to the AI by asking it to reply in markdown format, which makes the output more engaging.

Try It Yourself!

The complete project is available on GitHub: reka-ai/api-examples-python

What Makes Reka Vision API So nice to use

What really stood out to me was how approachable the Reka Vision API is. You don't need any special SDK—just the requests library making standard HTTP calls. And honestly, it doesn't matter what language you're used to; an HTTP call is pretty much always simple to do. Whether you're coming from .NET, Python, JavaScript, or anything else, you're just sending JSON and getting JSON back.

Authentication is refreshingly straightforward: just pop your API key in the header and you're good to go. No complex SDKs, no multi-step authentication flows, no wrestling with binary data streams. The conversational interface lets you ask questions in natural language, and you get back structured JSON responses with clear fields.

One thing worth noting: in this example, the videos are pre-uploaded and indexed, which means the responses come back fast. But here's the impressive part—the AI actually looks at the video content. It's not just reading a transcript or metadata; it's genuinely analyzing the visual elements. That's what makes the roasts so spot-on and contextual.

Final Thoughts

The Reka Vision API itself deserves credit for making video AI accessible. No complicated SDKs, no multi-GB model downloads, no GPU requirements. Just simple HTTP requests and powerful AI capabilities. I'm not saying I'm switching to Python full-time, but expect to see me sharing more Python projects in the future!

References and Resources


Reading Notes #415


Every Monday, I share my "reading notes". Those are the articles, blog posts, podcast episodes, and books that catch my interest during the week and that I found interesting. It's a mix of the actuality and what I consumed.



Cloud

Programming

Miscellaneous

  • What are Azure CLI Extensions? (Michael Crump) - An interesting first article of a series. This one introduces us to the extension... Hmmm. I think I have an idea.
~


Reading Notes #382

Cloud


Programming

~

Reading Notes #331

IMG_20180609_102403-EFFECTS

Cloud


Programming



Books



Miscellaneous



Reading Notes #321

ester-eggs-2345859_640

Suggestion of the week



Cloud



Programming



Databases



Miscellaneous


Books



When: The Scientific Secrets of Perfect Timing (Daniel H. Pink)

A really amazing book packed of very interesting advice. Things that you kind of already knew, or at least had a feeling you maybe knew are clearly explained to you.

After reading (or listening) this book, you will know why, and you can decide to fight it or change the when... improve your performance and use your time and energy on something else.

ISBN: 0525589333






Reading Notes #319

kUz2asfCloud


Programming


Databases


Miscellaneous



Reading Notes #317

729c5a03-6992-401b-a653-7f9b343472e0[1]Suggestion of the week


Cloud


Programming


Miscellaneous