VLM Archives - 4Cs Cloud-Native & Linux

Vision language models such as Ollama, LM Studio, or any OpenAI‑style endpoint accept image data alongside text prompts. This tutorial shows you how to capture an image on a Linux box, encode it for HTTP transmission, and invoke the VLM API using both curl and Python. The examples are deliberately simple so they can be adapted to shell scripts, CI pipelines, or edge devices.

Why Send Images to a VLM?

Multimodal assistants – combine visual context with natural language queries
Document analysis – extract text from scanned PDFs or screenshots
Rapid prototyping – test model responses without writing full client libraries

All of these use cases require you to send binary image data (usually JPEG or PNG) as part of a multipart/form‑data request, or base64‑encoded JSON payloads depending on the server’s expectations.

Prerequisites

A Linux system with ffmpeg or imagemagick installed for image capture
curl version 7.55+ (most modern distros have this)
Python 3.8+ and the requests library (pip install requests)

You also need an active VLM server endpoint:

Service	Example Endpoint
Ollama	`http://localhost:11434/api/generate`
LM Studio	`http://127.0.0.1:1234/v1/chat/completions`

Both accept JSON bodies with an optional "image" field.

Step 1 – Capture an Image

If you have a webcam attached, you can use ffmpeg to snap a picture:

ffmpeg -f v4l2 -video_size 640x480 -i /dev/video0 -vframes 1 captured.jpg

Alternatively, with imagemagick you can capture the screen:

import -window root screenshot.png

Make sure the file exists and is readable:

ls -l captured.jpg
# or
file screenshot.png

Step 2 – Encode the Image for HTTP

Option A: Multipart Form Data (curl)

The simplest approach is to let curl handle multipart encoding. The VLM server expects a field named image. Here’s how:

curl -X POST http://localhost:11434/api/generate \
     -F "prompt=What do you see in this picture?" \
     -F "image=@captured.jpg;type=image/jpeg"

Explanation of flags:

-X POST – explicit HTTP method
-F – creates a form field; the @ syntax tells curl to read file contents
type= – sets the MIME type, useful for servers that validate it

If your endpoint expects JSON instead, you need to base64‑encode the image.

Option B: Base64 JSON Payload (Python)

import base64
import json
import requests

# Load and encode image
with open('captured.jpg', 'rb') as f:
    img_bytes = f.read()
b64_image = base64.b64encode(img_bytes).decode('utf-8')

payload = {
    "model": "llava-v1.5",
    "prompt": "Describe the scene in detail.",
    "image": b64_image
}

headers = {'Content-Type': 'application/json'}

response = requests.post(
    'http://localhost:11434/api/generate',
    headers=headers,
    data=json.dumps(payload)
)

print(response.status_code)
print(response.json())

Key points:

The image is sent as a base‑64 string in the "image" field.
Some servers require an additional "model" key to select the VLM variant; adjust accordingly.

Step 3 – Handling Different API Schemas

Ollama’s `/api/generate` Endpoint

Ollama expects multipart form data or JSON with a base64 string under "images". Example using curl:

curl -X POST http://localhost:11434/api/generate \
     -H "Content-Type: application/json" \
     -d '{
           "model":"llava",
           "prompt":"Explain the objects in this picture.",
           "images":["'$(base64 -w 0 captured.jpg)'"]
         }'

The -w 0 flag tells base64 not to insert line breaks.

LM Studio’s Chat Completion Endpoint

LM Studio follows the OpenAI chat schema. Images are passed as separate "content" blocks with a "type": "image_url" entry:

{
  "model": "gpt-4v",
  "messages": [
    {"role":"user","content":[
        {"type":"text","text":"What is happening here?"},
        {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,{{BASE64}}"}}
    ]}
  ]
}

In Python:

import base64, json, requests

with open('screenshot.png', 'rb') as img:
    b64 = base64.b64encode(img.read()).decode()

payload = {
    "model": "gpt-4v",
    "messages": [
        {"role": "user", "content": [
            {"type": "text", "text": "What is happening in this picture?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
        ]}
    ]
}

resp = requests.post('http://127.0.0.1:1234/v1/chat/completions', json=payload)
print(resp.json())

Notice the data: URL scheme – the server extracts the base64 payload automatically.

Step 4 – Automate with a Shell Script

You can wrap everything into a single script that captures, encodes, and calls the API. Save as vlm_send.sh:

#!/usr/bin/env bash

# Capture image (adjust device path if needed)
ffmpeg -y -f v4l2 -video_size 640x480 -i /dev/video0 -vframes 1 /tmp/vlm_input.jpg >/dev/null 2>&1

# Base64 encode without newlines
IMG_B64=$(base64 -w 0 /tmp/vlm_input.jpg)

# Build JSON payload (adjust model name as needed)
read -r -d '' PAYLOAD <<EOF
{
  "model": "llava",
  "prompt": "Provide a concise description of the scene.",
  "images": ["$IMG_B64"]
}
EOF

# Send request with curl
curl -s -X POST http://localhost:11434/api/generate \
     -H "Content-Type: application/json" \
     -d "$PAYLOAD"

Make it executable:

chmod +x vlm_send.sh
./vlm_send.sh

The script prints the JSON response from the VLM.

Step 5 – Error Handling and Debugging

HTTP 400 – Likely a malformed JSON or missing required fields. Use curl -v to view request headers.
HTTP 415 Unsupported Media Type – The server did not recognize the MIME type; ensure you set Content-Type: application/json when sending JSON, and use correct image MIME in multipart (image/jpeg).
Timeouts – VLM inference can be slow for large images. Increase curl timeout with --max-time 120. In Python, pass timeout=120 to requests.post.

Performance Tips

Resize before sending – Large images increase payload size and processing time. Use ImageMagick:convert captured.jpg -resize 800x600 resized.jpg
Cache base64 strings – If you send the same image repeatedly, cache the encoded string to avoid re‑encoding overhead.
Batch multiple images – Some APIs accept an array of images ("images": ["b64_1","b64_2"]). This reduces round trips when analyzing a set of photos together.

Conclusion

Sending images to vision language model APIs is straightforward once you understand the required payload format. Whether you prefer raw multipart form data with curl or structured JSON with base64 encoding in Python, the steps above cover both approaches for popular servers like Ollama and LM Studio. By automating image capture, resizing, and request handling, you can embed multimodal AI capabilities into scripts, IoT gateways, or web back‑ends with minimal effort.

Category: VLM

Send Images to a Vision Language Model (VLM) API

Why Send Images to a VLM?

Prerequisites

Step 1 – Capture an Image

Step 2 – Encode the Image for HTTP

Option A: Multipart Form Data (curl)

Option B: Base64 JSON Payload (Python)

Step 3 – Handling Different API Schemas

Ollama’s `/api/generate` Endpoint

LM Studio’s Chat Completion Endpoint

Step 4 – Automate with a Shell Script

Step 5 – Error Handling and Debugging

Performance Tips

Conclusion

Category: VLM

Send Images to a Vision Language Model (VLM) API

Why Send Images to a VLM?

Prerequisites

Step 1 – Capture an Image

Step 2 – Encode the Image for HTTP

Option A: Multipart Form Data (curl)

Option B: Base64 JSON Payload (Python)

Step 3 – Handling Different API Schemas

Ollama’s /api/generate Endpoint

LM Studio’s Chat Completion Endpoint

Step 4 – Automate with a Shell Script

Step 5 – Error Handling and Debugging

Performance Tips

Conclusion

Ollama’s `/api/generate` Endpoint

LM Studio’s Chat Completion Endpoint