Vision language models such as Ollama, LM Studio, or any OpenAI‑style endpoint accept image data alongside text prompts. This tutorial shows you how to capture an image on a Linux box, encode it for HTTP transmission, and invoke the VLM API using both curl and Python. The examples are deliberately simple so they can be adapted to shell scripts, CI pipelines, or edge devices.
Why Send Images to a VLM?
- Multimodal assistants – combine visual context with natural language queries
- Document analysis – extract text from scanned PDFs or screenshots
- Rapid prototyping – test model responses without writing full client libraries
All of these use cases require you to send binary image data (usually JPEG or PNG) as part of a multipart/form‑data request, or base64‑encoded JSON payloads depending on the server’s expectations.
Prerequisites
- A Linux system with
ffmpegorimagemagickinstalled for image capture curlversion 7.55+ (most modern distros have this)- Python 3.8+ and the
requestslibrary (pip install requests)
You also need an active VLM server endpoint:
| Service | Example Endpoint |
|---|---|
| Ollama | http://localhost:11434/api/generate |
| LM Studio | http://127.0.0.1:1234/v1/chat/completions |
Both accept JSON bodies with an optional "image" field.
Step 1 – Capture an Image
If you have a webcam attached, you can use ffmpeg to snap a picture:
ffmpeg -f v4l2 -video_size 640x480 -i /dev/video0 -vframes 1 captured.jpg
Alternatively, with imagemagick you can capture the screen:
import -window root screenshot.png
Make sure the file exists and is readable:
ls -l captured.jpg
# or
file screenshot.png
Step 2 – Encode the Image for HTTP
Option A: Multipart Form Data (curl)
The simplest approach is to let curl handle multipart encoding. The VLM server expects a field named image. Here’s how:
curl -X POST http://localhost:11434/api/generate \
-F "prompt=What do you see in this picture?" \
-F "image=@captured.jpg;type=image/jpeg"
Explanation of flags:
-X POST– explicit HTTP method-F– creates a form field; the@syntax tells curl to read file contentstype=– sets the MIME type, useful for servers that validate it
If your endpoint expects JSON instead, you need to base64‑encode the image.
Option B: Base64 JSON Payload (Python)
import base64
import json
import requests
# Load and encode image
with open('captured.jpg', 'rb') as f:
img_bytes = f.read()
b64_image = base64.b64encode(img_bytes).decode('utf-8')
payload = {
"model": "llava-v1.5",
"prompt": "Describe the scene in detail.",
"image": b64_image
}
headers = {'Content-Type': 'application/json'}
response = requests.post(
'http://localhost:11434/api/generate',
headers=headers,
data=json.dumps(payload)
)
print(response.status_code)
print(response.json())
Key points:
- The image is sent as a base‑64 string in the
"image"field. - Some servers require an additional
"model"key to select the VLM variant; adjust accordingly.
Step 3 – Handling Different API Schemas
Ollama’s /api/generate Endpoint
Ollama expects multipart form data or JSON with a base64 string under "images". Example using curl:
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model":"llava",
"prompt":"Explain the objects in this picture.",
"images":["'$(base64 -w 0 captured.jpg)'"]
}'
The -w 0 flag tells base64 not to insert line breaks.
LM Studio’s Chat Completion Endpoint
LM Studio follows the OpenAI chat schema. Images are passed as separate "content" blocks with a "type": "image_url" entry:
{
"model": "gpt-4v",
"messages": [
{"role":"user","content":[
{"type":"text","text":"What is happening here?"},
{"type":"image_url","image_url":{"url":"data:image/jpeg;base64,{{BASE64}}"}}
]}
]
}
In Python:
import base64, json, requests
with open('screenshot.png', 'rb') as img:
b64 = base64.b64encode(img.read()).decode()
payload = {
"model": "gpt-4v",
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "What is happening in this picture?"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
]}
]
}
resp = requests.post('http://127.0.0.1:1234/v1/chat/completions', json=payload)
print(resp.json())
Notice the data: URL scheme – the server extracts the base64 payload automatically.
Step 4 – Automate with a Shell Script
You can wrap everything into a single script that captures, encodes, and calls the API. Save as vlm_send.sh:
#!/usr/bin/env bash
# Capture image (adjust device path if needed)
ffmpeg -y -f v4l2 -video_size 640x480 -i /dev/video0 -vframes 1 /tmp/vlm_input.jpg >/dev/null 2>&1
# Base64 encode without newlines
IMG_B64=$(base64 -w 0 /tmp/vlm_input.jpg)
# Build JSON payload (adjust model name as needed)
read -r -d '' PAYLOAD <<EOF
{
"model": "llava",
"prompt": "Provide a concise description of the scene.",
"images": ["$IMG_B64"]
}
EOF
# Send request with curl
curl -s -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d "$PAYLOAD"
Make it executable:
chmod +x vlm_send.sh
./vlm_send.sh
The script prints the JSON response from the VLM.
Step 5 – Error Handling and Debugging
- HTTP 400 – Likely a malformed JSON or missing required fields. Use
curl -vto view request headers. - HTTP 415 Unsupported Media Type – The server did not recognize the MIME type; ensure you set
Content-Type: application/jsonwhen sending JSON, and use correct image MIME in multipart (image/jpeg). - Timeouts – VLM inference can be slow for large images. Increase curl timeout with
--max-time 120. In Python, passtimeout=120torequests.post.
Performance Tips
- Resize before sending – Large images increase payload size and processing time. Use ImageMagick:
convert captured.jpg -resize 800x600 resized.jpg - Cache base64 strings – If you send the same image repeatedly, cache the encoded string to avoid re‑encoding overhead.
- Batch multiple images – Some APIs accept an array of images (
"images": ["b64_1","b64_2"]). This reduces round trips when analyzing a set of photos together.
Conclusion
Sending images to vision language model APIs is straightforward once you understand the required payload format. Whether you prefer raw multipart form data with curl or structured JSON with base64 encoding in Python, the steps above cover both approaches for popular servers like Ollama and LM Studio. By automating image capture, resizing, and request handling, you can embed multimodal AI capabilities into scripts, IoT gateways, or web back‑ends with minimal effort.