4Cs Cloud-Native & Linux

4Cs – Cloud, Cluster, Container, and Code & Linux Tips and Tricks

Tag: LMStudio

  • Send Images to a Vision Language Model (VLM) API

    Vision language models such as Ollama, LM Studio, or any OpenAI‑style endpoint accept image data alongside text prompts. This tutorial shows you how to capture an image on a Linux box, encode it for HTTP transmission, and invoke the VLM API using both curl and Python. The examples are deliberately simple so they can be adapted to shell scripts, CI pipelines, or edge devices.

    Why Send Images to a VLM?

    • Multimodal assistants – combine visual context with natural language queries
    • Document analysis – extract text from scanned PDFs or screenshots
    • Rapid prototyping – test model responses without writing full client libraries

    All of these use cases require you to send binary image data (usually JPEG or PNG) as part of a multipart/form‑data request, or base64‑encoded JSON payloads depending on the server’s expectations.

    Prerequisites

    • A Linux system with ffmpeg or imagemagick installed for image capture
    • curl version 7.55+ (most modern distros have this)
    • Python 3.8+ and the requests library (pip install requests)

    You also need an active VLM server endpoint:

    ServiceExample Endpoint
    Ollamahttp://localhost:11434/api/generate
    LM Studiohttp://127.0.0.1:1234/v1/chat/completions

    Both accept JSON bodies with an optional "image" field.

    Step 1 – Capture an Image

    If you have a webcam attached, you can use ffmpeg to snap a picture:

    ffmpeg -f v4l2 -video_size 640x480 -i /dev/video0 -vframes 1 captured.jpg
    

    Alternatively, with imagemagick you can capture the screen:

    import -window root screenshot.png
    

    Make sure the file exists and is readable:

    ls -l captured.jpg
    # or
    file screenshot.png
    

    Step 2 – Encode the Image for HTTP

    Option A: Multipart Form Data (curl)

    The simplest approach is to let curl handle multipart encoding. The VLM server expects a field named image. Here’s how:

    curl -X POST http://localhost:11434/api/generate \
         -F "prompt=What do you see in this picture?" \
         -F "image=@captured.jpg;type=image/jpeg"
    

    Explanation of flags:

    • -X POST – explicit HTTP method
    • -F – creates a form field; the @ syntax tells curl to read file contents
    • type= – sets the MIME type, useful for servers that validate it

    If your endpoint expects JSON instead, you need to base64‑encode the image.

    Option B: Base64 JSON Payload (Python)

    import base64
    import json
    import requests
    
    # Load and encode image
    with open('captured.jpg', 'rb') as f:
        img_bytes = f.read()
    b64_image = base64.b64encode(img_bytes).decode('utf-8')
    
    payload = {
        "model": "llava-v1.5",
        "prompt": "Describe the scene in detail.",
        "image": b64_image
    }
    
    headers = {'Content-Type': 'application/json'}
    
    response = requests.post(
        'http://localhost:11434/api/generate',
        headers=headers,
        data=json.dumps(payload)
    )
    
    print(response.status_code)
    print(response.json())
    

    Key points:

    • The image is sent as a base‑64 string in the "image" field.
    • Some servers require an additional "model" key to select the VLM variant; adjust accordingly.

    Step 3 – Handling Different API Schemas

    Ollama’s /api/generate Endpoint

    Ollama expects multipart form data or JSON with a base64 string under "images". Example using curl:

    curl -X POST http://localhost:11434/api/generate \
         -H "Content-Type: application/json" \
         -d '{
               "model":"llava",
               "prompt":"Explain the objects in this picture.",
               "images":["'$(base64 -w 0 captured.jpg)'"]
             }'
    

    The -w 0 flag tells base64 not to insert line breaks.

    LM Studio’s Chat Completion Endpoint

    LM Studio follows the OpenAI chat schema. Images are passed as separate "content" blocks with a "type": "image_url" entry:

    {
      "model": "gpt-4v",
      "messages": [
        {"role":"user","content":[
            {"type":"text","text":"What is happening here?"},
            {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,{{BASE64}}"}}
        ]}
      ]
    }
    

    In Python:

    import base64, json, requests
    
    with open('screenshot.png', 'rb') as img:
        b64 = base64.b64encode(img.read()).decode()
    
    payload = {
        "model": "gpt-4v",
        "messages": [
            {"role": "user", "content": [
                {"type": "text", "text": "What is happening in this picture?"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
            ]}
        ]
    }
    
    resp = requests.post('http://127.0.0.1:1234/v1/chat/completions', json=payload)
    print(resp.json())
    

    Notice the data: URL scheme – the server extracts the base64 payload automatically.

    Step 4 – Automate with a Shell Script

    You can wrap everything into a single script that captures, encodes, and calls the API. Save as vlm_send.sh:

    #!/usr/bin/env bash
    
    # Capture image (adjust device path if needed)
    ffmpeg -y -f v4l2 -video_size 640x480 -i /dev/video0 -vframes 1 /tmp/vlm_input.jpg >/dev/null 2>&1
    
    # Base64 encode without newlines
    IMG_B64=$(base64 -w 0 /tmp/vlm_input.jpg)
    
    # Build JSON payload (adjust model name as needed)
    read -r -d '' PAYLOAD <<EOF
    {
      "model": "llava",
      "prompt": "Provide a concise description of the scene.",
      "images": ["$IMG_B64"]
    }
    EOF
    
    # Send request with curl
    curl -s -X POST http://localhost:11434/api/generate \
         -H "Content-Type: application/json" \
         -d "$PAYLOAD"
    

    Make it executable:

    chmod +x vlm_send.sh
    ./vlm_send.sh
    

    The script prints the JSON response from the VLM.

    Step 5 – Error Handling and Debugging

    • HTTP 400 – Likely a malformed JSON or missing required fields. Use curl -v to view request headers.
    • HTTP 415 Unsupported Media Type – The server did not recognize the MIME type; ensure you set Content-Type: application/json when sending JSON, and use correct image MIME in multipart (image/jpeg).
    • Timeouts – VLM inference can be slow for large images. Increase curl timeout with --max-time 120. In Python, pass timeout=120 to requests.post.

    Performance Tips

    1. Resize before sending – Large images increase payload size and processing time. Use ImageMagick:convert captured.jpg -resize 800x600 resized.jpg
    2. Cache base64 strings – If you send the same image repeatedly, cache the encoded string to avoid re‑encoding overhead.
    3. Batch multiple images – Some APIs accept an array of images ("images": ["b64_1","b64_2"]). This reduces round trips when analyzing a set of photos together.

    Conclusion

    Sending images to vision language model APIs is straightforward once you understand the required payload format. Whether you prefer raw multipart form data with curl or structured JSON with base64 encoding in Python, the steps above cover both approaches for popular servers like Ollama and LM Studio. By automating image capture, resizing, and request handling, you can embed multimodal AI capabilities into scripts, IoT gateways, or web back‑ends with minimal effort.