Serverless Endpoints

Build auto-scaling GPU inference endpoints.

Serverless Endpoints

Serverless endpoints provide auto-scaling GPU inference that scales to zero when idle. You only pay for the compute time you use.

How It Works

  1. Create a template with your model and inference code
  2. Deploy an endpoint referencing that template
  3. Send requests — workers auto-scale based on queue depth
  4. Scale to zero — no charges when there are no requests

Creating an Endpoint

curl -X POST https://api.gpuworker.com/endpoints \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-inference",
    "templateId": "your-template-id",
    "gpuIds": "NVIDIA GeForce RTX 4090",
    "workersMin": 0,
    "workersMax": 5
  }'

Worker Configuration

ParameterDescription
workersMinMinimum number of always-warm workers (0 for scale-to-zero)
workersMaxMaximum workers to scale up to under load
gpuIdsGPU type(s) to use for workers

Sending Requests

Once deployed, send inference requests to your endpoint:

curl -X POST https://api.gpuworker.com/v2/{endpoint_id}/run \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "Hello, world!"
    }
  }'

Synchronous vs Asynchronous

  • /run — Waits for the result (up to 30s timeout)
  • /runsync — Returns immediately with a job ID, poll for results

Pricing

Serverless endpoints are billed per-second of active compute time. Cold start time is not billed. Idle workers (when workersMin > 0) are billed at the standard GPU rate.

Best Practices

  1. Set workersMin: 0 for development endpoints to avoid idle charges
  2. Set workersMin: 1 for production to eliminate cold starts
  3. Use appropriate GPU — RTX 4090 is cost-effective for most inference workloads
  4. Optimize your handler — Minimize model load time for faster cold starts