Serverless Endpoints

Serverless endpoints provide auto-scaling GPU inference that scales to zero when idle. You only pay for the compute time you use.

How It Works

Create a template with your model and inference code
Deploy an endpoint referencing that template
Send requests — workers auto-scale based on queue depth
Scale to zero — no charges when there are no requests

Creating an Endpoint

curl -X POST https://api.gpuworker.com/endpoints \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-inference",
    "templateId": "your-template-id",
    "gpuIds": "NVIDIA GeForce RTX 4090",
    "workersMin": 0,
    "workersMax": 5
  }'

Worker Configuration

Parameter	Description
`workersMin`	Minimum number of always-warm workers (0 for scale-to-zero)
`workersMax`	Maximum workers to scale up to under load
`gpuIds`	GPU type(s) to use for workers

Sending Requests

Once deployed, send inference requests to your endpoint:

curl -X POST https://api.gpuworker.com/v2/{endpoint_id}/run \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "Hello, world!"
    }
  }'

Synchronous vs Asynchronous

/run — Waits for the result (up to 30s timeout)
/runsync — Returns immediately with a job ID, poll for results

Pricing

Serverless endpoints are billed per-second of active compute time. Cold start time is not billed. Idle workers (when workersMin > 0) are billed at the standard GPU rate.

Best Practices

Set workersMin: 0 for development endpoints to avoid idle charges
Set workersMin: 1 for production to eliminate cold starts
Use appropriate GPU — RTX 4090 is cost-effective for most inference workloads
Optimize your handler — Minimize model load time for faster cold starts