Serverless Endpoints
Build auto-scaling GPU inference endpoints.
Serverless Endpoints
Serverless endpoints provide auto-scaling GPU inference that scales to zero when idle. You only pay for the compute time you use.
How It Works
- Create a template with your model and inference code
- Deploy an endpoint referencing that template
- Send requests — workers auto-scale based on queue depth
- Scale to zero — no charges when there are no requests
Creating an Endpoint
curl -X POST https://api.gpuworker.com/endpoints \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "llama-inference",
"templateId": "your-template-id",
"gpuIds": "NVIDIA GeForce RTX 4090",
"workersMin": 0,
"workersMax": 5
}'
Worker Configuration
| Parameter | Description |
|---|---|
workersMin | Minimum number of always-warm workers (0 for scale-to-zero) |
workersMax | Maximum workers to scale up to under load |
gpuIds | GPU type(s) to use for workers |
Sending Requests
Once deployed, send inference requests to your endpoint:
curl -X POST https://api.gpuworker.com/v2/{endpoint_id}/run \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"input": {
"prompt": "Hello, world!"
}
}'
Synchronous vs Asynchronous
/run— Waits for the result (up to 30s timeout)/runsync— Returns immediately with a job ID, poll for results
Pricing
Serverless endpoints are billed per-second of active compute time. Cold start time is not billed. Idle workers (when workersMin > 0) are billed at the standard GPU rate.
Best Practices
- Set
workersMin: 0for development endpoints to avoid idle charges - Set
workersMin: 1for production to eliminate cold starts - Use appropriate GPU — RTX 4090 is cost-effective for most inference workloads
- Optimize your handler — Minimize model load time for faster cold starts