In the last blog, we discuss how to achieve low-latency inference with inference engines. But in practice, we often need to handle concurrency and improve throughput of ML services.
In this blog, we guide you how to deploy the pre-trained or optimized model on high-throughput inference server.
Table of contents
- Table of contents
- Use Ray Serve to deploy PyTorch model
- Use Ray Serve to deploy TorchScript/ONNX/TensorRT
- Use Ray Serve and NVIDIA Triton Together
- Conclusion
- References
Use Ray Serve to deploy PyTorch model
Use Ray Serve to deploy TorchScript/ONNX/TensorRT
Use Ray Serve and NVIDIA Triton Together
Conclusion
In this blog, we have learned …
We recommend …