Inference Server

In the last blog, we discuss how to achieve low-latency inference with inference engines. But in practice, we often need to handle concurrency and improve throughput of ML services.

In this blog, we guide you how to deploy the pre-trained or optimized model on high-throughput inference server.

Table of contents
Use Ray Serve to deploy PyTorch model
Use Ray Serve to deploy TorchScript/ONNX/TensorRT
Use Ray Serve and NVIDIA Triton Together
Conclusion
References

Use Ray Serve to deploy PyTorch model

Use Ray Serve to deploy TorchScript/ONNX/TensorRT

Use Ray Serve and NVIDIA Triton Together

Conclusion

In this blog, we have learned …

We recommend …

References

Deploying HuggingFace models on Triton
Conceptual Guides for NVIDIA Triton
Ray Serve Documentation
Serving models with Triton Server in Ray Serve

Table of contents

Use Ray Serve to deploy PyTorch model

Use Ray Serve to deploy TorchScript/ONNX/TensorRT

Use Ray Serve and NVIDIA Triton Together

Conclusion

References