In the last blog, we discuss how to achieve low-latency inference with inference engines. But in practice, we often need to handle concurrency and improve throughput of ML services.

In this blog, we guide you how to deploy the pre-trained or optimized model on high-throughput inference server.

Table of contents

Use Ray Serve to deploy PyTorch model

Use Ray Serve to deploy TorchScript/ONNX/TensorRT

Use Ray Serve and NVIDIA Triton Together

Conclusion

In this blog, we have learned …

We recommend …


References

  1. Deploying HuggingFace models on Triton
  2. Conceptual Guides for NVIDIA Triton
  3. Ray Serve Documentation
  4. Serving models with Triton Server in Ray Serve