To effectively communicate with Triton Inference Server, especially when deploying on Windows, utilizing the Triton project’s client libraries is highly recommended. These libraries streamline the interaction process and offer robust tools for various inference tasks. This guide provides a comprehensive overview of how to leverage Triton client libraries on Windows, ensuring optimal performance and seamless integration. For any questions or issues, please refer to the main Triton issues page.
Triton offers client libraries in multiple languages:
- C++ Client Library: For high-performance applications requiring low latency.
- Python Client Library: For ease of use and rapid development, ideal for scripting and integration with Python-based ML workflows.
- Java Client Library: For Java-based applications and enterprise environments.
Numerous example applications are also available, demonstrating the practical usage of these libraries. Many examples utilize models from the example model repository, which is a great resource for getting started.
Obtaining Client Libraries and Examples for Windows
There are several methods to acquire the Triton client libraries for your Windows environment:
- Using pip (Python Package Installer): The simplest method to install the Python client library.
- Downloading from GitHub: Access pre-built client libraries directly from Triton’s GitHub releases.
- Downloading Docker Image from NGC: Obtain a Docker image from NVIDIA GPU Cloud (NGC) that includes client libraries.
- Building with CMake: Compile the client libraries from source using CMake, offering customization and flexibility, especially for Windows users who may need specific configurations.
Installation via Python Package Installer (pip) on Windows
For Windows users, pip provides a straightforward way to install the Python client libraries. Ensure you have a recent version of pip installed.
pip install tritonclient[all]
Installation badge showing BSD3 license, indicating the licensing of Triton Client Libraries.
The all
option installs both HTTP/REST and GRPC client libraries. You can also install specific protocol support using grpc
or http
options. For example, to install only the HTTP/REST client library:
pip install tritonclient[http]
For utilizing cuda_shared_memory utilities, include the cuda
package. Note that all
includes cuda
by default.
pip install tritonclient[http, cuda]
The installed packages contain the following components:
http
: HTTP client library.grpc
: GRPC client library, includingservice_pb2
,service_pb2_grpc
, andmodel_config_pb2
.utils
: Utility modules, withshared_memory
andcuda_shared_memory
for Linux distributions, but relevant shared memory functionalities are also available for Windows.
Downloading Pre-built Libraries from GitHub for Windows
Pre-built client libraries are available on the Triton GitHub release page. Locate the release version you need and find the “Assets” section. Client libraries are packaged in a tar file named according to the release version and OS, e.g., v2.3.0_ubuntu2004.clients.tar.gz. While the naming convention might suggest Ubuntu, these pre-built libraries can be adapted for Windows environments or used within a Windows-based Docker container.
mkdir clients
cd clients
wget <tarfile_path>
tar xzf <tarfile_name>
After extraction, you’ll find libraries in lib/
, headers in include/
, Python wheel files in python/
, and Java JAR files in java/
. The bin/
and python/
directories contain example applications that can be run on Windows or within Windows containers.
Utilizing Docker Image from NGC on Windows
A Docker image containing client libraries and examples is hosted on NVIDIA GPU Cloud (NGC). Ensure you have NGC access before proceeding. Refer to the NGC Getting Started Guide for setup instructions.
Use docker pull
to retrieve the client libraries and examples container from NGC.
docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
Replace <xx.yy>
with the desired version. Inside the container, client libraries are located at /workspace/install/lib
, headers at /workspace/install/include
, and Python wheel files at /workspace/install/python
. The image also includes pre-built client examples, which can be very helpful for Windows users looking to deploy Triton in containerized environments.
Important Note for Windows Docker Users: When using Docker containers on Windows and employing CUDA shared memory, the --pid host
flag is crucial during container launch. This is because CUDA IPC APIs require distinct PIDs for the source and destination of exported pointers. Docker’s PID namespace can cause PID equality if not configured correctly, leading to errors when containers operate in non-interactive mode.
Building Client Libraries with CMake on Windows
Building client libraries using CMake offers customization, particularly beneficial for Windows environments.
-
Prerequisites: Ensure you have an appropriate C++ compiler and necessary dependencies installed for Windows. The easiest approach is using the Windows min Docker image and building within a container launched from it.
docker run -it --rm win10-py3-min powershell
Alternatively, you can set up a Windows host system with the required dependencies.
-
CMake Configuration: Configure the build using CMake. If not using the
win10-py3-min
container, adjustCMAKE_TOOLCHAIN_FILE
path accordingly.mkdir build cd build cmake -DVCPKG_TARGET_TRIPLET=x64-windows -DCMAKE_TOOLCHAIN_FILE='/vcpkg/scripts/buildsystems/vcpkg.cmake' -DCMAKE_INSTALL_PREFIX=install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_PYTHON_GRPC=ON -DTRITON_ENABLE_GPU=OFF -DTRITON_ENABLE_EXAMPLES=ON -DTRITON_ENABLE_TESTS=ON ..
For release branches (or development branches based on releases), include additional CMake arguments to specify release branch tags for dependent repositories. For instance, for the r21.10 client branch:
-DTRITON_COMMON_REPO_TAG=r21.10 -DTRITON_THIRD_PARTY_REPO_TAG=r21.10 -DTRITON_CORE_REPO_TAG=r21.10
-
Build Process: Use
msbuild.exe
to build the client libraries.msbuild.exe cc-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly msbuild.exe python-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly
Upon completion, libraries and examples are located in the
install
directory. This method is particularly useful for Windows users needing to compile specifically for their Windows environment or wanting to customize build options.
Client Library APIs for Windows Development
The client libraries offer APIs in C++, Python, and Java, all compatible with Windows.
-
C++ Client API: Features a class-based interface, detailed in
grpc_client.h
(grpc_client.h),http_client.h
(http_client.h), andcommon.h
(common.h). -
Python Client API: Mirrors the capabilities of the C++ API, with interfaces in
grpc
(grpc) andhttp
(http). -
Java Client API: Provides similar functionalities to the Python API. More details can be found in the Java client directory.
HTTP Options on Windows
SSL/TLS on Windows
Secure communication over HTTPS is supported. Ensure your Triton server on Windows is configured behind an https://
proxy like nginx.
- C++ Client:
HttpSslOptions
struct inhttp_client.h
(http_client.h). - Python Client: Options in
http/__init__.py
(http/__init__.py):ssl
,ssl_options
,ssl_context_factory
,insecure
.
Examples in C++ (C++) and Python (Python) demonstrate SSL/TLS usage.
Compression on Windows
HTTP compression is supported to improve performance, especially in Windows environments where network bandwidth might be a concern.
- C++ Client:
request_compression_algorithm
andresponse_compression_algorithm
parameters inInfer
andAsyncInfer
functions inhttp_client.h
(http_client.h). - Python Client: Corresponding parameters in
infer
andasync_infer
functions inhttp/__init__.py
(http/__init__.py).
C++ (C++) and Python (Python) examples illustrate compression options.
Python AsyncIO Support (Beta) on Windows
Asynchronous operations are crucial for efficient Windows applications.
- Python client supports
async
andawait
syntax for advanced users. Example: infer. - SSL/TLS with AsyncIO:
ssl
andssl_context
options inhttp/aio/__init__.py
(http/aio/__init__.py).
Python Client Plugin API (Beta) on Windows
Custom plugins can be registered to modify request headers, useful for integrating with gateways requiring extra headers, such as HTTP Authorization in Windows-based enterprise setups.
class MyPlugin:
def __call__(self, request):
request.headers['my-header-key'] = 'my-header-value'
from tritonclient.http import InferenceServerClient
client = InferenceServerClient(...)
my_plugin = MyPlugin()
client.register_plugin(my_plugin)
client.infer(...)
Unregister plugins using client.unregister_plugin()
.
Basic Auth on Windows
Basic Authentication plugin is available.
from tritonclient.grpc.auth import BasicAuth
from tritonclient.grpc import InferenceServerClient
basic_auth = BasicAuth('username', 'password')
client = InferenceServerClient('...')
client.register_plugin(basic_auth)
GRPC Options on Windows
SSL/TLS on Windows
Secure GRPC communication is essential for production deployments on Windows.
- C++ Client:
SslOptions
struct ingrpc_client.h
(grpc_client.h). - Python Client: Options in
grpc/__init__.py
(grpc/__init__.py):ssl
,root_certificates
,private_key
,certificate_chain
.
Examples: C++ (C++) and Python (Python). Server-side parameters are in the server documentation.
Compression on Windows
GRPC compression can significantly improve performance, especially in Windows environments.
- C++ Client:
compression_algorithm
parameter inInfer
,AsyncInfer
, andStartStream
ingrpc_client.h
(grpc_client.h). - Python Client:
compression_algorithm
ininfer
,async_infer
, andstart_stream
ingrpc/__init__.py
(grpc/__init__.py).
Examples: C++ (C++) and Python (Python). Server-side details are in the server documentation.
GRPC KeepAlive on Windows
KeepAlive parameters ensure connection stability, important for long-running Windows applications.
Examples: C++ (C++) and Python (Python). Server-side parameters: server documentation.
Custom GRPC Channel Arguments on Windows
For advanced Windows users, custom channel arguments are supported.
Examples: C++ (C++) and Python (Python). Comprehensive list of arguments: here.
Python AsyncIO Support (Beta) on Windows
AsyncIO support extends to GRPC for efficient Windows applications.
Request Cancellation on Windows
Request cancellation provides control over inflight requests, crucial for responsive Windows applications.
ctx = client.async_infer(...)
ctx.cancel()
For streaming:
client.start_stream()
for _ in range(10):
client.async_stream_infer(...)
client.stop_stream(cancel_requests=True)
Details in grpc/_client.py
(grpc/_client.py).
For GRPC AsyncIO:
infer_task = asyncio.create_task(aio_client.infer(...))
infer_task.cancel()
For AsyncIO streaming:
responses_iterator = aio_client.stream_infer(...)
responses_iterator.cancel()
Details in grpc/aio/__init__.py
(grpc/aio/_init_.py). Server-side handling: request_cancellation. gRPC cancellation guide: cancellation.
GRPC Status Codes on Windows
Enhanced error reporting with gRPC error codes in streaming mode, available from release 24.08. Enable by adding header triton_grpc_error: true
.
triton_client = grpcclient.InferenceServerClient(triton_server_url)
metadata = {"triton_grpc_error": "true"}
triton_client.start_stream(
callback=partial(callback, user_data),
headers=metadata
)
Server-side handling: grpc error codes. gRPC status codes guide: status-codes.
Simple Example Applications for Windows Testing
Several example applications illustrate key features and can be tested on Windows.
Bytes/String Datatype
Supports variable-length binary data tensors (BYTES datatype). Python client uses NumPy with np.object_
dtype for BYTES tensors.
Examples: C++ (simple_http_string_infer_client.cc
, simple_grpc_string_infer_client.cc
), Python (simple_http_string_infer_client.py
, simple_grpc_string_infer_client.py
).
System Shared Memory on Windows
Improves performance by using system shared memory for tensor communication.
Examples: C++ (simple_http_shm_client.cc
, simple_grpc_shm_client.cc
), Python (simple_http_shm_client.py
, simple_grpc_shm_client.py
). Python shared memory module: system shared memory module.
CUDA Shared Memory on Windows
Further performance gains using CUDA shared memory. Requires CUDA-enabled Windows environment.
Examples: C++ (simple_http_cudashm_client.cc
, simple_grpc_cudashm_client.cc
), Python (simple_http_cudashm_client.py
, simple_grpc_cudashm_client.py
). Python CUDA shared memory module: CUDA shared memory module. Supports NumPy arrays (example usage) and DLPack tensors (example usage).
Client API for Stateful Models
For stateful models, clients manage sequence IDs and start/end flags.
Examples: C++ (simple_grpc_sequence_stream_infer_client.cc
), Python (simple_grpc_sequence_stream_infer_client.py
).
Image Classification Example on Windows
The image classification example demonstrates practical usage and can be run on Windows. C++ client: src/c++/examples/image_client.cc
, Python client: src/python/examples/image_client.py
.
Requires a running Triton server with image classification models. See QuickStart for model repository setup.
Example usage:
image_client -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Command line output showing image classification results for a mug image using Triton Client.
Python version usage:
python image_client.py -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Protocol flag -i
(default HTTP/REST, use -i grpc
for GRPC), -u
for GRPC endpoint.
image_client -i grpc -u localhost:8001 -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Classification count -c
:
image_client -m inception_graphdef -s INCEPTION -c 3 qa/images/mug.jpg
Batch processing -b
:
image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images/mug.jpg
Directory processing:
image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images
GRPC version: grpc_image_client.py
.
Ensemble Image Classification Example Application
This example utilizes an ensemble model with DALI backend and TensorFlow Inception model, processing raw images directly. Refer to DALI ensemble example instructions for setup and usage details on Windows.
By following this guide, Windows users can effectively set up and utilize Triton Inference Server client libraries to build and deploy high-performance inference applications.