Getting Started with Triton Inference Server Client Libraries on Windows

To effectively communicate with Triton Inference Server, especially when deploying on Windows, utilizing the Triton project’s client libraries is highly recommended. These libraries streamline the interaction process and offer robust tools for various inference tasks. This guide provides a comprehensive overview of how to leverage Triton client libraries on Windows, ensuring optimal performance and seamless integration. For any questions or issues, please refer to the main Triton issues page.

Triton offers client libraries in multiple languages:

  • C++ Client Library: For high-performance applications requiring low latency.
  • Python Client Library: For ease of use and rapid development, ideal for scripting and integration with Python-based ML workflows.
  • Java Client Library: For Java-based applications and enterprise environments.

Numerous example applications are also available, demonstrating the practical usage of these libraries. Many examples utilize models from the example model repository, which is a great resource for getting started.

Obtaining Client Libraries and Examples for Windows

There are several methods to acquire the Triton client libraries for your Windows environment:

  • Using pip (Python Package Installer): The simplest method to install the Python client library.
  • Downloading from GitHub: Access pre-built client libraries directly from Triton’s GitHub releases.
  • Downloading Docker Image from NGC: Obtain a Docker image from NVIDIA GPU Cloud (NGC) that includes client libraries.
  • Building with CMake: Compile the client libraries from source using CMake, offering customization and flexibility, especially for Windows users who may need specific configurations.

Installation via Python Package Installer (pip) on Windows

For Windows users, pip provides a straightforward way to install the Python client libraries. Ensure you have a recent version of pip installed.

pip install tritonclient[all]

Installation badge showing BSD3 license, indicating the licensing of Triton Client Libraries.

The all option installs both HTTP/REST and GRPC client libraries. You can also install specific protocol support using grpc or http options. For example, to install only the HTTP/REST client library:

pip install tritonclient[http]

For utilizing cuda_shared_memory utilities, include the cuda package. Note that all includes cuda by default.

pip install tritonclient[http, cuda]

The installed packages contain the following components:

  • http: HTTP client library.
  • grpc: GRPC client library, including service_pb2, service_pb2_grpc, and model_config_pb2.
  • utils: Utility modules, with shared_memory and cuda_shared_memory for Linux distributions, but relevant shared memory functionalities are also available for Windows.

Downloading Pre-built Libraries from GitHub for Windows

Pre-built client libraries are available on the Triton GitHub release page. Locate the release version you need and find the “Assets” section. Client libraries are packaged in a tar file named according to the release version and OS, e.g., v2.3.0_ubuntu2004.clients.tar.gz. While the naming convention might suggest Ubuntu, these pre-built libraries can be adapted for Windows environments or used within a Windows-based Docker container.

mkdir clients
cd clients
wget <tarfile_path>
tar xzf <tarfile_name>

After extraction, you’ll find libraries in lib/, headers in include/, Python wheel files in python/, and Java JAR files in java/. The bin/ and python/ directories contain example applications that can be run on Windows or within Windows containers.

Utilizing Docker Image from NGC on Windows

A Docker image containing client libraries and examples is hosted on NVIDIA GPU Cloud (NGC). Ensure you have NGC access before proceeding. Refer to the NGC Getting Started Guide for setup instructions.

Use docker pull to retrieve the client libraries and examples container from NGC.

docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk

Replace <xx.yy> with the desired version. Inside the container, client libraries are located at /workspace/install/lib, headers at /workspace/install/include, and Python wheel files at /workspace/install/python. The image also includes pre-built client examples, which can be very helpful for Windows users looking to deploy Triton in containerized environments.

Important Note for Windows Docker Users: When using Docker containers on Windows and employing CUDA shared memory, the --pid host flag is crucial during container launch. This is because CUDA IPC APIs require distinct PIDs for the source and destination of exported pointers. Docker’s PID namespace can cause PID equality if not configured correctly, leading to errors when containers operate in non-interactive mode.

Building Client Libraries with CMake on Windows

Building client libraries using CMake offers customization, particularly beneficial for Windows environments.

  1. Prerequisites: Ensure you have an appropriate C++ compiler and necessary dependencies installed for Windows. The easiest approach is using the Windows min Docker image and building within a container launched from it.

    docker run -it --rm win10-py3-min powershell

    Alternatively, you can set up a Windows host system with the required dependencies.

  2. CMake Configuration: Configure the build using CMake. If not using the win10-py3-min container, adjust CMAKE_TOOLCHAIN_FILE path accordingly.

    mkdir build
    cd build
    cmake -DVCPKG_TARGET_TRIPLET=x64-windows -DCMAKE_TOOLCHAIN_FILE='/vcpkg/scripts/buildsystems/vcpkg.cmake' -DCMAKE_INSTALL_PREFIX=install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_PYTHON_GRPC=ON -DTRITON_ENABLE_GPU=OFF -DTRITON_ENABLE_EXAMPLES=ON -DTRITON_ENABLE_TESTS=ON ..

    For release branches (or development branches based on releases), include additional CMake arguments to specify release branch tags for dependent repositories. For instance, for the r21.10 client branch:

    -DTRITON_COMMON_REPO_TAG=r21.10 -DTRITON_THIRD_PARTY_REPO_TAG=r21.10 -DTRITON_CORE_REPO_TAG=r21.10
  3. Build Process: Use msbuild.exe to build the client libraries.

    msbuild.exe cc-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly
    msbuild.exe python-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly

    Upon completion, libraries and examples are located in the install directory. This method is particularly useful for Windows users needing to compile specifically for their Windows environment or wanting to customize build options.

Client Library APIs for Windows Development

The client libraries offer APIs in C++, Python, and Java, all compatible with Windows.

  • C++ Client API: Features a class-based interface, detailed in grpc_client.h (grpc_client.h), http_client.h (http_client.h), and common.h (common.h).

  • Python Client API: Mirrors the capabilities of the C++ API, with interfaces in grpc (grpc) and http (http).

  • Java Client API: Provides similar functionalities to the Python API. More details can be found in the Java client directory.

HTTP Options on Windows

SSL/TLS on Windows

Secure communication over HTTPS is supported. Ensure your Triton server on Windows is configured behind an https:// proxy like nginx.

  • C++ Client: HttpSslOptions struct in http_client.h (http_client.h).
  • Python Client: Options in http/__init__.py (http/__init__.py): ssl, ssl_options, ssl_context_factory, insecure.

Examples in C++ (C++) and Python (Python) demonstrate SSL/TLS usage.

Compression on Windows

HTTP compression is supported to improve performance, especially in Windows environments where network bandwidth might be a concern.

  • C++ Client: request_compression_algorithm and response_compression_algorithm parameters in Infer and AsyncInfer functions in http_client.h (http_client.h).
  • Python Client: Corresponding parameters in infer and async_infer functions in http/__init__.py (http/__init__.py).

C++ (C++) and Python (Python) examples illustrate compression options.

Python AsyncIO Support (Beta) on Windows

Asynchronous operations are crucial for efficient Windows applications.

  • Python client supports async and await syntax for advanced users. Example: infer.
  • SSL/TLS with AsyncIO: ssl and ssl_context options in http/aio/__init__.py (http/aio/__init__.py).

Python Client Plugin API (Beta) on Windows

Custom plugins can be registered to modify request headers, useful for integrating with gateways requiring extra headers, such as HTTP Authorization in Windows-based enterprise setups.

class MyPlugin:
    def __call__(self, request):
        request.headers['my-header-key'] = 'my-header-value'

from tritonclient.http import InferenceServerClient
client = InferenceServerClient(...)
my_plugin = MyPlugin()
client.register_plugin(my_plugin)
client.infer(...)

Unregister plugins using client.unregister_plugin().

Basic Auth on Windows

Basic Authentication plugin is available.

from tritonclient.grpc.auth import BasicAuth
from tritonclient.grpc import InferenceServerClient
basic_auth = BasicAuth('username', 'password')
client = InferenceServerClient('...')
client.register_plugin(basic_auth)

GRPC Options on Windows

SSL/TLS on Windows

Secure GRPC communication is essential for production deployments on Windows.

  • C++ Client: SslOptions struct in grpc_client.h (grpc_client.h).
  • Python Client: Options in grpc/__init__.py (grpc/__init__.py): ssl, root_certificates, private_key, certificate_chain.

Examples: C++ (C++) and Python (Python). Server-side parameters are in the server documentation.

Compression on Windows

GRPC compression can significantly improve performance, especially in Windows environments.

  • C++ Client: compression_algorithm parameter in Infer, AsyncInfer, and StartStream in grpc_client.h (grpc_client.h).
  • Python Client: compression_algorithm in infer, async_infer, and start_stream in grpc/__init__.py (grpc/__init__.py).

Examples: C++ (C++) and Python (Python). Server-side details are in the server documentation.

GRPC KeepAlive on Windows

KeepAlive parameters ensure connection stability, important for long-running Windows applications.

  • KeepAliveOptions struct/class in C++ (C++) and Python (Python) client libraries.

Examples: C++ (C++) and Python (Python). Server-side parameters: server documentation.

Custom GRPC Channel Arguments on Windows

For advanced Windows users, custom channel arguments are supported.

Examples: C++ (C++) and Python (Python). Comprehensive list of arguments: here.

Python AsyncIO Support (Beta) on Windows

AsyncIO support extends to GRPC for efficient Windows applications.

Examples: infer and stream.

Request Cancellation on Windows

Request cancellation provides control over inflight requests, crucial for responsive Windows applications.

ctx = client.async_infer(...)
ctx.cancel()

For streaming:

client.start_stream()
for _ in range(10):
    client.async_stream_infer(...)
client.stop_stream(cancel_requests=True)

Details in grpc/_client.py (grpc/_client.py).

For GRPC AsyncIO:

infer_task = asyncio.create_task(aio_client.infer(...))
infer_task.cancel()

For AsyncIO streaming:

responses_iterator = aio_client.stream_infer(...)
responses_iterator.cancel()

Details in grpc/aio/__init__.py (grpc/aio/_init_.py). Server-side handling: request_cancellation. gRPC cancellation guide: cancellation.

GRPC Status Codes on Windows

Enhanced error reporting with gRPC error codes in streaming mode, available from release 24.08. Enable by adding header triton_grpc_error: true.

triton_client = grpcclient.InferenceServerClient(triton_server_url)
metadata = {"triton_grpc_error": "true"}
triton_client.start_stream(
    callback=partial(callback, user_data),
    headers=metadata
)

Server-side handling: grpc error codes. gRPC status codes guide: status-codes.

Simple Example Applications for Windows Testing

Several example applications illustrate key features and can be tested on Windows.

Bytes/String Datatype

Supports variable-length binary data tensors (BYTES datatype). Python client uses NumPy with np.object_ dtype for BYTES tensors.

Examples: C++ (simple_http_string_infer_client.cc, simple_grpc_string_infer_client.cc), Python (simple_http_string_infer_client.py, simple_grpc_string_infer_client.py).

System Shared Memory on Windows

Improves performance by using system shared memory for tensor communication.

Examples: C++ (simple_http_shm_client.cc, simple_grpc_shm_client.cc), Python (simple_http_shm_client.py, simple_grpc_shm_client.py). Python shared memory module: system shared memory module.

CUDA Shared Memory on Windows

Further performance gains using CUDA shared memory. Requires CUDA-enabled Windows environment.

Examples: C++ (simple_http_cudashm_client.cc, simple_grpc_cudashm_client.cc), Python (simple_http_cudashm_client.py, simple_grpc_cudashm_client.py). Python CUDA shared memory module: CUDA shared memory module. Supports NumPy arrays (example usage) and DLPack tensors (example usage).

Client API for Stateful Models

For stateful models, clients manage sequence IDs and start/end flags.

Examples: C++ (simple_grpc_sequence_stream_infer_client.cc), Python (simple_grpc_sequence_stream_infer_client.py).

Image Classification Example on Windows

The image classification example demonstrates practical usage and can be run on Windows. C++ client: src/c++/examples/image_client.cc, Python client: src/python/examples/image_client.py.

Requires a running Triton server with image classification models. See QuickStart for model repository setup.

Example usage:

image_client -m inception_graphdef -s INCEPTION qa/images/mug.jpg

Command line output showing image classification results for a mug image using Triton Client.

Python version usage:

python image_client.py -m inception_graphdef -s INCEPTION qa/images/mug.jpg

Protocol flag -i (default HTTP/REST, use -i grpc for GRPC), -u for GRPC endpoint.

image_client -i grpc -u localhost:8001 -m inception_graphdef -s INCEPTION qa/images/mug.jpg

Classification count -c:

image_client -m inception_graphdef -s INCEPTION -c 3 qa/images/mug.jpg

Batch processing -b:

image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images/mug.jpg

Directory processing:

image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images

GRPC version: grpc_image_client.py.

Ensemble Image Classification Example Application

This example utilizes an ensemble model with DALI backend and TensorFlow Inception model, processing raw images directly. Refer to DALI ensemble example instructions for setup and usage details on Windows.

By following this guide, Windows users can effectively set up and utilize Triton Inference Server client libraries to build and deploy high-performance inference applications.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *