- Nvidia NIMs are inference microservices that package up popular AI models along with all the APIs needed to run them at scale.
- They include inference engines like tensor RT llm and data management tools for authentication, health checks, monitoring, etc.
- These APIs and the model itself are containerized and run on Kubernetes, allowing deployment to the cloud, on-prem, or even on a local PC.
- NIMs can be used to scale AI in any environment, augmenting human work rather than replacing it.
- NIMs reduce development time while facilitating the deployment of tools that augment human capabilities.
- NIMs can be accessed via the API and have been standardized to work with the OpenAI SDK.
- They can also be pulled with Docker and run in a local environment or configured in the cloud to scale to any workload.
via How to self-host and hyperscale AI with Nvidia NIM