AWS Marketplace: The Inference Server - Llama.cpp - CUDA - NVIDIA Container

The Inference Server - Llama.cpp - CUDA - NVIDIA Container - Ubuntu 22

By: NI SP - High-End Remote Desktop and HPC Latest Version: Inference-2023.12.26-NVIDIA-535.104-CUDA12.2.2-LLAMA.CPP-Ubu22

Linux/Unix

Product Overview

The Inference server offers the full infrastructure to run fast inference on GPUs.

It includes llama.cpp inference, latest CUDA and NVIDIA Docker container toolkit.

Leverage the multitude of models freely available to run inference with 8 bit or lower quantized models which makes inference possible on e.g. 16 GB or 24 GB memory GPUs.

Llama.cpp offer efficient inference of quantized models in interactive and server mode. It features

Plain C/C++ implementation without dependencies
2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support
Running inference on GPU and CPU simultaneously allowing to run larger models in case GPU memory is insufficient
AVX, AVX2 and AVX512 support for x86 architectures
Supported models: LLaMA, LLaMA 2, Falcon, Alpaca, GPT4All, Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2, Vigogne (French), Vicuna, Koala, OpenBuddy (Multilingual), Pygmalion 7B / Metharme 7B, WizardLM, Baichuan-7B and its derivations (such as baichuan-7b-sft), Aquila-7B / AquilaChat-7B, Starcoder models, Mistral AI v0.1, Refact

Here is our guide How to use the AI SP Inference Server

The Inference server supports in addition

llama-cpp-python: OpenAI API compatible Llama.cpp inference server
Open Interpreter: let language models run code on your computer. An open-source, locally running implementation of OpenAIs Code Interpreter.
Tabby coding assistant: a self-hosted AI coding assistant, offering an open-source alternative to GitHub Copilot

Includes remote desktop access via NICE DCV high-end remote desktops or via ssh (putty, ...).

Version

Inference-2023.12.26-NVIDIA-535.104-CUDA12.2.2-LLAMA.CPP-Ubu22

NI SP - High-End Remote Desktop and HPC

Video

See Product Video

The Inference Server - Llama.cpp - CUDA - NVIDIA Container - Ubuntu 22

The Inference Server - Llama.cpp - CUDA - NVIDIA Container - Ubuntu 22

Product Overview

Pricing Information

Usage Information

Support Information

Customer Reviews