The Inference Server - Llama.cpp - CUDA - NVIDIA Container - Ubuntu 22
By:
NI SP - High-End Remote Desktop and HPC
Latest Version:
Inference-2023.12.26-NVIDIA-535.104-CUDA12.2.2-LLAMA.CPP-Ubu22
Linux/Unix
Linux/Unix
Product Overview
The Inference server offers the full infrastructure to run fast inference on GPUs.
It includes llama.cpp inference, latest CUDA and NVIDIA Docker container toolkit.
Leverage the multitude of models freely available to run inference with 8 bit or lower quantized models which makes inference possible on e.g. 16 GB or 24 GB memory GPUs.
Llama.cpp offer efficient inference of quantized models in interactive and server mode. It features
- Plain C/C++ implementation without dependencies
- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support
- Running inference on GPU and CPU simultaneously allowing to run larger models in case GPU memory is insufficient
- AVX, AVX2 and AVX512 support for x86 architectures
- Supported models: LLaMA, LLaMA 2, Falcon, Alpaca, GPT4All, Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2, Vigogne (French), Vicuna, Koala, OpenBuddy (Multilingual), Pygmalion 7B / Metharme 7B, WizardLM, Baichuan-7B and its derivations (such as baichuan-7b-sft), Aquila-7B / AquilaChat-7B, Starcoder models, Mistral AI v0.1, Refact
Here is our guide How to use the AI SP Inference Server
The Inference server supports in addition
- llama-cpp-python: OpenAI API compatible Llama.cpp inference server
- Open Interpreter: let language models run code on your computer. An open-source, locally running implementation of OpenAIs Code Interpreter.
- Tabby coding assistant: a self-hosted AI coding assistant, offering an open-source alternative to GitHub Copilot
Includes remote desktop access via NICE DCV high-end remote desktops or via ssh (putty, ...).
Version
Inference-2023.12.26-NVIDIA-535.104-CUDA12.2.2-LLAMA.CPP-Ubu22
Video
Categories
Operating System
Linux/Unix, Ubuntu 22
Delivery Methods