text generation inference

2025-12-10 0 655

Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face
to power Hugging Chat, the Inference API and Inference Endpoints.

Table of contents

  • Get Started
    • Docker
    • API documentation
    • Using a private or gated model
    • A note on Shared Memory (shm)
    • Distributed Tracing
    • Architecture
    • Local install
    • Local install (Nix)
  • Optimized architectures
  • Run locally
    • Run
    • Quantization
  • Develop
  • Testing

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

  • Simple launcher to serve most popular LLMs
  • Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
  • Tensor Parallelism for faster inference on multiple GPUs
  • Token streaming using Server-Sent Events (SSE)
  • Continuous batching of incoming requests for increased total throughput
  • Messages API compatible with Open AI Chat Completion API
  • Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
  • Quantization with :
    • bitsandbytes
    • GPT-Q
    • EETQ
    • AWQ
    • Marlin
    • fp8
  • Safetensors weight loading
  • Watermarking with A Watermark for Large Language Models
  • Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
  • Stop sequences
  • Log probabilities
  • Speculation ~2x latency
  • Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
  • Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model\’s output
  • Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

  • Nvidia
  • AMD (-rocm)
  • Inferentia
  • Intel GPU
  • Gaudi
  • Google TPU

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \\
    ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id $model

And then you can make requests like

curl 127.0.0.1:8080/generate_stream \\
    -X POST \\
    -d \'{\"inputs\":\"What is Deep Learning?\",\"parameters\":{\"max_new_tokens\":20}}\' \\
    -H \'Content-Type: application/json\'

You can also use TGI\’s Messages API to obtain Open AI Chat Completion API compatible responses.

curl localhost:8080/v1/chat/completions \\
    -X POST \\
    -d \'{
  \"model\": \"tgi\",
  \"messages\": [
    {
      \"role\": \"system\",
      \"content\": \"You are a helpful assistant.\"
    },
    {
      \"role\": \"user\",
      \"content\": \"What is deep learning?\"
    }
  ],
  \"stream\": true,
  \"max_tokens\": 20
}\' \\
    -H \'Content-Type: application/json\'

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.4-rocm --model-id $model instead of the command above.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route.
The Swagger UI is also available at: https://huggingface.*gi**thub.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HF_TOKEN environment variable for configuring the token employed by
text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

  1. Go to https://hu*gg**ingface.co/settings/tokens
  2. Copy your CLI READ token
  3. Export HF_TOKEN=<your CLI READ token>

or with Docker:

model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data \\
    ghcr.io/huggingface/text-generation-inference:3.3.4 --model-id $model

A note on Shared Memory (shm)

NCCL is a communication framework used by
PyTorch to do distributed training/inference. text-generation-inference makes
use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by
creating a volume with:

- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that
this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the --otlp-endpoint argument. The default service name can be
overridden with the --otlp-service-name argument

Architecture

Detailed blogpost by Adyen on TGI inner workings: LLM inference at scale with TGI (Martin Iglesias Goyanes – Adyen, 2024)

Local install

You can also opt to install text-generation-inference locally.

First clone the repository and change directory into it:

git clone https://gith*ub.c**om/huggingface/text-generation-inference
cd text-generation-inference

Then install Rust and create a Python virtual environment with at least
Python 3.9, e.g. using conda or python venv:

curl --proto \'=https\' --tlsv1.2 -sSf https://sh.r**u*stup.rs | sh

#using conda
conda create -n text-generation-inference python=3.11
conda activate text-generation-inference

#using python venv
python3 -m venv .venv
source .venv/bin/activate

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://g*ith**ub.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local \'include/*\'
rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

Local install (Nix)

Another option is to install text-generation-inference locally using Nix. Currently,
we only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies can
be pulled from a binary cache, removing the need to build them locally.

First follow the instructions to install Cachix and enable the Hugging Face cache.
Setting up the cache is important, otherwise Nix will build many of the dependencies
locally, which can take hours.

After that you can run TGI with nix run:

cd text-generation-inference
nix run --extra-experimental-features nix-command --extra-experimental-features flakes . -- --model-id meta-llama/Llama-3.1-8B-Instruct

Note: when you are using Nix on a non-NixOS system, you have to make some symlinks
to make the CUDA driver libraries visible to Nix packages.

For TGI development, you can use the impure dev shell:

nix develop .#impure

# Only needed the first time the devshell is started or after updating the protobuf.
(
cd server
mkdir text_generation_server/pb || true
python -m grpc_tools.protoc -I../proto/v3 --python_out=text_generation_server/pb \\
       --grpc_python_out=text_generation_server/pb --mypy_out=text_generation_server/pb ../proto/v3/generate.proto
find text_generation_server/pb/ -type f -name \"*.py\" -print0 -exec sed -i -e \'s/^\\(import.*pb2\\)/from . \\1/g\' {} \\;
touch text_generation_server/pb/__init__.py
)

All development dependencies (cargo, Python, Torch), etc. are available in this
dev shell.

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in this list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map=\"auto\")

or

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map=\"auto\")

Run locally

Run

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

Read more about quantization in the Quantization documentation.

Develop

make server-dev
make router-dev

Testing

# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests

下载源码

通过命令行克隆项目:

git clone https://github.com/huggingface/text-generation-inference.git

收藏 (0) 打赏

感谢您的支持,我会继续努力的!

打开微信/支付宝扫一扫,即可进行扫码打赏哦,分享从这里开始,精彩与您同在
点赞 (0)

申明:本文由第三方发布,内容仅代表作者观点,与本网站无关。对本文以及其中全部或者部分内容的真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。本网发布或转载文章出于传递更多信息之目的,并不意味着赞同其观点或证实其描述,也不代表本网对其真实性负责。

左子网 编程相关 text generation inference https://www.zuozi.net/33213.html

常见问题
  • 1、自动:拍下后,点击(下载)链接即可下载;2、手动:拍下后,联系卖家发放即可或者联系官方找开发者发货。
查看详情
  • 1、源码默认交易周期:手动发货商品为1-3天,并且用户付款金额将会进入平台担保直到交易完成或者3-7天即可发放,如遇纠纷无限期延长收款金额直至纠纷解决或者退款!;
查看详情
  • 1、描述:源码描述(含标题)与实际源码不一致的(例:货不对板); 2、演示:有演示站时,与实际源码小于95%一致的(但描述中有”不保证完全一样、有变化的可能性”类似显著声明的除外); 3、发货:不发货可无理由退款; 4、安装:免费提供安装服务的源码但卖家不履行的; 5、收费:价格虚标,额外收取其他费用的(但描述中有显著声明或双方交易前有商定的除外); 6、其他:如质量方面的硬性常规问题BUG等。 注:经核实符合上述任一,均支持退款,但卖家予以积极解决问题则除外。
查看详情
  • 1、左子会对双方交易的过程及交易商品的快照进行永久存档,以确保交易的真实、有效、安全! 2、左子无法对如“永久包更新”、“永久技术支持”等类似交易之后的商家承诺做担保,请买家自行鉴别; 3、在源码同时有网站演示与图片演示,且站演与图演不一致时,默认按图演作为纠纷评判依据(特别声明或有商定除外); 4、在没有”无任何正当退款依据”的前提下,商品写有”一旦售出,概不支持退款”等类似的声明,视为无效声明; 5、在未拍下前,双方在QQ上所商定的交易内容,亦可成为纠纷评判依据(商定与描述冲突时,商定为准); 6、因聊天记录可作为纠纷评判依据,故双方联系时,只与对方在左子上所留的QQ、手机号沟通,以防对方不承认自我承诺。 7、虽然交易产生纠纷的几率很小,但一定要保留如聊天记录、手机短信等这样的重要信息,以防产生纠纷时便于左子介入快速处理。
查看详情

相关文章

猜你喜欢
发表评论
暂无评论
官方客服团队

为您解决烦忧 - 24小时在线 专业服务