Skip to content

Releases: vllm-project/vllm

v0.5.0

10 Jun 22:57
114332b
Compare
Choose a tag to compare
v0.5.0 Pre-release
Pre-release

What's Changed

  • [CI/Build] CMakeLists: build all extensions' cmake targets at the same time by @dtrifiro in #5034
  • [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by @tlrmchlsmth in #5137
  • [Kernel] Update Cutlass fp8 configs by @varun-sundar-rabindranath in #5144
  • [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by @dashanji in #5151
  • [Bugfix] Fix call to init_logger in openai server by @NadavShmayo in #4765
  • [Feature][Kernel] Support bitsandbytes quantization and QLoRA by @chenqianfzh in #4776
  • [Bugfix] Remove deprecated @abstractproperty by @zhuohan123 in #5174
  • [Bugfix]: Fix issues related to prefix caching example (#5177) by @Delviet in #5180
  • [BugFix] Prevent LLM.encode for non-generation Models by @robertgshaw2-neuralmagic in #5184
  • Update test_ignore_eos by @simon-mo in #4898
  • [Frontend][OpenAI] Support for returning max_model_len on /v1/models response by @Avinash-Raj in #4643
  • [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by @divakar-amd in #4927
  • [Misc] Simplify code and fix type annotations in conftest.py by @DarkLight1337 in #5118
  • [Core] Support image processor by @DarkLight1337 in #4197
  • [Core] Remove unnecessary copies in flash attn backend by @Yard1 in #5138
  • [Kernel] Pass a device pointer into the quantize kernel for the scales by @tlrmchlsmth in #5159
  • [CI/BUILD] enable intel queue for longer CPU tests by @zhouyuan in #4113
  • [Misc]: Implement CPU/GPU swapping in BlockManagerV2 by @Kaiyang-Chen in #3834
  • New CI template on AWS stack by @khluu in #5110
  • [FRONTEND] OpenAI tools support named functions by @br3no in #5032
  • [Bugfix] Support prompt_logprobs==0 by @toslunar in #5217
  • [Bugfix] Add warmup for prefix caching example by @zhuohan123 in #5235
  • [Kernel] Enhance MoE benchmarking & tuning script by @WoosukKwon in #4921
  • [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by @afeldman-nm in #5210
  • [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by @zifeitong in #5229
  • [CI/Build] Add inputs tests by @DarkLight1337 in #5215
  • [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by @DamonFool in #5249
  • [Kernel] Add back batch size 1536 and 3072 to MoE tuning by @WoosukKwon in #5242
  • [CI/Build] Simplify model loading for HfRunner by @DarkLight1337 in #5251
  • [CI/Build] Reducing CPU CI execution time by @bigPYJ1151 in #5241
  • [CI] mark AMD test as softfail to prevent blockage by @simon-mo in #5256
  • [Misc] Add transformers version to collect_env.py by @mgoin in #5259
  • [Misc] update collect env by @youkaichao in #5261
  • [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by @zifeitong in #5226
  • [Misc] Add CustomOp interface for device portability by @WoosukKwon in #5255
  • [Misc] Fix docstring of get_attn_backend by @WoosukKwon in #5271
  • [Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) by @tomeras91 in #5278
  • [CI] Add nightly benchmarks by @simon-mo in #5260
  • [misc] benchmark_serving.py -- add ITL results and tweak TPOT results by @tlrmchlsmth in #5263
  • [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by @tlrmchlsmth in #5157
  • [Model] Correct Mixtral FP8 checkpoint loading by @comaniac in #5231
  • [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by @DriverSong in #5207
  • [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by @pcmoritz in #5238
  • [Docs] Add Sequoia as sponsors by @simon-mo in #5287
  • [Speculative Decoding] Add ProposerWorkerBase abstract class by @njhill in #5252
  • [BugFix] Fix log message about default max model length by @njhill in #5284
  • [Bugfix] Make EngineArgs use named arguments for config construction by @mgoin in #5285
  • [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by @wuisawesome in #5290
  • [Misc] Skip for logits_scale == 1.0 by @WoosukKwon in #5291
  • [Docs] Add Ray Summit CFP by @simon-mo in #5295
  • [CI] Disable flash_attn backend for spec decode by @simon-mo in #5286
  • [Frontend][Core] Update Outlines Integration from FSM to Guide by @br3no in #4109
  • [CI/Build] Update vision tests by @DarkLight1337 in #5307
  • Bugfix: fix broken of download models from modelscope by @liuyhwangyh in #5233
  • [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by @pcmoritz in #5294
  • [Frontend] enable passing multiple LoRA adapters at once to generate() by @mgoldey in #5300
  • [Core] Avoid copying prompt/output tokens if no penalties are used by @Yard1 in #5289
  • [Core] Change LoRA embedding sharding to support loading methods by @Yard1 in #5038
  • [Misc] Missing error message for custom ops import by @DamonFool in #5282
  • [Feature][Frontend]: Add support for stream_options in ChatCompletionRequest by @Etelis in #5135
  • [Misc][Utils] allow get_open_port to be called for multiple times by @youkaichao in #5333
  • [Kernel] Switch fp8 layers to use the CUTLASS kernels by @tlrmchlsmth in #5183
  • Remove Ray health check by @Yard1 in #4693
  • Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by @JamesLim-sy in #5296
  • [Kernel] Dynamic Per-Token Activation Quantization by @dsikka in #5037
  • [Frontend] Add OpenAI Vision API Support by @ywang96 in #5237
  • [Misc] Remove unused cuda_utils.h in CPU backend by @DamonFool in #5345
  • fix DbrxFusedNormAttention missing cache_config by @Calvinnncy97 in #5340
  • [Bug Fix] Fix the support check for FP8 CUTLASS by @cli99 in #5352
  • [Misc] Add args for selecting distributed executor to benchmarks by @BKitor in #5335
  • [ROCm][AMD] Use pytorch sdpa math backend to do naive attention by @hongxiayang in #4965
  • [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) by @youkaichao in #5347
  • [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) by @youkaichao in #5357
  • [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale by @mgoin in #5353
  • [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint by @youkaichao in #5074
  • [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py by @youkaichao in #5361
  • [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by @bnellnm in #5047
  • [Bugfix] Fix KeyError: 1 When Using LoRA adapters by @BlackBird-Coding in #5164
  • [Misc] Update to comply with the new compressed-tensors config by @dsikka in #5350
  • [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server by @ywang96 in #5374
  • [misc][typo] fix typo ...
Read more

v0.4.3

01 Jun 00:25
1197e02
Compare
Choose a tag to compare

Highlights

Model Support

LLM

  • Added support for Falcon (#5069)
  • Added support for IBM Granite Code models (#4636)
  • Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
  • Added Snowflake arctic model implementation (#4652, #4889, #4690)
  • Supported Dynamic RoPE scaling (#4638)
  • Supported for long context lora (#4787)

Embedding Models

  • Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
  • Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)

Vision Language Model

  • Add base class for vision-language models (#4809)
  • Consolidate prompt arguments to LLM engines (#4328)
  • LLaVA model refactor (#4910)

Hardware Support

AMD

  • Add fused_moe Triton configs (#4951)
  • Add support for Punica kernels (#3140)
  • Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)

Production Engine

Batch API

  • Support OpenAI batch file format (#4794)

Making Ray Optional

  • Add MultiprocessingGPUExecutor (#4539)
  • Eliminate parallel worker per-step task scheduling overhead (#4894)

Automatic Prefix Caching

  • Accelerating the hashing function by avoiding deep copies (#4696)

Speculative Decoding

  • CUDA graph support (#4295)
  • Enable TP>1 speculative decoding (#4840)
  • Improve n-gram efficiency (#4724)

Performance Optimization

Quantization

  • Add GPTQ Marlin 2:4 sparse structured support (#4790)
  • Initial Activation Quantization Support (#4525)
  • Marlin prefill performance improvement (about better on average) (#4983)
  • Automatically Detect SparseML models (#5119)

Better Attention Kernel

  • Use flash-attn for decoding (#3648)

FP8

  • Improve FP8 linear layer performance (#4691)
  • Add w8a8 CUTLASS kernels (#4749)
  • Support for CUTLASS kernels in CUDA graphs (#4954)
  • Load FP8 kv-cache scaling factors from checkpoints (#4893)
  • Make static FP8 scaling more robust (#4570)
  • Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)

Optimize Distributed Communication

  • change python dict to pytorch tensor (#4607)
  • change python dict to pytorch tensor for blocks to swap (#4659)
  • improve paccess check (#4992)
  • remove vllm-nccl (#5091)
  • support both cpu and device tensor in broadcast tensor dict (#4660)

Extensible Architecture

Pipeline Parallelism

  • refactor custom allreduce to support multiple tp groups (#4754)
  • refactor pynccl to hold multiple communicators (#4591)
  • Support PP PyNCCL Groups (#4988)

What's Changed

Read more

v0.4.2

05 May 04:31
c7f2cf2
Compare
Choose a tag to compare

Highlights

Features

Models and Enhancements

Dependency Upgrade

  • Upgrade to torch==2.3.0 (#4454)
  • Upgrade to tensorizer==2.9.0 (#4467)
  • Expansion of AMD test suite (#4267)

Progress and Dev Experience

What's Changed

Read more

v0.4.1

24 Apr 02:28
468d761
Compare
Choose a tag to compare

Highlights

Features

  • Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
  • Support private model registration, and updating our support policy (#3871, 3948)
  • Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
  • Add option for using LM Format Enforcer for guided decoding (#3868)
  • Add option for optionally initialize tokenizer and detokenizer (#3748)
  • Add option for load model using tensorizer (#3476)

Enhancements

Hardwares

  • Intel CPU inference backend is added (#3993, #3634)
  • AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

  • [Kernel] Layernorm performance optimization by @mawong-amd in #3662
  • [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
  • [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
  • [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
  • [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
  • [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
  • [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
  • [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
  • [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
  • [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
  • [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
  • [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
  • [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
  • [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
  • Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
  • [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
  • [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
  • [Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
  • [BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in #3804
  • [Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
  • [Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
  • [3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
  • Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
  • [Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
  • Fixes the argument for local_tokenizer_group by @sighingnow in #3754
  • [Core] Enable hf_transfer by default if available by @michaelfeil in #3817
  • [Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
  • [Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
  • [Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
  • [Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
  • [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
  • [Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
  • [Model] Cohere CommandR+ by @saurabhdash2512 in #3829
  • [Core] improve robustness of pynccl by @youkaichao in #3860
  • [Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
  • [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
  • [Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
  • [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
  • [Bugfix] Fixing requirements.txt by @noamgat in #3865
  • [Misc] Define common requirements by @WoosukKwon in #3841
  • Add option to completion API to truncate prompt tokens by @tdoublep in #3144
  • [Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
  • [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
  • [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
  • [Core] enable out-of-tree model register by @youkaichao in #3871
  • [WIP][Core] latency optimization by @youkaichao in #3890
  • [Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
  • [Model] add minicpm by @SUDA-HLT-ywfang in #3893
  • [Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
  • [Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration by @Ki6an in #3767
  • [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
  • [BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
  • [Core] separate distributed_init from worker by @youkaichao in #3904
  • [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
  • [Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
  • [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
  • [Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
  • [Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
  • [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
  • [Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
  • [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
  • [Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
  • [Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
  • [Doc] Add doc to state our model support policy by @youkaichao in #3948
  • [Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server by @dmarasco in #3945
  • [Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
  • [Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
  • [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
  • [Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
  • [Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
  • [Test] Add xformer and flash attn tests by @rkooo567 in #3961
  • [Misc] refactor ops and cache_ops layer by @jikunshang in #3913
  • [Doc][Installation] delete python setup.py develop by @youkaichao in #3989
  • [Ke...
Read more

v0.4.0.post1, restore sm70/75 support

02 Apr 20:01
a3c226e
Compare
Choose a tag to compare

Highlight

v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.

What's Changed

  • [Kernel] Layernorm performance optimization by @mawong-amd in #3662
  • [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
  • [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
  • [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
  • [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
  • [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
  • [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
  • [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
  • [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
  • [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
  • [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
  • [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
  • [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
  • [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
  • Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
  • [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
  • [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803

New Contributors

Full Changelog: v0.4.0...v0.4.0.post1

v0.4.0

30 Mar 01:54
51c31bc
Compare
Choose a tag to compare

Major changes

Models

Production features

  • Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
  • Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
  • Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
  • Custom all reduce kernel has been re-enabled after more robustness fixes.
  • Replaced cupy dependency due to its bugs.

Hardware

  • Improved Neuron support for AWS Inferentia.
  • CMake based build system for extensibility.

Ecosystem

  • Extensive serving benchmark refactoring (#3277)
  • Usage statistics collection (#2852)

What's Changed

Read more

v0.3.3

01 Mar 20:58
82091b8
Compare
Choose a tag to compare

Major changes

  • StarCoder2 support
  • Performance optimization and LoRA support for Gemma
  • 2/3/8-bit GPTQ support
  • Integrate Marlin Kernels for Int4 GPTQ inference
  • Performance optimization for MoE kernel
  • [Experimental] AWS Inferentia2 support
  • [Experimental] Structured output (JSON, Regex) in OpenAI Server

What's Changed

New Contributors

Full Changelog: v0.3.2...v0.3.3

v0.3.2

21 Feb 19:50
8fbd84b
Compare
Choose a tag to compare

Major Changes

This version adds support for the OLMo and Gemma Model, as well as seed parameter.

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.3.2

v0.3.1

16 Feb 23:06
5f08050
Compare
Choose a tag to compare

Major Changes

This version fixes the following major bugs:

  • Memory leak with distributed execution. (Solved by using CuPY for collective communication).
  • Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

New Contributors

Full Changelog: v0.3.0...v0.3.1

v0.3.0

31 Jan 08:07
1af090b
Compare
Choose a tag to compare

Major Changes

  • Experimental multi-lora support
  • Experimental prefix caching support
  • FP8 KV Cache support
  • Optimized MoE performance and Deepseek MoE support
  • CI tested PRs
  • Support batch completion in server

What's Changed

New Contributors

Read more