v0.5.0
Features
- server: add flash-attention based version of Llama
- server: add flash-attention based version of Santacoder
- server: support OPT models
- router: make router input validation optional
- docker: improve layer caching
Fix
- server: improve token streaming decoding
- server: fix escape charcaters in stop sequences
- router: fix NCCL desync issues
- router: use buckets for metrics histograms