LLM inference, RRTF, ToolLLM and more
Welcome to the eighth edition of our newsletter The Token! In this episode we take a brief look at the various options that can speedup LLM inference, a new technique that could act as replacement to RLHF and a new LLM adapted to use more than 16K tools 🛠 . We also discuss a new prompting technique that can parallelize LLM word generation as well as a new assistant from Dimensions, the most popular tool used by funders and research institutions to analyse their data.
As ever, let us know what you think, and if you find yourself in need of help with an NLP problem, get in touch at hi@mantisnlp.com.
✍️ LLM latency
Given the size of Large Language Models, latency is a concern for some use cases. A new blog post explores popular tools for serving LLMs and their relative speeds. These are tools such as Text Generation Inference (TGI) from Hugging Face, exllama, ctranslate2 and vllm and techniques such as quantisation via bits and bytes or GPTQ as well as paged attention.
Even though not a rigorous benchmark, it seems that TGI and the huggingface ecosystem including the transformers library and the inference endpoints can output ~20 tokens per second in an A6000 with vllm being able to pull twice ~40 and ctranslate2 three times as much at around 60 tokens per second. This matches our own experience I have to say.
🔗 Read more here https://hamel.dev/notes/llm/inference/03_inference.html
🧪 RRTF
A new alternative to RLHF, that bypasses the need for Reinforcement learning was proposed named RRTF (Rank Responses from Teacher Feedback). The method utilises the ranking information directly to guide the training of the model instead of a reward signal used in traditional RLHF ✨
This method is tested on top of StarCoder, an LLM for code and achieves state of the art results in the open model domain and close to ChatGPT3.5 results 🔥 Even though not directly tested to align a base LLM, it opens the door for further methods in that space that can make it easier to adapt an LLM for chat or other use cases, similar to direct preference optimisation which we discussed in an earlier post.
🔗 Read more here https://arxiv.org/pdf/2307.14936.pdf
🧪 ToolLLM
A new research paper expands the abilies of Llama to be able to use more than 16K real world APIs 😮 This is two orders of magnitute more than other approaches. It also demonstrates similar performance with ChatGPT which is already excellent at using tools and APIs. It also enables a LLM to use multiple tools, not just one 🛠
The catch is that a teacher model is used to create the dataset, in this case ChatGPT, so the data and models come with limitations associated with that choice. It also means that it leaves the topic of creating, and making avaiable, such a dataset without a teacher model, open.
That said, extending LLMs to use tools can supercharge their abilities 💪 and will be relevant to many industry use cases where an assistant might need to interact with both internal and external tools. You can experiment with ToolLLM in the huggingface hub 🤗
🔗 Read more here https://arxiv.org/pdf/2307.16789v1.pdf
🧪 Skeleton of thought
A new prompt technique came out that aims to speed up inference by approx 2x. One bottleneck for Large Language Models is the fact that they need to output words sequentially. This means that generating words cannot be parallelised since each new word depends on the previous ones.
Skeleton of though aims to improve that exact step by breaking the word generation in two steps, one in which the skeleton of the answer is produced and another where the skeleton is filled with more information. The second step can be parallelised since you can ask an LLM to expand the answers concurrently using the skeleton. This seems to offer a bit more than a 2x speedup.
Latency can be quite important in some applications and while there are other ways to speed up an LLM like quantising or PagedAttention, this provide an easy alternative that only depends on clever prompting.
🔗 Read more here https://arxiv.org/pdf/2307.15337.pdf
🖱 Dimensions AI
Dimensions is the most popular tool that funders and research institutions use to access research, grant and policy data. These organisations commonly look for relevant data to inform their fund giving or research initiatives. Dimensions is rolling out its own assistant to make this discovery process much easier. The assistant uses OpenAI and Dimension custom models to search, rerank and summarise their large catalog of data 🗄
This is yet anoter assistant powered by Large Language Models to further increase the productivity of a certain sector. As we discussed in the past, we think assistants are very relevant to a lot of companies and will power use cases from customer support to education and search 🔍
🔗 Read more here https://www.dimensions.ai/blog/powering-research-with-dimensions-ai-assistant/