Scaling transformers to 1M tokens, making them sparse and more

The Token from MantisNLP

May 05, 2023

🧪 Scaling transformers context window to 1M tokens

Most transformer architectures have a context window between 512 and 4K tokens. And while the latest Large Language Models have increased that up to 32K for GPT4, there are still applications that require some form of larger memory to be effective which at the moment is outsourced to vector databases and retrieval approaches.

One way to overcome this limitation is by introducing some memory tokens that are recurrently fed from output to input while the longer text is split into segments (see image). This architecture is called Recurrent Memory Transformer and it has been show to scale up to 2M tokens 💥

They test this ability in a synthetic Q&A dataset where facts are placed at random locations of noisy text while the model is asked the question at the end of the long sequence making it possible to reply only if it has used memory effectively.

Most importantly the inference time scales linearly with the number of instead of quadratically and the approach can be used with any BERT compatible architecture.

🔗 Read more here

🖱 HuggingChat

HuggingFace creates a chat interface for serving open source conversational models. The first model available comes from OpenAssistant and it is a Llama 30b model which is instruction fine tuned and further trained with Reinforcement Learning from Human Feedback, similar to ChatGPT using 160K messages and 400K quality ratings ⭐️

The model seems comparable to ChatGPT powered by GPT3.5 according to human evaluators but even if its performance is inferior in practice, this is an important milestone for open source Large Language Models as the gap between state of the art and open source models is closing rapidly 🚀

Most importantly this may be the start of an open source ecosystem built on top of Large Language Models and catalysed by organisations such as HuggingFace 🤗

🕹 Try it here hf.co/chat

🧪 RedPajama

A new research initiative kicks off with the aim to produce state of the art open source models. This is a collaboration among various labs including Stanford's CRFM, Hazy research and Mila Quebec.

Their first goal is reproducing Llama which stands out as the best semi open model to date. Their first step towards this goal is creating and releasing a 1.2T token dataset similar to the one used to train Llama. They are currently training Llama clones 🐑 the first of which are expected to land in the coming weeks. Then the plan is to instruction fine tune them using data from OpenChatKit following the route that Alpaca and friends took.

Things are accelerating 🚀 at the open source space and this is yet another indication that the gap between state of the art closed source models and open source is closing rapidly and the trend might invert in favour of open source soon. More importantly this new wave of models will be the driving force for a new wave of innovation from the industry.

🔗 Read more here.

🖱 New ways to manage your ChatGPT data

ChatGPT users can now turn their chat history off which in effect means OpenAI will not use their conversation data for training subsequent iterations of their models. This alleviates a big concern for many users, especially when using ChatGPT for work.

More importantly though, OpenAI announced that they are working on a Business subscription with even tighter data controls and the ability to manage users. This might unblock companies from using OpenAI models as for a lot of companies sharing sensitive data was not an option and the inability to manage the use of ChatGPT from their employees was unacceptable.

🔗 Read more here

🧪 SparseGPT

There is a new pruning method that manages to eliminate more than 50% of the weights of a Large Language Model of the GPT family. For 175B param models this sometimes translate to 100B params. All that with almost unnoticeable loss in performance 🔥

This method requires no data, just the model weights ✨ and can complete in less than 5 hours in one GPU 🏎 This method paired with the new sparse tensor cores from the latest GPUs can dramatically decrease inference time and memory requirements needed to run Large Language Models.

🔗 Read more here

The Token

Scaling transformers to 1M tokens, making them sparse and more

The Token from MantisNLP

🧪 Scaling transformers context window to 1M tokens

🖱 HuggingChat

🧪 RedPajama

🖱 New ways to manage your ChatGPT data

🧪 SparseGPT

Discussion about this post