Skip to content
Jamba LLM Release
Home » Jamba: AI21 Labs Unveils the World’s First Production-Scale Mamba-Based Model

Jamba: AI21 Labs Unveils the World’s First Production-Scale Mamba-Based Model

In a groundbreaking development, AI21 Labs has unveiled Jamba, the world’s first production-scale model based on the innovative Mamba architecture. By enhancing the traditional Transformer architecture elements with the structured state-space model (SSM) technology of Mamba, Jamba overcomes the inherent limitations of pure SSM models, setting a new benchmark in the realm of large language models (LLMs).

Jamba boasts an impressive 256K context window, offering significant improvements in throughput and efficiency – a mere glimpse into the vast potential of this pioneering hybrid architecture. Remarkably, Jamba has demonstrated performance that surpasses or matches other state-of-the-art models of equivalent scale across a wide range of benchmarks.

The weights of Jamba have been released under the Apache 2.0 license, making it accessible to the broader AI community through the ai21labs/Jamba-v0.1 repository on Hugging Face.

Jamba’s core features include:

  • The first production-scale Mamba model based on the innovative SSM-Transformer hybrid architecture.
  • A threefold increase in throughput for long contexts compared to the Mixtral 8x7B model.
  • Democratizing access to a massive 256K context window.
  • The only model in its size category capable of accommodating up to 140K contexts on a single GPU.
  • Open weights released under the Apache 2.0 license.
  • Currently available on Hugging Face, with upcoming integration into the NVIDIA API catalog.
  • Jamba: Combining the Best of Both Architectures

The release of Jamba marks two significant milestones in the evolution of large language models (LLMs): the successful integration of Mamba with the Transformer architecture and the advancement of hybrid SSM-Transformer models to production-scale quality and size.

Until now, LLMs have primarily been built upon the traditional Transformer architecture.

While undoubtedly powerful, this architecture suffers from two major drawbacks:

High memory consumption: Transformers’ memory usage scales with increasing context length, making it challenging to run long context windows or multiple parallel batches without substantial hardware resources, limiting opportunities for widespread experimentation and deployment.

Slow inference speeds with increasing context: The attention mechanism in Transformers grows quadratically with sequence length, as each token depends on the entire preceding sequence, leading to throughput degradation – rendering long-context use cases impractical for efficient production scenarios.

The Mamba architecture, proposed by researchers from Carnegie Mellon University and Princeton University, aimed to address these shortcomings, opening new frontiers in language model development. However, without attention over the entire context, this architecture struggled to match the output quality of existing state-of-the-art models, particularly in tasks requiring recall of relevant information.

To capture the best of both the Mamba and Transformer architectures, AI21 Labs developed the corresponding Joint Attention and Mamba (Jamba) architecture. Jamba comprises Transformer, Mamba, and Mixture of Experts (MoE) layers, simultaneously optimizing for memory, throughput, and performance.

Jamba’s Mixture of Experts (MoE) layers allow it to utilize only 12B of its available 52B parameters during inference, while its hybrid structure makes these 12B active parameters more efficient than those of an equivalently-sized, purely Transformer-based model.

While some attempts have been made to scale up Mamba, no model has exceeded 3B parameters in size. Jamba is the first hybrid architecture of its kind to reach production-scale proportions.

Building a Scalable Jamba Architecture

To successfully scale Jamba’s hybrid structure, several core architectural innovations were necessary. As illustrated below, AI21’s Jamba architecture employs a block-and-layer approach, enabling the successful integration of the two architectures. Each Jamba block contains either an attention layer or a Mamba layer, followed by a multi-layer perceptron (MLP), with one Transformer layer occurring every eight layers overall.

The second feature is the utilization of Mixture of Experts (MoE) techniques to increase the total parameter count of the model while simplifying the number of active parameters used during inference – resulting in increased model capacity without a corresponding increase in computational demands. To maximize model quality and throughput on a single 80GB GPU, the number of MoE layers and experts was optimized, leaving sufficient memory for common inference workloads.

Unprecedented Throughput and Efficiency

Based on preliminary evaluations, Jamba has demonstrated impressive performance across key metrics, such as throughput and efficiency. While these initial performance benchmarks have already reached remarkable milestones, it will be exciting to see how these benchmarks continue to improve as the community experiments with and optimizes this new technology, pushing its capabilities even further.

Efficiency

Jamba’s throughput for long contexts is three times that of Transformer-based models of similar size, such as the Mixtral 8x7B, making it a more efficient model compared to its counterparts.

Cost-effectiveness

Jamba can accommodate up to 140K contexts on a single GPU, providing more accessible opportunities for deployment and experimentation compared to other open-source models of equivalent scale currently available.

As the AI community embraces and explores the potential of Jamba, the future of large language models promises to be more efficient, scalable, and accessible than ever before.

Unparalleled Performance and Versatility

Beyond its impressive throughput and efficiency metrics, Jamba has demonstrated remarkable performance across a wide range of benchmarks, often outperforming or matching state-of-the-art models of comparable size. This versatility underscores the power of Jamba’s hybrid architecture, which combines the strengths of both the Transformer and Mamba models.

One notable advantage of Jamba is its exceptional capability in handling long-context scenarios. Thanks to its massive 256K context window, Jamba can process and generate coherent responses for extended passages of text, making it well-suited for tasks such as document summarization, question-answering, and dialogue generation.

Moreover, Jamba’s performance on natural language understanding tasks, such as sentiment analysis and topic classification, is on par with leading models, showcasing its robust language comprehension abilities. This versatility extends to natural language generation tasks as well, where Jamba excels in areas like creative writing, code generation, and language translation.

Importantly, Jamba’s strengths are not limited to specific domains or languages. Its hybrid architecture has been designed to be highly adaptable, enabling it to perform well across a diverse range of tasks and languages, making it a valuable asset for researchers, developers, and businesses alike.

Fostering Open Innovation and Collaboration

In line with AI21 Labs’ commitment to open-source innovation, the weights of Jamba have been released under the permissive Apache 2.0 license. This move not only ensures transparency but also encourages collaboration and further development within the AI community.

By making Jamba’s weights openly available through the ai21labs/Jamba-v0.1 repository on Hugging Face, researchers and developers can easily access and experiment with this state-of-the-art model, potentially unlocking new applications and advancements in the field of large language models.

Additionally, Jamba’s upcoming integration into the NVIDIA API catalog will further enhance its accessibility, enabling developers to leverage its capabilities seamlessly within their applications and workflows.

Looking Ahead: The Future of Language Models

The release of Jamba marks a significant milestone in the evolution of large language models, but it is merely the beginning of what is possible with the fusion of Transformer and Mamba architectures. As the AI community continues to explore and refine this hybrid approach, we can expect to witness even more groundbreaking advancements in the field.

One area of particular interest is the potential for further scaling and optimization of Jamba’s architecture. By leveraging more advanced hardware and distributed training techniques, it may be possible to create even larger and more powerful versions of Jamba, pushing the boundaries of what is achievable in natural language processing.

Furthermore, the versatility of Jamba’s hybrid architecture opens up exciting possibilities for domain-specific fine-tuning and adaptation. By tailoring Jamba to specific industries or use cases, such as healthcare, finance, or scientific research, we may see the emergence of highly specialized and accurate language models that can revolutionize various sectors.

As the world of artificial intelligence continues to evolve at a rapid pace, the release of Jamba serves as a compelling reminder of the transformative potential of innovative architectures and open collaboration. With the AI21 Labs team and the broader research community working together, the future of large language models promises to be one of unprecedented capability, efficiency, and accessibility.