Tailoring Intelligence Part 2: Model merging
Takeaways
Model merging is an emerging, exciting, and growing technique that shows great promise for allowing companies to ingest knowledge into a model in a cost-efficient and scalable way.
Model merging complements fine-tuning, as you can merge fine-tuned models and bring the benefits of each into one model.
Evolutionary model merging is potentially a game-changer for model merging as it removes many complexities and guesswork from the process.
There is exciting potential for model merging in multi-modality by merging vision and language models.
Despite its potential, model merging remains an emerging field. Proven effectiveness in production settings will be a crucial driver of increased adoption. The field requires technical developments in areas such as merging models with different architectures and sizes. For founders interested in model merging, the most practical advice is to focus on techniques that have already been validated.
Introduction to model merging
In Part 1 of our Tailoring Models series, we discussed the crucial role of fine-tuning for companies, emphasizing the significant benefits it offers in improving performance for specific tasks. However, there are many instances where companies want to embed new knowledge into the model that may not be related to the original training data. Fine-tuning processes like LoRA are less capable of adapting to domains that differ substantially from the original training data. The alternative in such circumstances is to pre-train from scratch, but this is too expensive, requires extensive data, and is too complex for most companies. To address these challenges, an emerging field shows great promise: model merging.
Model merging is the process of combining the weights and layers of different models into a single, unified model without requiring additional training or fine-tuning. By merging models, developers can retain essential knowledge while integrating new information. This technique allows them to leverage the strengths of each individual model, providing a cost-effective approach to developing new models that is often achievable using just a CPU. Interestingly, when you merge two models that excel at separate tasks, you not only end up with a model capable of performing both tasks, but it often outperforms the original models on each individual task.
One of the primary advantages of model merging is its ability to mitigate catastrophic forgetting, which occurs when a model loses previously learned information as new data is introduced. The merging process allows you to achieve high performance in a specific domain while maintaining the broad capabilities of the base model (merging back to a specific checkpoint). It does so with a much lower computational budget compared to full fine-tuning and replaying datasets, which can be very expensive, especially for large datasets.
Model merging has been gaining attention thanks to Arcee.ai (a Flybridge Portfolio company) and Charles Goddard, creator of Mergekit (an open-source toolkit for merging pre-trained language models that supports most of the popular merging methods through a simple YAML file). Some of the most popular methods can be found in Appendix B. In a conversation I had with Charles, he mentioned:
"Merging techniques are scalable in a way traditional approaches aren't. You can independently fine-tune on different domains and then combine them for downstream applications without retraining over the entire set." — Charles Goddard
It's important to note that you can merge fine-tuned models, so merging and fine-tuning are complementary toolsets for founders and builders. Another benefit of merging is that it can help reduce the dilution effect when you fine-tune or pre-train a model. For example, you can merge a model trained with your raw data completions into an instructional model. Then, when a new instructional model of the same base comes along that surpasses the previous capabilities, you can merge your completion model with this new instruction model, thereby eliminating the dilution effect.
Adoption Challenges and evolutionary model merging
Despite the potential benefits of model merging, it has not yet been widely adopted in production settings by companies for several reasons. First, model merging is a newly emerging field, and most companies are focusing on proven methods rather than experimental techniques. Another significant challenge to adoption was that, until recently, model merging required a highly manual and experimental process. It involved manually testing different merges and determining how various merging parameters and hyperparameters affect the final model's performance. This created a barrier to entry, as not everyone possesses the knowledge required to run these experiments effectively. Additionally, many people tested model merging once, and if the result was unsatisfactory, they assumed that was the maximum potential. However, the first attempt would not likely yield optimal results, as the nature of model merging requires trying different combinations to find the best approach. To this point, Charles shared:
"There’s a reputation that model merging is just for gaming leaderboards, which discourages serious adoption. But with the right hyperparameter tuning, it’s a powerful tool for real-world applications." — Charles Goddard
Earlier this year, Sakana AI released a groundbreaking paper on evolutionary merging in which they applied an evolutionary algorithm (Covariance Matrix Adaptation / CMA-ES Algorithm) to optimize the parameter tweaking for the merging process. This way, they automate the combination of model weights and layers by iterating, evaluating, and merging models based on defined criteria. The core concept is that if you can measure how well a model performs at a specific task, you can use that to guide the optimization process. In this process, you provide the model with a list of evaluations you want to optimize for, and the models go through the evolutionary process of evaluation and merging over several cycles until they find the optimal merging combination.
In the paper, they shared two ways to merge a model. The first is merging models in the data flow space (layers), in which you find the best combination of the layers of the different models to merge and build a new model. The second is merging models in the parameter space (weights). In this approach, you mix the weights in different proportions of the different models to build a new model (akin to mixing colors). Both data flow space and parameter space can be combined to build a new model, and the actual results showed the highest increase in accuracy by combining both methods. It's important to note that you can leverage the evolutionary process using the different merging methods in Appendix B. For those who want to learn more, I enjoyed this walkthrough of Evolutionary Model Merging by Oxen.
Leveraging the evolutionary merging process, they introduced a series of models like EvoLLM-JP, an LLM that can solve math problems in Japanese. To generate it, they made 128 random combinations, and the best performing were selected. Amazingly, the process took just a couple of days and 16 GPUs. They also released one of the first merged models combining vision and LLM, by merging a Japanese LLM (Sisha Gamma 7B) and the vision model (Llava 1.6 Mistral 7B).
It is worth noting that this approach is highly novel and still has a long way to go, but it appears to be heading in the right direction for gaining wider adoption. However, it is important to mention that it does require more computational resources than the regular merging process, and that the merged models inherit some of the same limitations as the base models.
Arcee: a pioneer in the model merging space
Arcee, a Flybridge portfolio company, is a pioneer in leveraging model merging as part of its SLM adaptation system. Arcee's approach to model development emphasizes extending the capabilities of base models like Llama-2-base or Mistral-7B-base through domain-specific fine-tuning and continual pre-training using proprietary client data. Additionally, Arcee utilizes model merging to combine the strengths of multiple models into a single, versatile checkpoint, balancing domain-specific expertise with general-purpose functionality. This strategy focuses on fine-tuning smaller language models for specific domains, offering substantial cost savings compared to training LLMs. Arcee employs model merging to synthesize the capabilities of multiple pre-trained models into a single, versatile checkpoint. They have validated their approach, as can be seen in two case studies they shared for legal and medical use cases, in which the merged model outperformed the base models. For example in the medical use case the linear merge model demonstrated superior performance on the PubMedQA benchmark, achieving a score of 75.6 compared to the base Llama2 7B Chat's 73.4.
Jacob Solawetz, one of the founders of Arcee, emphasizes how they enable customers to own their models and the importance of privacy in their approach:
"With Arcee's end-to-end VPC service, we enable customers to leverage the power of smaller, specialized AI models, while protecting the privacy and ownership of their data."
Challenges, opportunities, and future areas of research
As we mentioned at the beginning, model merging is still in its early stages, and despite the advances with evolutionary model merging, adoption levels remain relatively low. We expect that as more novel research by companies like Arcee and Sakana emerges, it will expand and become a central toolkit of the AI stack.
Current limitations and interesting future areas of research include merging models of different sizes and bases. For example, combining models like a Mistral 3B with a Llama 7B, or training a 1B parameter model and merging it with a 70B parameter model. This approach has been partially explored but is still very experimental and not proven.
Challenges with merging models of different architectures and sizes involve among others:
Handling Cross-Attention Mechanisms: Challenges include managing how different models process and integrate attention across various segments or batches of data. The focus is on ensuring that cross-attention mechanisms, which allow models to share contextual information across different data groups, are compatible and effectively integrated.
Integration of Different Attention Configurations: Integrating models with different configurations of attention heads presents a substantial challenge. These heads, crucial for determining the focus points within the data, vary significantly across models in both architecture and function. Harmonizing these differences is key to successful model merging.
Overcoming these challenges and successfully merging models of different sizes and architectures could lead to more powerful and versatile language models, making this an exciting area for future research and development.
Another exciting promise of model merging, initiated by Sakana Research, is the impact it can have on multi-modality, as it allows for the combination of capabilities from vision and language models. As more companies realize the potential of multi-modal use cases, we expect this to become an increasingly important area of research in the merging space.
Two practical pieces of advice for founders and operators are:
Stick to the practices that have been validated so far in model merging, such as merging models of the same base and sizes. Leave the experimental aspects of merging to research-focused companies like Arcee and Sakana, as the investment may not be the best allocation of resources..
Similar to when you fine-tune a model, when you are merging a model, have a clear idea of the goal you want to achieve and how you are going to measure if you reached the desired output.
A key factor that would lead to wide adoption of model merging is the validation of the results that merged models can achieve in a company production setting. This means that the merging process needs to be validated not just with academic/experimental datasets but also with company datasets. This way, it can cross the chasm from early adopters to the mainstream, as it would no longer be seen as something experimental but as something that can deliver results and ROI, which is what companies care about.
Appendix A: Useful Sources
Maxime Labonne X + Lazy merge kit (Colab notebook to automate model merging using mergekit. It's very convenient, only requires a CPU, and allows you to create MoE like Beyonder)
Maya Akim walkthrough merging process
Arcee Blog + Youtube Channel (SLM Show) + Github
Appendix B: Model merging techniques
Table made by leveraging the following sources that i encourage people to look: Deci comparing model merging methods, MergeKit Merge Methods, Julien Simon Deep dive to model merging
Method | Description | Pros | Cons |
---|---|---|---|
Linear Model soup |
Averaging many variants of the same model, trained on the same dataset with different hyperparameters, includes Uniform Soup (average all models) and Greedy Soup (average models one by one, keeping those that improve accuracy). This approach is effective when you have models that were tuned for the same dataset on the same task. |
Simple and effective for averaging. | May not capture complex interactions between models. |
SLERP | Originally designed for computer graphics, this method preserves the magnitude of weights, the shape of the embedding space, and ensures a smooth transition between two models. | More effective than linear averaging, because if you have a very high weight in one and a low weight in the other, merging does not quite produce an accurate output. | Restricted to two models, may not fully capture advantage of each. |
Task Arithmetic | Utilizes task vectors (tensor updates) applied during fine-tuning to add or remove capabilities. | Enhances task-specific capabilities, supports multiple models, flexibility with several merging strategies. | Requires a suitable base model, increased complexity with more models, risk of overfitting to specific task. May not full capture complex model improvement. |
TIES (Trim, Elect, Sign, and Merge) | This method focuses on merging influential parameters while resolving sign conflicts to avoid interference. Step one is to trim each task vector to keep only influential parameters and remove those with very small magnitudes. Step two is to determine the sign (decide whether to stick to positive or negative) and remove parameters that don't have our chosen sign. | Reduces model complexity by focusing on impactful changes while retaining significant changes. | Overlooks subtle but important adjustment. Require choosing which parameter to ignore. |
DARE (Drop and Rescale) | Randomly eliminates 99% of parameter updates during fine-tuning, rescaling the remaining ones to reduce redundancy and interference. You can use DARE alongside TIES (DARE TIES) to combine the magnitude focus of TIES with the randomness of DARE. |
Introducing randomness helps prevent overfitting and simplifies the model by reducing the weight count. | Randomness can lead to the loss of critical information. Using DARE TIES adds complexity when combining these methods. |
Passthrough | Used for layer-stacking merges where models have different architectures. | Useful for layer-stacking merges. | No modification, doesn't combine model features. |
Model Breadcrumbs | Discards small and extremely large differences from the base model, can be combined with TIES. Discards small and large differences from the base model. |
Retains significant differences, flexible merging. | Requires tuning of density and gamma, potential loss of small but important differences. |
Model Stock | Computing good weights for linear interpolation using geometric properties of fine-tuned models, requires at least three models. | Uses geometric properties of models for weight calculation. Good for merging multiple models, supports varied architectures. | Requires at least three models, complexity in weight calculation. |
MoE-merge (Moerge) | You start by merging several models using other techniques like Linear or TIES, then integrate the merged model into a Mixture of Experts (MoE) framework. The MoE gating network can dynamically select which parts of the merged model to activate based on the input data. While MoE as an architectural concept focuses on dynamically using multiple specialized sub-models within a single framework, MoE in the context of model merging refers to integrating different models as experts within a larger MoE layer. This approach leverages the unique strengths of each model to enhance overall performance, rather than merging models by averaging or directly combining their weights. |
MoE allows for precision in activating models for specific tasks, leading to increased performance as more parameters are leveraged. It dynamically adapts to the complexity of the input. | Complexity in configuring and managing routing. Potential overhead from dynamic routing decision. Require careful balancing to overload specific expert. Difficult to fine-tune and require high VRAM capacity. |