Archon: A Machine Learning Framework for Large Language Model Enhancement Using Automated Inference-Time Architecture Search for Improved Task Performance

Artificial intelligence has made remarkable strides with the development of Large Language Models (LLMs), significantly impacting various domains, including natural language processing, reasoning, and even coding tasks. As LLMs grow more powerful, they require sophisticated methods to optimize their performance during inference. Inference-time techniques and strategies used to improve the quality of responses generated by these models at runtime have become crucial. However, the research community must still establish best practices for integrating these techniques into a cohesive system.

A core challenge in improving LLM performance is determining which inference-time techniques yield the best results for different tasks. The problem is compounded by the sheer variety of functions, such as instruction-following, reasoning, and coding, which may benefit from various combinations of inference-time techniques. Moreover, understanding the complex interactions between techniques like ensembling, repeated sampling, ranking, fusion, and verification is crucial for maximizing performance. Researchers need a robust system that can efficiently explore the extensive design space of possible combinations and optimize these architectures according to the task and compute constraints.

Traditional methods for inference-time optimization have focused on applying individual techniques to LLMs. For instance, generation ensembling involves querying multiple models simultaneously and selecting the best response, while repeated sampling involves querying a single model numerous times. These techniques have shown promise, but their standalone application often leads to limited improvements. Frameworks like Mixture-of-Agents (MoA) and LeanStar have attempted to integrate multiple techniques but still face challenges in generalization and performance across various tasks. Thus, there is a growing demand for a modular, automated approach to building optimized LLM systems.

Researchers from Stanford University and the University of Washington have developed Archon, a modular framework designed to automate LLM architecture search using inference-time techniques. The Archon framework leverages diverse LLMs and inference-time methods, combining them into a cohesive system that surpasses traditional models’ performance. Rather than relying on a single LLM queried once, Archon dynamically selects, combines, and stacks layers of techniques to optimize performance for specific benchmarks. By treating the problem as a hyperparameter optimization task, the framework can identify optimal architectures that maximize accuracy, latency, and cost-efficiency for a given compute budget.

The Archon framework is structured as a multi-layered system where each layer performs a distinct inference-time technique. For example, the first layer might generate multiple candidate responses using an ensemble of LLMs, while subsequent layers apply ranking, fusion, or verification techniques to refine these responses. The framework uses Bayesian optimization algorithms to search potential configurations and select the most effective one for a target benchmark. This modular design allows Archon to outperform top-performing models like GPT-4o and Claude 3.5 Sonnet by an average of 15.1 percentage points across a wide range of tasks.

The performance of Archon was evaluated across several benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. The results were compelling: Archon architectures demonstrated an average accuracy increase of 11.2 percentage points using open-source models and 15.1 percentage points utilizing a mix of open-source and closed-source models. In coding tasks, the framework achieved a 56% improvement in Pass@1 scores, boosting accuracy from 17.9% to 29.3% through unit test generation and evaluation. Even when constrained to open-source models, Archon surpassed the performance of single-call state-of-the-art models by 11.2 percentage points, highlighting the efficacy of its layered approach.

The key results show that Archon achieves state-of-the-art performance in various domains by integrating multiple inference-time techniques. For instruction-following tasks, adding numerous layers of generation, ranking, and fusion significantly improved the quality of responses. Archon excelled in reasoning tasks like MixEval and MATH by incorporating verification and unit testing methods, leading to an average increase of 3.7 to 8.9 percentage points when applying task-specific architectures. The framework combined extensive sampling and unit test generation to produce accurate and reliable outputs for coding challenges.

Key Takeaways from the research on Archon:

Performance Boost: Archon achieves an average accuracy increase of 15.1 percentage points across various benchmarks, outperforming state-of-the-art models like GPT-4o and Claude 3.5 Sonnet.

Diverse Applications: The framework excels in instruction-following, reasoning, and coding tasks, showing versatility.

Effective Inference-Time Techniques: Archon provides superior performance in all evaluated scenarios by combining techniques such as ensembling, fusion, ranking, and verification.

Improved Coding Accuracy: Achieved a 56% boost in coding task accuracy by leveraging unit test generation and evaluation methods.

Scalability and Modularity: The framework’s modular design allows it to adapt easily to new tasks and configurations, making it a robust tool for LLM optimization.

In conclusion, Archon addresses the critical need for an automated system that optimizes LLMs at inference time by effectively combining various techniques. This research provides a practical solution to the complexities of inference-time architecture design, making it easier for developers to build high-performing LLM systems tailored to specific tasks. The Archon framework sets a new standard for optimizing LLMs. It offers a systematic and automated approach to inference-time architecture search, demonstrating its ability to achieve top-tier results across diverse benchmarks.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Source link