Enhancing open-source AI and improving data governance

Chart illustrating upwards improvement in open-source AI and data governance highlighted by Databricks in an interview ahead of AI & Big Data Expo Europe.


Ahead of AI & Big Data Expo Europe, AI News caught up with Ivo Everts, Senior Solutions Architect at Databricks, to discuss several key developments set to shape the future of open-source AI and data governance.

One of Databricks’ notable achievements is the DBRX model, which set a new standard for open large language models (LLMs).

“Upon release, DBRX outperformed all other leading open models on standard benchmarks and has up to 2x faster inference than models like Llama2-70B,” Everts explains. “It was trained more efficiently due to a variety of technological advances.

“From a quality standpoint, we believe that DBRX is one of the best open-source models out there and when we refer to ‘best’ this means a wide range of industry benchmarks, including language understanding (MMLU), Programming (HumanEval), and Math (GSM8K).”

okex

The open-source AI model aims to “democratise the training of custom LLMs beyond a small handful of model providers and show organisations that they can train world-class LLMs on their data in a cost-effective way.”

In line with their commitment to open ecosystems, Databricks has also open-sourced Unity Catalog.

“Open-sourcing Unity Catalog enhances its adoption across cloud platforms (e.g., AWS, Azure) and on-premise infrastructures,” Everts notes. “This flexibility allows organisations to uniformly apply data governance policies regardless of where the data is stored or processed.”

Unity Catalog addresses the challenges of data sprawl and inconsistent access controls through various features:

Centralised data access management: “Unity Catalog centralises the governance of data assets, allowing organisations to manage access controls in a unified manner,” Everts states.

Role-Based Access Control (RBAC): According to Everts, Unity Catalog “implements Role-Based Access Control (RBAC), allowing organisations to assign roles and permissions based on user profiles.”

Data lineage and auditing: This feature “helps organisations monitor data usage and dependencies, making it easier to identify and eliminate redundant or outdated data,” Everts explains. He adds that it also “logs all data access and changes, providing a detailed audit trail to ensure compliance with data security policies.”

Cross-cloud and hybrid support: Everts points out that Unity Catalog “is designed to manage data governance in multi-cloud and hybrid environments” and “ensures that data is governed uniformly, regardless of where it resides.”

The company has introduced Databricks AI/BI, a new business intelligence product that leverages generative AI to enhance data exploration and visualisation. Everts believes that “a truly intelligent BI solution needs to understand the unique semantics and nuances of a business to effectively answer questions for business users.”

The AI/BI system includes two key components:

Dashboards: Everts describes this as “an AI-powered, low-code interface for creating and distributing fast, interactive dashboards.” These include “standard BI features like visualisations, cross-filtering, and periodic reports without needing additional management services.”

Genie: Everts explains this as “a conversational interface for addressing ad-hoc and follow-up questions through natural language.” He adds that it “learns from underlying data to generate adaptive visualisations and suggestions in response to user queries, improving over time through feedback and offering tools for analysts to refine its outputs.”

Everts states that Databricks AI/BI is designed to provide “a deep understanding of your data’s semantics, enabling self-service data analysis for everyone in an organisation.” He notes it’s powered by “a compound AI system that continuously learns from usage across an organisation’s entire data stack, including ETL pipelines, lineage, and other queries.”

Databricks also unveiled Mosaic AI, which Everts describes as “a comprehensive platform for building, deploying, and managing machine learning and generative AI applications, integrating enterprise data for enhanced performance and governance.”

Mosaic AI offers several key components, which Everts outlines:

Unified tooling: Provides “tools for building, deploying, evaluating, and governing AI and ML solutions, supporting predictive models and generative AI applications.”

Generative AI patterns: “Supports prompt engineering, retrieval augmented generation (RAG), fine-tuning, and pre-training, offering flexibility as business needs evolve.”

Centralised model management: “Model Serving allows for centralised deployment, governance, and querying of AI models, including custom ML models and foundation models.”

Monitoring and governance: “Lakehouse Monitoring and Unity Catalog ensure comprehensive monitoring, governance, and lineage tracking across the AI lifecycle.”

Cost-effective custom LLMs: “Enables training and serving custom large language models at significantly lower costs, tailored to specific organisational domains.”

Everts highlights that Mosaic AI’s approach to fine-tuning and customising foundation models includes unique features like “fast startup times” by “utilising in-cluster base model caching,” “live prompt evaluation” where users can “track how the model’s responses change throughout the training process,” and support for “custom pre-trained checkpoints.”

At the heart of these innovations lies the Data Intelligence Platform, which Everts says “transforms data management by using AI models to gain deep insights into the semantics of enterprise data.” The platform combines features of data lakes and data warehouses, utilises Delta Lake technology for real-time data processing, and incorporates Delta Sharing for secure data exchange across organisational boundaries.

Everts explains that the Data Intelligence Platform plays a crucial role in supporting new AI and data-sharing initiatives by providing:

A unified data and AI platform that “combines the features of data lakes and data warehouses into a single architecture.”

Delta Lake for real-time data processing, ensuring “reliable data governance, ACID transactions, and real-time data processing.”

Collaboration and data sharing via Delta Sharing, enabling “secure and open data sharing across organisational boundaries.”

Integrated support for machine learning and AI model development with popular libraries like MLflow, PyTorch, and TensorFlow.

Scalability and performance through its cloud-native architecture and the Photon engine, “an optimised query execution engine.”

As a key sponsor of AI & Big Data Expo Europe, Databricks plans to showcase their open-source AI and data governance solutions during the event.

“At our stand, we will also showcase how to create and deploy – with Lakehouse apps – a custom GenAI app from scratch using open-source models from Hugging Face and data from Unity Catalog,” says Everts.

“With our GenAI app you can generate your own cartoon picture, all running on the Data Intelligence Platform.”

Databricks will be sharing more of their expertise at this year’s AI & Big Data Expo Europe. Swing by Databricks’ booth at stand #280 to hear more about open AI and improving data governance.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Tags: ai, ai expo, artificial intelligence, data intelligence platform, databricks, dbrx, ivo everts, large language models, llm, mosaic ai, open source, open-source, unity catalog



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest