Enterprise GenAI has a Big Dirty Little Secret
You might have missed the news last week, when Snowflake announced the public beta of its new product designed to augment or replace business intelligence and analytics tools like Salesforce Tableau, Sigma, Omni and Looker with generativeAI conversations.
That’s a big promise – and a huge market.
Per VentureBeat, Snowflake Cortex Analyst is “a fully managed service that provides businesses with a conversational interface to talk to their data. All users have to do is ask business questions in natural language prompts, and the agentic AI system handles the rest, from converting text prompts into SQL and querying the data; to running checks, and providing answers in the form of metrics, tables and graphs.”
I haven’t seen this LLM-based front end from other cloud or data platform providers yet, though standalone analytics tools like ThoughtSpot deserve a shout out for leading the way. We’re talking about running useful and accurate natural language queries against large-scale, private and secure enterprise data sets.
Applying generative AI to business planning and operations has the potential to radically transform the way we work, from probabilistic forecasting to real-time decisioning. I can imagine workflows that start with humans asking the questions to identify opportunities… evolving to workflows where models do all of the work, with approval checkpoints or maybe simply contingency guardrails established for when automated results underperform plans and budgets.
But here’s the big dirty little secret when it comes to applying generative AI to your business data: Meaningful generative AI output requires meaningful data. Trustworthy generative AI output requires trustworthy data.
How can you trust your agentic AI data analyst if you don’t trust your data?
Snowflake’s Head of AI, who joined the company through the acquisition of his blockchain technology startup, told the publication last week “‘In real-world applications, you have tens of thousands of tables and hundreds of thousands of columns with strange names. For example, ‘Rev 1 and Rev 2’ could be iterations of what might mean revenue. [Snowflake] customers can specify these metrics and their meaning in the semantic descriptions, enabling the system to use them when providing answers.’”
Per the story, “to ensure the LLM agents behind Cortex Analyst understand the complete schema of a user’s data structure and provide accurate, context-aware responses, the company requires customers to provide semantic descriptions of their data assets during the setup phase” [emphasis added]. “This fills a major problem associated with raw schemas and enables the models to capture the intent of the question, including the user’s vocabulary and specific jargon.”
Augmenting generative AI models with well-defined, structured private data is complicated and expensive. As the Magnificent Seven invest hundreds of billions of dollars in gen AI models and training, and as AI focus shifts from aging public Internet content to private and recent enterprise data, the elephant in the room is data understanding as manifested in the semantic layer — or lack thereof. Data scientists train powerful models that know everything from Shakespeare to GAAP accounting rules, and then data engineers get mired in the plumbing of retrieving and shaping and labeling the augmentation data to make it both available and understandable for both humans and language models, large and small.
The barrier isn’t the data transport (the “E” and the “L” in ETL), as moving data is relatively easy. ETL tools are generally very good at E and L. Gaining meaning and trust in the data, however – the semantic layer of understanding – makes LLMs and SLMs hum, but can also completely bog the whole system down.
Data catalogs exist to help define, label and map data to semantic concepts and entities – but today these governance tools are typically applied during the “last mile” of data flow, through BI tools or aftermarket data catalogs that read structured data and attempt to recreate lineage (the “path’), transformational logic (the “math”), and enforce things like data typing during validation (the “proof’) – all long after the data has been moved, modified and modeled in who-knows-what kinds of ways.
Some say darker forces are at work here, too, as the prevailing modern data stack and this after-the-fact approach to establishing data meaning and trust seem designed to maximize cloud storage and compute costs, not to mention data engineering expenses.The bigger risk to organizations is that the results of AI-generated analysis are wrong, a foregone conclusion if we start with garbage in.
What’s needed is to tackle the complexity of data meaning and trust as data flows from its points of origin into its staging area for genAI models. For Snowflake, that staging area is Snowflake itself, the cloud data warehouse. Data landed and materialized in Snowflake is natively accessible by Cortex Analyst, ready for plain English prompts ready to write your SQL and retrieve your answers and reports.
For Snowflake and other data warehouse-native generative AI tools, the industry needs automated tooling to collect, unify, define, label and map genAI-ready data — landing well-defined, useful data right from the source. These tools must be cloud, data warehouse and AI model-agnostic to avoid cloud provider lock-in. They must utilize low-code, no-code (LCNC) interfaces, allowing central data teams and business teams to collaborate in-tool to build business definitions and transformations. The semantic layer data understanding must be portable and flexible, to inform any and every downstream consumer of the data, both human and machine.
Start with meaningful, trustworthy data – and everyone in your organization can become a data-driven decision maker, relying on meaningful, trustworth generative AI results.
It’s the perfect time to adopt Reactor.