关于数据未来方向的几个新观点
New Views on the Future of Data
6 年前我写过两篇文章谈数据的未来:一篇讲了数据开发技术的三个方向,一篇讲了数据产品的三个方向。
今天回看,方向几乎都应验了,但每一个的实现路径都不是当年设想的。更关键的是,6 个方向都默认了一个隐含前提:数据的消费者是人,但这个前提已经被撼动。
做一次修订。
现在看
按当年两篇的顺序复盘。
数据开发技术的三个方向
流批一体。应验了,且已经不再被单独讨论。Lakehouse 加 Iceberg / Hudi / Paimon 这些表格式上来之后,流和批的物理分界被自然抹掉,存储统一已经很成熟。Coding agent 能力的迅速增长,流批代码是否统一已经不重要,不再需要专门拎出来谈。
代码自动化。方向是对的,但当年我想的实现路径今天看是走偏了。当时看到的是 Dataphin 那一路的可视化建模加配置化代码生成;现在大家用 coding agent 写 SQL 和代码,自动优化也越来越多沉到引擎里去。低代码这条路即使没死,也已不是主流,甚至低代码厂商自己也都加了 AI 助手,把”拖拽建数仓”换成”自然语言描述需求”。
OLAP Cubes 衰落。应验,Lakehouse 加 MPP 列存引擎成了事实标准,预计算 Cube 在主流场景里基本被淘汰。随着 StarRocks / Doris 这类引擎的不断成熟,直查明细在多数场景比预聚合还快。当年说”业务上不容易、需要等 BI 工具演化”事后看是多虑了,Agent 直接消化了这一层。
数据产品的三个方向
BI / 低代码搭建数据产品。今天也成立,但 BI 入口正在被 Agent 进一步消化。本来要做成看板的需求,越来越多在对话里直接解决。Tableau、Power BI 这一拨厂商最近两年的更新,主线全是在加 AI Copilot,没人再去做 BI 本身的功能创新,这本身就是信号。BI 没死,但从主入口退到了备用入口。
数据产品和业务产品合二为一。应验,而且被推得比我当年想的远。当年设想是”产品里嵌入诊断和 SOP”;今天是 Agent 自己拿数据、拿结论、调业务工具,“数据产品”那层壳很多时候不再独立存在。Claude Code、Cursor 这一拨 coding agent 是最早跑通的例子,工程师不再先打开”代码搜索 + 文档 + Slack”三个面板再决定写哪段,而是直接问 AI,这个 pattern 正在从 coding 蔓延到所有有数据决策的场景。
交互式、对话式分析。应验,但实现路径是没想到的。当年我设想要走三步:先做一层自然语言理解,把口语问题翻成结构化查询;再加一层受控的领域词表,让用户的提问必须落到预定义的指标和实体上;上面再叠一个半结构化的语义层。后来发生的是 LLM 把前两步直接做掉了,本体被分层的上下文工程替代,比死板的本体灵活得多。
最后,6 个方向几乎都应验。但有一个我当年完全没想到的:所有路径设想里,数据的消费者都默认是人,让人查得快、看得清、补得齐。LLM 出来之后,数据多了一类新消费者:Agent。
未来方向
今天重看未来的方向。
数据开发技术
第一是 Context 工程成为数据建设的第一工作。过去数据建设以”数仓建模 + 元数据治理 + 指标体系”为骨架,目的是让人能快速准确取数。Agent 时代,数据建设的核心工作是把”过去藏在人脑子里的业务直觉”显性化成 Agent 能读的语义资产。这件事今天还没有标准方法论,很多团队都在摸索。慢慢会有。
第二是数据接口从查询语言扩展到能力单元。过去数据团队对外的接口是 SQL、API、看板。今天 Agent 调用的接口是 Skill 和 Tool:一次调用包含查询能力加上业务语义。Skills / Tools 的标准化、跨场景复用、版本管理,会成为数据平台的核心工程。我们今年在开放数据平台工具 CLI 后已经沉淀了多个高频 Skills(取数、圈人、画像、行为序列、AB 评估等),在多个业务线得到规模化复用。这个范式今天已经是事实标准。
第三是评测和反馈回路从研发副产物升级为基础设施。过去数据质量靠交付物验证加监控告警,人来发现和解决问题。Agent 时代质量是个生产系统问题:评测集怎么建、Agent 错了怎么纠、纠错怎么沉淀回 Context 不再犯、跨场景跨用户怎么共享学习。OpenAI 的 Eval 框架、Anthropic 的 evals 工具链是这条路工业化的早期形态;数据团队过去做”数据质量监控”,现在要做的是”Agent 行为评测”,后者从理念到工具栈都不一样。
数据产品
第一是 Agent 成为数据产品的主形态。BI、看板、报表会继续存在,但主入口会让位给对话式的 Agent。“看一份数”会变成例外动作,“问 Agent”会变成默认动作。数据产品的核心 UX 不再是图表布局,而是 Agent 的对话能力和长期可信的记忆。我们的对话式数据 Agent 上线半年后,简单查询从 30 分钟压到 2 分钟,复杂分析从 2-3 天压到 30 分钟。但比”快”更值得说的是,用户开始问之前从来不会问的问题,因为问的成本下来之后,“假设”的密度起来了。
第二是数据产品的边界向业务系统延伸,Agent 自闭环。“数据和业务产品合二为一”是这条路的早期形态,今天 Agent 已经在跑这条闭环:自己看数、自己出结论、自己调业务系统。运营、投放、客服这些业务侧的 Agent 化会加速,数据团队的产物会越来越多直接落到业务执行里,而不是停在分析报告上。我们已经能看到:增长团队的 Agent 在自己”看数据 → 出策略 → 调投放系统 → 看效果”地跑闭环;客服 Agent 在自动处理多数标准化工单。这些都是 2024 年还没成熟、2026 年开始进入生产的东西,变革越来越近。
第三是数据成为产品 runtime 的核心组件,不再只在后台。过去数据团队的产物服务公司内部,用户感知到的是产品 UI 背后的推荐结果。AI 入口让用户直接和 Agent 对话,Agent 调用的就是数据团队建的用户记忆、内容理解、行为叙事。我目前带的数据团队,用户记忆 pipeline 日均加工数百万人次、消耗千亿级 token,全部本地推理。数据从”事后被分析的对象”变成”实时被推理的输入”,数据团队的产物直面用户体验。
核心能力
当年那两篇文末各列了三项能力(数据开发:业务理解、把数据做深、全局观;数据产品:业务目标评估体系、抽象分析框架和行动点、提效执念)。今天看,要迭代。
业务理解,从”懂”升级到”写得出”。过去说一个数据从业者”懂业务”,意思是他在群里被问一句能答上来,他写出来的报表能让业务拍板。Agent 时代的”懂业务”意味着另一件事:你能不能把这套理解写成 Agent 能读的语义资产。举个具体的:以前问一个老分析师”沉浸 DAU 和 DAU 有啥区别”,他能答上来就够了;今天他还得能把这套口径写成结构化文件,让 Agent 自动选对指标。以前业务理解是脑子里的,现在它必须落到文件里。
评测能力,从”我做的对不对”升级到”系统错了能不能被发现、纠偏了能不能不再犯”。过去数据质量靠最终交付物的验证:SQL 是不是对的、指标是不是对的、报表是不是对的。Agent 时代质量是个生产系统问题。前面那节已经讲过具体的评测和反馈机制,这里只看能力本身:你能不能从盯单点交付物,变成搭一套评测闭环。一个具体场景:我们团队第一次给数据 Agent 定义评测集时,最难的不是设计题目,是把”分析师怎么判断一份分析报告是不是对的”这种隐性标准变成可计算的指标。这件事过去没人系统做过,现在每个团队都得自己重新发明一遍。这块的能力好坏决定了你的 Agent 系统能不能上生产线、能不能被业务长期信任。
Agentic 思维,从”做给人看”升级到”先为 Agent 设计,再让人也能用”。过去做数据产品的人讲”用户体验”,那个用户默认是人。今天,你脑子里要常驻一个 Agent:设计任何数据资产和服务时,默认起点是”Agent 来读这东西会读到什么”,人能看是接下来的事。同一份语义文档既要让 Agent 调得准,也要让人读得懂。举一个例子:以前写一份指标文档,模板是”指标名 + 业务含义 + 计算公式 + 使用注意”;现在重新写要加”同义词列表 / 适用场景 / 不适用场景 / 历史口径变更”这些维度。后面这几项不是为了让人看,是为了让 Agent 在歧义场景下不出错。
Six years ago I wrote two pieces on the future of data: one on three directions for data development technology, one on three directions for data products. Both were broad directional bets, made against the technology stack I could see in 2020.
Looking back today, almost all the directions came true, but none of the implementation paths matched what I had imagined. More importantly, all 6 directions assumed one unspoken premise: the consumer of data is human. That premise has been shaken.
Time to revise.
Looking Back
In the order of the two original pieces.
Three Directions for Data Engineering
Stream-batch unification. Came true, and no longer discussed as a standalone topic. With Lakehouse plus table formats like Iceberg / Hudi / Paimon, the physical boundary between streaming and batch was naturally erased. Storage unification is mature. With the rapid growth of coding agent capabilities, whether stream and batch code are unified no longer matters. It’s not a topic worth singling out anymore.
Code automation. The direction was right, but the path I imagined back then was off. I was looking at the Dataphin route: visual modeling plus config-driven code generation. Today people write SQL and code with coding agents, and auto-optimization is increasingly pushed down into the engines themselves. Low-code didn’t die, but it’s no longer mainstream. Even low-code vendors have added AI assistants, replacing “drag to build a data warehouse” with “describe what you need in natural language.”
The decline of OLAP Cubes. Came true. Lakehouse plus MPP columnar engines became the de facto standard, and precomputed Cubes were essentially retired in most scenarios. As engines like StarRocks / Doris matured, querying detail tables directly became faster than precomputed aggregation in most cases. My earlier concern that “this would be hard on the business side, needing BI tools to evolve” turned out to be overcautious. Agents digested that layer directly.
Three Directions for Data Products
BI / low-code for building data products. Still valid today, but the BI entry point is being further absorbed by Agents. Needs that used to become dashboards are increasingly resolved in conversation. The recent two years of updates from Tableau and Power BI have been entirely about adding AI Copilots. No one is innovating on BI itself anymore. That’s a signal in itself. BI isn’t dead, but it has moved from main entry to fallback.
Data products and business products merging into one. Came true, and pushed further than I had imagined. Back then I pictured “embedding diagnosis and SOPs inside the product.” Today Agents fetch the data, draw the conclusion, and call the business tools themselves; the “data product” shell often no longer exists as a separate layer. Claude Code and Cursor, this wave of coding agents, are the earliest examples of the pattern. Engineers no longer open three panels (code search, docs, Slack) to decide what to write. They just ask the AI. The pattern is spreading from coding to every scenario where data informs a decision.
Interactive and conversational analysis. Came true, but the path was nothing I had imagined. My plan was to go through three layers: first a natural language understanding layer to translate spoken questions into structured queries; then a controlled domain vocabulary to force user questions onto predefined metrics and entities; on top of that a semi-structured semantic layer. What actually happened was that LLMs collapsed the first two steps directly. The ontology layer was replaced by a layered context engineering, which is much more flexible than rigid ontologies.
Finally, of the 6 directions almost all came true. But there was one thing I completely failed to imagine: in all the path scenarios, the consumer of data was assumed to be human, making it fast to query, easy to read, possible to fill in the gaps. After LLMs arrived, data gained a new kind of consumer: the Agent.
New Directions
A fresh look at where things are heading.
Data Engineering
First, Context engineering becomes the foundational work of data construction. The skeleton of data construction used to be “data warehouse modeling + metadata governance + metric systems,” all in service of making it fast and accurate for humans to query. In the Agent era, the core work of data construction is making the business intuition that used to live inside people’s heads explicit, as semantic assets readable by Agents. There’s no standard methodology for this yet; many teams are still figuring it out. It will come together over time.
Second, the data interface expands from query language to capability unit. The data team used to expose interfaces like SQL, APIs, and dashboards. Today, what Agents call is Skills and Tools. A single call carries query capability plus business semantics. Standardization, cross-scenario reuse, and version management for Skills / Tools will become the core engineering work of the data platform. After we opened the data platform tools as a CLI this year, we’ve accumulated multiple high-frequency Skills (query, segmentation, profile, behavior sequence, A/B evaluation, and others), being reused at scale across multiple business lines. This paradigm is already the de facto standard.
Third, evaluation and feedback loops are promoted from R&D byproduct to infrastructure. Data quality used to rely on validating deliverables plus monitoring alerts, with people discovering and resolving problems. In the Agent era, quality is a production-system problem: how do you build the eval set, how do you fix the Agent when it errs, how do you fold that fix back into Context so the error doesn’t repeat, how do you share that learning across scenarios and users. OpenAI’s Eval framework and Anthropic’s evals toolchain are early industrial forms of this path. What data teams used to do as “data quality monitoring” has to become “Agent behavior evaluation,” and the two are different from concept to toolchain.
Data Products
First, Agents become the primary form of data products. BI, dashboards, and reports will continue to exist, but the main entry will give way to a conversational Agent. “Looking at a number” will become the exception action; “asking an Agent” will become the default. The core UX of a data product is no longer about chart layout, but about the Agent’s conversational ability and the long-term trustworthiness of its memory. Six months after our conversational data Agent went live, simple queries went from 30 minutes to 2 minutes, and complex analyses went from 2-3 days to 30 minutes. But more telling than “fast” is this: users have started asking questions they never asked before, because once the cost of asking drops, the density of “hypotheses” goes up.
Second, the boundary of data products extends into business systems. The Agent closes its own loop. “Data products and business products merging into one” was the early form of this path. Today, Agents already run the closed loop themselves: looking at the data, drawing the conclusion, calling the business system. Agent-ification on the business side, in growth, marketing, customer support, will accelerate. The data team’s deliverables will increasingly land directly in business execution, not stop at an analysis report. We can already see it: growth team Agents running their own “look at data → produce strategy → call the ad system → check results” loop; customer support Agents handling most standardized tickets automatically. These didn’t exist as mature products in 2024 and started going into production in 2026. Change is getting closer.
Third, data becomes a core runtime component of the product, not just something in the back office. The data team’s deliverables used to serve the company internally; what users perceived was the recommendation result behind the product UI. AI entry points let users talk to an Agent directly, and what the Agent invokes is the user memory, content understanding, and behavior narratives built by the data team. The data team I lead today runs a user memory pipeline that processes millions of users per day and consumes hundreds of billions of tokens, all on local inference. Data shifts from “the object of after-the-fact analysis” to “the input of real-time reasoning.” The data team’s deliverables now face the user experience directly.
Core Capabilities
Each of the two original pieces ended with three capabilities (for data development: business understanding, depth in working with data, holistic view of the pipeline; for data products: a sense of the business’s goal-evaluation system, the ability to abstract analytical frameworks and action points, an obsession with iteration efficiency). Looking at it today, this list needs revisiting.
Business understanding, from “knowing” to “writing it out.” Saying a data person “understands the business” used to mean they could answer when asked in a group chat, and the report they wrote could close a business decision. In the Agent era, “understanding the business” means something else: can you write that understanding into semantic assets that Agents can read. A concrete example: it used to be enough for a senior analyst to answer “what’s the difference between immersion DAU and DAU”; today they have to be able to write that definition into a structured file so the Agent can automatically pick the right metric. Business understanding used to live in heads; now it has to land in files.
Evaluation ability, from “is what I made correct” to “can the system catch and not repeat its errors.” Data quality used to rely on validating the final deliverable: was the SQL correct, was the metric correct, was the report correct. In the Agent era, quality is a production-system problem. I covered the specific evaluation and feedback mechanisms in the previous section; here I’m only looking at the capability itself: can you shift from policing single deliverables to building an evaluation loop. A concrete scene: when our team first defined an eval set for a data Agent, the hardest part wasn’t designing the problems. It was turning “how an analyst judges whether an analysis report is right” from implicit standard into computable metric. No one has systematically done this before; every team has to reinvent it now. The strength of this capability decides whether your Agent system can ship to production and whether it can earn long-term trust from the business.
Agentic thinking, from “made for humans to read” to “designed for the Agent first, made usable for humans too.” People who built data products used to talk about “user experience,” and the user there was assumed to be human. Today, you have to carry an Agent in your head: when designing any data asset or service, the default starting point is “what will the Agent read when it reads this,” and “what humans see” comes second. The same semantic document must be precise enough for the Agent to call correctly and clear enough for a human to read. Take an example: it used to be that a metric document followed a template of “metric name + business meaning + calculation formula + things to watch out for.” Now you have to add “synonyms / applicable scenarios / non-applicable scenarios / historical definition changes.” Those additions aren’t for human reading. They’re for keeping the Agent from going wrong in ambiguous situations.