April 15, 2026

Public data was a legitimate foundation for a SaaS business — until LLMs made it free. If you want a defensible data asset in 2026, you need data that was never public in the first place.
For a long time, the playbook worked. Aggregate public data. Transform it. Resurface it in a cleaner, faster, more useful interface. Charge for access.
That playbook is done. Anyone with an LLM and an afternoon can build an app that does the same thing with public data. The aggregation is free. The transformation is automated. The interface takes a weekend. What took a team of engineers months to ship in 2022 ships in a day today. The competitive advantage you spent two years building is now a weekend project for someone else.
The deeper problem is that LLMs were trained on public data. The entire web, cleaned and indexed. If your product's core value is surfacing, organizing, or summarizing information that exists on the public internet, an LLM can approximate what you do without needing your product at all.
Non-public data is a different story. LLMs have not trained on it. AI agents cannot go collect it. Competitors cannot replicate it quickly. A16z's January 2025 analysis reached this conclusion directly: as foundation model capabilities commoditize, the scarce input shifts from the model to the data. Proprietary data becomes more valuable in an LLM world, not less.
If public data is no longer defensible, the question is practical: where do you get data that is not public? There are three paths.
The first path is straightforward. Find someone who holds data you want, and negotiate exclusive access.
This is not a new strategy. Financial data providers, healthcare networks, and logistics companies have been licensing proprietary datasets for decades. What is changing is the urgency. Data holders who previously did not know the strategic value of what they had are starting to understand it. Access is getting more expensive and more contested.
The moat here is contractual, not technical. If you can lock in exclusivity, competitors are simply blocked from building what you built, regardless of their engineering resources. The risk is that exclusivity agreements expire, get renegotiated, or get contested. A contract is a moat only as long as both parties honor it.
The due diligence question before going down this path: who else can this data holder sell to? If the answer is "anyone," your exclusivity is always one renegotiation away from evaporating.
The second path produces the most durable moats. Build an application that generates proprietary data as a natural byproduct of users doing their work.
Every user interaction becomes an asset. Templates, benchmarks, usage patterns, workflow choices — all of it accumulates into a dataset that no competitor can replicate from scratch. The longer users are in the product, the deeper the moat gets. This is the mechanism behind the stickiness of companies like Salesforce and ServiceNow: decades of customer interactions make their AI agents more accurate than any generic alternative.
The critical design question is instrumentation. You have to build the data collection in from the start. Retrofitting it later is painful and produces incomplete datasets. Every product decision — what users can do, where they click, what they submit — is also a data architecture decision. The founders who treat it that way early end up with something that becomes genuinely hard to replicate.
The self-reinforcing loop matters here. More users generate more data. More data produces a better product. A better product attracts more users. A competitor entering late does not just face a feature gap; they face a data gap that compounds over time.
The third path is the most underestimated. Find data that exists but has never been aggregated, and be the first one to gather it.
The example that makes this concrete: local contractor pricing. What does a licensed plumber charge for a water heater replacement in Phoenix versus Nashville? That information exists, scattered across invoices, estimates, and job records held by individual businesses. Nobody has it in one place. It has never been public. But if you build a tool that small contractors use to manage their jobs, and those contractors enter that data as part of their workflow, you accumulate something no one has.
The insight Praveen emphasizes here is that this third path is not primarily an engineering problem. The hard part is not building the database or the aggregation pipeline. The hard part is getting enough people to use the product in the first place. You need distribution. You need a reason for the data holders to participate. The data collection is a consequence of adoption, not a precondition of it.
This makes the third path fundamentally a marketing and distribution exercise. You are not trying to acquire data. You are trying to build a product that people use, and capturing the data as a byproduct.
The decision about which path to pursue is not just a data strategy decision. It shapes your entire go-to-market.
Exclusive access deals require enterprise sales and legal resources. They work for companies that can negotiate at that level and stomach the contract risk.
App-generated data works for companies that can get a product into users' hands and keep them there. The product has to deliver standalone value before the data moat materializes. You cannot build a mediocre product and justify it by saying "we're accumulating a dataset." The product has to earn its users.
Grassroots aggregation requires the discipline to treat distribution as the core problem. The temptation is to spend six months on data infrastructure before you have proven that anyone will use the product. That is backwards. Get adoption first. The data follows.
The common thread across all three paths: you cannot buy your way around the need to have something unique. In a world where LLMs can replicate features overnight and scrape public data in hours, the only sustainable advantage is data that was never available to replicate in the first place.
Describing your software or AI project takes minutes. Get a full scope with story-point pricing, sprint estimates, and a downloadable plan. No calls, no waiting.
Free and instant. Try the calculator now.
If LLMs were trained on public data, can they replicate my public-data product right now?
Effectively, yes, for most aggregation and summarization use cases. An LLM with web search or a retrieval-augmented generation (RAG) setup can surface, organize, and reformat public information on demand. Products whose core value is purely access to or organization of public data are the most exposed. Products with workflow depth, network effects, or user-generated data have more staying power even if their underlying data is partially public.
What is the difference between a data moat and just having a lot of data?
Volume alone is not a moat. A moat requires that the data is hard for a competitor to replicate — because it comes from exclusive contracts, from users doing work inside your product, or from an aggregation effort that took years of distribution work to build. Large, stale, or easily replicated datasets do not create defensibility. Freshness, uniqueness, and integration depth matter more than raw size.
How do you protect a data moat built from user-generated data?
Three ways: contractual (terms of service that govern data ownership), technical (architecture that keeps data inside your platform rather than exportable in bulk), and product-level (building the product so that the value compounds over time, making switching costly). No single layer is sufficient. Companies with durable user-data moats typically use all three in combination.
Is the grassroots aggregation path realistic for a small team?
Yes, but only if you solve distribution first. A small team that can get a product used by a targeted community (say, independent HVAC contractors, or physical therapists in private practice) will accumulate data that no large competitor could replicate without replicating the distribution. The size of the team is less important than the specificity of the niche and the quality of the adoption motion.
What happens when your exclusive data contract expires?
You either renegotiate, often at a higher price, or you lose the moat. This is the structural risk of the first path. The best mitigation is to use the exclusivity window to build a product and a user base deep enough that the data relationship becomes secondary to the network effects. If users are locked in by habit and switching costs, the moat survives even if the underlying data becomes available to competitors.
Can competitors use AI agents to collect the same non-public data you have?
For some data types, eventually. AI agents can simulate users, fill out forms, and extract information from interfaces not designed to share it. But this is slow, expensive, and often legally contested. More importantly, if your data comes from real users voluntarily entering it into your product as part of their work, agent-based replication is not a viable path. The moat is not just the data; it is the relationship with the people who produce it.
Andreessen Horowitz / a16z. "AI, Data, and the Shifting SaaS Moat." Referenced in Rob Saker, "AI is Eating Enterprise SaaS." Medium. https://medium.com/@rsaker/ai-is-eating-enterprise-saas-1259d352f193
Steven Cen. "AI Killed the Feature Moat. Here's What Actually Defends Your SaaS Company in 2026." Medium. https://medium.com/@cenrunzhe/ai-killed-the-feature-moat-heres-what-actually-defends-your-saas-company-in-2026-9a5d3d20973b
Insignia Business Review. "Is Proprietary Data Still a Moat in the AI Race?" https://review.insignia.vc/2025/03/10/ai-moat/
The Startup Story. "What Is a Data Moat? Definition, Examples & Why It Matters in AI." https://thestartupstory.co/data-moat/
Financial Content / MarketMinute. "The SaaS Awakening: Large-Cap Software Reclaims the Throne as AI Disruption Fears Turn into Monetization Reality." https://markets.financialcontent.com/stocks/article/marketminute-2026-4-3-the-saas-awakening-large-cap-software-reclaims-the-throne-as-ai-disruption-fears-turn-into-monetization-reality
Related: How Much Does It Cost to Build an App? · Story Points Explained