The tools startups actually use to ship LLM products — and why you need less custom infrastructure than you think.
LLM-based products are more than a thin interface on top of GPT-4. A complete stack involves six coordinated layers — and knowing which ones you actually need prevents the two most common mistakes: over-engineering before you have users, and under-engineering once you do.
There is a wide variety of use cases that companies are tackling with generative AI and large language models. Relative to traditional applications, LLM-based products involve dealing with relatively unstructured inputs from users and producing relatively unstructured outputs as well.
In the pre-LLM world, building a useful chatbot was genuinely hard because open-ended questions from users are difficult for computers to parse — they do not conform to any particular structure. Generative AI outputs are meant to resemble human output: while you can generate a CSV from an LLM, you are more likely looking for a generated image or block of text that appears similar to something a person wrote.
Large language models handle the understanding and generation, but a lot of glue is needed in between. Since LLMs have limited context windows — a maximum input length — how do you analyze large inputs? How do you search an existing knowledge base using the power of an LLM? And what about non-text modalities: audio, images, video? These needs create the demand for a deeper stack of technologies.
The LLM space is evolving rapidly, so let’s define the key layers:
LLM tech stack: the combination of technologies layered together to power a production application built on a large language model. It typically includes the LLM itself, an orchestration layer, a vector database for semantic search, and supporting components for modalities like speech, images, and document text.
Large Language Models are artificial neural networks trained on vast amounts of text data from the internet. They contain billions to trillions of parameters and work by taking an input text and repeatedly predicting the next token. LLMs handle a variety of natural language tasks: question answering, text generation, summarization, and natural language understanding.
LLM Orchestration Frameworks — such as LangChain, LlamaIndex, and Semantic Kernel — provide a way to manage and control large language models. They can simplify development of LLM-based applications and improve reliability. Your mileage will vary by programming language, since most frameworks prioritize Python first as the language most closely associated with data science.
Vector Databases store data as high-dimensional embeddings — mathematical representations that capture the semantic meaning of a block of text. Comparing embeddings allows efficient lookup of nearest-neighbor results in N-dimensional space, typically powered by k-nearest neighbor indexes. In practice, this means you can find related text based on a user query or document comparison, not just keyword matching.
Image Generation is the task of creating new images from a model. You’ve seen the results: generated movie trailers, world leaders in unusual settings, astronauts on unicorns — all produced by ML algorithms that share underlying technology with LLMs. While these models have historically struggled with rendering text in images, that capability is improving quickly.
OCR (Optical Character Recognition) is the ability to read text from images such as scanned documents or photographs. Modern OCR uses machine learning and computer vision to identify individual characters. Accuracy has greatly improved in the LLM era, and text extracted from images serves as a key input into LLM-based workflows.
Speech-to-Text (and back) converts spoken words to written text, and written text to spoken audio. Converting speech to text provides valuable input for LLM-based workflows. Text-to-speech output is progressing rapidly in terms of realism — advances here now power calling engines that make and take phone calls from real humans.
For a current benchmark comparison, the Chatbot Arena leaderboard on HuggingFace tracks model rankings in real time. GPT-4 has held a consistent lead, while Claude by Anthropic has narrowed the gap significantly. For models that can be run privately on your own infrastructure, LLaMA 2 by Meta is not far behind the closed leaders.
While the options are growing, the best place to start for most teams continues to be OpenAI. It is easy to stand up and get going, and the odds are that your real product challenges lie elsewhere anyway. Once you have traction, you can consider moving in-house or switching providers based on your specific needs — cost, latency, data residency, or model behavior.
Startups focused on solving vertical-specific problems are best served by focusing on those domain problems first, powered by the most accessible model they can access. The list of models available on HuggingFace grows daily, but time spent training or fine-tuning models is better spent after product-market fit, when you are optimizing away the cost of a vendor like OpenAI rather than trying to build a moat with model architecture. For teams exploring what early LLM projects actually look like in practice, this conclusion holds across nearly every project type.
Get a structured project plan with stack recommendations, story-point estimates, and sprint timelines — free, in minutes, no call required.
Scope Your Project for FreeNo call required. Takes a few minutes.
LangChain, LlamaIndex, Semantic Kernel, and other frameworks promise to make LLM integration easy — except that calling the OpenAI API directly is already fairly easy. Similarly, they make integrating with vector databases like Pinecone or Chroma straightforward, but those integrations are also simple to do manually with their own SDKs.
As a result, frameworks see less adoption than you might expect from their mindshare. That said, they do provide value in two specific ways: first, by structuring code so that future developers can quickly understand and update it; second, by making it easier to swap out the LLM implementation or vector database later without a full rewrite.
While future-proofing is rarely the right priority for an early-stage startup, if your team has a specific sensitivity around developer turnover or implementation lock-in, using a framework is a reasonable hedge. Otherwise, start simple.
Some form of embedding storage and search is proving essential for most LLM applications. PineconeDB and Chroma are among the leading startups in this space, but it’s worth noting that all major cloud providers — AWS, GCP, Azure — offer managed vector search solutions as well.
The reason vector databases matter: they are the key to semantic search. Want to search a large volume of internal documents? Want to create client-specific content search? What about searching a website’s contents? These use cases are powered by vector embeddings. You chunk a document into small, overlapping sections, send each chunk to an embedding model, and store the resulting vectors. At query time, you embed the user’s question the same way and retrieve the nearest-matching chunks — which are then passed to the LLM as context.
The embeddings capture the LLM’s understanding of the meaning of the text, enabling more accurate search across documents than keyword matching. Note that very short text fragments result in lower-quality embeddings and degraded search quality, and traditional approaches like Elasticsearch can outperform vector search for simple keyword-heavy use cases. For teams building production-grade retrieval systems, learning from real LLM project experience across different retrieval approaches is the fastest way to avoid the common failure modes.
While the examples above focus on text, vector databases also support similarity search across images, video, and other content types — provided a good model for generating vectors from that content.
Unless you’ve deliberately avoided this topic, you’re familiar with OpenAI’s DALL-E, now in its third substantially improved version. Stable Diffusion (and DreamStudio by Stability AI) is an important competitor that outperforms DALL-E 2 in many instances. MidJourney is another powerful image generation service that excels at artistic imagery — though it is only accessible via Discord, which creates real friction for business integrations.
On the OCR side, Amazon Textract, Google’s Cloud Vision OCR, and Azure’s AI Vision all provide ML-based text extraction as part of their AI services. The state of the art has progressed rapidly, and you should get serviceable results from the major cloud providers — use whichever integrates most easily into your existing infrastructure.
Converting recorded audio streams into text is key for powering LLM-based workflows involving voice, and the reverse is needed to communicate back to users in a near-human voice. Google’s Speech-to-Text API provides industry-leading quality for real-time transcription in over 100 languages. On the synthesis side, startups like ElevenLabs are leading on cloned-voice generation.
Speech-to-text is a well-solved problem at this point — the technology ships on every smartphone and smart speaker. Human speech generation remains a newer frontier: real-time performance still lags in most demos, and extended conversation with a synthesized voice still sounds stilted. That gap is closing quickly.
Successful LLM-based products are more than a thin wrapper on top of GPT-4. At the same time, they can be much less than a custom-trained model built by a team of ML specialists. By chaining together capabilities from an LLM, an image generator, an OCR library, and a speech processor, you can build applications that were only theoretical in 2022.
The right starting point for most teams is the growing ecosystem of API providers. Progress to open-source HuggingFace models when APIs no longer provide what you need — or once you’ve reached a scale where the vendor cost justifies the engineering investment to replace it. You can certainly build from scratch using PyTorch, Keras, TensorFlow, and similar libraries, but that approach is only warranted if you are directly in the ML infrastructure business. For everyone else, it is likely a poor use of engineering time.
For teams considering whether to hire ML specialists early versus bringing in expert help on a flexible basis, the case for fractional LLM and RAG engineers is worth understanding before you make that headcount decision.
Praveen Ghanta is a five-time founder and serial entrepreneur. He is the founder of DevHawk.ai, an AI-powered engineering management platform, and Fraction.work, which connects fast-growing companies with top fractional tech and growth marketing talent. Previously, he founded HiddenLevers, a risk analytics platform for wealth management that he bootstrapped from inception to acquisition by Orion Advisor Solutions in 2021, serving thousands of advisors and $600B in assets. He earlier founded SmartWorkGroups, acquired by Intralinks in 2000.
Connect on LinkedIn →Describe your software or AI project. Get a full scope with story-point pricing, sprint estimates, and a downloadable plan in minutes. No calls, no waiting.
Scope Your Project for FreeWorking on a data strategy? Talk to a Fraction CTO. → Book an intro call