AI Tools

LLM Tech Stacks

The tools startups actually use to ship LLM products — and why you need less custom infrastructure than you think.

Praveen Ghanta, CEO, Hire Fraction · October 3, 2023 ·8 min read

LLMgenerative AIvector databasesAI infrastructure

What you’ll learn

The six distinct layers of an LLM tech stack and what each one actually does in a production application
Why vector databases are essential for any LLM product that needs to search documents — and which providers dominate the market
The specific reason orchestration frameworks like LangChain see lower adoption than expected, and when they do add value
How to choose between GPT-4, Claude, and open-source models like LLaMA 2 based on your team’s actual constraints
Why image generation, OCR, and speech-to-text are table-stakes for modern LLM apps — and which APIs give you serviceable results fastest

LLM-based products are more than a thin interface on top of GPT-4. A complete stack involves six coordinated layers — and knowing which ones you actually need prevents the two most common mistakes: over-engineering before you have users, and under-engineering once you do.

What use cases are companies actually solving with LLMs?

There is a wide variety of use cases that companies are tackling with generative AI and large language models. Relative to traditional applications, LLM-based products involve dealing with relatively unstructured inputs from users and producing relatively unstructured outputs as well.

In the pre-LLM world, building a useful chatbot was genuinely hard because open-ended questions from users are difficult for computers to parse — they do not conform to any particular structure. Generative AI outputs are meant to resemble human output: while you can generate a CSV from an LLM, you are more likely looking for a generated image or block of text that appears similar to something a person wrote.

Large language models handle the understanding and generation, but a lot of glue is needed in between. Since LLMs have limited context windows — a maximum input length — how do you analyze large inputs? How do you search an existing knowledge base using the power of an LLM? And what about non-text modalities: audio, images, video? These needs create the demand for a deeper stack of technologies.

What are the core building blocks of an LLM tech stack?

The LLM space is evolving rapidly, so let’s define the key layers:

Definition

LLM tech stack: the combination of technologies layered together to power a production application built on a large language model. It typically includes the LLM itself, an orchestration layer, a vector database for semantic search, and supporting components for modalities like speech, images, and document text.

Large Language Models are artificial neural networks trained on vast amounts of text data from the internet. They contain billions to trillions of parameters and work by taking an input text and repeatedly predicting the next token. LLMs handle a variety of natural language tasks: question answering, text generation, summarization, and natural language understanding.

LLM Orchestration Frameworks — such as LangChain, LlamaIndex, and Semantic Kernel — provide a way to manage and control large language models. They can simplify development of LLM-based applications and improve reliability. Your mileage will vary by programming language, since most frameworks prioritize Python first as the language most closely associated with data science.

Vector Databases store data as high-dimensional embeddings — mathematical representations that capture the semantic meaning of a block of text. Comparing embeddings allows efficient lookup of nearest-neighbor results in N-dimensional space, typically powered by k-nearest neighbor indexes. In practice, this means you can find related text based on a user query or document comparison, not just keyword matching.

Image Generation is the task of creating new images from a model. You’ve seen the results: generated movie trailers, world leaders in unusual settings, astronauts on unicorns — all produced by ML algorithms that share underlying technology with LLMs. While these models have historically struggled with rendering text in images, that capability is improving quickly.

OCR (Optical Character Recognition) is the ability to read text from images such as scanned documents or photographs. Modern OCR uses machine learning and computer vision to identify individual characters. Accuracy has greatly improved in the LLM era, and text extracted from images serves as a key input into LLM-based workflows.

Speech-to-Text (and back) converts spoken words to written text, and written text to spoken audio. Converting speech to text provides valuable input for LLM-based workflows. Text-to-speech output is progressing rapidly in terms of realism — advances here now power calling engines that make and take phone calls from real humans.

Which LLM should a startup use in 2024 and beyond?

For a current benchmark comparison, the Chatbot Arena leaderboard on HuggingFace tracks model rankings in real time. GPT-4 has held a consistent lead, while Claude by Anthropic has narrowed the gap significantly. For models that can be run privately on your own infrastructure, LLaMA 2 by Meta is not far behind the closed leaders.

While the options are growing, the best place to start for most teams continues to be OpenAI. It is easy to stand up and get going, and the odds are that your real product challenges lie elsewhere anyway. Once you have traction, you can consider moving in-house or switching providers based on your specific needs — cost, latency, data residency, or model behavior.

Startups focused on solving vertical-specific problems are best served by focusing on those domain problems first, powered by the most accessible model they can access. The list of models available on HuggingFace grows daily, but time spent training or fine-tuning models is better spent after product-market fit, when you are optimizing away the cost of a vendor like OpenAI rather than trying to build a moat with model architecture. For teams exploring what early LLM projects actually look like in practice, this conclusion holds across nearly every project type.

Building an LLM product and not sure where to start?

Get a structured project plan with stack recommendations, story-point estimates, and sprint timelines — free, in minutes, no call required.

Scope Your Project for Free

No call required. Takes a few minutes.

Are LLM orchestration frameworks like LangChain actually worth using?

LangChain, LlamaIndex, Semantic Kernel, and other frameworks promise to make LLM integration easy — except that calling the OpenAI API directly is already fairly easy. Similarly, they make integrating with vector databases like Pinecone or Chroma straightforward, but those integrations are also simple to do manually with their own SDKs.

As a result, frameworks see less adoption than you might expect from their mindshare. That said, they do provide value in two specific ways: first, by structuring code so that future developers can quickly understand and update it; second, by making it easier to swap out the LLM implementation or vector database later without a full rewrite.

While future-proofing is rarely the right priority for an early-stage startup, if your team has a specific sensitivity around developer turnover or implementation lock-in, using a framework is a reasonable hedge. Otherwise, start simple.

How do vector databases power semantic search in LLM applications?

Some form of embedding storage and search is proving essential for most LLM applications. PineconeDB and Chroma are among the leading startups in this space, but it’s worth noting that all major cloud providers — AWS, GCP, Azure — offer managed vector search solutions as well.

The reason vector databases matter: they are the key to semantic search. Want to search a large volume of internal documents? Want to create client-specific content search? What about searching a website’s contents? These use cases are powered by vector embeddings. You chunk a document into small, overlapping sections, send each chunk to an embedding model, and store the resulting vectors. At query time, you embed the user’s question the same way and retrieve the nearest-matching chunks — which are then passed to the LLM as context.

The embeddings capture the LLM’s understanding of the meaning of the text, enabling more accurate search across documents than keyword matching. Note that very short text fragments result in lower-quality embeddings and degraded search quality, and traditional approaches like Elasticsearch can outperform vector search for simple keyword-heavy use cases. For teams building production-grade retrieval systems, learning from real LLM project experience across different retrieval approaches is the fastest way to avoid the common failure modes.

While the examples above focus on text, vector databases also support similarity search across images, video, and other content types — provided a good model for generating vectors from that content.

What image generation and speech tools should LLM builders know about?

Image generation and OCR

Unless you’ve deliberately avoided this topic, you’re familiar with OpenAI’s DALL-E, now in its third substantially improved version. Stable Diffusion (and DreamStudio by Stability AI) is an important competitor that outperforms DALL-E 2 in many instances. MidJourney is another powerful image generation service that excels at artistic imagery — though it is only accessible via Discord, which creates real friction for business integrations.

On the OCR side, Amazon Textract, Google’s Cloud Vision OCR, and Azure’s AI Vision all provide ML-based text extraction as part of their AI services. The state of the art has progressed rapidly, and you should get serviceable results from the major cloud providers — use whichever integrates most easily into your existing infrastructure.

Speech-to-text and text-to-speech

Converting recorded audio streams into text is key for powering LLM-based workflows involving voice, and the reverse is needed to communicate back to users in a near-human voice. Google’s Speech-to-Text API provides industry-leading quality for real-time transcription in over 100 languages. On the synthesis side, startups like ElevenLabs are leading on cloned-voice generation.

Speech-to-text is a well-solved problem at this point — the technology ships on every smartphone and smart speaker. Human speech generation remains a newer frontier: real-time performance still lags in most demos, and extended conversation with a synthesized voice still sounds stilted. That gap is closing quickly.

How do you put an LLM tech stack together for a real product?

Successful LLM-based products are more than a thin wrapper on top of GPT-4. At the same time, they can be much less than a custom-trained model built by a team of ML specialists. By chaining together capabilities from an LLM, an image generator, an OCR library, and a speech processor, you can build applications that were only theoretical in 2022.

The right starting point for most teams is the growing ecosystem of API providers. Progress to open-source HuggingFace models when APIs no longer provide what you need — or once you’ve reached a scale where the vendor cost justifies the engineering investment to replace it. You can certainly build from scratch using PyTorch, Keras, TensorFlow, and similar libraries, but that approach is only warranted if you are directly in the ML infrastructure business. For everyone else, it is likely a poor use of engineering time.

For teams considering whether to hire ML specialists early versus bringing in expert help on a flexible basis, the case for fractional LLM and RAG engineers is worth understanding before you make that headcount decision.

Frequently asked questions

What is an LLM tech stack?

An LLM tech stack is the combination of technologies layered together to build a production application powered by a large language model. It typically includes an LLM (such as GPT-4 or Claude), a vector database for semantic search, an orchestration framework to manage multi-step workflows, and supporting components for modalities like speech, images, and OCR. The stack exists because LLMs alone cannot handle the full range of inputs and outputs a real application requires.

Should a startup use LangChain or build LLM integrations directly?

For most early-stage startups, building directly against the OpenAI API is simpler and faster than adopting a framework like LangChain. Frameworks add value by structuring code for long-term maintainability and making it easier to swap out LLM providers or vector databases later — but that flexibility is rarely the bottleneck before product-market fit. Use a framework if your team has a specific sensitivity around developer turnover or vendor lock-in. Otherwise, start simple and add structure when complexity demands it.

Why do LLM applications need a vector database?

LLMs have limited context windows — they can only process a fixed amount of text at once. Vector databases solve this by storing documents as numerical embeddings that capture semantic meaning, enabling fast retrieval of the most relevant chunks before they are passed to the LLM. This pattern, called retrieval-augmented generation (RAG), is how LLM apps search internal documents, knowledge bases, or large datasets without hitting context limits or requiring expensive fine-tuning.

What is the difference between GPT-4, Claude, and open-source LLMs like LLaMA?

GPT-4 and Claude are closed, API-accessed models from OpenAI and Anthropic respectively — they are the easiest to start with and consistently rank at the top of benchmarks. LLaMA 2 and its successors are open-source models from Meta that can be run privately on your own infrastructure, which matters for data-sensitive use cases or cost optimization at scale. For most startups, the right answer is to begin with OpenAI, then revisit model choice once you have working product and enough volume to justify the switch.

When does it make sense to use a speech-to-text layer in an LLM application?

Any time your application needs to accept spoken input or deliver spoken output — customer support bots, call-handling tools, voice-driven interfaces, or meeting summarization pipelines. Speech-to-text converts audio into text that the LLM can process, and text-to-speech converts LLM output back into natural-sounding audio. The underlying technology is mature and reliable; Google’s Speech-to-Text API handles real-time transcription in over 100 languages. Text-to-speech realism is improving rapidly but still struggles with latency in live conversation scenarios.

How much custom LLM infrastructure does a startup actually need?

Less than most vendors will tell you. A working LLM product can be built entirely from API services — a cloud LLM provider, a managed vector database, and standard cloud storage. Custom model training, self-hosted inference, and bespoke orchestration frameworks are optimizations that make sense after product-market fit and at significant volume. Before that point, time spent building infrastructure is time not spent on product. Start from the API ecosystem and add custom infrastructure only when a specific constraint — cost, latency, data residency, or model behavior — genuinely demands it.

Praveen Ghanta

CEO, Hire Fraction

Praveen Ghanta is a five-time founder and serial entrepreneur. He is the founder of DevHawk.ai, an AI-powered engineering management platform, and Fraction.work, which connects fast-growing companies with top fractional tech and growth marketing talent. Previously, he founded HiddenLevers, a risk analytics platform for wealth management that he bootstrapped from inception to acquisition by Orion Advisor Solutions in 2021, serving thousands of advisors and $600B in assets. He earlier founded SmartWorkGroups, acquired by Intralinks in 2000.

Connect on LinkedIn →

Get started

Get an Instant Project Plan + Cost Estimate

Describe your software or AI project. Get a full scope with story-point pricing, sprint estimates, and a downloadable plan in minutes. No calls, no waiting.