In my 10 years working in the tech industry, I have seen waves of trends, but few have moved as fast as Generative AI. The numbers back this up: McKinsey reports that by 2025, 67% of telecom respondents expect to use GenAI for more than 30% of their daily tasks.

But statistics are one thing; production is another.

Today, my colleagues and I are seeing a massive rise in RAG (Retrieval-Augmented Generation) solutions. We are moving beyond simple chatbots to building complex systems that “talk” to your internal documentation. Based on the projects I’ve delivered for our customers, from retail to finance, here is a breakdown of the architectures we actually use, the problems we’ve solved, and my honest advice on when not to use GenAI.

Why We Are Betting on RAG

When I talk to clients, I explain that a standard chatbot is like a brilliant improviser who sometimes makes things up. RAG changes that. It forces the bot to look at your specific data before answering.

We have built RAG solutions for everything from internal “Q&A over documentation” tools to customer-facing retail bots where users ask about their specific order status. In my experience, this approach is the only way to guarantee accuracy, domain adaptability, and data privacy.

The Three Architectures We Use

Over the last year, I’ve overseen deployments across three different environments. Here is how they compare in the real world:

1. The On-Premise / GPU Approach (Llama & Hugging Face)

We built a chatbot in Levi9 using purely internal infrastructure. This was for a customer who needed absolute control.

Follow these principles:

What we did: We ran the solution on our own GPUs. We pulled open-source models (like Llama) from Hugging Face and handled the coding ourselves.
My take: This gives you maximum control. We used a "Prompt Engineer Assistant" layer where we explicitly told the bot: "You are an assistant that replies from this context, but does not talk about that." It works beautifully, but you have to manage the hardware.

2. The Azure Cloud Approach

For one of our retail customers, we deployed a solution using the Microsoft stack.

3. The AWS Approach (Bedrock & Guardrails)

I have also worked with Amazon Bedrock, and there is one feature here that I find particularly useful for corporate clients: Guardrails.

What we did: Beyond just hooking up a Foundation Model and a Vector Database, we configured guardrails to strictly control behavior.
My take: In a corporate setting, you can't have a bot going rogue. With AWS, I could easily configure it to say, "I don't give finance advice" or block it from discussing sensitive topics like race or politics. It adds a layer of safety that is essential for enterprise use.

When I Don't Use GenAI

This might sound counter-intuitive coming from an AI proponent, but sometimes GenAI is the wrong tool.

I am currently working on a project involving sensitive financial data. In this specific case, we made the hard decision to not use Generative AI. Why? Because we cannot risk hallucinations. When dealing with bank details and financial reporting, “mostly accurate” isn’t good enough.

In these scenarios, I always pivot back to classical AI and basic statistics. We need models that are mathematically interpretable and verifiable. If you are in a high-risk industry, my advice is to prioritize interpretability over the “cool factor” of GenAI.

Challenges & My Recommendations

If you are ready to build, here are a few hurdles I’ve faced and how I suggest you handle them:

Security is Paramount: For one of our projects, we had to strictly anonymize Named Entities (names, locations, bank details) before the data ever touched the model. I also insist on ensuring data residency—if you are in Europe, ensure your data doesn't leave the EU to stay GDPR compliant.
Scalability via Quantization: If you are running your own models, you will run into compute limits. We use quantization—essentially reducing the decimal precision of the model weights—to make massive models run efficiently on smaller GPUs.
Prompt Caching: Don't waste money re-processing the same context. We use prompt caching so that when a user asks a follow-up question, the context is retained.

The Playground is Open: It’s Time to Experiment

If there is one thing my decade in tech has taught me, it’s that you cannot learn this technology from slides alone. We are in a unique moment where the barrier to entry is incredibly low. Whether you are an individual developer or a CTO, you have access to the same powerful tools we use—Amazon Bedrock, Azure OpenAI, and the endless library of models on Hugging Face.

My advice? Don’t wait for the “perfect” use case.

Go explore. Spin up a sandbox environment. Pull a model from Hugging Face and try to break it. See what happens when you feed it your own data. Test the limits of what these architectures can do. The landscape is shifting under our feet, and the only way to stay relevant is to be the one testing the ground.

We are just scratching the surface of what RAG can do. I’ve shared my map of the territory—now it’s time for you to start walking the path.

Written by:

Ana-Maria Cehan,
Data Tech Lead
Levi9

Published:

4 February 2026

Blog post

Computer Vision Scoops with Levi9’s Hackathon Winners

April 16, 2024

Blog post

Scaling Smart: A Pragmatic Approach to Data Strategy

November 26, 2024

Blog post

Clean Code: Essential Skills for Quality AI Assisted Enterprise Software

December 9, 2025

Blog post

Why being lazy can make you a better developer

January 16, 2025

Blog post

Rebuild This Castle: A Data Migration Story

September 29, 2023

Blog post

10 Ways Java Is Getting Better

September 1, 2023

Blog post

6 Easy Ways to Boost Your Recruitment with ChatGPT

June 8, 2023

Blog post

Long-Term Effects of AI – Levi Niners Weigh In

August 31, 2025

Blog post

Reading Between the Smiles: Emotion-Sensing with AI

September 6, 2023

Blog post

Bringing the Tachograph’s Simplicity Back to Drivers’ Fingers with a Transportation App

September 1, 2023

Why We Are Betting on RAG

The Three Architectures We Use

1. The On-Premise / GPU Approach (Llama & Hugging Face)

2. The Azure Cloud Approach

3. The AWS Approach (Bedrock & Guardrails)

When I Don't Use GenAI

Challenges & My Recommendations

The Playground is Open: It’s Time to Experiment

In this article:

Related posts

Computer Vision Scoops with Levi9’s Hackathon Winners

Scaling Smart: A Pragmatic Approach to Data Strategy

Clean Code: Essential Skills for Quality AI Assisted Enterprise Software

Why being lazy can make you a better developer

10 Ways Java Is Getting Better

6 Easy Ways to Boost Your Recruitment with ChatGPT

Long-Term Effects of AI – Levi Niners Weigh In

Reading Between the Smiles: Emotion-Sensing with AI

Bringing the Tachograph’s Simplicity Back to Drivers’ Fingers with a Transportation App

Contact us

Explore

Join us

Azure vs. AWS vs. Local GPU: Which RAG Architecture Fits Your Business?

Why We Are Betting on RAG

The Three Architectures We Use

1. The On-Premise / GPU Approach (Llama & Hugging Face)

2. The Azure Cloud Approach

3. The AWS Approach (Bedrock & Guardrails)

When I Don't Use GenAI

Challenges & My Recommendations

The Playground is Open: It’s Time to Experiment

In this article:

Related posts

Computer Vision Scoops with Levi9’s Hackathon Winners

Scaling Smart: A Pragmatic Approach to Data Strategy

Clean Code: Essential Skills for Quality AI Assisted Enterprise Software

Why being lazy can make you a better developer

Rebuild This Castle: A Data Migration Story

10 Ways Java Is Getting Better

6 Easy Ways to Boost Your Recruitment with ChatGPT

Long-Term Effects of AI – Levi Niners Weigh In

Reading Between the Smiles: Emotion-Sensing with AI

Bringing the Tachograph’s Simplicity Back to Drivers’ Fingers with a Transportation App

When I Don't Use GenAI