Published on: 2026-02-28
This guide demonstrates how to integrate Ollama with FastAPI and a self-hosted n8n workflow, enabling AI-powered question answering using data scraped from Shopify documentation.
This second version separates the requests: /embed for embedding and /ask for response.
You can read this first version with a single request
π Overview
The workflow leverages n8n for orchestrating tasks, FastAPI for processing requests, and Ollama for local AI model inference.
In this setup, FastAPI handles a single request that performs both embedding generation and chat-based answering, using search results gathered via the Serper API.
This version is similar to the previous one, but it introduces Step 9 for response and uses Step 8 for embedding.
βοΈ Workflow Steps
1 Trigger via Webhook
A webhook initiates the process. On this workflow example:
http://localhost:5678/webhook-test/shopify-dev-scrape?question=implement+hydrogen+and+oxygen+for+theme
2 Format Search Query
JavaScript code formats the query to target searches exclusively on the shopify.dev website.
3 Search Shopify.dev
Executes a search query using Serper API (requires a free API key from serper.dev).
4 Get Top Result
Selects only the top search result and returns it as json.url.
5 Fetch Shopify Page
Retrieves the HTML content from the top search result.
6 HTML Extraction
Extracts specific content from the HTML body, saving it as datahtml.
7 HTML File Conversion
Converts datahtml into a file, which will be attached as a document for the βAsk Ollamaβ workflow.
8 Embed
This step sends the processed HTML document to the FastAPI /embed endpoint to create vector embeddings for later retrieval.
Ensure both FastAPI and Ollama are running through ngrok and that the forwarding URLs are correctly configured.
πΉ FastAPI Setup
Python Environment Setup (Using uv):
uv venv
.venv\Scripts\activate
uv pip install -r requirements.txt
requirements.txt
fastapi
uvicorn
torch
openai
ollama
lxml
python-multipart
numpy
embedollama.py
Embed endpoint as new step workflow
# Utility: Chunk text
def chunk_text(text, chunk_size=800, overlap=100):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap
return chunks
# Utility: Embed chunks
def embed_chunks(chunks, model_name='mxbai-embed-large'):
embeddings = []
for i, chunk in enumerate(chunks):
print(f" βͺ Embedding chunk {i+1}/{len(chunks)}")
response = ollama.embeddings(model=model_name, prompt=chunk)
print(f" β’ Vector: {response['embedding'][:10]}...")
embeddings.append(response["embedding"])
embeddings_tensor = torch.tensor(embeddings)
return embeddings_tensor, chunks
# FastAPI Endpoint: /embed
@app.post("/embed")
async def embed_uploaded_html(file: UploadFile = File(...)):
print("β Reading uploaded file for embedding...")
contents = await file.read()
print("β File read.")
text = re.sub(r'\s+', ' ', contents.decode("utf-8")).strip()
print("β Splitting into chunks...")
chunks = chunk_text(text)
print(f"β Created {len(chunks)} chunks.")
print("β Generating embeddings for chunks...")
embeddings_tensor, chunks = embed_chunks(chunks)
print("β All embeddings generated.")
# Convert tensor to list for JSON serialization
embeddings_list = embeddings_tensor.tolist()
return {
"chunks": chunks,
"embeddings": embeddings_list
}
9 Ask and Response
Sends the userβs question to the FastAPI /ask endpoint to generate an answer using the previously created embeddings.
Ensure both FastAPI and Ollama are running via ngrok with correctly updated forwarding URLs.
# Get relevant context
def get_relevant_context(query, vault_embeddings, vault_content, top_k=3):
print("β Generating embedding for query....")
input_embedding = ollama.embeddings(model='mxbai-embed-large', prompt=query)["embedding"]
print("β Query embedding done.")
input_tensor = torch.tensor(input_embedding).unsqueeze(0)
cos_scores = torch.cosine_similarity(input_tensor, vault_embeddings)
top_k = min(top_k, len(cos_scores))
top_indices = torch.topk(cos_scores, k=top_k)[1].tolist()
return [vault_content[i].strip() for i in top_indices]
# Main chat wrapper
def ollama_chat(
question,
system_message,
vault_embeddings_tensor,
vault_chunks,
conversation_history,
model_name="dolphin-llama3"
):
print("β Getting relevant context...")
relevant_context = get_relevant_context(question, vault_embeddings_tensor, vault_chunks)
print(f"β Got {len(relevant_context)} relevant chunks.")
context_str = "\n".join(relevant_context)
user_input_with_context = context_str + "\n\n" + question if relevant_context else question
# Append current user message to conversation history
conversation_history.append({
"role": "user",
"content": user_input_with_context
})
# Build complete message list including system prompt
messages = [{"role": "system", "content": system_message}] + conversation_history
print("β Sending request to Ollama model...")
try:
response = client.chat.completions.create(
model=model_name,
messages=messages
)
except OpenAIError as e:
print("β OpenAIError:", e)
raise
except Exception as e:
print("β Unexpected Error:", e)
raise
print("β Received response from Ollama.")
# Append assistant's response to history
assistant_reply = response.choices[0].message.content
conversation_history.append({
"role": "assistant",
"content": assistant_reply
})
return {
"relevant_context": relevant_context,
"response": assistant_reply
}
# FastAPI Endpoint: /ask
@app.post("/ask")
async def ask_with_embeds(request: AskRequest):
print("β Received embeddings and chunks for Q&A.")
vault_embeddings_tensor = torch.tensor(request.embeddings)
print(f"β Loaded {len(request.chunks)} chunks and embeddings.")
print("β Calling ollama_chat()...")
conversation_history = []
result = ollama_chat(
question=request.question,
system_message="You are a helpful assistant that extracts the most useful info from uploaded documents.",
vault_embeddings_tensor=vault_embeddings_tensor,
vault_chunks=request.chunks,
conversation_history=conversation_history,
model_name="dolphin-llama3"
)
print("β ollama_chat() complete.")
print("Final Response:", result["response"])
return {
"question": request.question,
"relevant_context": result["relevant_context"],
"response": result["response"]
}
Run the API:
uvicorn embedollama:app --reload
πΉ Ollama Setup
Pull required models:
ollama pull mxbai-embed-large
ollama pull dolphin-llama3
πΉ NGROK Setup
Ollama requires traffic policy configuration to accept requests from the FastAPI server. ngrok is used to tunnel requests securely during development.
Configure ngrok traffic policy (docs):
on_http_request:
- actions:
- type: add-headers
config:
headers:
host: localhost
10 Respond to User
The webhook returns the AI-generated answer based on the question.
Behind the python running
Embedding: /embed endpoint

Ask/Response API: /ask endpoint

π Downloadable n8n Workflow
The n8n JSON file can be downloaded and customized. Update the configuration to match your environment. (N8N Config file)
π TL;DR
Integrating Ollama with FastAPI and a self-hosted n8n workflow enables private, local AI pipelines for structured knowledge retrieval and question answering.
This setup is ideal for:
- Scraping and embedding documentation (e.g., Shopify docs) for semantic search
- Building a Q&A system powered by vector embeddings
- Automating knowledge workflows using low-code orchestration with n8n
- Creating domain-specific AI assistants without relying on external APIs
By separating /embed and /ask, this architecture clearly distinguishes between data processing and response generation. It combines local AI privacy, vector-based retrieval, and flexible FastAPI endpoints for building scalable, AI-powered tools.