Overview
Prompt playgrounds are too simple
Prompt playgrounds work for single prompts. You edit text, test it, deploy it.
Real applications are not single prompts. A RAG pipeline retrieves documents, ranks them, generates a response. An agent picks tools and chains steps. A classifier runs multiple prompts in parallel and combines results.
These applications have many parameters: retrieval settings, model choices, thresholds, aggregation rules. A prompt playground cannot represent them.
The problem
When subject matter experts want to improve prompts, they need engineers. Every change requires editing code, deploying, testing. The back-and-forth is slow.
Evaluations run on isolated prompts. But a prompt that scores well alone might fail when combined with retrieval or aggregation. You do not know until production.
When something finally works, someone copies the configuration to production code. There is no shared source of truth. Things drift.
Custom workflows
Custom workflows let you build a playground for your actual application.
You wrap your existing code with a thin interface. Agenta calls your code and shows a UI for the configuration. Subject matter experts edit prompts and parameters. They see real outputs. Engineers own the business logic.
This gives you:
- SMEs iterate on the real workflow, not an approximation
- Evaluations test the complete pipeline
- Configuration is versioned
- Production uses the same configuration that was tested
How it works
You run your own server with the Agenta SDK. Agenta calls your server with the current configuration. Your code runs and returns the result.
In development, you run the service on your infrastructure and expose it (using ngrok or similar) so that the PM can use the playground. Agenta sends requests to your service with the configuration. Your service runs the business logic and calls your LLM gateway.
In production, you fetch the configuration from Agenta and run independently. Agenta is not in your request path. See Integration for details.
What you can configure
The SDK provides types for common parameters:
- PromptTemplate: Messages, model, temperature, and other LLM settings
- Floats and integers: Thresholds, limits, with validation
- Booleans: Flags and toggles
- Choices: Dropdowns for model selection or algorithm variants
You define configuration as a Pydantic model. Agenta renders UI controls for each field.
FAQ
What languages are supported?
Python only for now.
Does my function need to return a specific type?
Yes. The entry point function must return a str.
Does Agenta host my workflow?
No. You run your own server. Agenta connects to it via the URL you provide.
How long does setup take?
About an hour with existing code. Less with a template.
Can I use my existing framework?
Yes. Langchain, LlamaIndex, Haystack, or plain Python. Any model provider.
What about my LLM gateway?
Your code calls your gateway directly. Agenta manages configuration. It does not proxy LLM calls.