In today’s complex Kubernetes environments, managing and prioritizing vulnerabilities can quickly become overwhelming. With dozens or even hundreds of containers running across multiple services, how do you decide which vulnerabilities to address first?
This is where AI can help, and in this article we’ll share our experience building HAIstings, an AI-powered vulnerability prioritizer, using LangGraph and LangChain, with security enhanced by CodeGate, an open source AI gateway developed by Stacklok.
Too Many Vulnerabilities, Too Little Time
If you’ve ever run a vulnerability scanner like Trivy against your Kubernetes cluster, you know the feeling: hundreds or thousands of common vulnerabilities and exposures (CVEs) across dozens of images, with limited time and resources to address them. Which ones should you tackle first?
The traditional approach relies on severity scores (i.e., critical, high, medium, low), but these scores don’t account for your specific infrastructure context. For example, a high vulnerability in an internal, non-critical service might be less urgent than a medium vulnerability in an internet-facing component.
We wanted to see if we could use AI to help solve this prioritization problem. Inspired by Arthur Hastings, the meticulous assistant to Agatha Christie’s detective Hercule Poirot, we built HAIstings to help infrastructure teams prioritize vulnerabilities based on:
- Severity (critical/high/medium/low).
- Infrastructure context (from GitOps repositories).
- User-provided insights about component criticality.
- Evolving understanding through conversation.
Building HAIstings With LangGraph and LangChain
LangGraph, built on top of LangChain, provides an excellent framework for creating conversational AI agents with memory. Here’s how we structured HAIstings:
1. Core Components
The main components of HAIstings include:
- k8sreport: Connects to Kubernetes to gather vulnerability reports from trivy-operator.
- repo_ingest: Ingests infrastructure repository files to provide context.
- vector_db: Stores and retrieves relevant files using vector embeddings.
- memory: Maintains conversation history across sessions.
2. Conversation Flow
HAIstings uses a LangGraph state machine with the following flow:
|
graph_builder = StateGraph(State) # Nodes graph_builder.add_node(“retrieve”, retrieve) # Get vulnerability data graph_builder.add_node(“generate_initial”, generate_initial) # Create initial report graph_builder.add_node(“extra_userinput”, extra_userinput) # Get more context
# Edges graph_builder.add_edge(START, “retrieve”) graph_builder.add_edge(“retrieve”, “generate_initial”) graph_builder.add_edge(“generate_initial”, “extra_userinput”) graph_builder.add_conditional_edges(“extra_userinput”, needs_more_info, [“extra_userinput”, END]) |
This creates a loop where HAIstings:
- Retrieves vulnerability data.
- Generates an initial report.
- Asks for additional context.
- Refines its assessment based on new information.
3. RAG for Relevant Context
One of the challenges was efficiently retrieving only the relevant files from potentially huge GitOps repositories. We implemented a retrieval-augmented generation (RAG) approach:
|
def retrieve_relevant_files(repo_url: str, query: str, k: int = 5) –> List[Dict]: “””Retrieve relevant files from the vector database based on a query.””” vector_db = VectorDatabase() documents = vector_db.similarity_search(query, k=k)
results = [] for doc in documents: results.append({ “path”: doc.metadata[“path”], “content”: doc.page_content, “is_kubernetes”: doc.metadata.get(“is_kubernetes”, False), })
return results |
This ensures that only the most relevant files for each vulnerable component are included in the context, keeping the prompt size manageable.
Security Considerations
When working with LLMs and infrastructure data, security is paramount. The vulnerability reports and infrastructure files we’re analyzing could contain sensitive information like:
- Configuration details.
- Authentication mechanisms.
- Potentially leaked credentials in infrastructure files.
This is where the open source project CodeGate becomes essential. CodeGate acts as a protective layer between HAIstings and the LLM provider, offering crucial protections:
1. Secrets Redaction
CodeGate automatically identifies and redacts secrets like API keys, tokens and credentials from your prompts before they reach the large language model (LLM) provider. This prevents accidental leakage of sensitive data to third-party cloud services.
For example, if your Kubernetes manifest or GitOps repo contains:
|
apiVersion: v1 kind: Secret metadata: name: database-credentials type: Opaque data: username: YWRtaW4= # “admin” in base64 password: c3VwZXJzZWNyZXQ= # “supersecret” in base64 |
CodeGate redacts these values from prompts before reaching the LLM; then it seamlessly unredacts them in responses.
You may be saying, “Hang on a second. We rely on things like ExternalSecretsOperator to include Kubernetes secrets, so we’re safe… right?”
Well, you might be experimenting with a cluster and have a token stored in a file in your local repository or in your current working directory. An agent might be a little too ambitious and accidentally add it to your context, as we’ve often seen with code editors. This is where CodeGate jumps in and redacts sensitive info before it is unintentionally shared.
2. PII Redaction
Beyond secrets, CodeGate also detects and redacts personally identifiable information (PII) that might be present in your infrastructure files or deployment manifests.
3. Controlled Model Access
CodeGate includes model multiplexing (muxing) capabilities help ensure that infrastructure vulnerability information goes only to approved, trusted models with appropriate security measures.
Model muxing allows you to create rules that route specific file types, projects or code patterns to different AI models. For example, you might want infrastructure code to be handled by a private, locally hosted model, while general application code can be processed by cloud-based models.
Model muxing enables:
- Data sensitivity control: Route sensitive code (like infrastructure, security or authentication modules) to models with stricter privacy guarantees.
- Compliance requirements: Meet regulatory needs by ensuring certain code types never leave your environment.
- Cost optimization: Use expensive, high-powered models only for critical code sections.
- Performance tuning: Match code complexity with the most appropriate model capabilities.
Here’s an example model muxing strategy with an infrastructure repository:
- Rule:
*.tf
,*.yaml
or*-infra.*
can be muxed to a locally hosted Ollama model. - Benefit: Terraform files and infrastructure YAML never leave your environment, preventing potential leak of secrets, IP addresses or infrastructure design.
4. Traceable History
CodeGate maintains a central record of all interactions with AI models, creating an audit trail of all vulnerability assessments and recommendations.
Configuring HAIstings With CodeGate
Setting up HAIstings to work with CodeGate is straightforward. Update the LangChain configuration in HAIstings:
|
# HAIstings configuration for using CodeGate self.llm = init_chat_model( # Using CodeGate’s Muxing feature model=“gpt-4o”, # This will be routed appropriately by CodeGate model_provider=“openai”, # API key not needed as it’s handled by CodeGate api_key=“fake-api-key”, # CodeGate Muxing API URL base_url=“http://127.0.0.1:8989/v1/mux”, ) |
The Results
With HAIstings and CodeGate working together, the resulting system provides intelligent, context-aware vulnerability prioritization while maintaining strict security controls.
A sample report from HAIstings might look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
|
# HAIsting’s Security Report
## Introduction
Good day! Arthur Hastings at your service. I‘ve meticulously examined the vulnerability reports from your Kubernetes infrastructure and prepared a prioritized assessment of the security concerns that require your immediate attention.
## Summary
After careful analysis, I‘ve identified several critical vulnerabilities that demand prompt remediation:
1. **example–service (internet–facing service)** – Critical vulnerabilities: 3 – High vulnerabilities: 7 – Most concerning: CVE–2023–1234 (Remote code execution)
This service is particularly concerning due to its internet–facing nature, as mentioned in your notes. I recommend addressing these vulnerabilities with the utmost urgency.
2. **Flux (GitOps controller)** – Critical vulnerabilities: 2 – High vulnerabilities: 5 – Most concerning: CVE–2023–5678 (Git request processing vulnerability)
As you‘ve noted, Flux is critical to your infrastructure, and this Git request processing vulnerability aligns with your specific concerns.
## Conclusion
I say, these vulnerabilities require prompt attention, particularly the ones affecting your internet–facing services and deployment controllers. I recommend addressing the critical vulnerabilities in example–service and Flux as your top priorities. |
Performance Considerations
LLM interactions are slow by themselves, and you shouldn’t rely on them for real-time and critical alerts. Proxying LLM traffic will add some latency into the mix. This is expected since these are computationally expensive operations. That said, we believe the security benefits are worth it. You’re trading a few extra seconds of processing time for dramatically better vulnerability prioritization that’s tailored to your specific infrastructure needs.
Secure AI for Infrastructure
Building HAIstings with LangGraph and LangChain has demonstrated how AI can help solve the problem of vulnerability prioritization in modern infrastructure. The combination with CodeGate ensures that this AI assistance doesn’t come at the cost of security. You get intelligent, context-aware guidance without compromising security standards, freeing up your team to focus on fixing what matters most.
As infrastructure becomes more complex and vulnerabilities more numerous, tools like HAIstings represent the future of infrastructure security management, providing intelligent, context-aware guidance while maintaining the strictest security standards.
You can try HAIstings by using the code in our GitHub repository.
Would you like to see how AI can help prioritize vulnerabilities in your infrastructure? Or do you have other ideas for combining AI with infrastructure management? Jump into Stacklok’s Discord community and continue the conversation.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.