AI in AppSec: The Paradigm Shift with Brad Geesaman

TLDR;

Thanks Brad for the lovely conversation. Here are a few points which stood out for me:

  • Building / Adoption of AI Systems should start small. Breakdown projects into smaller and specific tasks and apply AI on it. Instead of doing a complete revamp at once.
  • LLMs today are non-deterministic. Define your risk tolerance level and optimize around it. Perfection would be a challenge to building / adoption.
  • With increased adoption of AI Systems, Complexity Increases, Craftmanship and Curiosity takes a hit. Use AI Systems as a aid vs replacement.

Transcript

Host: Hi, everyone. This is Purusottam, and thanks for tuning into ScaleToZero podcast. Today's episode is with Brad Geesaman. Brad has been a pen tester, a sales engineer, a CTF builder, and a cloud and Kubernetes security consultant in his past lives. But right now, he's focusing. He's like a principal security engineer at Ghost Security, focused on removing toil in the app and the API security space with the help of LLMs.

Thank you so much, Brad, for coming to the podcast.

Brad: Yeah, thanks for having me. Great to be on.

Host: Before we kick off, you want to add anything to your journey? Like maybe how do you get into security? What keeps you still interested in the security space or anything you want to add?

Brad: Yeah, I'll keep it brief. I've done a lot of different things, worn a lot of different hats, and AppSec was one of those things that I haven't done a lot of. I've done a lot of the foundational stuff, SOC operations, pen testing and things, but yeah, I'm just really excited about being in a new space for the next five or six years. That's where I want to be, because I think there's a lot of toil and a lot of problems to be solved there.

Host: Yeah. And I mean, one unique thing that you get out of working with multiple areas is you get a unique perspective, right? Because you see how other teams are, like other parts of security rather, not other teams, are dealing with security programs and how can you leverage that learning to maybe app-sec now. So yeah, I'm looking forward to today's discussion how… how you connect maybe those worlds and how do you see that help you as well.

But before we kick start on security questions, one of the things that we ask all of our guests and we get unique answers. So I'm curious, like what does a day in your life look like?

Brad: Every day is very different, but across, we'll say a sprint, it's probably five things. the percentage has changed, but it's either product development, so working with our product manager and talking about strategy and what we're doing, working with engineering to help on implementation or architecture or things like that.

I also talk with lot of customers, do demos and get a lot of face to face with the folks that are having the problems and sometimes do some content creation. But often my time is spent in the R&D side of our business, which is de-risking and exploring and understanding how new problems that come up can be solved. What are the most efficient ways to solve them and can LLMs help in some cases? So I wear like any number of hats, you know, I, I'm in meetings quite often, and I try to keep places for content creation and for R &D deep heads down work in there as well. So it really is a mixed bag.

Host: Yeah, I can see like you're working across so many domains in the company. So hopefully we will be able to touch on some of them. So today's focus is around building solutions into AppSec space using Agentic AI and using LLMs for real problems that help AppSec folks keep up at scale.

So you have had experience like extensive career in cloud, cybersecurity, Kubernetes, you have spent… time on all of those areas. Now you're exploring agentic AI. So I was at RSA and there is a lot of focus being given to AppSec using AI, agentic AI, LLMs and things like that. What inspired you to focus on application security and more recently using agentic AI within that domain?

Brad: Yeah, when I joined here at Ghost a couple of years ago, we were focused on API security and we've moved into more broadly the application security space. And when sort of a confluence of things happened or came together, models got stronger. You know, they had things like tool calling and structured output and the frameworks started being able to be more user friendly, we'll say.

That was, that sort of came to a little while ago, less than a year ago. And then also we started having use cases that tended to fit the strengths of what LLMs can do. And so a lot of secure code, a lot of review, lot of automating of processes and classification and things. it was like, initially I was very hesitant about LLMs for a while because it wasn't doing anything for me specifically. But once I found use cases where they would actually excel.

Then it started to come together. And that's where I got excited about like, okay, if you're very judicious about the use case, you can do some very interesting things and solve some problems. And my life's mission is to rid toil and burnout from security, right? Like that's just an overarching broad goal that I want to do things or be in areas where there's pain and toil and try to automate that out because I get a lot of satisfaction.

So that's what inspires me is like, I see an opportunity here to do a lot with the help of some LLMs in the right places.

Host: That's a great mission to help security folks. So when it comes to security, there was traditional SDLC. would have CICD, like DevSecOps and things like that. And now we are talking about using LLMs and agent-tk-i for application security. Maybe before we get into agent-tk-i, how to use it, can you give our audience a little bit of idea of what agenti-ki is and how do you see it being used in application security?

Brad: So, you know, I struggled with like, what is an agent and what does that mean? And how does that initially, and I thought about it and you know what I did? I tried to explain it to my spouse and my mom. And it took a while, but I got to the point of like, okay, there's some sort of hybrid here that's a little bit more technical than that. But that fundamentally, when you're interfacing with chat GBT, you are the agent, right? So you're typing the instruction.

You're determining the execution path. You're getting some output and then your brain is going, yeah, that's a good output. I'm done. I got the result I want. Whereas with agentic AI, you're delegating that to the LLM to decide what questions to ask, what plan to take, is the result that I got correct, and then feeding that back to you. So it's like telling your five-year-old how to make a peanut butter sandwich. You're doing it very, very step-by-step. But maybe when they're 15, you go, just go make yourself a peanut butter sandwich. And they know all the steps, they figure it out and they execute it on their own. when it's like, what is an agent, you're just delegating aspects of it to the LLM to decide for itself, you know, with pros and cons, of course, that go with that.

But when it comes to AppSec, it's just inserting that in all of the places where there's human decision, you you can opt to have the LLM make parts of that decision or all of that decision. So that's like, that's where it's sort of a sliding scale spectrum of applicability. know, sometimes it's really not helping much. A user can do it, or the scale is such that it doesn't matter. But then there's sometimes where there's a high scale, a little bit of nuance and the human could never keep up with it. That might be a great opportunity for an agent to see what problem to solve and then try to solve it for you. And then the user would be able to look at the results and go, thank you for doing all the leg work. Now I can see.

Yeah, I can believe that and trust that.

Host: So that's a great way to sort of get a greater understanding of Agentic AI and also validate. You spoke with your wife and mom. So when it comes to product building, there is a similar test that folks do. And there is a book called The Mom Test, where you sort of pitch your idea. You dumb it down. And you do a test to see whether folks are able to catch it.

So I see something similar you try to do when it comes to Agentic AI

 So one follow-up question to that is, how do you see it differ from traditional AI? One of the things that you mentioned is maybe you can automate some part of, let's say, decision-making or some part of validation and things like that.

If I give an example, if I open it, we use GitHub. So if I go to GitHub and I raise a PR, now I can say a review with Copilot. So Copilot does some of the review and it gives me some of the findings, which maybe normal folks can, like Google reviewers can, but still it's automated. Some of that burden is gone from our side, right? So how do you see traditional AppSec different from agent-tk-i? And if you can give an example, that would be great.

Brad: Yeah, I think, you know, traditional AppSec, the tools and the processes are, first of all, like AppSec in general is a world full of nuance and just not quite broadly applicable patterns. it's, there's so many subtle things, cause code itself can be written in any number of forms and still have the same intent or the same outcome. So that makes that high variability.

So tooling around that, typically traditional tooling, sort of forcing that nuance onto the user naturally. So it's like, I think this could be something. This might be something. Do you want to take a look?

And often it requires understanding of the code structure and expertise to be able to make a good review of that. And where I see agentic AI really help is maybe you don't know everything about Java. You're not proficient in Java, but you're responsible for securing or being part of the security processes for this Java based app.

What are you basing that on if you don't have a foundational set of knowledge? So leaning on something like an LLM can say, what do you think of this code? What is it doing? Like tell to me in human form can give you a leg up right there. That just, okay, now I know at least what it's doing, what its intent is. And then you can ask some questions about it.

And this could be either an agent doing that, sort of an adversarial agent going, where does this code go wrong or where is it getting it right? And that can sort of do that rubber ducking, where you're like, what about this? What about that? For you as part of the process and just bolster your understanding.

I see agents as a way to give you table stakes or broad coverage of expertise and fill in the pockets where a human just cannot be an expert in every single, every situation, every code best practice. It's like, what are the top five things, you know, and good understanding across every language. And you're going to be that much more effective as an AppSec engineer.

So you can apply it as a, as a filling in the gaps and filling in your understanding and help support you. You could stay at a higher altitude and get the accomplish, accomplish the goal and not get buried in the weeds in a lot of those situations where you don't know exactly.

Host: You are spot on. When it comes to a new language, it's definitely, you might know the programming constructs, but writing it in a different language could be tricky, right? And I think the other part is, if there is legacy code, you lack the knowledge, right? Then sometimes, as you mentioned, you can use LLMs to explain. Hey, explain me this code. It does a decent job of explaining what exactly is something. someone is trying to achieve as part of the code. And I have seen that as part of, let's say, the PR process also. You can sometimes say that, summarize my PR. And it does a decent job that what are we trying to achieve as part of the PR. So that definitely helps human reviewers when they are going through the review process.

So one of the questions that we got from Vaibhav Gupta around this is how do you see secure coding paradigm change with the realm of AI now that we are talking about agentic AI.

Yeah, there's a couple of things that are happening here that AI enables and also not hurts, but exacerbates the challenges, right? It lowers the barrier to entry to be able to write code, right? You can vibe code. I don't like that term. I prefer supported or augmented code or co-pilot code. But it's like, I'm trying to do something. Help me get started. Help me structure my project. Help me.

It lowers that barrier to entry of like, what's the way to just get set up, get your make file, get your repo and all those things for early beginners that can help them get a jumpstart and get that confidence. But also there's more developers now. So there's more code being shipped and it's being shipped faster. So that means there's more code to validate and secure and understand and make sure that it's risk assessed. Right.

So that's where it sort of exacerbates the problem. But back on the pro side is I see a lot of flows or personally myself where if I'm using it, I'm in the flow state longer if I'm automating some of the boilerplate around it. if I write an initial test harness for something and I'm adding to the code and I go, add me some use cases around what we just did.

And it'll get me 80 % of the way or 90 % of the way. And I'll be able to tweak and adjust. And I'm back into, what do I want the code to do? It helps me stay in the more creative or more flow side, as opposed to getting wrapped around the actual stuff that kind of slows you down and sort of takes the fun out of producing or shipping a feature. Yeah, I just see more. It's more code, more developers, more things to secure. That's kind of it.

Host: Yeah, I think recently a report came out that even now non programmers can program as well, right? Because now you, your way of programming has changed from learning, let's say Java or C++ or something, you can just talk to chat GPT and generate code and you can start deploying that, right? Even if you are a non programmer, you can still do that. So yeah, I mean, totally in agreement that is barrier of entry has lowered. So anyone can start writing code.

That also means security folks have to now be extra careful, right? Because they need to not only secure code written by expert developers, but also new developers who have never done maybe programming in the past.

So one of the things that when it comes to agentic AI is these Agents need to talk to other tools or other agents and things like that. And recently, think there has been a lot of focus on MCP, Google released agent framework and things like that. Can you explain why some of these standardizations will help in building Agentic AI applications for AppSight?

Brad: One of the projects that I worked on recently at Ghost is Reaperbot, which is a team of agents that interfaces with our proxy tool called Reaper. So it's sort of like an overlay that helps pull strings on Reaper. And Reaper is just restful APIs. Discover, host in a domain, probe them to be live, go test them for this or test them for that, show the results and those types of things.

And, when I wrote Reaper bot, was sort of prior to the MCP standard, it got a lot of attention and it got adopted by OpenAI because was Anthropics to start. And that sort of opened the floodgates to, okay, MCP is the standard. I lived the experience of, spent 80 % of my time writing the integration between the tools and the framework and most everything else was relatively straightforward.

The downsides are, is that my code that's ReaperBot talking to Reaper, the interface of that is all highly custom, not interoperable, model and framework specific. So if I wanted to be able to go and say, hey, somebody else, you know, use your chosen framework and use your own model, the private model or what have you, and call my tools, you can't, it's not a standard. So model context protocol just decouples, it puts an abstraction layer in between LLM and external service or thing that you're trying to interact with, get data or act on something. And you're making that standard.

And what that does is it separates cleanly the consumer and the producer. So I can produce, if I was a vendor and a lot of vendors are doing this, I produce an MCP endpoint and say, your LLM can talk to me in a standard way, can discover the tools that are here and use them and I just run the MCP server and then you can consume it from any number of situations and places with like minimal friction and that is, that is huge.

Host: For sure, MCP has opened a lot of opportunities for developers, vendors to build tooling around it. Do you see any challenges with MCP or Agents SDK? And do you see these getting standardized at some point? Because right now, we are talking about two frameworks,

Brad: That's the hope. know it's, still pretty early. they, this whole thing, six months is an eternity, but how far it has come in those handful of months. the challenges still are, I think, and I know they're being worked on cause folks are seeing it is, is when this is not running locally or the assumption that the MCP server is not literally running right next to the code on the same system. Once you separate that, you have security concerns and identity and access management concerns and office, officey concerns.

Those are being worked on. The moment you separate them, everybody has the same questions. Like, well, how do I make sure it's accessing the right data or doing the right thing, et cetera?

And I think once those primitives are sort of put on guardrails and are easy to understand and consume, I think that will really let it take off. And I hope that that standard emerges soon so that we're not sort of doing, you know, the 15 standards, let's make a new one. And now we have 16 standards.

Like, I really would like it to be… that we all coalesce and get on the same page because I think that unlocks some very interesting interoperability and like marketplaces and service models and product opportunities.

Host: I am totally in agreement when it comes to standardizing 15 standards to create another standard. Right. Because that will again be the same thing, right, that you have to support so many standards that it doesn't help builders to move fast. Like as you mentioned, right, do worked on these integrations manually, you spent a lot of time building those integrations before MCB was there. So you must have spent a lot of time which could have been avoided if there was a standard or like even for folks who are building today, if they have just one standard to build for, they can move fast and build more things, right? So yeah, absolutely.

So one of the things that Steve, Steve Giguere has asked around this is, Like when it comes to AI, are sort of two, AI and security. There are two sides, right? One is using AI to secure something or security of the AI. So let's talk about the first one. When people say using AI to secure AI, what does that mean?

Brad: So the way I interpret this is that in sort of higher level abstract, it's like you have a process and it either has deterministic inputs and deterministic outputs, or you have variable inputs with variable outputs and it all matters, the degree of scale. So what an LLM will do is we'll amplify that. So if I'm taking slightly nuanced inputs and I'm getting slightly nuanced outputs,

Maybe I do or do not control those inputs. I have to assume or how do I assess that the thing that was asked for was a good thing to be asked for? The thing that was returned was a good thing to be returned? And it's happening, the whole point of this is to do this at past to human scale. So the only thing that can handle the nuance, at least attempt to handle the nuance is another LLM.

And so you'll see a lot of patterns of using LLMs as a judge or adversarial LLMs. So you have one from this model. Other maybe different biases and different training, either agreeing or disagreeing and having maybe odd numbers of them with the LLM as a judge, all sorts of things. But basically what you're saying is how do I guarantee the safety or the security of what's coming in and out of this at speed? The only way to keep up with that, unless you have millions of people, is to use LLMs to help judge LLMs from that perspective. So is it perfect? No, but that's sort of the… the high level rationale.

Host: Yeah, so, so I, during RSA, I participated in threat modeling workshop and it was run by Adam Shostack and few other folks. And one of the things that was mentioned was that, we are now writing code using AI and we are securing the code written by AI using AI.

And we are in a unique world now because both sides are using AI heavily and we need to find the right tooling so that when let's say you're generating code, you are using the right tooling when you are validating it. Again, same thing. And one of the things that was brought up was like Adversary adversarial LLMs, can you use that to do some of the threat modeling and things like that? So yeah, absolutely, makes sense.

One of the things that you highlighted was the deterministic versus non-deterministic outputs, right? And this is again, question from Steve, where he's asking, given the non-determinism of AI, how can we use AI systems to be considered reliable?

Brad: I think that's a fantastic question. And it's up to the system builder and designer, the user of the use case to determine, because it's always on a spectrum. It's always on a percentage of reliability. It's almost never 100 % no matter what. And the question is, what's your tolerance for that reliability? And then if you know what that is, does the LLM or the setup or your operational state land within that tolerance or does it exceed it or go outside that window?

And for a lot of folks, they're thinking sort of with deterministic mindset of like, I'm gonna take this input and I'm always gonna get this output. where that sort of, you have to be okay with it not being 100 % correct. When I say you have to be okay with it, your use case has to be such that that is the best thing that solves the problem.

Where if you have a deterministic input, chances are you have a approach to be reliable to use deterministic means to get to that. But if it's like, summarize this text, I need to take 10,000 words and put it down to 100 words. LLM is probably the best tool for that job. And so you're gonna have to give it some nudging and some correction. It's gonna eventually get there with your feedback, which is gonna eventually get to the quality and reliability that you want.

But if you're applying it to something that needs extremely high, you know, precision and accuracy, it's going to be really challenging. You're going spend a lot of time and money getting to that point, if at all.

Like I'm not, I'm not saying that it's the best solution for every, for every problem, but if it is, then you can get it within workable windows for the most cases. Like you can, you can do a lot of performance evaluations and things to make sure it stays in that operating window. But what most folks don't do is they don't define what they're able to tolerate and then shoot for that. If you're just like, want it to be perfect, that's a non-starter. You're never gonna, you're never

Host: Sometimes you might have noticed this more than I have. Like when you let's say ask a question to chat GPT or Claude or something like that, there is hallucination of when you ask the same question again that, this with slightly different tone, you get a different response. So you have to, again, find out the right balance and also the tolerance of what you are OK with if it fails, what percentage you're OK with. So yeah, it makes sense.

One of the things now slightly changing the topic is Now we spoke about AgenticAI, what does it do, how can we use it, what security, how can we use for AppSec and things like that.

Now for organizations who are getting into it, who want to build something around AgenticAI or who want to use it, they might have some of existing SDLC process, existing architectural considerations and things like that. How do you see organizations adopting this?

Agent AI systems for AppSec? Do they need to make any architectural changes at the foundational level? Or you think minor changes would get them to get started on agent API? What do you see? How do you see it working out?

Brad: Yeah, I certainly would not start if this is like I'm new to AI or I'm new to implementing this in my processes. I would not start necessarily with agentic AI. I would build into it. We sort of think of two types of uses for AI. think of like workflows and we think of self-led, non-deterministic, agentic AI stuff as sort of a superset or an improved version of that define workflow.

But what's really important is that you define that workflow and you look for each one of those steps. You break it down, like back to the creating the peanut butter sandwich, you know, get the bread, you know, get the peanut butter, open the jar, get the knife. like, you have to think of it in extremely small steps and break your workflow down into that. And then look at the very specific step that is best suited for an LLM.

And then implement it at that spot and try to be deterministic or use other traditional methods to feed that input reliably into that hop, into that part of the decision tree, and then go deterministic for everything else. So in other words, use it as sparingly as possible to start. So if it's like a classification task, does this look like dev, Sage, or prod based on a little bit of context? That's a great example of bucketing or easy classification that could really help reduce toil without a lot of negative impact.

But when it comes to agentic AI, you're thinking of like, okay, now I have workflows, but some of these steps take different shapes. Like there's maybe three pathways or five pathways in this one step, and I can't really deterministically make that work every time. That's a great place for an agent.

So you give it instructions, you say, these are the three types of things. Here's your goal. You're trying to get to this, but there might be three pathways to get there, or five. And you say, you can use these tools to set out the plan and get to that endpoint based on the slightly variable input. And therefore, it's a great use for it. It's like dropping in a junior or a new or inexperienced person that just has 1,000 examples of like,

You're doing this thing, you're trying to get to this goal, iterate over this process till you get to that. And then you feed those examples back in, it more accurate. But like you're taking that workflow, you have to define it and what your tolerances are for the risk. And then right when it's like, I just cannot solve for that spot. It just is not easy. That's where you might want to apply an agent. So build into it essentially.

Host: Yeah, thank you for breaking that down. I love your PB &J sandwich example because that gives you that construct to think about. How should I think about incorporating AI into my systems other than just doing Big Bang? You need to find a specific use case. You start from there and then you slowly expand in it.

So on this here, so one of the questions that we got from Rami McCarthy is, what impact do you see AI having on organizational structures for security teams? Because right now there are so many security teams, and they are focusing on different areas, like you have seen throughout your career as well. So what impact do you see AI having on organizational structures?

Brad: I think initially not much on the organizational structure. I see we're sort of on this trajectory here where most of the use cases, most of the tooling, most of it is centered around augmenting or assisting the individual. And so it's sort of like an exoskeleton or an exosuit. It helps you run faster and jump higher, but it's still fundamentally you driving.

And I think that as… these, as more folks start building these capabilities, the natural progression is that you're gonna want to extend it to your team. You're gonna want to make some, put something in the center of the table as opposed to at your place setting. You're like, everybody can share from this. How do I make it such that everybody can equally participate in feedback and use?

And once you do that, then it can go in a couple different directions. I mean, this is just postulating, but like. I can see, you know, a SOC team and an AppSec team and a DevSecOps team having, if they all have their capabilities sort of naturally extended from the individual that services their team, it opens the door for team to team integration or interoperability. And that might break down some barriers that might help restructure or, or hopefully just help facilitate more, you know, interoperability and, and reduce the friction on getting something reviewed. if it's like.

I have a security finding that's in my CI CD pipeline. You might not send it to a human or you might not ticket it to a human. You might send it to an agentic system that does first set of passes. And maybe it's a really quick, yep. from a human on the app sec team. And then it goes back into the queue. Then like, that's a win. That's just breaking down there. So I'm, I'm hopeful, like maybe now it's not really changing much, but I'm hopeful that this sort of opens the door, gives opportunities for, for teams to more closely collaborate and maybe maybe they finally fold in together under security as opposed to secure app sec, the silos of all those different org structures.

Host: Yeah, yeah. Makes sense. Yeah, and there are so many silos, right? Vulnerability management has its own place in app security, DevSecOps. Like there are so many splits we see today. One question, I love this question because generally we think about how can we leverage AI, how can we get benefit? So Rami is asking, what elements of AppSec do you see benefiting the least from AI deployments or tooling?

Brad: A great question. I love that. First off, I think it's important to say that the things that benefit the least are your wallet and your complexity budget. So for these workflows, if you're scaling them and you're using LLMs, you're spending tokens or you're running hardware and power and all that good stuff, to be able to solve this, it's not making that workflow any simpler.

It's actually extending its complexity. And so that's the trade-off. I have a workflow that I need to run 10,000 times a day. It requires human intervention right now. that requires, it would be a toil treadmill, as we like to call them. You never, just, just more toil every day you wake up, there's more toil in your inbox.

Like, how do we get ahead of those? How do we stop that? It's going to be challenging. It's going to add complexity, but the trade-off is, that you're spending less time as a human burning out doing the same repetitive task over and over.

The other downside I see is more on the human side. Not many people who are like proponents for LLMs talk about this, but the curiosity and sort of the craftsmanship of building software or building systems, if an LLM gets really, really powerful, really useful and it solves most of your problems and it commoditizes aspects, like if it generates boilerplate architectures, it generates write-ups and it builds structures of projects and it solves most of the core logic and it writes most of the code for you. It detracts from sort of the art and the craft of it, of being like really precise. Cause maybe it's a, technically it's a correct implementation, but it's not the most clever or precise or cost efficient.

And it might detract on the human's curiosity to explore for better ways. In other words, like, if I asked the LM, what's the best way to do it, it goes this. It's like, well, that's clearly the best way, isn't it? Where's the challenge to go, actually, I think it could be done better. It might depress that. So I would always encourage folks to use this as an exoskeleton as opposed to a remote controlled robot and use it to empower you, help keep you in the flow state and use it that way as opposed to take all the fun out of it.

Host: I love that response. Yeah, no, I love that response because generally, as builders, as engineers, I love, again, I love the term that you used flow state a few times because ultimately as humans, we want to get to a flow state to be creative and to express what we are trying to build, right? And with LLM, some of that is going away.

Because as you rightly pointed out, either it could be a new language, it could be a new architecture or new design that you are trying to think about. And now you can just go to LLM and it gives you everything. So yeah, you don't have the fun of building software if you use LLMs entirely instead of using it as an addition to your team rather than your team.

Brad: Absolutely.

Host: So let's pivot to the other topic that I wanted to talk about. And you touched on this earlier, which is around the Reaper bot that you built. Can you briefly explain what it is? How does it fit into the whole AgenticAI ecosystem?

Brad: Yeah, Reaperbot is a team of agents that is a co-project to Reaper. So the Reaper is an API-driven proxy. So go make a request, go look up a domain, those types of things to a host that you can set your browser to proxy through, and it'll track all the requests, full payloads, and things going through it.

And our idea was there's workflows that you tend to do going through Reaper that are like, well, first, let's type in a domain, and let's find the live host, and let's probe them, and let's… see what they are, fingerprint them or what have you. And then there's, how do we enumerate things that are going on in the application and decide is there something they're worth testing?

And so it's sort of attempting to be a lightweight proxy that you can interact with, maybe a lighter weight version of BIRP or something similar. But what we targeted on, we use it for educational purposes on, is our GhostBank website. And there's a broken object level vulnerability in the transfer function. And where we sort of thought was like, how does somebody who doesn't know what's going on get to, I'm looking for live hosts in ghostbank.net domain to, there's Ebola in the transfer function of this fictitious banking app. And we sort of broke it down and said, we'll put a challenge out to ourselves is like, can we automate this with agents?

And so over the holiday break, I was like, think we could do this. If you have some sub-agents and have them do some tooling, can sort of prove like the crawl, know, look up this domain and it gets the domain. Probe for host, it probes the host. And it's like, what's the best way to organize and set these up as a team of agents so that they stay on focus and they stay on task and they do the thing.

So the way it's architected is that there's an orchestrator agent and it's the strongest models. I think it was 03mini at the time. And it's the one that takes the input from the user and goes, I know what you're trying to do. And it breaks it down into steps. And then it just so happens that it has sub agents that it delegates to, to handle that. if it's like, right, so there's a discoverer agent and there's a tester agent and the discoverer does the benign things like domain, probe them, sub find types of things.

And then the tester agent is the one that says, tell me exactly what endpoint and what parameters, and I'm going to test it for variations of BOLA. So how do you go from there to there? Well, in the agent speak, it's give it a goal, and it'll break it down. It'll start calling out to the tools and knowing what order and then taking the results of those tools and going, oh, okay, that worked. And now the next step is to run the next tool. The next step is to run the next tool. then

Here, here's all the live hosts. And then it's, tell me if there's BOLA existing in any of the endpoints in this domain. It's like, okay, well, let me go look at the request catalog of all the things that were interacted with this website. These look like their potential for BOLA, like testable, they're post parameters. And go look at them and go, hey, this is a make transfer function. It's got an account from, an account to, and an amount.

What if I guess a different account from? And so that's what it's doing in the Bola agent. So the Bola agent is very specifically looking at parameters and going, what should I do to test this? When you add that all up, and you put a nice UI on it and put some logging on the right side, you get this kind of, I found it fun. I can give it a goal and then watch it go and it'll iterate through. Sometimes it'll get it wrong, but it like corrects itself because it knows I've got an error.

It'll retry with a variation and it walks itself through the plan and then it produces a report. It's like, yes, I scanned this domain. I found these live hosts, the requests to these live hosts. Some of them are candidates for being tested for Ebola. I went ahead and tested those candidates for Ebola. Here are some of the request examples that were successful that show that it was Ebola. And here's a nice written output. It's in this make transfer function. I can guess an account from and take money from.

And so that was like the, that was the overarching vision is like, see if we can walk that entire loop or that entire flow of a web app pen tester going from nothing to something of value. And, yeah, that was it. And we're trying to package it in like an educational package, like something small enough that you could like realistically read. It's in Python. It's really easy to read. You could read all the prompts and see what it's doing and how it's delegating.

And I think that like, hopefully serves as a learning tool so that it sort of breaks down the barrier and it's more of a show, not tell. So it has problems, it's not perfect, but you can at least see them and play with them and see where the limitations are and then make adjustments. And of course, if you write Python, you can easily extend it, add more tools and add more agents and make it do something interesting for your own use case.

Host: So I'm guessing it's open source, right? Okay. so what we'll do is when we publish this episode, we'll add it to the show notes so that our audience can also go and play with it and learn from it. Yeah. So one of the things that you touched on is you have, let's say there is an, you provide a prompt or you provide what objective you want to achieve.

And then there are step-by-step, sub-agents running step-by-step. And there is some level of autonomy sort of given to the agents in terms of decision-making, in terms of taking actions, working with other tools, and things like that. Now, one question that comes to my mind is, what are some of the key advantages and disadvantages of using like manual versus automated. I know with automated, you get the speed and you can achieve some of these things. But do you see any disadvantages with that? Do you see any advantages with doing certain things manually? What are your thoughts on that?

Brad: Absolutely, I think it's everything is use case dependent for sure You know to be able to push the boundary of automation you have to show that it can but it also show where it can't The manual the pros of the manual side is you have absolute control over whatever Actions you're taking what order what the results are retrying those types of things, but that does require expertise and especially my days of being a pen tester sometimes a web app pen tester

You're given two days to attack a given web app or two web apps. That's 16 hours that you can, like that's not a lot of time to test everything. So you're looking for every advantage you can to sort of automate the boilerplate or some of the table stakes items out of the way, but at the cost of maybe steering the output as it goes or the determinism, et cetera, it's a trade-off for sure.

So if I were to be like, If I were to do web app pen testing today, I would be looking for sort of like things that surface candidates to me. And then I would use them, I would take that as a cue and then I would work off of them and I would be helping or having assistance. I wouldn't necessarily go full automated, do full automated pen tests today with it. Like I would still want that level of control for like discovery and fingerprinting and enumeration and those types of aspects of the attack life cycle. Yeah, absolutely. But if it's like sending payloads, crafting payloads, iterating on that. I want that to be assisted, but not driven by today.

But Reaperbot is like, it's an experiment. It's like, can we? And the answer is yes. And the answer, should we? Maybe not. Not all the cases, not all the situations, not all the targets, run it against test environments or.

Host: So you mentioned like Reaperbot is experimental and things like that. What are some of the things that you see in the roadmap of Reaperbot? What are some of the exciting capabilities that you are thinking of building or that you want maybe community to contribute with the Reaperbot?

Brad: Um, you know, there's not a, there's not a high degree of expectation. was just, it was initially meant to be, you know, showing our work or being able to talk to and have good conversations with folks. And it has spawned a lot of good conversations with customers and you know, folks that we know and go, you know, where's that line? The one we were just talking about it, like where, where do we want to go automated? What are we comfortable with? What situations would we want that to work or not?

And being able to sort of just talk from a shared set of understanding, because everybody's definition is different, everybody uses agents differently. It's hard to understand what the capabilities are. But if you can go, I can do this, this, this, and this, potentially automated all the way through, where do want the line to stop and in what situations? That's a much better conversation. It is open source, free to use and learn and extend and whatnot. But from a vision perspective, it's a little bit down the road, but there are things that we want to do at Ghost that bring some of this functionality in, of course, to our capabilities and our natural validation and testing extensions of some of the things and issues that we're deriving from source code.

So that's part of our cast tooling and capability that we're talking about is deriving things in source code, but then validating and assessing those in runtime making that connection between the two.

So for an example of, you know, where, where does Reaper go and what does it do and where does Reaper bot go? Being a part of the platform of, of, I want to test an authorization problem. said about that Bola, but like, what if it's just an endpoint is not authenticated and it's from source code. It's showing it doesn't have dependencies or middleware or anything like that. But like, what's the state in the real world? That is a, bunch of toil, a bunch of busy work that an AppSec engineer would have to do to be able to like. Should I care about this? Let's automate that away and go probe it and go, do I get a 200 or nothing? Or do I need a header? Or do I get a 401? That just eliminates some of that friction from moving that piece down the triage pipeline.

So Reaperbot's sort of an exploration. It's a conversation starter, but it's also part of our R&D to be able to suss out what we're going to do and how we're going to build the next piece.

Host: And I see how it aligns with your own passion area also, right? Like reducing toil on security and how you are connecting them. So yeah, I really like that. So far we have been talking about AgenticAI, how can we use and things like that. There is also an operational aspect to it, right? So let me touch on that.

So in general, when it comes to security, it's a very sensitive area and there is a lot of… a trust plays a huge role in it. Now with these automatically running Agentic AI, like decision making agents running, how do you build that trust? How do you, let's say you want to build or you want to roll out a AgenticAI system in your company.

How do you go about it? How do you that trust with maybe other teams or with leadership? How do you look at it?

Brad: Definitely see it as a progression and the adage of trust must be earned applies even more now. It's like you're adding a new person to your team, you're bringing them on and you're onboarding them. How do you trust their output? Maybe it's a new SOC engineer and they're triaging tier one. What do you give them? Do you give them the advanced persistent threat indicators or do you give them the very low level logs and alerts to go good or bad?

And you audit and review everything that comes out of it. Transparency and audit logging is absolutely critical. And so building that process from the start with an idea that I'm probably going to get this wrong until I get it right enough times to build up that trust. It's not going to be perfect out of the gate.

If you jump in and go, I'm going to slap some agentic AI on this problem space, you're going to run into a very difficult a set of standards to hit right out of the gate. But if you start going, I can automate this part of the workflow, then I can automate this part of workflow. And I have a good way of evaluating that I'm doing that well, and I'm doing that accurately over time, thousands of repetitions. That's what builds up the trust. Just like any detection you add into a sock, you drop it in, you test it out over here quietly in quiet mode, and then you start putting it on some of your, maybe you're not higher end customers if you're a MSSP or something.

And then you roll it out more broadly over time, you're actually testing its trust, your trust in that capability. That's sort of the same thing. It applies the same.

Host: So it's more of a continuous process. cannot be like a big bang. You just roll out something and you have trust on the system. So yeah, makes sense.

Brad: Absolutely. Yeah, to that point, these systems are living systems. Just like a detection system, it's a living system. It will, as models change underfoot, unless you're using your own specific static model, things will be changing underfoot. Examples will come and go. Your feedback loop will change the prompts. Like you're to be adjusting prompts and things. It's a little bit to contain to be able to keep it on track. But that is the, it is not a drop it in and then let it go and it will… good forever. It will change over time. So you need to keep it in its performing window by validating and auditing.

Host: Yeah, you're spot on. And to add to that, every few months there is a new version of the model. And there are so many players who are providing their models, like from Anthropic to Google to ChatGPT, like OpenAI rather. Every day there is a new model in the way. I think at RSA, Cisco rolled out a security sort of baseline model. So I don't think anyone would be using like a static model for long term. So there is constant change. That means it's a continuous process. cannot be like you just rolled it out and you are done.

Brad: Absolutely. It's a living, breathing thing. And that's also the benefit. So if you're designing your system that you're taking advantage of the strengths of the model and you're not tightly coupled or locking in with a trained model necessarily for your use case, and a new model comes out that's cheaper, better, faster, the promise or the goal is to have your system to be able to be tested and validated that it works well on that new model and you can shift operation over to that new model and get all the benefits and the cost benefits and power, strength, accuracy as well. So it's also important to design not so tightly coupled to one specific model or one provider as best you can to be portable and able to evaluate on different models. Because like you said, they come out every other week really. It's impressive, the speed.

Host: Yeah, absolutely. So on the operations, one of the questions that we have received from Emily Fox is that when it comes to the use case of improving the development of applications using AI is very well understood, either using copilots or IDs or cloud code and things like that. But what is there for the operations team where the real risk of vulnerable or misconfigured applications are realized? How do you see that?

Brad: Yeah, I mean, there's certainly a lot of attention on the secure coding aspect and being in the IDE and the developer flow. But there were certainly opportunities here for CI CD pipeline testing and efficiency trade-offs. Like a lot of tools will take five or 10 minutes and slow down builds and people don't realize that. Other optimizations there. But from like an operations perspective, like in runtime, there's - Is it up? Is it running? Is it responding to those types of gentle probes that we can do for validation things to help in the triage motion? You know, what version is running?

Maybe there's an API endpoint that lets you get the version and helps you tie it to what's running in source code. All those types of things. But then there's, I talk about the exoskeleton, it makes you run faster and jump higher. There's a lot of writing. There's a lot of summarizing and synthesizing and write-ups write up for an executive, write up for a technical audience.

And there's like, you don't want to be writing 15 different variations of the same thing necessarily. would ideally be able to write maybe the technical document or the technical write-up potentially assisted and then go, all right, make a consolidated version that is an email to this team or make a consolidation that's an email to this executive or report or what have you. It just takes that, that the toil essentially away from those aspects.

So it's not like live and in runtime, but it's all the supporting elements and supporting processes around securely operating runtime environment. see a lot of opportunity there to just like, help me write this summary. I don't have the, I'm 15 alerts in and I don't have the brain power to go write up another in an attacker's use case, you know, it can give you a start and then you go, all right, adjust, adjust. And then yeah, that's mine. I'll send it. Right. So I feel like there's there's ample opportunity logs and alerts summarizing, synthesizing, like, what does this log mean generally? What does this grouping of logs mean? It just takes some of that, that heavy lifting off so that you can stay in the flow state of like doing assessment, risk reduction, strat strategic thinking, communicating with teams, less spent operating tools.

Host: So on that, there is another question that Emily has asked us. You touched on some of these low-hanging fruits and things like that. What special considerations should operations teams be mindful when choosing AI systems to assist in eliminating some of these small, small things?

Brad: I think it's important and I'm not a proponent for LLMs in every situation. think it's important that you match it to the problem space that you have to the scale, to the strength of the models. Because otherwise you're going to get into a world where you're Kubernetes for Kubernetes sake, right? The old adage, you're doing LLMs for LLM sake when you're really just adding to the complexity and you're not getting enough of the benefit out of it.

If you're looking at like tackling low-hanging fruit or, you know, security issues, make sure it's something that is at a scale that's beyond what humans can do, or it requires non-determinism to be effective, to handle nuances in use cases where your current tooling is not succeeding and not able to handle all the one-offs and you're just doing exception handling after exception handling. Those are good candidates for LLMs.

And then when you're building that system, be very clear and upfront about your accuracy and acceptable levels of accuracy. Like, I just want it to be better than a human over a thousand iterations. That's not too hard to hit, right? It says, is it better than a human over 10 iterations? That might be hard, because a human doesn't get tired after 10, but after 100, they're extremely tired, right? So if you're saying, I just want to do a good job across thousands of things, set that bar.

And that will help everybody who's using the system understand the intent and the purpose. And then the last thing I will say is that LLMs or processes like this, amplify weaknesses in your existing process. So if I have an existing process and it's already a little bit behind and now I throw an LLM at one of the steps and it speeds this thing up a hundred fold, it's going to push that next piece of strain or pain onto the next part of your workflow. Maybe it's like,

Now there's a thousand tickets. So I would argue that you want to spend at least half of your time implementing process improvements to support the improved part of the, of of the, LLM enhance. Like don't go, all right, I'm going to slap some LLM on it. We're going to be done here. Think that's half of your budget. The other half is like fixing how we send tickets or, how we group tickets or how do we consolidate things before we send it to the next step, et cetera, versus just this is only an LLM thing.

This is a process. You have to look at everything is like which part of the process hurts the most, which thing if we speed this up, what does that do downstream? So that's my generic advice to anything, but I think it applies really specifically to operational folks because they're doing things thousands of times a day and that's just gonna exacerbate. If you don't sort of think that ahead.

Host: Yeah, I love your answer. And the example that you gave, right? Just don't do Kubernetes for Kubernetes sake. It's very similar. Don't do LLM just because there is a lot of hype and things like that. Find your use case and maybe just focus on that. And the last part where you mentioned about you cannot just change one area. You cannot just do AppSec. Like use AI in AppSec and you are done. That also has downstream impact. So plan for that. Plan for that. Do some process revisions before you roll out some of these big changes. So yeah.

Brad: Yeah, absolutely. Like if you are triaging SAS findings and there's a thousand and you started with like, all right, I'm just going to do the criticals. Maybe there's a hundred of them. And of that, you do a manual triage and you come up with 50 that you're like, yep, dev team. would be really great if you fixed you. You'll work with them and made 50 tickets. And now all of a sudden you have this power, super power of being able to look at all the mediums and go, which of those mediums are worthy of fixing? And now you have 300 of those.

That's, you know, 6X what you were doing to get, now you're foisting six more times worth of work over to the dev teams to fix. You're like, they're like, hey, what's going on? And you're like, I made this process so much better because I found more risk. And the answer is, you know, downstream of that, you exacerbated a constraint, which is development resources because you were able to optimize this part of the process. So I would just be like really empathetic of the downstream effects of a sped up system.

Host: Yeah. And I think that there is a age old debate on vulnerability management already, right? That organizations have thousands of vulnerabilities and every day you see new vulnerabilities and you cannot address all of them. You come back to prioritize and things like that. With the introduction of AI systems into your SDLC process, you could open up more such issues, not vulnerability SAS findings or SCA findings now the engineering team would get impacted heavily. So yeah, great point.

And I think that's a great way to end the episode as well.

But before we end, like I have one last question for you. Do you have any reading recommendation or like learning recommendation for our audience? It could be a blog or a book or a podcast or anything.

Brad: Yeah, I like to go like one traditional and one non-traditional item. I say non-traditional, it's not specific to the agentic AI space is I've been recently devouring the high performance podcast. They do in-depth interviews over like high profile folks that are in sports and, know, fame and like, they really get into these deep conversations. And I just love how people like a lot of Formula One drivers, I like Formula One.

So a lot of the Formula One drivers are talking about like the mental side and the mental health and the physical preparation, know, winning the race, like what success is to that. just love that stuff. get that up. And then the more, the thing I have to do a little bit of a plug. We recently released a report at GoSecurity. It's at ghostsecurity.com/report.

And we talk about some of the research that I had hand in with some of my amazing colleagues on quantifying the triage toil of SAS findings and sort of thinking about, you know, how do we reduce that toil? How do we get the time out of that so that you can spend your time on other things? But also what if we re rethought this problem from the start and that's, you know, that's in there as well. So some of the future things, the things that we're kind of leading towards, that's in that report. But I think it's an interesting, like, it's an interesting read. I thought it was, really interesting the data that came out of it is just how many things are really not truly a risk when you add a little bit of runtime context. like, yes, it might be in software, it might be a vulnerability, but is it like running? Is it in production? Like those types of things really narrow the funnel down to like a handful of things and you spend a lot of

Host: Awesome. Yeah. Yeah. I think one of the challenges with security is quantification, right? So yeah, when we publish the episode, we'll for sure add it to the show notes so that our audience can go and learn from the report. Thank you. Thank you so much. Yeah. Thank you again for coming to the podcast and sharing your knowledge.

Brad: We just wanted to quantify. Yeah, thank you.

Host: Definitely, there was a good learning for me. And I hope our audience also gets a lot of value out of

Brad: Yeah, absolutely. Thank you so much for having me. It's been a real treat. So appreciate it.

Host: Yeah. Thank you.