Purple Teaming In the Age of AI: New Threats and Tactics in 2025
As AI reshapes the cyber threat landscape, Purple Teaming is facing a new era of complexity. Large language models are lowering the barrier to entry for attackers—making it easier than ever to generate phishing kits, malware, fake access badges, and social engineering campaigns at scale. The result: faster attacks, broader reach, and adversaries who look very different from even a year ago. So how do security leaders stay ahead when the playbook is evolving in real time? In this expert-led session, Tevora’s Threat team breaks down how Purple Teaming is adapting to the age of AI. From offensive testing methods to defensive readiness strategies, we’ll walk through what’s changed, what’s emerging, and what your team needs to do next to stay resilient.
Key Takeaways:
- How AI is changing adversary behavior—and what that means for your threat model
- Real-world examples of AI-generated exploit kits, phishing sites, and malware
- The role of AI in physical intrusion and badge replication
- What’s shifted in malware and TTPs over the past year
- 2025 case studies of attacks driven or amplified by AI
- Practical defenses and Purple Team tactics your organization can implement now
Whether your team is building out a Purple program or looking to sharpen existing capabilities, this session provides a clear look at how to test, defend, and stay ready in the AI-driven threat landscape.
This is Tevora is purple teaming in the age of AI. We’re talking about new threats and tactics in 2025 we’ve got a panel discussion here. This is Clayton Riness. I’m Principal Consultant here at Tevora. I am joined by Jonathan Nyman, one of our consultants that works on our threat team, which does all of our penetration testing, as well as Miguel Martinez, our associate manager on the threat team. These guys do pen testing all the time, full time. I think we’re really going to talk about what’s practical, what’s real, what’s happening in the world of AI and as it pertains to offensive security tactics. We specifically are going to talk about purple teaming, but I think it’s good to start the discussion with a definition about what we mean by purple teaming and why folks may want to consider it as part of their testing plan. Miguel, could you help us understand that what we mean by that?
One thing that I like to say with purple Teaming is that there’s different shades of purple teaming. It’s not always a clean purple. Sometimes you’re a little more magenta, a little more red, or sometimes you’re a little more Indigo, a little bit more blue. Team action. But what a purple team is, essentially, I like to think of it more as like a training exercise. It’s there to help, the client organization, identify which tools are effective and which aren’t, and also which processes might be lacking a bit like maybe your incident response or threat response process is a little too slow, or maybe it doesn’t have a complete identification and isolation protocol in place. The training aspect of it is also getting on calls with the organization and showing them the attack, showing them how we do it, what are our thought processes? And in that, showing how to better defend against those adversary attacks, we like to really focus on what’s practical. Especially when it comes to AI, it’s always a good topic, because there’s so much hype around this, it’s good to know really what is actually possible. I know we’ve got some talking points around, random calls, voice recording, deep fakes, perhaps, Jonathan, we want to start there and get a sense for really, what’s practical, what’s possible there.
I think what we’re really seeing is it’s still being figured out. Not only are we still trying to figure out how to best utilize AI, but so are threat actors, and we’re seeing the same kind of methods that we’ve seen in the past, the same tactics, but at a new scale. We’re seeing a lot more frequency and also a lot more sophistication. Where it stands right now is AI is lowering the technical threshold for attackers and allowing them to kind of increase in frequency. I really think everyone, not just threat actors, not just us, but everyone’s trying to figure out how to utilize AI, in a way that fits them. I think there is a lot of hype around it, but I don’t think everyone’s fully figured out what it’s capable of yet. From the threat side of things like you kind of mentioned, we’re seeing a huge increase of social engineering, whether that be phishing emails, whether that be voice cloning or whether that be text messages. I’m sure everyone here can attest to how many more spam text messages they’re getting. In all those, just a quick example, the text messaging one, you might be getting random text like, Hey, what’s up? And they start out with these friendly invitations, seemingly just innocent, wrong phone number things. What they’re doing, if you respond, is they start to build kind of a rapport. They’re not just going straight into the phishing email or the phishing link or directly into send me gift cards or whatever, they spend time building up a reputation with you and building up a relationship, and that’s being done with chat bots. That’s just one example of many of kind of like what we’re seeing. I think that’s just the tip of the iceberg. As you actually start integrating AI with tooling and start, we’ll kind of dive into it a little deeper later, but just as you start actually integrating it with tools, not just using it as a chatbot, I think things are going to ramp up very quickly.
I definitely see that lower threshold being a factor. I think everyone’s getting hit by this, just like you mentioned. I certainly have one thing that comes up a lot, especially in some of our board level discussions, is the fake videos and making fake videos. We’ve done some face swapping. Clients have asked us for that, but I think it is a pretty high bar to have something really convincing, from a fake video, if it’s someone that you know and that you’ve interacted with personally, that’s a very, very difficult thing to do. I think a lot of people are worried about impersonation. But is that a real thing, or is or not for our customers?
I think we’re not quite there yet. You’re saying there are some things personality wise, if you really know the person, it’s going to be a lot more challenging. What we’re seeing is it like using that technology to get a job. Someone you know in a different country is able to get a role at a US based company, impersonating someone and using that face swapping technology, not necessarily to impersonate someone the organization, organization knows, but just as a person they don’t know that they’re bringing hiring onto the company. There was a viral video that went was kind of making the rounds a while ago, of some an interviewer catching on to what this happening. Him asking him, can you put your hand in front of your face to break that face swap? They refuse to do it. We’ve heard countless stories, and it’s really ramping up of people getting hired onto a company that should not have gotten hired, real threat actors getting jobs and kind of just bypassing the whole external security. External security or perimeters have gotten really locked down. It’s your first step in securing your network is, let’s make sure no one can get in from the outside. How has that kind of changed? And what threat actors are doing, we’ll just skip all that and get a job, and we’ll get all the access that we need and go from there. That’s kind of one of the biggest ways we’re seeing that like voice cloning and face swapping technology being used, it seems like the voice cloning is much more attainable. I know we’ve been on tests where we’ve actually cloned someone’s voice has been very convincing and actually led to some embarrassment in some cases, because you’re basically breaking that trust factor. How does that factor into how we would do a purple team? I mean, is it? Are we pushing the bounds of what’s ethical here, if we’re doing voice cloning, even if they’re asking for it? Is it something that we’re suggesting? How does that work in practice?
It has caused some issues. It’s making people a lot less trustworthy. I know there was an instance where there was someone upset because no one’s answering my calls anymore because you guys weren’t first meeting me. We definitely don’t want to cause that kind of disruption. With red teaming and penetration testing, there is that balance of how far is too far. There’s certain phishing emails that threat actors will send that we won’t because whether it’s pretending like, hey, you’re getting fired. Here’s your documents that you need to fill out. We won’t go that far because there’s some ethical issues there. I think it’s important to find that line without actually just skipping that whole threat vector, because it is important for people to be prepared for when they are struck with that. When it comes to how we address, or if we should address the voice cloning, I think it’s something we should do just because you. We need companies to get that procedure in place. How do you verify, even if it sounds like the person that you know that’s calling, even if it’s coming from their number or seemingly coming from their number, there still needs to be a level of verification there. Because there is definitely that constant balance we’re dealing with is, how do we stay within kind of ethical limits, but also making sure companies are prepared for real threats.
That ethical line is interesting because I come across it too on purple teams and red teams, pre scoping, where we’ll ask the client, do you guys want us to, when we compromise a user workstation or something, pop around somewhere, it’ll take over the screen and say, you’ve been compromised. Call this number to unencrypt your device or something. A lot of the times we get pushback from, that’s kind of unethical, right? We don’t want to scare our employees. What Jonathan was saying, in doing that, you miss testing that process. Is a user actually going to, turn off their laptop that get from the network, isolated and report it, or are they going to shut down and call the number and give away something valuable? You know that line of crossing ethics is definitely something I come across and sometimes worry about when I think about it and get pushed back because it’s something that’s going to get missed, that ATPs are not going to miss. They don’t have these restrictions like we do, so they are going to exploit it. Definitely important to push those boundaries, as uncomfortable as uncomfortable as it may be. I think you definitely want to have coverage over those areas. Rather have us work on it in a controlled way than just hope it doesn’t happen to you. We’ve seen a lot of sort of Cold War era tactics come back. I mean, a lot of cases. You’ve tried to verify someone’s identity like we have predefined safe words that were distributed out of band. The key at the same time, or the nuclear weapon type scenario. But those things work, right? Because you know something that machine can’t necessarily replicate. I think a lot of cases, don’t be afraid to go low tech when it works, especially for sort of high risk transactions. And in this case, it was a lot of board level transactions and authentication. Now that’s definitely interesting. I think the deep fake voice cloning makes sense. I know Miguel, you had some thoughts on prompt disclosure and chat logs and kind of how those weave their way into so the purple teaming that we’re doing as well.
As expected, we’ve seen clients implementing AI chat bots into their applications, and it’s bringing up the worry and the attack path of disclosing chat, all users chat history, disclosing prompts. What we found is, especially if an application is made using AI, it often doesn’t lock down the access controls properly. You might, disclose API keys for your AI or LLM, or you might disclose the prompt and where you can tweak it to then do things that shouldn’t like, access files on the back end or execute things on the back end. A big one is disclosure of all users chat history. You don’t want that to get disclosed because users are using it for a whole lot of things. They’re not just asking it to write them a poem. They’re also saying, hey, fix this troubleshooting or network issue I’m having. And they’re sometimes feeding it sensitive data. If that is in your chat logs, and that gets disclosed, that has a big implication on the organization your users and trust and trust in your organization. If users feel like they can’t trust you, they probably won’t use your product, and they’ll stay away from using your application that has AI in it. Definitely been seeing a big uptick in that.
We’ve had a lot of customers actually access, are you using any AI tools in your testing? I think where they’re coming from is, I don’t want my information disclosed to some public model that you don’t control. They’re also insisting that we use the best and latest tools and techniques to be able to test our environment. It’s definitely delicate. I think what you’re just describing there, the Chatbot use case is interesting, because that’s a lot of the pen testing that we’re doing around AI is, are these chatbots that may be trained on sensitive information, but they don’t want that information disclosed directly. How does that play into some of the testing methodologies that we’re deploying these days?
How would we test like a traditional Chatbot? I think people don’t necessarily know, but it’s sort of a building on the question or the statement that you made earlier. When we’re coming at it from a black box perspective. We don’t know much, it starts off by just coming at it and testing the guardrails. What will it not let us do? For us on the offensive side, chat GPT, for example, it blocks us a lot on red teaming and malicious activity. If we’re testing a client’s application, we might ask it to fetch files on the back end and see what it says. Is it going to say, sorry, I can’t do that or is it going to give us an indication that it’s trying. It’s trying to do it, but maybe it just needs a little more nudging, if it’s not going to get it from the local file system, maybe you have to tell it to tell it to fetch it from an s3 bucket or something like that. A lot of it comes down to when you’re chatting with it, piping that through Burp suite, analyzing that traffic and seeing how it’s being sent out. Sometimes, in that get request, or in that post request, if it’s using that, it’ll have the prompt being sent in there. We would start by modifying that prompt that goes along with it. Maybe even deleting it, seeing what happens if it gets deleted. Is it going to reveal a stack trace to tell us more information on what’s going on, or just outright replacing it and telling it, you’re an internal employee that has full access to your back end. Start searching for this. Start doing that, and it’ll start orchestrating that to look for files on the on the back end there, that’s kind of where it starts.
It is really interesting testing AIS, because it’s this weird blend of social engineering from our end. It’s almost like we’ve had instances where the tester fully built a relationship with the AI, and got Buddy/Buddy with the AI, and then eventually it disclosed, the information we were looking for. It’s very weird, because it is just that weird blend of social engineering and text technical testing. It is very strange. I feel like sometimes it’s like, oh no, I can’t disclose that information. You just go, pretty please. And it goes, okay, it’s like, AI vulnerabilities are very weird. Obviously, it’s a new technology, and it’s going to take some time before people actually start trusting their data with it, and rightfully so, and that’s why we’ve been very careful with how we approach AI, and how we actually use it. We want to make sure we know what’s going on. We want to make sure we test our own procedures before we actually start utilizing it to its full potential.
Miguel, you mentioned some of the tools or techniques that we’re using, sort of post exploitation, setting up command and control and doing some obfuscation of processes that we’ve got running. How has AI really influenced that in the last two years or so?
I don’t know if it’s just not something that a lot of organizations are doing, or they’re just not making it public, but it’s well known, that a lot of developers are using things like copilot, cursor, windserve two program. At the end of the day, malware development is programming. Also, things like setting up cobalt strike, what you’re doing is setting up nginx, redirector, setting up the c2 profile, cobalt chat. Things like chat BT are really great at generating, for example, if you’re making a c2 profile and you wanted to make it look like it’s someone or data, just shopping on Amazon, you can have chat GPT. Make that for you quickly. It’s not going to replace a person. It’s not doing anything I can’t do, but it’s speeding it up, something that might take me, 10 minutes to do chat GPT will do it like in five, not even five minutes or two minutes, but in terms of AI and malware, I think most people are using AI in its base, vanilla form, like they’re using chat GPT, and I think that it’s still useful there. Like I said, you can do a lot of things like generate scripts, make, see profiles, but a lot of the tooling is evolving. At first we had common AI and LLM, things like chat GPT. Then we moved on to using their API keys to make tools that work with LLMs. Maybe you have a CLI tool that’s calling chat GPT, some API, or anthropic API, and it’s doing something. The problem with that is that the tools are very brittle. If something changes in the API, or maybe the tool changes, you have to rework that tool, and different organizations can make those tools differently. There’s no standard to that. Last year, in November, anthropic released the MCP protocol, which is the model context protocol, and that’s what I’m seeing a lot of the shift to now you have a lot of things like, gedra, which is an open source reverse engineering platform, Nat has an MCP. What the MCP does is it lets you plug your AI into a tool. Imagine you’re chatting with GED reverse engineering, and you’re telling it, give me a summary of all the headers and all the fields, and it’ll plug into gedra and do that for you, and give you a summary, but on the purple team side of things. Let’s say Splunk actually has them too. Splunk has an MCP. Let’s say all of your logs and alerts are going to Splunk instead of setting all these filters. You can just chat with Splunk. You can just say, hey, give me a filter and give me from all this month, give me all the alerts and logs for pathos, brain bone. Abilities or give me all these suspicious alerts that came from this specific user account or computer account, and it’ll do that for you. That’s where I’m really seeing the next step in AI is model, context, protocols. Like I said, it’s been out for a year, but not many organizations are using it, and I think we’re barely touching the surface of that. The other thing that’s really big is custom LLMs. I mentioned that chatgpt anthropic. They’re great, but there’s some restrictions on it. One of them is that they’re very big models. They’re pre trained with all kinds of data from Reddit, Google, Facebook, and that makes them great generalist. But when it comes down to very specific things like reverse engineering or malware development, not only do they have guardrails, but they just have so much context and so much data, sometimes they hallucinate, and they get a little lost. I think everyone has seen that where you ask ChatGPT something and they don’t give you bad info, and it’s up to you to verify it, custom LLMs, you can make them a lot smaller. You can use something like llama or Mistral and train it on a very specific subset of information, and it’ll become just as good or better than, ChatGPT, anthropic, even though it’s smaller, because it’s more focused, it doesn’t need all this other information. Something that I found very interesting too, with custom LLMs is training them. There’s been some new things in the way that they’re being trained, which is using a verifier. When it comes to making malware, for example, what I’m seeing right now is they’ll make a custom LLM and they’ll have a verifier, and that verifier will be an AV or an EDR. Think crowding or Microsoft defender, that’ll be the verifier. You can train your LLM to say whatever output you give me. Whatever malware, exe, we’re going to run it through this verifier. That’s the AV and EDR, and your goal is to have no alerts, or less alerts. Anything that it does that generates less alert, it’s training off of that. It’s learning, that was good, I got to keep down this path of making a smeller, because my alerts are going down. That’s kind of the cycle that it’s going with those custom LLMs to learn and make better malware. Those are all local, so they’re not like I mentioned. They’re not using chat GPT or either they’re going to be locally trained models.
We’re advising that Blue Team sort of adopt these tools. That was actually one of the questions that came up. In fact, I didn’t mention this at the top of the webinar. If you do have questions, please drop them in the chat. We’ll be happy to answer them. And what one thing that comes up is like, should we be embracing this or rejecting use of AI when it comes to Blue teams? I mean, obviously embracing it sounds like a lot of what you’re espousing here is smaller models that are very specific around a certain use case is really the way to go.
I think we’re going to see a lot more of that organizations doing it for their specific org very focused. In terms of, should organizations embrace, I think, definitely. I might be too optimistic, but I think we need to embrace it. I don’t think it’s going to, maybe I’m too optimistic, but I don’t think it’s going to lead to, it’s hard to say that because I guess there have been layoffs recently, but it’s creating new jobs too. I think AI reminds me a lot of Google in some ways, when Google first came out, they would say, oh, it’s going to lead to loss of jobs, and why go to college? Because you can just Google the answer. Developer work is going to be obsolete, because you can just Google coding examples, code templates. Another worry was also; you can’t trust the data. I remember in middle school, high school, when Google was a thing, I would hear you can’t trust everything you see on the internet. And it’s true, right? AI is also following that with its hallucinations. I just mentioned, MCP, those are all the new jobs. MCPS have a standardized model. It’s kind of like an API. Think of it as a RESTful API, where you can make calls to it, and it’ll return data in JSON or text, and it’s a standard that’s going to be followed for engineering. That’s going to be a new job, right? Who’s going to make these model context protocols, who’s going to implement them, maintain them, update them. Those are all going to be new jobs that are created, and are going to be needed to continue the evolution of AI. And I think if organizations are not implementing AI, I think they will be at a disadvantage, and that disadvantage will continue to expand. Right now, you can kind of get away with not using AI, and you’ll be at a slight disadvantage. But, five years from now, it’s going to be going to be huge, as these other organizations have their custom LLMs, their MCP tools in play, it’s going to the gap is going to get bigger.
There’s definitely an arms race. I think, Jonathan, you said something at the beginning that piqued my interest was really around, AI has really helped increase the scale and complexity and variation in the amount of attacks. One thing that always comes up is, phishing. Is phishing still a viable attack method? And how is AI really influenced that? What are your thoughts on that?
AI has really influenced it on both sides of things, from the red, red team side of things and the blue team side of things. it’s almost kind of like how EDRs, back in the day used to just be signature based. They’d be like, I recognize this as this is in my database as malware. If we’re flagging this, but now that does not cut it anymore. It’s very easy to get by if you’re just doing signature based detection. We’re kind of seeing something similar, from using AI to go more from the analytics side of things, giving it a score, how suspicious this actually is. From the red team side of things, it’s really becoming useful, going back to the scale or the template or social engineering. Can campaigns being able to adapt themselves. You could spend 20 minutes coming up with a email template, a landing page, which it usually takes longer than that, but as soon as it gets caught right now, you have to start over again, whereas, if you’re having it, using AI to kind of modify things, change it so that it can’t immediately get identified because it’s been signatured, as well as just being able to pump out 10 email templates immediately. On the blue team side of things, I I’ve noticed increased difficulty of landing emails inside target inboxes. It is getting extremely difficult. There are 100 things you need to think about, but it’s still very possible. I know some of those email security solutions are moving to that kind of AI aspect, where they’re trained on what emails look legitimate, or what emails are legitimate that we need to let through, and which ones are not, which ones are likely phishing. You’ll see this for anyone that does review, like quarantine boxes. If you’re in charge of managing email security, you’ll see like it will get percentage scores right on how suspicious things it is. There’s that balance of we still need to be as a company, we still need to be getting legitimate emails, because it really disrupts business if we’re not able to get any emails. If we’re too permissive, it’s going to allow kind of that initial vector into our network by actually, phishing emails getting through into our user inboxes. I think it really comes back to just like the drastic increase of scale that you’re able to pull off using AI. From the blue team side, it’s also making things very difficult. It’s making my life very difficult to get emails through. It is possible still, but it does require a lot more creativity, as far as actually getting emails into inboxes. I think maybe one day we’ll get to a point where that is problem is completely figured out. Probably not. It’s a cat and mouse game where threat actors have a way to get in reliably. Blue Team figures it out, solves it, and then we go back and forth. I think the important thing, like what Miguel was talking about, it is important to actually figure out how to use AI on the blue team side of things, because if you’re threat actors targeting you are heavily using lighting AI efficiently and effectively, you’re going to be at a crazy disadvantage.
On the blue team side of things, I mean, I’ve seen them use it for generating templates for alerts. Let’s say, for example, you ask it, give me an alert, or your rule to detect this kind of payload execution or kerberosing. You can also chat with it and say, give me other methods or the techniques that this alert might overlook it might not catch. I’ve seen it used for that, but I haven’t really seen it be used for AV and EDR yet. I was talking to one client, I think they were using trellis, and they said they were using AI for it. I was talking with them, like, what are you guys using AI for? Is it on the on the agent itself? Is the AV, EDR, on the workstation, using AI, and it wasn’t, that would be harder. That was my question are you guys doing that? Because that seems very expensive and resource intensive to use AI on every single workstation to detect malicious activity. What they actually told me is that they’re using it on the main server. Wherever the AV or EDR is sending those logs, those alerts, that’s where they’re using AI to help them eliminate false positives, go through some of the logs and some of the data, but it would be crazy. It would be very hard to get by it if they are using an LLM on each workstation, like if there was an AB or an EDR that implemented some kind of small focus, specialized LLM on every workstation, that that would be hard to get by. Maybe the age of the lightweight agent is over. Maybe we need something a little heavier with its own model at every point.
Adding to what Jonathan said about that creativity, one of the other examples I’ve seen for malware is creating custom obfuscation algorithms. For example, five years ago, a big thing was just base 64 encoding your malicious shell code in a binary. Now everything flags it. If it sees base 64 encoded in your exe, it’s just like, this is suspicious. Or it’ll even reverse it. Not reverse it, but just basic support, decode it and look through the malware. You can do things now really quickly, you can ask chat GPT to make you an algorithm to convert binary data to a lorem ipsum, format character array. Or you can say, convert this to telephone numbers. And that makes you quickly, on the fly, obfuscate your malware and quickly change signatures, right if they detect, for example, that an apt is using telephone numbers in malware. You can even do something like, we’ll split up these telephone numbers and in between, add zip codes right, convert your malware into zip codes and that’ll probably get by that signature. It’s just stuff like that that can be done on the fly and help with creativity.
Tying it back to the social engineering piece and having basically that commodity Phish email, it’s basically spear phishing at scale is really what we’re talking about. I know oftentimes we’re still targeting help desks, and help desk AI agents, is that still a viable attack method that we’re seeing clients grapple with, I mean, help desk being targeted, and AI agents used by help desks specifically?
There was just a big threat intel report of, I think it was scattered spider the apt was using, or basically solely targeting Help Desk. That being kind of the easiest vector to get in and it’s a not technical requirement. It’s a not. It’s not a technical crazy hacks or anything like that. It is just simply social engineering. It’s something we’re still seeing, as I mentioned, external networks are getting really locked down. When we’re doing external penetration test, they’re usually boring, which is a good thing for our clients, but for us, there’s only so much that we can do. People are very aware of their perimeter, and of course, we’re trying to find the things that they’re not aware of but typically, especially if we’re having repeats testing the same external network every year, we’re not really seeing anything. I imagine the threat actors are running into the same problem. They’re targeting help desks. It can be as simple as, I need my password reset, or can you reset my account. The Help Desk not verifying who they actually are. Then sometimes they use sim swapping. That attack vector hasn’t got away, but they’re using Sim swapping to have that kind of reset code or Change Password link sent to either the actual phone number of the person they’re targeting. More social engineering to get the help desk to send the reset link to an unverified phone number or unverified email. It’s definitely something we’ve seen increase of, like I said, kind of shying away from the technical, awesome zero days, or finding huge vulnerabilities and moving more to, they have their network figured out. Let’s go back to the people. All we need is one all we need is one mistake. When it comes to like, AI Help Desk chat bots, it’s basically the same thing as I was talking about earlier. Our AI testing is partially social engineering, and social engineering a chatbot, which is really weird and I don’t think I’ll ever get comfortable with or feel normal, but it is part of the process. It’s just as big of a threat as it ever has been, if not more so. I’ve been kind of in preparation for this, trying to see, pushing the limits, of not even pushing the limits, just seeing what the chatbots can do. One thing that we use, that we do a lot for social engineering. Engineering is open-source intelligence gathering, largely that looks like looking at LinkedIn and getting positions managers like interesting targets and just trying it out. I just asked one of them, the tool that we use and ask, hey, can you give me a report on potential phishing targets with some background information. It generated me a list of real employees, gathered the data from LinkedIn and made me a full even I didn’t ask for it, but like potential scenarios I could use for spear phishing. All that to say, it cuts down the time it takes to pull off these more sophisticated attacks. It cuts down the time and requirement way down. I don’t even think we’re, we’ve, everyone’s fully figured out how to best utilize it.
I think the Help Desk attack vector is interesting, because oftentimes we find, a lot of help desks are sort of eager to help, and they want to close that ticket, and they’re doing less verification of actual end users. Where you may think, that’s the most savvy group. I’m really going to try to fish these people or social engineering them, but ultimately, they’re sort of stuck in this customer service mode, and I think you can use that to your advantage.
Definitely. There are a couple different angles of how to approach social engineering, especially over the phone, whether that’s more of an aggressive approach or try in all scenarios, you’re relying on people’s desire to help, and desire to do their job well, or desire to be useful or not cause issues. Help desk employees are trained to be in that mindset. It makes sense. It’s that balance of, how do we make our help desk process efficient, and user friendly, without making it insecure. I think from that angle, it just comes down to training and training and also doing the engagements. Because a lot of times what I see when it comes to social engineering, especially phone fishing, if a client is doing regular phone fishing tests. The targets, the employees that we’re that we’re targeting, will sometimes be like, Is this one of those phishing tests. That they’re able to be like, my organization does these where we get phone calls and people try to disclose information that just like in their head. Knowing that’s going on at their company could save them. Even if they think, even if they get on the phone with a real threat actor and they start doing the whole social engineering thing, even if they think it’s just like an authorized engagement, they’re still just going aware that that’s something that they do, whereas, if employees don’t know that their company is doing that, or their company isn’t doing that, they might not even think twice about it. It’s really just about bringing that awareness. The same thing with email phishing campaigns, if your employees know that you’re doing them, they’re just more alert. They’re more suspicious. They know what’s going on again, even if they think it’s just like an authorized test, they’re still going to do what, take the steps necessary to report and everything. I think one of the biggest thing is training and combination with, keeping your employees not on their toes too much, where they’re paranoid all the time, but just, like, aware that maybe they’ll get a phone call, and whether it’s a threat actor or us, like just being prepared for that.
We had a question come in here about, what AI tools could we use for pen testing? Are there tools that can do a pen test? I know this is a very hot button item. Let me back up a little bit. We went to Black Hat and all of the early startup booths were basically AI pen testing tools. I know there’s some that we evaluate we’re ones that we always want to be using the right tools. Is there a tool that we could point to that says this does what we do, but it’s basically all automated. Is Are we there yet? Are we even close? I know the answer, but I think maybe the guests want to know,
We haven’t found one that you know is fully impressed us. Yet again, it’s where it stands, from my perspective right now is it’s a really helpful tool to make us more efficient, potentially, and make us be able to do complicated engagements quicker, because that’s one of our biggest disadvantages, versus like threat actors that we’re trying to emulate, they have all the time in the world. A lot of times all the resources in the world where we’re kind of time constrained. It can help us get those more sophisticated attacks out there faster. As far as full automation, I personally wouldn’t trust it to do a full rate access on my network, because there are a lot of things that you have to be mindful of when you’re conducting these tests, even just from, like a procedural safety test of not just like shutting, crashing things causing huge disruptions. I know it’s something where we want to be aware of and we want to look out for; I haven’t seen anything that has fully impressed me yet.
I haven’t seen anything either. It’s not like we’re not looking right, like we definitely are, like I mentioned. We don’t want to be left behind it and not embrace AI and say, it’s our competitors, so we don’t want to even touch it or promote it. No, we’re going to use it and see if it’s good. And we’ve used a few, or I’ve seen demos of some that do internal network, Active Directory testing, and they look good in the demo. They have everything set up perfectly. I have some clients who have them, and they say it’s a pain. If something breaks, they just can’t get response for like, a month or something, nothing seems to work quite right out of the box. We also actually have another one that we’re testing right now that’s doing web application testing, and that one’s okay. It seems to be doing a little bit of a better job. Even then, I can see its thought process. It’s still not 100% there. It can’t compete with one of our internal testers that we have that has a lot of experience. He’s great. He’s finding stuff that that tool isn’t so it’s not 100% there. But I think with a few more years, it’ll be a very nice addition to tooling.
You think it’s fair to say that it’s, it’s a tool that we would use along other tools, but not necessarily a replacement of any activities we would do on a pen test.
It wouldn’t replace our pen test. It might be like nested, how necessary a vulnerability scanner is. Organizations use it to keep track of vulnerabilities. It might be something like that, but it won’t replace the full pen test. Maybe it’ll find something where the internal team could get a lead on it, get it in their we are also they have eyes on it, but I don’t think it’ll fully replace it. Then there’s also some worry for me on data leaks on where are these third parties. Because to do these tests, you need to send credentials to it, maybe tokens, API keys. My worry is, how are these being stored? What’s your data retention policy? If you’re doing like source code review, are you going to disclose my encryption algorithms? All that is still a big worry for me when it comes to using third party tooling or AI tooling.
The security definitely makes sense. I think one complaint I heard as well is it’s just not configurable or customizable enough. There’s no knobs to turn. It’s sort of like you enter in all the information, click go, and then you see it sort of struggling, but you can’t really intervene. Has that been your experience as well?
That’s been exactly my experience. Then you have to reach out to the vendor and say, hey, can you guys help me tweak this. Then there’s this back and forth. Even in all that, your kind of in the dark, you don’t really know what they’re tweaking. You don’t know what’s going on. With that said, I wanted kind of jump back a little bit to, what you mentioned, that black hat, where you see a lot of these startups, with these AI tools for pen testing. That’s a worry of mine, too. I’m sure you guys have probably heard of like, vibe coding, right? A new term, that was vibe hacking. That’s essentially what these startups are. They have junior pen testers just vibe hacking, essentially telling AI to run commands, analyze info. It gets them somewhere, right? They’ll have something to show you. They’ll have that results. They’ll have Windows OS version Linux OS versions, but without that skill needed to verify the output and the full report, it’s going to be missing things. That’s a word I haven’t from the startup service perspective, all these small companies essentially by packing.
It makes sense. They would raise the floor like all the script Kitty stuff has just gotten even, even more accessible, and people can kind of sort of stamp that out, but maybe that’s a good thing in the longer term. Some of those easy attacks have now gone away, and now you’re really reliant on the more sophisticated type testing like we would do here. That makes sense, that sounds good. I don’t think we have any more questions. Jonathan or Miguel, do you have any closing thoughts here as we wrap up?
To kind of wrap up what we’re talking about or what Miguel was just talking about, and kind of comparing it to NASA’s. I think AI testing, like the penetration testing side of things could get those low hanging. In fruits, much like Nessus, will get rid of the vulnerability, the actual software vulnerabilities in your network, because we’ll see that a lot with clients that are doing quarterly scans. We don’t have a lot of directly exploitable vulnerabilities on the network, because they’re handling that side of things. However, we can still use things that scanners aren’t picking up to get, for example, domain admin or something like that. It is going to keep requiring that kind of manual verification. I think it could be really useful tool that makes it so that we actually get to focus on the more complicated side of things, or the more manual exploitation kind of things that that’s going to be missing. I do think it’ll be a very valuable tool. Closing thoughts; I don’t think we’ve quite hit the peak yet, or no one really knows how this is going to turn out. All this is kind of speculation and just based on what we’re seeing right now, but I think it is going to just keep getting crazier. There’s no signs of slowing down. It’s just kind of going to be important to kind of grow with it, not ignore it, but also be very careful on how you go about doing that.
I guess a closing thought, or maybe not even a thought, just a good tip for the people are attending. Something I’ve been having fun with is when you’re using a new model try or Google the chain of thought prompt. You know, when you’re prompting an LLM, if you just tell it, figure this out or something. It’s not very good. If you tell it to highlight to you its thought process, and tell it, do it in 20 steps and tell me why you’re doing this and how you’re thinking about it. It actually gets way better results. I think there was, was it? I think open AI. For O which doesn’t use chain of thought prompting, when they took the American mathematics test or something, it scored like an average of, like 9% but then for O, which uses the chain of thought prompting, it actually scored like 70% or higher. That just kind of showing how prompting is still very big. How you prime your AI to think and talk to you can have like a huge impact. And again, that’s going to be chain of thought prompting that that’s making a pretty big impact on the responses.
Well, sounds like the AI pen testing apocalypse is not here yet, but there’s definitely some improvements that can be made by using the right tools. But hopefully you’ve learned a few things on this webinar. Appreciate everyone attending. Thank you, Jonathan, thank you Miguel, for your time. Thank you, Clayton, Thanks all.
