March 24, 2023 by Gabriel Bassett

Large Language Models in Security


Large Language Models in Security

Large Language Models (or LLMs for short are the rage right now. (I’m going to use LLM rather than ‘AI’ because AI is a pretty general term and carries a lot of ambiguity.) From ChatGPT to Bing AI to Bard, everyone’s getting into it.

In this post I’m going to try and skip or only lightly touch most of the common concepts that are covered elsewhere. You don’t need me to tell you what LLMs are you don’t need me to tell you what they’re not (search engines). You don’t need me to tell you they lie, that they have ethical concerns, both in bias and in use of other people’s works. And hopefully we can agree they are not alive and don’t really want to steal you away from your spouse.

I do want to touch on some of the things they are good at though.

  • LLMs are decently good at drafting content you are already familiar with. Often that first draft is the hardest and LLMs can do an excellent job with that draft. It won’t be right, but it will offer a starting place.
  • LLMs are also good code switching (adapting the variety of language separate from the content). If you want to convey something, LLMs can say it succinctly or in long form, at an executive or technical level, at a basic level of understanding or at a professional level, in a friendly way or angry way.
  • LLMs can search contextually. A good example of this is you can describe a word by what it means to you and the LLM will likely be able to help you identify it.

The reason LLms are good at the last point, is of their ’embedding’. Think of if you had most the words int he english language and jello. Your job is to put all the words into the jello in such a place that they are close to similar words and farther apart from dissimilar ones. The X, Y, Z point where you put the word would be it’s embedding in the jello. In practicality, machine learning (ML) researchers skip the jello and use far more than 3 dimensions, but the idea is the same.

That embedding can actually be incredibly useful. This project creates embeddings for the sections of a document and then answers questions by retrieving paper sections similar to the question. There are even databases designed specifically to store embeddings and projects just to do the comparisons. It’s not hard to imagine this being useful any place you need to quickly contextualize large amounts of text, whether it be searching notes, summarizing the state of academic research, or understanding what resumes are most similar to a job description.

As Stephen Wolfram points out in his excellent article, LLMs don’t just embed the text, they then find common paths through that embedding. The same way that “dog” is unlikely to follow “The dog” in a sentence, there are clearly patterns in english, or any, language. LLMs exploit that to generate real sounding text. The same way superimposing AI generated profile pictures on top of each other tends to show some common things such as the locations of the eyes and nose, LLMs generate text that ‘flows’ similarly. In the case of text it’s a benefit as we like our text to sound like it belongs. However as the internet gets flooded with such text, novelty may actually become a premium feature.

LLMs have other potential benefits, for example managing APIs. When we build something like a security program, we can either buy a fully integrated solution or integrate lots of parts. While the integrated version may have lots of benefits, the down side is maintaining the integrations can be hard as the software we use is updated. LLMs may potentially solve this, understanding what changed in the API and translating the previously correct integration into the new correct integration. Already OpenAI has OpenAPI which uses their LLms to create the integration with the LLM itself.

LLMs are also good at structuring data. There are many tasks, particularly in data science, that become substantially easier once the data is structured. LLMs can translate unstructured data (for example a news article about a breach) into structured (for example VERIS or Attack Flow). Even if it isn’t perfect, it gets close enough to then import into an editor for a human to finalize.

I even think there’s an interesting application in having LLMs identify how to compress data. Because it already knows the embedding, which is itself a compression, it should be able to write it’s own algorithms for compressing data to make transfer and communicating with the LLM more efficient.

Finally, as the proliferation of LLMs grows, I think we need to ask ourselves what it looks like for LLMs to work together. If we are aware that they have biases, can we train multiple to reflect different biases or goals and have them work together to solve problems and answer questions? Can we federate LLMs such that no one person has to maintain a very large, expensive LLM, but a group of them together can equal or even surpass the state of the art?

All in all, AI, ML, and LLMs in particular offer amazing opportunities and amazing disruptions. We truly don’t know what impact they will have though AI ethicists are putting in a lot of work to help us figure it out. For the time being it behooves us to stay aware of where they are and where they are going.