How to use AI for software development and cybersecurity
August 30, 2023
0 mins readWe’ve seen how technology can evolve at warp speed, and AI has emerged as both a revolutionary force and a tantalizing enigma. Whether you're a seasoned developer seeking to expand your toolkit or a security enthusiast on a quest for clarity in the realm of AI, embarking on the journey to demystify this dynamic field can be both exhilarating and overwhelming.
This blog post is your starting compass through the AI labyrinth, designed to equip developers and security aficionados alike with the tools they need to not only grasp the fundamental concepts but also to ignite meaningful conversations and chart a path for deeper exploration. We’ll take a top-down approach, first peeling back the areas of AI that we should make sure we’re focused on, decode its terminology, and illuminate avenues that are well worth further investigation.
Categories of AI in development
Having spoken to many people about AI, I’ve found that a good way to organise thoughts around AI capabilities or technologies is to categorise them into one of three buckets:
AI-Assisted Development - Development using generative AI to help write, review, and document code.
AI-Assisted Applications - AI capabilities that are being added to an application, for example, a chatbot.
AI-Assisted Tooling - Tools that leverage built-in AI to make them (and the processes they're used in) more effective.
In this post, we’ll focus on AI-assisted development, but we’ll talk about the others soon in future blog posts, so keep an eye out!
Other terminology you should know about
Let’s define some more terminology!
Foundational AI concepts
Artificial intelligence (AI) - The simulation of human intelligence processes by machines, enabling them to learn from experiences, adapt to new information, and perform tasks typically requiring human intelligence.
Natural language processing (NLP) - The branch of AI focused on enabling computers to understand, interpret, and generate human language.
Machine learning (ML) - A subset of AI that empowers systems to improve their performance on a specific task through exposure to data, without being explicitly programmed.
Neural networks - Computational models inspired by the human brain's structure and functioning, capable of learning patterns and relationships in data.
Language and text understanding:
LLM (Large Language Model) - A type of artificial neural network that has been trained on a massive amount of text data, enabling it to generate human-like text and understand language to some extent.
Generative Pre-trained Transformer (GPT) - An advanced neural network architecture used for generating coherent human-like text based on extensive language training.
Chatbot - An AI-powered software that engages in automated conversations with users, leveraging natural language processing to understand and respond to text or speech inputs.
AI-assisted development
This post will cover AI-assisted development. Here we’re talking about how AI is making us more performant and effective at writing and packaging up code. We’ll start with writing code, but then move on to testing, reviewing, and documenting.
Writing code
The most obvious and most talked about application of development-time use of an AI tool is code suggestion. There are some tools, like ChatGPT for example, that can provide code examples based on your specific requests. So you would ask in conversational text for some code that performed a task, for example:
Let’s take a look at the code it generates.
1public class ZipExtractor {
2
3 public static void extract(ZipFile zip, String dest) throws I0Exception
4 File dir = new File(dest);
5 if (!dir.exists ()) {
6 dir.mkdirs ();
7 }
8
9 Enumeration<? extends ZipEntry> entries = zip.entries();
10
11 while (entries.hasMoreElements ()) {
12 ZipEntry entry = entries. nextElement();
13 File destFile = new File(dir, entry. getName());
14
15 if (entry.isDirectory()) {
16 destFile.mkdirs ();
17 } else {
18 destFile.getParentFile().mkdirs();
19
20 try (InputStream in = zip.getInputStream (entry);
21 OutputStream out = new FileOutputStream(destFile)) {
22 byte[] buffer = new byte[1024];
23 int len;
24 while ((len = in. read (buffer)) > 0) {
25 out. write (buffer, O, len);
26 }
27 }
28 }
29 }
30 }
Other tools — such as GitHub Copilot, Amazon CodeWhisperer or Tabnine, as three popular examples — offer similar functionality but directly in the IDE you’re using. This means that while you’re writing code, such as in a method definition or a code comment, the IDE plugins will provide code suggestions that you can accept or reject.
Oh, as an early spoiler, the code that has been generated above has a significant critical severity vulnerability in it. That’s right, LLMs learn from community code, which is riddled with vulnerabilities, and so the code we get suggested to us will also have vulnerabilities in them. It’s imperative that we continue to test that with the usual tools to ensure that our generated code as well as the code we write is secure. Take a look at the code again and see if you can spot the vulnerability. Continue reading to the next section, Testing code, where we’ll show you the details of the vuln and how to fix it!
A couple of other tools that are pretty interesting include AI Query, which is an AI tool that generates SQL queries from your natural language inputs. This comes in particularly neatly if you’re trying to create a divide between your LLM and your data for security reasons. You’re able to then validate that the queries that are produced are not doing anything risky and are also reasonable queries for an end user to be making with their levels of authorization.
Testing code
Are you here because you found the vulnerability? Or did you not even try and are here just for the answer? Either way, let’s narrow it down a bit :) The vulnerability is an example of a Zip Slip security issue. It’s a cross between a directory traversal and an arbitrary file overwrite, can even lead to an arbitrary code execution. Here is the line that causes the vulnerability:
File destFile = new File(dir, entry.getName());
Let’s show you the contents of a dangerous archive first and it will likely click as to why the code above is vulnerable. The following file being added into a zip archive is perfectly legal according to the zip file format specification.
5 Fri Aug 18 11:04:29 BST 2023 good.sh 20 Fri Aug 18 11:04:42 BST 2023 ../../../../../../../../tmp/evil.sh
As you’ve likely worked out, when we concatenate this zipped filename onto the destination directory, our resultant file copy will likely be done in the system temp directory rather than our destination directory. This can be a pretty dangerous attack vector, yet this is quite a typical output from code generation tools.
For completion, the fix is to get the canonical filename after concatenation, and validate it is within the target directory as follows:
1File destFile = new File(dir, entry.getName());
2 String canonicalDestinationFile = destinationfile.getCanonicalPath();
3 if (!canonicalDestinationFile.startsWith(canonicalDestinationDirPath + File.separator)) {
4 throw new ArchiverException("Entry is outside of the target dir: " + e.getName());
5 }
6
I’ll pull this file into the IntelliJ browser where I already have Snyk installed. Snyk will scan the source code, and using its symbolic AI capabilities, it understands the data and code flows through the application, and as you can see from the screenshot below, it has identified the issue in the new code (1), as well as a full description of the issue (2), and even examples of how to fix the issue, similar to what we showed above (3).
Live Hack: Exploiting AI-Generated Code
Gain insights into best practices for utilizing generative AI coding tools securely in our upcoming live hacking session.
We mentioned symbolic AI, which is one of the types of AI that Snyk uses. There are also different types of AI that can be used, and each has various advantages and disadvantages depending on your goals. Read our guide to using different AI models for more information. The wave of AIs that have been made popular with ChatGPT use:
Machine Learning AI - An AI approach that enables systems to learn from data and improve performance on a specific task without being explicitly programmed, with subtypes including supervised, unsupervised, and reinforcement learning.
Neural Network AI - A subset of machine learning where artificial neural networks, inspired by the human brain, are used to recognize patterns, features, and relationships in data.
Symbolic AI (or Symbolic Reasoning) - A type of AI that represents knowledge using symbols, rules, and logic to perform tasks, often involving human-readable expressions and formal reasoning.
Evolutionary AI (or Genetic Algorithms) - An AI approach that mimics the process of natural selection to evolve and optimize solutions to problems, often used for optimization and design tasks.
Expert System AI - A type of AI that emulates human expertise in a specific domain by using a knowledge base of facts and rules to make decisions or provide recommendations.
Another tool to take a look at is Codium, an AI-based tool that can analyse your code, tell you what it does, create a test plan, and even generate the tests which you can copy and use in your unit test suites.
Reviewing code
For looking at code that has already been written, such as a pull request, you might be interested in taking a look at What the Diff. It’s an interesting Git-based tool that is capable of summarising the changes made in the PR and can help you write a summary for the PR, and even help with refactoring of the code in the PR.
So we’ve covered a lot there! writing code, testing code, reviewing code… Ready to get started?
Make sure that your code changes aren’t regressing and introducing security issues. These can range from quality issues, to security issues as mentioned above.
Make sure you’re using tools throughout the review and SDLC process as usual, and your use of generative AI shouldn’t replace these, but rather rely on them more, since there will likely be more code and functionality being developed and delivered in the same amount of time.
Check out our cheat sheet for best practices when using AI in the SDLC. Snyk’s integrations into the SDLC, including our Git repo PR integrations can help you with your code reviews as part of your usual workflow to ensure you’re not introducing first- or third-party vulns, vulnerable containers, or IaC security issues or misconfigurations.
Own AI security with Snyk
Explore how Snyk’s helps secure your development teams’ AI-generated code while giving security teams complete visibility and controls.