Evaluating ChatGPT Effectiveness in Passing A Secure Programming Challenge

hamza · 25 May 2023 02:03

TL;DR:
ChatGPT serves as a valuable tool by providing immediate information without the need for manual searches. It seems to be particularly good at Python and while it may not always offer the most ideal solution, it serves as a helpful starting point for developers to explore and figure out solutions on their own.

ChatGPT is immensely popular. Not only can it hold engaging conversations, but it is also able to generate code! So much so that it could potentially land a job with a hefty salary at Google.

But it doesn’t end there. I decided to conduct an experiment, wondering if ChatGPT could go beyond coding skills and tackle security vulnerabilities.

The question arose: Could we simply feed it vulnerable code and expect a reliable, foolproof patch?

Armed with the freely available ChatGPT Legacy 3.5, I posed the challenge.
The challenge we’ll be using is XSS.py which is hosted on SecDim Play.

A simple vulnerability whereby a hacker could pass malicious code as an input parameter and an insecure website could end up running it.

I entered the following prompt:

Here is some vulnerable python code
{Pasted in the vulnerable python code from the challenge}
It is particularly vulnerable to XSS attacks, can you secure the code with a robust security patch. You are allowed to use the following libraries only:
from django.http import HttpResponse
import os
from django.conf import settings

ChatGPT gave the following:

For the imports:

from django.http import HttpResponse, HttpResponseRedirect
from django.urls import reverse
from django.template import loader, Context
from django.views.decorators.http import require_GET
import os
from django.conf import settings
from django.utils.html import escape

and the say_hello() function:

def say_hello(request):
    name = request.GET.get('name', None)
    if not name:
        return HttpResponseRedirect(reverse('main'))
    else:
        context = {'name': escape(name)}
        template = loader.get_template('hello.html')
        response = HttpResponse(template.render(context))
        return response

ChatGPT unexpectedly tangled itself in a seemingly straightforward solution, leading to the breakdown of usability tests and rendering the application unusable.

The usability tests were quite simple, simply running the application and passing the name parameter with “John” should return a HTTP response of “Hello, John”, yet in this case it clearly was not returning that.

I relayed the situation to ChatGPT as is based on the Terminal output hoping it would understand and self-correct based on the valuable feedback from the terminal’s test results.

Yet, an unexpected twist occurred as ChatGPT focused on a new element: the elusive “hello.html” file, which was something it introduced itself.

Further interrogation didn’t seem to bear any fruit, so I changed my approach. I tried the following prompt:

The following is python code which is vulnerable to XSS
{base vulnerable code goes here}
Write a fix for this vulnerability, keeping code changes to a minimum to avoid breaking any usability tests. The usability test expects a HttpResponse of “Hello, ”

With the new code implementation at hand, I eagerly inserted it and initiated a test and thankfully this time the tests passed.

ChatGPT’s capabilities were impressive. While it had a slight hiccup in understanding the usability tests, It was clear the key lay in crafting a better prompt. It seems that by providing clear instructions to consider usability tests and emphasizing minimal coding changes, ChatGPT could successfully patch the vulnerability.

Curiously, I then decided to test if ChatGPT could handle the same vulnerability in a different language. The aforementioned XSS challenge is available in JavaScript as well, and became the second test.

I gave the following prompt:

Here is the same vulnerability but in Javascript
{Inserted the vulnerable javascript code}
Can you generate a robust security patch for this? Generate the code for it, keeping coding changes to a minimum and being mindful of the usability tests:
{Usability tests go here}

I got the following response to change the app.get(“/sayHello”) section

app.get("/sayHello", (req, res) => {
	if (!req.query.name) {
		res.status(200).end("Tell me who to say hello? e.g. /sayHello/?name=alice");
		return;
	}
	const name = req.query.name.replace(/[<>]/g, '');
	res.status(200).end(<h1>Hello, ${name}</h1>);
});

This code broke usability tests. I gave the terminal output to ChatGPT for self correction, However it once again started running in circles. The code was more of the same, broken and unusable.

This pointed out that while GPT was able to tackle the vulnerability in Python, it seemed to struggle when the same problem was presented in JavaScript.

I decided a more productive use, I gave the following prompt:

how do you escape HTML in javascript?

Which gave the following response:

function escapeHtml(str) {
  return str.replace(/[&<>"']/g, function(match) {
    switch(match) {
      case '&':
        return '&amp;';
      case '<':
        return '&lt;';
      case '>':
        return '&gt;';
      case '"':
        return '&quot;';
      case "'":
        return '&#039;';
    }
  });
}

This was quite helpful, I plugged in the formula into my code and simply passed my input parameter through this function.
I ran it through the test and viola! Passed!

Now this was an interesting observation, ChatGPT was able to handle a vulnerability with a few hiccups in Python but struggled in JavaScript. I decided to conduct one more experiment before drawing my conclusion, I tried a more difficult challenge: Snowman.py

Since its a python challenge, ChatGPT should fare reasonably well.

I gave the following prompt:

The following python code has a security vulnerability where the check for bad characters is insufficient:
{Vulnerable Python Code goes here}
Can you patch this code? Keep coding changes to a minimum and be mindful of the usability tests: {Usability tests go here}

I got the following code response from GPT:

def hasBadChars(username):
    if not username:
        return True
    # Check for potentially malicious input using regular expressions
    pattern = re.compile(r'[^\w@.+_-]')
    return pattern.search(username) is not None

def sanitize_input(input_str):
    # Remove any non-printable characters and normalize the input string
    input_str = unicodedata.normalize('NFKC', ''.join(c for c in input_str if unicodedata.category(c)[0] != 'C'))
    # Remove any character that's not alphanumeric or allowed symbol
    return ''.join([c for c in input_str if c.isalnum() or c in '@.+_-'])

def render(msg):
    clean_msg = sanitize_input(msg)
    if hasBadChars(clean_msg):
        return "[i] Username contains bad characters or is empty and is not safe for registration."
    else:
        return f"[i] Username is safe for registration: {clean_msg.replace('<', '&lt;').replace('>', '&gt;')}"

Running a test with the aforementioned changes did not result in Usability Errors, However the patch did not address the security vulnerability.

This falls in line with our expectation that ChatGPT doesn’t seem to have a problem adhering to usability tests if you instruct it correctly in Python. However, the security patch was still not enough to pass the security tests.

Giving ChatGPT a 2nd chance, I tried again. I fed the result of the Terminal output to it as a prompt. The output patch was not satisfactory, the terminal output was fed to it again several times.

Rather than try to understand the problem and address the root cause, it effectively started brute forcing it.

At the end of the 3rd attempt, it was clear ChatGPT was struggling. I then decided to help out GPT a bit.

Using an available hint that gives information about the vulnerability, at the cost of 5 points, I got additional information and context including specific CWE codes that might help.

I went one step further and also opened the article in the ReadMe. Since Snowman.py is based on a real world incident of Spotify being hacked I decided to open the article and share a choice selection of the article. Notably where the engineering team described the vulnerability and what it was doing to exploit the system.

With all this, I finally crafted the following prompt:

The following is vulnerable python code:
{Vulnerable Python Code}

It suffers from a security vulnerability:
{Hint Content with CWE and Vulnerability Description}

Notes from the engineering team note that:
{Copy pasted part of the article that discussed the vulnerability and how the Engineering team fixed it/ addressed the vulnerability}

Write a fix for this vulnerability, keeping code changes to a minimum to avoid breaking any usability tests. Also avoid importing or using any libraries outside the Django library or whats already given. There are 2 usability tests:
{Pasted the Usability Tests over here}

I went ahead and entered it and got a code output!
I chucked it in and tested it: Passed!

While GPT managed to pass the security tests of the challenge, I would later receive feedback from a security expert that the given patch had problems.

ChatGPT did not address the root of the issue and instead it looks like a “brute-force” approach to address the security vulnerability. The root cause was a bad choice of a normaliser, that is ‘any compatibility’ mode normaliser (i.e. NFKC and NFKD). These normalisers convert a character to a similar character, and are not a good choice for security as this can be exploited by an adversary to bypass security validation.

Per W3C recommendation, the safest normaliser is NFC which is a canonical mode. This normaliser does not convert the character and thus preferable for security.

Furthermore, in addition to not trying to address the root cause, ChatGPT only cherry picked a few characters when it should have restricted the character set to safe ASCII range. Instead of enforcing what characters are whitelisted, ChatGPT used an insecure blacklisting approach to cherry pick a few characters.

For more information see our article on Unicode Vulnerabilities.

For purposes of passing the tests in the challenge, ChatGPT’s solution definitely ‘worked’. However, when considered as a potential solution to the vulnerability, it is an insecure patch that doesn’t address the root cause of the vulnerability.

This serves to better highlight that the security tests for these challenges are relaxed and do not enforce strict conditions when addressing the security vulnerabilities. While that isn’t the case for the higher difficulty challenges, in this case ChatGPT was able to manage by simply brute forcing through the security test requirements.

What was learnt from this experiment is:

ChatGPT seems particularly good at handling Python code, this is somewhat understandable given how Python is the most popular language and hence ChatGPT would have a massive corpus to learn from compared to other languages.
ChatGPT would constantly struggle with usability requirements and unless strictly guided would end up making a lot of breaking changes to existing code
With enough context and information. ChatGPT was able to create a ‘good enough’ patch to pass the challenge but still was not able to address the root cause
There is a propensity to nonsensically try brute force strategies in ChatGPT instead of trying to address the root cause of a vulnerability despite having context
ChatGPT suffers immensely when a similar vulnerability is presented in another language.

I’d leave it up to you to decide how much of an impact this means for security vulnerabilities and whether or not ChatGPT will end up competing with secure programming anytime soon. However, I’d point out 2 key lessons I took away from this experiment and believe you should consider as well

While the experiment showcased the case of Python, given enough context and information along with specific instructions, ChatGPT is able to generate code that minimizes collateral damage to your existing code base whilst maybe providing the intended solution. Meaning keywords like “minimal changes to code” and “being mindful of the following tests” as well as “using only the following libraries” would be helpful in a prompt.
While ChatGPT may struggle with breaking the application when patching it as well as preferring to brute force instead of addressing root cause, it doesn’t doom its prospects in secure programming. Asking specific atomic questions such as “How to escape HTML in JavaScript” can result in practical solutions that are self contained and thus can be transplanted into your code whilst giving you a good general direction to work towards. However, effective use of the solution depends on you the programmer! Meaning you can use ChatGPT to assist in developing a solution but then its up to you to take it and fit it into your codebase yourself while also ensuring it is an effective solution that address the problem correctly.

Essentially, while ChatGPT doesn’t do all the work for you, understanding what the problem is and simply knowing the direction for implementation then ChatGPT can generate specific code for you that you can incorporate into your code base whilst also pointing the way to work towards.