Evaluating ChatGPT’s Effectiveness in Passing A Secure Programming Challenge

TL;DR:
ChatGPT serves as a valuable tool by providing immediate information without the need for manual searches. It seems to be particularly good at Python and while it may not always offer the most ideal solution, it serves as a helpful starting point for developers to explore and figure out solutions on their own.

ChatGPT is immensely popular. Not only can it hold engaging conversations, but it is also able to generate code! So much so that it could potentially land a job with a hefty salary at Google! (Check out the intriguing story here).

But the intrigue doesn’t end there. I embarked on a captivating experiment, wondering if ChatGPT could go beyond coding skills and tackle security vulnerabilities. The question arose: Could we simply feed it vulnerable code and expect a reliable, foolproof patch?

Armed with the freely available ChatGPT Legacy 3.5, I posed the challenge.
SecDim Play conveniently has a challenge in Python for the XSS security vulnerability.

A simple vulnerability whereby a malicious hacker could pass malicious code as an input parameter and an insecure website could end up running it.

I entered the following prompt:

Here is some vulnerable python code
{Pasted in the vulnerable python code from the challenge}
It is particularly vulnerable to XSS attacks, can you secure the code with a robust security patch. You are allowed to use the following libraries only:
from django.http import HttpResponse
import os
from django.conf import settings

ChatGPT gave the following:

For the imports

from django.http import HttpResponse, HttpResponseRedirect
from django.urls import reverse
from django.template import loader, Context
from django.views.decorators.http import require_GET
import os
from django.conf import settings
from django.utils.html import escape

and the say_hello() function

def say_hello(request):
    name = request.GET.get('name', None)
    if not name:
        return HttpResponseRedirect(reverse('main'))
    else:
        context = {'name': escape(name)}
        template = loader.get_template('hello.html')
        response = HttpResponse(template.render(context))
        return response

ChatGPT unexpectedly tangled itself in a seemingly straightforward solution, leading to the breakdown of usability tests and rendering the application unusable.

The usability tests were quite simple, simply running the application and passing the name parameter with “John” should return a HTTP response of “Hello, John”, yet in this case it clearly was not returning that.

I relayed the situation to ChatGPT as is based on the Terminal output hoping it would understand and self-correct based on the valuable feedback from the terminal’s test results.

Yet, an unexpected twist occurred as ChatGPT focused on a new element: the elusive “hello.html” file, which was something it introduced itself.

Further interrogation didn’t seem to bear any fruit, so I changed my approach. I tried the following prompt:

The following is python code which is vulnerable to XSS
{base vulnerable code goes here}
Write a fix for this vulnerability, keeping code changes to a minimum to avoid breaking any usability tests. The usability test expects a HttpResponse of “Hello, ”

With the new code implementation at hand, I eagerly inserted it and initiated a test, hoping for a positive outcome. To my astonishment, the tests passed!

ChatGPT’s capabilities left me impressed. While it had a slight hiccup in understanding the usability tests, I quickly realized that the key lay in crafting a better prompt. By providing clear instructions to consider usability tests and emphasizing minimal coding changes, ChatGPT successfully patched the vulnerability.

Curiously, I then decided to check if ChatGPT could handle the same vulnerability in a different language. Fortunate for us, XSS is also available in Javascript

I gave the following prompt:

Here is the same vulnerability but in Javascript
{Inserted the vulnerable javascript code}
Can you generate a robust security patch for this? Generate the code for it, keeping coding changes to a minimum and being mindful of the usability tests:
{Usability tests go here}

I got the following response to change the app.get(“/sayHello”) section

app.get("/sayHello", (req, res) => {
	if (!req.query.name) {
		res.status(200).end("Tell me who to say hello? e.g. /sayHello/?name=alice");
		return;
	}
	const name = req.query.name.replace(/[<>]/g, '');
	res.status(200).end(<h1>Hello, ${name}</h1>);
});

Sadly, this code broke usability tests. I gave the terminal output to ChatGPT in the hopes it will self correct, However it once again started running in circles. Seeing that such interrogation was also leading nowhere I decided to backtrack. I tried a number of prompts to try and get something, but either I got code that broke usability or just didn’t address the vulnerability.

This pointed out that GPT, while adept at dealing with Python vulnerabilities, seemed to struggle quite a bit when the same was presented in JavaScript.

I decided a more productive use, I gave the following prompt:

how do you escape HTML in javascript?

To which I got the following handy response:

function escapeHtml(str) {
  return str.replace(/[&<>"']/g, function(match) {
    switch(match) {
      case '&':
        return '&amp;';
      case '<':
        return '&lt;';
      case '>':
        return '&gt;';
      case '"':
        return '&quot;';
      case "'":
        return '&#039;';
    }
  });
}

This was quite helpful, I plugged in the formula into my code and simply passed my input parameter through this function.
I ran it through the test and viola! Passed!

Now this was an interesting observation, ChatGPT was able to handle a vulnerability with a few hiccups in Python but struggled in JavaScript. I decided to conduct one more experiment before drawing my conclusion, I tried a more difficult Python Challenge: Snowman.py

I gave the following prompt:

The following python code has a security vulnerability where the check for bad characters is insufficient:
{Vulnerable Python Code goes here}
Can you patch this code? Keep coding changes to a minimum and be mindful of the usability tests: {Usability tests go here}

I got the following code response from GPT:

def hasBadChars(username):
    if not username:
        return True
    # Check for potentially malicious input using regular expressions
    pattern = re.compile(r'[^\[email protected]+_-]')
    return pattern.search(username) is not None

def sanitize_input(input_str):
    # Remove any non-printable characters and normalize the input string
    input_str = unicodedata.normalize('NFKC', ''.join(c for c in input_str if unicodedata.category(c)[0] != 'C'))
    # Remove any character that's not alphanumeric or allowed symbol
    return ''.join([c for c in input_str if c.isalnum() or c in '@.+_-'])

def render(msg):
    clean_msg = sanitize_input(msg)
    if hasBadChars(clean_msg):
        return "[i] Username contains bad characters or is empty and is not safe for registration."
    else:
        return f"[i] Username is safe for registration: {clean_msg.replace('<', '&lt;').replace('>', '&gt;')}"

As expected, GPT doesn’t seem to have a problem adhering to usability tests if you instruct it correctly in Python. However, the security patch was not enough.

There was a bad choice of normaliser that is any compatibility mode normaliser (i.e. NFKC and NFKD ). These normaliser covert a character to a similar character. They are not good choice for security as this can be exploited by an adversary to bypass security validation.

However I gave GPT the benefit of doubt and let it try again, I fed the result of the Terminal output and had it self correct. However, it was unable to write an effective patch as it effectively started brute forcing it.

At the end of the 3rd attempt, I realized it was struggling. Curiously, I decided to help out GPT a bit.

Using the available hint that gives information about the vulnerability at the cost of 5 points, I got additional information and context including specific CWE codes that might help GPT.

I went one step further and also opened the article in the ReadMe. Since Snowman.py is based on a real world incident of Spotify being hacked I decided to open the article and share a choice selection of the article. Notably where the engineering team described the vulnerability and what it was doing to exploit the system.

With all this, I finally crafted the following prompt:

The following is vulnerable python code:
{Vulnerable Python code goes here}

It suffers from a security vulnerability:
{Content of the Hint goes here}

Notes from the engineering team note that:
{Copy pasted part of the article that discussed the vulnerability and how the Engineering team fixed it/ addressed the vulnerability}

Write a fix for this vulnerability, keeping code changes to a minimum to avoid breaking any usability tests. Also avoid importing or using any libraries outside the Django library or whats already given. There are 2 usability tests:
{Pasted the Usability Tests over here}

I went ahead and entered it and got a code output!
I chucked it in and tested it: Passed!

While GPT managed to pass the security tests of the challenge, I would later receive feedback from a security expert that the given patch had problems.

ChatGPT did not address the root of the issue and instead it looks like a “brute-force” approach to address the security vulnerability. The root cause was a bad choice of a normaliser, that is any compatibility mode normaliser (i.e. NFKC and NFKD). These normalisers convert a character to a similar character. They are not a good choice for security as this can be exploited by an adversary to bypass security validation.

Per W3C recommendation, the safest normaliser is NFC which is a canonical mode. This normaliser does not convert the character and thus preferable for security.

Furthermore, in addition to not trying to address the root cause, ChatGPT only cherry picked a few characters when it should have restricted the character set to safe ASCII range. Instead of enforcing what characters are whitelisted, ChatGPT used an insecure blacklisting approach to cherry pick a few characters.

For more information see our article on Unicode Vulnerabilities.

For purposes of passing the tests in the challenge, ChatGPT’s solution definitely ‘worked’. However, when considered as a potential solution to the vulnerability, it is an insecure patch that doesn’t address the root cause of the vulnerability.

This serves to better highlight that the security tests for these challenges are relaxed and do not enforce strict conditions when addressing the security vulnerabilities. While that isn’t the case for the higher difficulty challenges, in this case ChatGPT was able to manage by simply brute forcing through the security test requirements.

What was learnt from this experiment is:

  • ChatGPT seems particularly good at handling Python code, this is somewhat understandable given how Python is the most popular language and hence ChatGPT would have a massive corpus to learn from compared to other languages.
  • ChatGPT would constantly struggle with usability requirements and unless strictly guided would end up making a lot of breaking changes to existing code
  • With enough context and information. ChatGPT was able to create a ‘good enough’ patch to pass the challenge but still was not able to address the root cause
  • There is a propensity to nonsensically try brute force strategies in ChatGPT instead of trying to address the root cause of a vulnerability despite having context
  • ChatGPT suffers immensely when a similar vulnerability is presented in another language.

I’d leave it up to you to decide how much of an impact this means for security vulnerabilities and whether or not ChatGPT will end up competing with secure programming anytime soon. However, I’d point out 2 key lessons I took away from this experiment and believe you should consider as well

  1. While the experiment showcased the case of Python, given enough context and information along with specific instructions, ChatGPT is able to generate code that minimizes collateral damage to your existing code base whilst maybe providing the intended solution. Meaning keywords like “minimal changes to code” and “being mindful of the following tests” as well as “using only the following libraries” would be helpful in a prompt.

  2. While ChatGPT may struggle with breaking the application when patching it as well as preferring to brute force instead of addressing root cause, it doesn’t doom its prospects in secure programming. Asking specific atomic questions such as “How to escape HTML in JavaScript” can result in practical solutions that are self contained and thus can be transplanted into your code whilst giving you a good general direction to work towards. However, effective use of the solution depends on you the programmer! Meaning you can use ChatGPT to assist in developing a solution but then its up to you to take it and fit it into your codebase yourself while also ensuring it is an effective solution that address the problem correctly.

Essentially, while ChatGPT doesn’t do all the work for you, understanding what the problem is and simply knowing the direction for implementation then ChatGPT can generate specific code for you that you can incorporate into your code base whilst also pointing the way to work towards.

With that said, perhaps it would have been more applicable for this article to be titled:
Evaluating ChatGPT’s Effectiveness in Passing A Secure Programming Challenge
given its propensity to game the tests instead of addressing the root cause.

1 Like