In , we piloted a GenAI search assistant with the New Zealand public for the first time. We pushed ourselves to find innovative ways to test this new technology — with exciting results.
Key insights
The pilot taught us a lot. It showed us what worked, what didn’t, and where we can do better. The following are key insights we learned from testing generative artificial intelligence (GenAI) in the public sector.
User research is essential for understanding the value of new technology
To understand the real value of new technology, it’s important to talk to the people who will use it.
Digital transformation is not just about using new technology. It’s about understanding how people think, how they behave, and how open they are to doing things differently. This ensures the services government invests in are useful for New Zealanders.
Find the middle ground with GenAI
Be cautious with the technology but also look for ways to test it safely with the public. Roll it out too quickly and you’ll run into problems. Keep it internal for too long and you’ll fall behind how people are already using technology in their daily lives.
Bias, discrimination, fairness, equity and GenAI
The public sector needs GenAI to behave differently, and we can lead in this space
People expect government information to be accurate and trustworthy, but GenAI does not always meet that standard. In public services, AI must handle complex situations with care, be clear about where its information comes from, and protect people’s privacy. The goal is to help people get the support they need, quickly and clearly.
Responsible AI Guidance for the Public Service: GenAI
Testing large language models (LLMs) with people is different from testing software
We had to make our own methodology to do this. When we started working on our AI assistant, we quickly realised there was no precedent — no clear guide or standard way to test our bespoke solution. Unlike regular software, AI assistants respond to people in lots of different ways. We couldn’t rely on traditional testing methods to evaluate how we wanted the assistant to behave.
Digging deeper
The analogy that framed the pilot
I worked as a secondary teacher for 10 years before I became a user researcher.
During an earlier proof of concept, I found myself using the analogy that the AI assistant was like a 12-year-old student. If you’ve ever spent time around this age group, you’d know they:
- like to tell you everything they know about anything
- may make things up when they don’t know the answer
- have 7 years of education behind them — they can learn new stuff, but they can’t completely forget what they already know.
This analogy helped shaped my thinking for this pilot.
What we wanted to understand about the AI assistant
We piloted an AI search assistant across 21 government websites to understand how GenAI could support people to navigate government services.
We wanted to know if the tool would:
- be easier for people compared with our other channels (for example, phone, email or in-person services)
- be able to provide next steps to people (for example, a link, signup form or contact details)
- understand where people were on their journey with government and respond appropriately — rather than relying on generic, templated answers
- be trusted by the public for everyday use — even though it’s powered by GenAI.
We needed to make sure the tool:
- did not learn from interactions with users — we made this design decision because we wanted to protect people’s private information
- could be trained to say, ‘I don’t know’ and quickly guide users to the right service, rather than trying to solve everything itself
- could be tested by a wide range of people across New Zealand — both face-to-face and remotely.
We also needed to make sure we understood how people were using GenAI, and how much they trusted this new technology.
The method for testing our AI assistant
We created our own testing process based on Human Centred Design (HCD) principles. Instead of focusing on what the technology could or could not do, we looked at how it would work for people in their everyday lives.
Introduction to human-centred design — State Government of Victoria
It was important to us that we showed all our participants manaakitanga. We were the face of the New Zealand government. Taking care of the people who fuel the development of this technology was a key part of how we worked.
Manaakitanga definition — Kōwhiti Whakapae
Three streams of testing, using qualitative and quantitative methods
- Automated testing (4 rounds)
- Moderated user testing — interviews online with 26 people across the motu (island)
- Unmoderated user testing — remotely within Aotearoa New Zealand (95 people)
Mixing methods: a recipe for research success — UK Government
Testing based on real-life conversations
To understand if the tool would be useful for New Zealanders, we made sure our testing was based on real-life conversations people have with the government every day.
One of the key lessons we took from a proof of concept, completed in , was the need for LLMs to be tested with:
- real user queries from websites, emails and call centres for automated testing
- support from wider government Subject Matter Experts (SMEs) across agencies to understand what the ideal response to these questions would be
- user interviews where the public could ask their own questions about government — unique to their situation
- guided remote testing where users would be able to interact with the search assistant in a natural way.
Use case: automated testing
LLMs can sound human, but they still think like machines. We created a framework that brought together both educational methodologies and automated testing.
We took a broad approach to testing. Part of our framework was based on how people understand complex ideas across different areas of government. This helped us assess whether the AI assistant was capable of giving a user a sufficient answer — one that felt relevant and useful — rather than simply aiming for surface-level accuracy.
Automated testing involved 4 rounds of testing with 104 questions from 15 different government agencies.
The 4 key methodologies in our automated testing
- Completeness: how accurately the LLM could answer questions by comparing them to an ideal answer created by content designers and SMEs.
- Voice and tone: how closely the LLM matched our ideal voice and tone created through our content design guidance on digital.govt.nz.
- Plain language: how well the LLM could use plain language to simplify complex government processes for the public.
- Understanding user intent: how easily the LLM could understand what the user was asking.
LLM-as-a-judge to test voice and tone and plain language
To help us work a bit faster we used an LLM-as-a-judge — a technique where one LLM is used to evaluate the output of another.
- We used an Open AI model (GPT4o-mini) to help score voice and tone and plain language.
- The LLM-as-a-judge used 8 true and false statements to mark each answer the LLMs gave.
- This process involved designing an evaluation criteria and extensive prompt engineering. We benchmarked human scoring against the LLM-as-a-judge to help it improve its marking.
LLM-as-a-Judge: A Practical Guide — towards data science
For our use case, it was faster and more efficient for real people to mark completeness and user intent.
In each automated test, we checked for completeness and also looked for ways to improve the assistant’s responses overall.
The AI assistant could not learn directly from user interactions, so we had to guide the assistant’s behaviour through clear, targeted instructions.
Protection and privacy
One of our big concerns was how an AI search assistant would handle aggressive, violent or ‘at risk’ input from users. We also knew from previous research, that users would most likely enter personal data into the search assistant.
To test the AI assistant’s capabilities, we developed testing questions from feedback from govt.nz. We saw distinct patterns in user behaviour including people:
- entering large amounts of personal data including case numbers, passport numbers and other identifying information
- expressing anger or violence towards the New Zealand government
- being in a vulnerable state — for example, needing mental health support.
We used automated testing to help tune the LLM’s responses by gathering user feedback from govt.nz that fit into these 3 categories.
We also got mental health professionals to give us guidance on how the AI search assistant could best respond to ‘at risk’ users where they were vulnerable and in need of wider support.
Use-case: user testing
Our user interviews needed to cover a wide range of topics and tasks — from exploring people’s trust in AI to observing how they interacted with the tool in real time. We needed a broad range of people across the motu.
This included a wide range of:
- ages
- ethnicities
- locations (rural and urban)
- different levels of education.
A New Zealand-based vendor was able to support us to reach these demographic groups.
It was key for us to know if this tool was going to work for everyone.
It was important to us that:
- users were supported from the beginning to end of the research cycle
- users were made to feel as comfortable as possible during our testing sessions
- our vendor was on-hand to contact participants if they’d not responded or needed help with the user testing
- users were promptly given koha (contribution) for their time in a form that they could use easily.
Voice of our participants
The findings of this research have been talked about in a blog post. Here, I want to highlight the voices of our users and their feelings towards the AI assistant and GenAI.
General feedback about the AI assistant
It’s a brilliant idea to make information more accessible and reduce inequalities in information gathering but it’s important to get it right.
Feels a bit wrong – I don’t know why maybe it’s just new, risk of misinformation
Feedback about finding government information online
I have found it hard to find government information in the past. Sometimes it’s hard to know which service I need, which website I need to look at, which organisation is responsible for certain things like transport agency versus AA.
Firstly I always go online 1st to see if I can find the information for myself. If not, then I need to send an email. I feel it’s really hard to find information for IRD or you know or find the contact person. If I want to find some information, it’s really hard to navigate the website.
Conclusion
The statement ‘Will it work for people?’ was our guiding principle throughout this pilot.
This is not a tried-and-true formula to test an LLM, but it gave us a great reference to guide our exploration. We created this methodology in response to the needs of our users and our understanding of the technology at the time.
GenAI is constantly changing, but placing the needs of your users first should be non-negotiable.
If you’re interested in learning more about the methodology behind this work, email diaresearch@dia.govt.nz.
All illustrations in this blog are copyrighted by Tessa Moxey, .
Published