Humanising the machine – testing GenAI with a people-centred approach

By Tessa Moxey on 16 October 2025

In 2025, we piloted a GenAI search assistant with the New Zealand public for the first time. We pushed ourselves to find innovative ways to test this new technology — with exciting results.

Key insights

The pilot taught us a lot. It showed us what worked, what didn’t, and where we can do better. The following are key insights we learned from testing generative artificial intelligence (GenAI) in the public sector.

User research is essential for understanding the value of new technology

To understand the real value of new technology, it’s important to talk to the people who will use it.

Digital transformation is not just about using new technology. It’s about understanding how people think, how they behave, and how open they are to doing things differently. This ensures the services government invests in are useful for New Zealanders.

Flag with the words TALK TO PEOPLE written on it

Find the middle ground with GenAI

Be cautious with the technology but also look for ways to test it safely with the public. Roll it out too quickly and you’ll run into problems. Keep it internal for too long and you’ll fall behind how people are already using technology in their daily lives.

Bias, discrimination, fairness, equity and GenAI

The public sector needs GenAI to behave differently, and we can lead in this space

People expect government information to be accurate and trustworthy, but GenAI does not always meet that standard. In public services, AI must handle complex situations with care, be clear about where its information comes from, and protect people’s privacy. The goal is to help people get the support they need, quickly and clearly.

Responsible AI Guidance for the Public Service: GenAI

Testing large language models (LLMs) with people is different from testing software

We had to make our own methodology to do this. When we started working on our AI assistant, we quickly realised there was no precedent — no clear guide or standard way to test our bespoke solution. Unlike regular software, AI assistants respond to people in lots of different ways. We couldn’t rely on traditional testing methods to evaluate how we wanted the assistant to behave.

Digging deeper

The analogy that framed the pilot

I worked as a secondary teacher for 10 years before I became a user researcher.

During an earlier proof of concept, I found myself using the analogy that the AI assistant was like a 12-year-old student. If you’ve ever spent time around this age group, you’d know they:

like to tell you everything they know about anything
may make things up when they don’t know the answer
have 7 years of education behind them — they can learn new stuff, but they can’t completely forget what they already know.

This analogy helped shaped my thinking for this pilot.

Student at desk with hand raised

What we wanted to understand about the AI assistant

We piloted an AI search assistant across 21 government websites to understand how GenAI could support people to navigate government services.

We wanted to know if the tool would:

be easier for people compared with our other channels (for example, phone, email or in-person services)
be able to provide next steps to people (for example, a link, signup form or contact details)
understand where people were on their journey with government and respond appropriately — rather than relying on generic, templated answers
be trusted by the public for everyday use — even though it’s powered by GenAI.

We needed to make sure the tool:

did not learn from interactions with users — we made this design decision because we wanted to protect people’s private information
could be trained to say, ‘I don’t know’ and quickly guide users to the right service, rather than trying to solve everything itself
could be tested by a wide range of people across New Zealand — both face-to-face and remotely.

We also needed to make sure we understood how people were using GenAI, and how much they trusted this new technology.

The method for testing our AI assistant

We created our own testing process based on Human Centred Design (HCD) principles. Instead of focusing on what the technology could or could not do, we looked at how it would work for people in their everyday lives.

Introduction to human-centred design — State Government of Victoria

It was important to us that we showed all our participants manaakitanga. We were the face of the New Zealand government. Taking care of the people who fuel the development of this technology was a key part of how we worked.

Manaakitanga definition — Kōwhiti Whakapae

Three streams of testing, using qualitative and quantitative methods

Automated testing (4 rounds)
Moderated user testing — interviews online with 26 people across the motu (island)
Unmoderated user testing — remotely within Aotearoa New Zealand (95 people)

Mixing methods: a recipe for research success — UK Government

Testing based on real-life conversations

To understand if the tool would be useful for New Zealanders, we made sure our testing was based on real-life conversations people have with the government every day.

One of the key lessons we took from a proof of concept, completed in 2024, was the need for LLMs to be tested with:

real user queries from websites, emails and call centres for automated testing
support from wider government Subject Matter Experts (SMEs) across agencies to understand what the ideal response to these questions would be
user interviews where the public could ask their own questions about government — unique to their situation
guided remote testing where users would be able to interact with the search assistant in a natural way.

Use case: automated testing

LLMs can sound human, but they still think like machines. We created a framework that brought together both educational methodologies and automated testing.

We took a broad approach to testing. Part of our framework was based on how people understand complex ideas across different areas of government. This helped us assess whether the AI assistant was capable of giving a user a sufficient answer — one that felt relevant and useful — rather than simply aiming for surface-level accuracy.

Automated testing involved 4 rounds of testing with 104 questions from 15 different government agencies.

The 4 key methodologies in our automated testing

Completeness: how accurately the LLM could answer questions by comparing them to an ideal answer created by content designers and SMEs.
Voice and tone: how closely the LLM matched our ideal voice and tone created through our content design guidance on digital.govt.nz.
Plain language: how well the LLM could use plain language to simplify complex government processes for the public.
Understanding user intent: how easily the LLM could understand what the user was asking.

Writing style

Two speech bubbles one with garbled text another with readable text with the text PLAIN LANGUAGE below them

LLM-as-a-judge to test voice and tone and plain language

To help us work a bit faster we used an LLM-as-a-judge — a technique where one LLM is used to evaluate the output of another.

We used an Open AI model (GPT4o-mini) to help score voice and tone and plain language.
The LLM-as-a-judge used 8 true and false statements to mark each answer the LLMs gave.
This process involved designing an evaluation criteria and extensive prompt engineering. We benchmarked human scoring against the LLM-as-a-judge to help it improve its marking.

LLM-as-a-Judge: A Practical Guide — towards data science

For our use case, it was faster and more efficient for real people to mark completeness and user intent.

In each automated test, we checked for completeness and also looked for ways to improve the assistant’s responses overall.

The AI assistant could not learn directly from user interactions, so we had to guide the assistant’s behaviour through clear, targeted instructions.

Protection and privacy

One of our big concerns was how an AI search assistant would handle aggressive, violent or ‘at risk’ input from users. We also knew from previous research, that users would most likely enter personal data into the search assistant.

To test the AI assistant’s capabilities, we developed testing questions from feedback from govt.nz. We saw distinct patterns in user behaviour including people:

entering large amounts of personal data including case numbers, passport numbers and other identifying information
expressing anger or violence towards the New Zealand government
being in a vulnerable state — for example, needing mental health support.

We used automated testing to help tune the LLM’s responses by gathering user feedback from govt.nz that fit into these 3 categories.

We also got mental health professionals to give us guidance on how the AI search assistant could best respond to ‘at risk’ users where they were vulnerable and in need of wider support.

Use-case: user testing

Our user interviews needed to cover a wide range of topics and tasks — from exploring people’s trust in AI to observing how they interacted with the tool in real time. We needed a broad range of people across the motu.

This included a wide range of:

ages
ethnicities
locations (rural and urban)
different levels of education.

A New Zealand-based vendor was able to support us to reach these demographic groups.

It was key for us to know if this tool was going to work for everyone.

It was important to us that:

users were supported from the beginning to end of the research cycle
users were made to feel as comfortable as possible during our testing sessions
our vendor was on-hand to contact participants if they’d not responded or needed help with the user testing
users were promptly given koha (contribution) for their time in a form that they could use easily.

Voice of our participants

The findings of this research have been talked about in a blog post. Here, I want to highlight the voices of our users and their feelings towards the AI assistant and GenAI.

General feedback about the AI assistant

It’s a brilliant idea to make information more accessible and reduce inequalities in information gathering but it’s important to get it right.

Feels a bit wrong – I don’t know why maybe it’s just new, risk of misinformation

Feedback about finding government information online

I have found it hard to find government information in the past. Sometimes it’s hard to know which service I need, which website I need to look at, which organisation is responsible for certain things like transport agency versus AA.

Firstly I always go online 1st to see if I can find the information for myself. If not, then I need to send an email. I feel it’s really hard to find information for IRD or you know or find the contact person. If I want to find some information, it’s really hard to navigate the website.

Conclusion

The statement ‘Will it work for people?’ was our guiding principle throughout this pilot.

This is not a tried-and-true formula to test an LLM, but it gave us a great reference to guide our exploration. We created this methodology in response to the needs of our users and our understanding of the technology at the time.

GenAI is constantly changing, but placing the needs of your users first should be non-negotiable.

If you’re interested in learning more about the methodology behind this work, email diaresearch@dia.govt.nz.

All illustrations in this blog are copyrighted by Tessa Moxey, 2025.

JavaScript is currently turned off in your browser — this means you cannot submit the feedback form. It’s easy to turn on JavaScript — Learn how to turn on JavaScript in your web browser.

If you’re unable to turn on JavaScript — email your feedback to info@digital.govt.nz.

Was this page helpful?

Yes No Partly

Thanks, do you want to tell us more?

Do not enter personal information. All fields are optional.

What did you come here to do?

How can we improve this information?