

CwX 2026: How Caseware Verity closes the gap between AI capability and audit-grade reliability
AI models can now outperform domain experts on graduate-level questions. Benchmarks that would have stumped a PhD a year ago are being cleared with room to spare. By almost any technical measure, the capability of these models has crossed a threshold that few people predicted this quickly.
And yet, for most accounting firms, that capability sits at arm's length because passing a benchmark and being trustworthy enough to inform a professional conclusion are two entirely different things. The distance between those two points is where most AI in audit and assurance either succeeds or fails and it was the subject of a candid session at CwX 2026 in Fort Lauderdale, led by the team behind Caseware Verity. Caseware Verity is engagement-aware AI embedded in Caseware Cloud that works inside engagements, drawing on permitted engagement data, firm methodology and audit context.
Reflecting on where AI models stood when Caseware built its AI-powered digital assistant, AiDA, in 2024, Quinn Daneyko, Senior Product Manager at Caseware, said "The AI models evolved dramatically. But they weren’t transformative. They couldn’t reason deeply. They didn’t have deep technical knowledge. They weren’t at the level of a strong domain expert across different verticals."
A powerful AI model is just the starting point
AiDA worked the way most generative AI tools did: a user sends a message, context gets attached, the model responds. Verity is structurally different, Daneyko noted. "You send a message to a model and it doesn't just respond. It reasons over that question. It calls different resources, gets what it needs from a variety of different sources, and continues to work through that information until it can achieve a certain outcome."
In practice that means Verity can draw on engagement context such as trial balance data, financial statements, risks, controls, materiality, engagement documents, checklists and firm knowledge bases, as well as authoritative standards where configured and available through parallel sub-agents, before producing a response grounded in the full context of the engagement.
But even that architecture, Daneyko said, is only the foundation. "With that core platform capability, we can now start adding more capabilities. We can start giving it access to more context, giving it the ability to start driving workflows within the engagement." The platform is the enabler. What gets built on top of it still requires something the platform alone can't provide.
The knowledge a model can’t train on
Jason Bradley, VP of AI and Methodology at Caseware, and a former standard-setter, regulator, and inspector, discussed the gap between what a general AI model knows and what a domain expert knows.
General models trained on publicly available data will contain things about audit and financial reporting. But what they contain reflects what was written down and accessible, not the interconnective professional judgment that makes standards actually function in practice.
"If you look at things like SAS 145," Bradley explained, "there’s a lot of requirements, but they’re all connected to each other. Very few standards are isolated. They’re hugely interconnected, and that interconnectivity may not be clear in a base model’s training."
His example of what that looks like in practice: give a general model a trial balance and ask for risks. You'll get something. Inventory is up 11%. Margins have compressed. The output is plausible. But it's missing the synthesis that a competent auditor performs almost automatically — connecting that movement to the board minutes, the prior year control environment, the specific risk context of this client, at this point in time. "Without that context," Bradley said, "you're missing the sophisticated part that a human would do, which is to connect the whole holistically."
Verity’s domain intelligence is delivered through structured methodology and instruction sets that encode professional judgment, standards interconnectivity and firm-specific methodology directly into the Caseware platform.
Consistency over brilliance
There's a dimension to trustworthiness that the AI capability debate usually misses.
"A baseline model without any tuning will sometimes do an OK job, sometimes do a great job, sometimes do a terrible job,” Bradley explained. “The problem is almost more the inconsistency than the quality of the output."
A profession trained from day one to be sceptical needs something it can calibrate for. Occasional brilliance isn't useful if it comes bundled with unpredictable failure. What firms need is outputs that are reliable enough to build a review process around and that requires not just good skills files, but a continuous evaluation framework that reruns assessments every time a new model is released, measures what changed, and adjusts accordingly. "In three months there'll be some new model," Bradley said. "We're setting ourselves up so that we can rerun this as a regression test, test it against whatever's emerged, and see how it changes."
Stacie Simmons, VP of AI at Caseware, connected this to governance. Transparency into the sources and context behind outputs, reviewable suggestions and firm-level controls over how agents behave are what makes consistent quality operationally possible and what allows professional accountability to stay where it belongs.
The human judgment line
Throughout the session, one principle came up repeatedly: AI-assisted outputs must be reviewed and accepted by a human before they are incorporated into an engagement,
"Human judgment is sacrosanct,” Bradley said. “Before anything is written into the engagement, a human will always have to be involved in this process." The agent can surface supporting context, propose risks or next steps, and provide source references where applicable. The reviewer decides.
The goal isn't to replace the judgment call. It's to give the person making it better material to work with: more context synthesised more thoroughly, with supporting context and source references made available where applicable. The difference between a reviewer spending their time on genuine judgment and a reviewer spending their time on remedial correction is where the quality improvement lives.
The gap between an AI model that clears a benchmark and one that a senior auditor would rely on is real. Domain knowledge embedded at the platform level, continuous evaluation, human judgment kept genuinely in the loop, firm-specific context that makes outputs fit for purpose — none of that comes ready-made.
But the direction is clear enough. And for firms still waiting for AI that feels trustworthy rather than just impressive, the session offered an honest account of what closing that gap actually requires.
Learn more about Caseware Verity and Caseware Verity Agentic Suites.









