Onyx AI integration

Onyx is an open-source platform for AI search across your own data. 17,000+ GitHub stars. Netflix, Thales Group, Ramp and UC San Diego run it in production. We deploy and customise Onyx for governments, regulators and enterprises in Australia and the UK.

Why open source

The mainstream products in this space are Glean, Microsoft Copilot and Coveo. They're closed SaaS. Your data leaves your boundary, your roadmap depends on theirs, and your pricing is whatever they decide it is next year. For most companies that's fine. For regulators, healthcare providers and public-sector organisations with data-sovereignty obligations, it isn't.

Onyx is MIT-licensed. You run it where you want, on whichever LLM you choose (OpenAI, Anthropic, or self-hosted via Ollama or vLLM). You can audit which document chunks went into which answer. It ships as a complete product: 40+ connectors, role-based access control, SSO via OIDC, SAML and OAuth2, document-level permissions, an admin UI, and hybrid search with reranking. Initial deployment usually takes a day.

What we add on top

PretaGov has been running Onyx in production for AU and UK clients. The extensions below are things we needed and built. Most are upstream-mergeable.

Tables and structured data

Standard RAG breaks tables. Rows and columns get split into fragments, so questions like "what was total revenue last year" return partial answers or no answer at all. We added a parallel SQL pathway. Onyx still does the semantic search; alongside, the question hits a generated SQL query that computes the answer across the full table. End users get a real number instead of a list of links.

PDFs that aren't print-ready

Annual reports, investigation summaries and other regulatory documents have multi-page tables, footnotes and inconsistent layouts. Standard PDF-to-text tools miss those. We swapped Onyx's PDF extractor for one based on PyMuPDF layout analysis, with cross-page table merging. For Excel inputs we added named-table parsing and column unpivoting.

Data sources that aren't documents

Beyond Onyx's 40+ shipped connectors, we've added PowerBI dashboard scraping, table extraction from csv / xlsx / pptx files embedded in web pages, and sitemap improvements that discover linked files (not just linked pages). Our DiscoveryConnector interface makes crawl pruning correct rather than approximate.

An embeddable chatbot widget

For organisations that want AI search on a public site, we ship a pop-out chat widget that drops into any frontend. Streaming responses, markdown rendering, persistent sessions. Talks to Onyx through the existing token API.

Scale-to-zero hosting

For workloads that aren't 24/7, we run Onyx in a scale-to-zero configuration on Fly.io. The single Onyx worker is split into eight specialised workers, each suspended when idle. Cost between requests is zero.

Two examples in production

Regulator complaints and oversight. PowerBI dashboards, annual reports, media releases, investigation summaries. A user asks how many complaints were upheld in 2024 and gets a number with citations, not a list of links.

Multilingual health information. Translated medical resources in 60+ languages. A user asks for COVID vaccine information in Arabic and gets the right document in the right language.

Related work

Multilingual health AI search in 60+ languages

Customised Onyx deployment for a national health body — AI search across translated medical resources, returning the right document in the language the user reads.

1 June 2025

AI search across regulator complaints and oversight data

Customised Onyx deployment for a government oversight agency — natural-language search across investigation summaries, annual reports, media releases, and PowerBI complaints data.