Skip to Main Content
WELCOME TO CREX, THE COLLABORATIVE RESEARCH EXCHANGE FOR THE NIH INTRAMURAL RESEARCH PROGRAM

Go to Main Navigation

Benchmate: Data Sources

At its core, Benchmate is a retrieval-augmented generation (RAG) application. With RAG, you “ground” a model in reality by providing it context in your prompt. Ideally, you have the answer to the user’s query in your database, and you provide the answer to the LLM along with the question. This is where data sources come in.

Fundamentally, a data source, often referred to as a retrieval source, is a collection of documents that produce “chunks” of text. I’ll talk more about chunks in a later post, but these chunks are what we ultimately pass to the language model as context.

When it came to Benchmate, we knew we wanted to be able to ground our models using a wide variety of sources, so our data source abstraction must be able to source data from a wide variety of places. For instance, we have a Dropbox data source which, given an authenticated user and a directory, will pull all the files from a directory and create documents, and eventually chunks, from those files. The format of the files does not matter — text files, Word docs or PDFs of scanned paper documents — we can extract the data using Apache Tika and use the text to support Benchmate’s responses.

Currently, we support the following data sources:

  • Sitemap – useful for pulling data from an entire website.
  • Web scraping – useful for pulling data from websites that don’t provide sitemaps.
  • Single web page – useful for pulling in data from a single page, like a Wikipedia entry.
  • Markdown – allows the user to type or paste in raw text (markdown formatted) for ingestion.
  • RSS Feeds – useful for ingesting blogs.
  • Email – this source creates an email address and ingests any email sent to that address.
  • Dropbox – allows the user to ingest all the content from a Dropbox directory.
  • GitHub – pulls in all the issues and pull requests from a GitHub repository (code coming soon).
  • Clinical Trials – ingests every clinical trial ever run from https://clinicaltrials.gov.

Benchmate also supports the concept of a remote data source. This is useful in the event that we encounter a data source that is so large or changes so frequently that keeping it up to date in Benchmate is not feasible. In these scenarios, we can pull chunks of text directly from the source without the ingestion step. A prime example of this is our very own Scientist.com data source, which dynamically ingests information from a customer’s specific marketplace and powers our Procurement CoPilotTM product.

Regarding our local data sources, we break the text up into chunks of text so that we can more efficiently search for answers and provide them to the LLM. Once we have these chunks, we add them to the prompt that we send to the model, and in this way, we guide the LLM to the answer we want it to provide.

For example, we’ve set up a bot to answer Benchmate questions. It uses our FAQ sheet as a data source (in the future it might use this post!). A user might ask it:

What chunking method do you support?

When Benchmate searches through chunks, it will find the answer from our FAQ page, and what it will actually send to the LLM is:

What chunking method do you support?

Context: I understand chunking text is important for the chat bot, what chunking options are supported?

We support a number of different chunking options. There are rule based chunkers such as the HTML Chunker and Markdown Chunker which use the structure of a document. There the the Regex Chunker which uses regular expressions and the Recursive Character Text Chunker which uses a recursive character based approach. We also support more advanced chunkers like the Change Point Detection Chunker, the Cosine Similarity Chunker and the Topic Chunker which use machine learning algorithms to predict where topics change in the text.

From this example you can see how this prompt makes it much easier for the LLM to answer, while making it more likely that the answer will be grounded in reality.

The relationship between data sources and chunking is essential to Benchmate’s functionality, and in future posts I’ll explore how documents are chunked and how data sources are used to power our data extraction pipelines.