Beagle – Sniffing Out the Gaps in Your Information Retrieval System

The primary role of an information retrieval system (IR) is to retrieve a set of relevant documents given a query. For a machine learning based IR to be effective, it needs to be trained on sufficient training data and its performance needs to be monitored on a regular basis to avoid a “data drift” situation. 

“Data drift” happens when any of the following scenarios occurs:

  1. Users start to phrase their queries differently.
  2. Users post queries that the IR is not well-equipped to answer.
  3. The knowledge base content underlying the IR system is no longer relevant and needs to be updated (eg. HR policy before and after COVID-19).

Beagle was developed for IRs to be monitored in an intuitive and simplified manner and to collect more training data. Before delving deeper into the inner workings of Beagle, we will further illustrate the concept of “data drift” in the context of information retrieval.

Data Drift

Imagine that you have an IR that is deployed within your organization to answer COVID-19 related questions and one of your users posted a question with regard to the circuit breaker measures implemented by the Singapore Government. Unfortunately, the IR was not trained to recognize that the term “circuit breaker” in the context of COVID-19 and without adequate measures in place to look out for such occurrences, the poor performance of the model might be left unnoticed.

User: “What are the current circuit breaker measures?”

What the IR returns:

“A circuit breaker is an automatically operated electrical switch designed to …” 

What the user is actually looking for:

“The 2020 Singapore circuit breaker measures, abbreviated as CB …” 

Beagle in Action

A maintainer, who is usually an internal staff of the organization deploying the IR, will leverage on Beagle to collect more training data, finetune and deploy a new IR model. On the other hand, end users will interact with the IR by posting their queries to it. In this section, we explain and illustrate the interactions between Beagle and the IR using the architecture diagram below.

1. User sends query and receives responses

End users post their queries to the IR and in return receive a set of relevant responses.

2. User provides feedback (optional)

End users are able to provide the maintainer with feedback on how the IR is doing by selecting the responses that are relevant to them.

3. Maintainer verifies feedback

There might be cases where the end users’ annotations were incorrect. As such, Beagle has a similar interface for the maintainer to verify the feedback given by the users. The data collected will be especially useful for companies who do not have the capabilities to annotate their own training data.

As the IR returns a fixed number of responses (eg. 5 responses), it is possible for some ground truth responses to not appear in the list of responses returned to the end user (false negatives). The maintainer is able to make corrections for such errors by searching for these responses and annotating them as correct responses. Once the end user’s annotations have been verified and corrected, they will be saved and used for downstream model training purposes.

4. Maintainer has the option to upload labeled data

Alternatively, for companies who are able to annotate their own training data, Beagle allows the maintainer to upload the labeled data in the form of a “.csv” file.

5. Maintainer evaluates model against new batches of data

The data that is being collected over time are separated into batches of fixed size (eg. 100 data points) and are evaluated against the currently deployed model. Each data batch is split into train, validation and test sets. 

For information retrieval tasks, Mean Reciprocal Rank (MRR) score is a typical metric used for evaluation. As an example, the graph below illustrates the performance of the currently deployed model across 9 batches of data. Each of the data batches has two MRR scores (“Batch score and “Accumulated test score”) tagged to it. For clarity, we have listed down the data points that were used to calculate each score for batch #8 below.

Batch # Used to Compute Batch Score for Batch #8? Used to Compute Accumulated Score for Batch #8?
1 No Yes – Only test set
2 No Yes – Only test set
3 No Yes – Only test set
4 No Yes – Only test set
5 No Yes – Only test set
6 No Yes – Only test set
7 No Yes – Only test set
8 Yes Yes – Only test set
6. Maintainer retrains a new model

If required (e.g. when data drift has severely affected the accuracy), the maintainer is able to retrain a new set of model weights for the IR with the data collected so far.

7. Maintainer deploys a new model

After retraining a new set of model weights, the maintainer has the option to deploy the new model. Additionally, the maintainer has the option to revert to an older model version if he/she finds that the performance of the currently deployed model is unsatisfactory.

8. Repeat

The maintainer continues to monitor the performance of the model as new data is collected.


Our motivations for building Beagle for IRs are aligned with the Continuous Delivery For Machine Learning end-to-end process illustrated below. The data collection and performance monitoring features of Beagle tie in closely with the “Monitoring” and “Model Building” pipelines. We are continuously improving the functionalities of Beagle and are looking to generalize the features of Beagle to other machine learning use cases. Do drop us an email if you have any questions or if you would like to request for a demo.

Adapted from: Continuous Delivery for Machine Learning


Beagle was jointly developed by Kenneth Wang, Samson Lee, Muhammad Jufri Bin Ramli, Benedict Lee, William Tjhi.