Despite the successful adoption of a range of approved drugs since the early 2000s, malaria remains one of the deadliest infectious diseases worldwide, with over 200 million new infections and more than 600,000 deaths per year, mostly amongst children and pregnant women. Critically, malaria infections are now on the rise for the first time in the last 20 years, due to a slowdown in the progress of curbing the number of malarial infections, worsened by the COVID-19 pandemic . Additionally, the emergence of malarial strains that are resistant to the main component of most anti-malarial drugs, artemisinin, highlights the need to urgently develop new approaches to identify novel drugs and overcome this global health challenge [2, 3, 4].
Innovation is critical to meet the main requirements for a novel and effective anti-malarial drug:
Low cost: drug needs to be cheap to manufacture and to be distributed affordably across the globe
Accessible: administration of the drug via a single oral dose for patients
Safe: limited side effects, especially for children and pregnant women
Efficient: avoid cross-resistance by using mechanisms of action distinct from current drugs
Traditionally, the development of a drug that meets all these requirements needs the scaling capabilities of pharmaceutical companies, but due to the low expected return on investment, antimalarial drug discovery research is mostly carried out by academic, non-profit and public-private partnerships.
To remedy the lack of progress in new anti-malarial drug development, Prof Matthew Todd founded an open science initiative called Open Source Malaria (OSM), to bring together an interdisciplinary and international team of researchers to work on a fully open and patent-free drug discovery pipeline that includes the design, synthesis, and testing of new anti-malarial candidates . OSM runs their discovery pipeline in different phases, or Series, linked to separate chemical families that could act on different targets of the malarial parasite. The most recent Series 4 chemical family, for example, acts by blocking the P-type ATPase PfATP4, a very promising target that has already advanced through the initial stages of the OSM discovery pipeline .
One of the most critical stages in developing a novel drug is the optimisation of lead candidates. This is when a promising candidate compound undergoes many small modifications in order to meet the cost, accessibility, safety, and efficiency requirements that would make it an effective drug against the parasite. A non-commercial organisation such as OSM cannot test all these potential variations of a compound in the lab, and major efforts are in place to use computational models to predict which compound variations could be most effective before taking them to the lab for synthesis and testing.
Developing a computational model that can predict drug activity is still an open challenge, and thus in 2016 and 2019 OSM launched two competition rounds to improve the lead optimisation phase of their Series 4 compounds using advanced AI methods. The most effective AI models can learn from experimental data of compounds of known activity and make good predictions for compounds of unknown activity. These AI models are well established for computer vision tasks but are still an ongoing work for computer chemistry tasks. The two main limiting factors in this context are:
Experimental: the availability of datasets containing thousands of experimentally validated compounds to provide the AI with examples to learn from
Computational: the development of algorithms that can capture the complexity of chemical data in a representation that AI models can effectively learn from
The experts at the Ersilia Open Source Initiative, the non-profit organisation that specialises in open research for infectious and neglected diseases, and DeepMirror, the tech-bio company that specialises in Small Data AI to accelerate therapeutics development, joined forces to take up the challenge.
Ersilia has been working with the OSM team for the last year, and as part of this collaboration they have curated a list of experimentally validated compounds from the OSM open repository. This list includes information about the drugs’ activity, accessibility, and physicochemical properties. Then, DeepMirror used their platform and these 300 highly curated datapoints to build a model that can predict Series 4 compound activity.
Two main innovations were necessary to develop an effective AI model to predict anti-malarial drug efficacy:
The chemical compound data was represented as graphs, where nodes correspond to atoms, and edges correspond to atomic bonds. With this data representation, one could use a so-called message passing network to learn molecular-level information from this data very efficiently.
DeepMirror’s Small Data AI engine used the few hundreds highly curated datapoints in combination with the thousands of non-validated datapoints, which correspond to the many possible small variations in the Series 4 compounds that would be too costly and time-consuming to test and validate in the lab.
DeepMirror’s approach was developed with a specific classification task in mind: predicting whether a drug would be effective or not against a specific target. In the first phase of this project, DeepMirror used two large small-molecule datasets that are publicly available as a benchmark for validation: a standard large anti-malarial drug dataset  (Malaria in the figures), and a large standard dataset containing activity outcomes of cytochrome panel assays  (P450 in the figures), which are normally used to determine how drugs affect patients with different genetic traits.
Having two large standardised and validated datasets allowed DeepMirror to validate their approach on Small Data AI for small molecules. To do so, they conducted ablation computational experiments, whereby measuring AI learning performance from datasets of increasing size (Figure 1). Strikingly, in Small Data regimes (< 1,000 validated samples), DeepMirror’s AI led to a large increase in the Area Under Curve (AUC) metric, which represents the power of the model to distinguish between compounds that are effective vs the ones that are not (the higher, the better), showing the potential in improving the current lead optimisation workflows by building predictive models using small validated datasets.
Figure 1 – Benchmarking the DeepMirror platform on drug activity data. (Left) Ablation study on a publicly available malaria drug activity . (Right) Ablation study on P450 activity . Models were trained using 50, 100, 200, and 1200 validated molecules and 2400 non-validated molecules with a message passing network using a conventional algorithm and DeepMirror’s proprietary algorithm. For each training iteration, we benchmarked the performance using 300 unseen molecules using the area under curve (AUC) metric from the receiver operating characteristic curve. For each label split we ran 15 experiments to cross validate our models (bootstrapped confidence interval CI=0.95 represented with shaded regions). DeepMirror’s platform outperforms conventional AI training in small data regimes (< 1,000 samples).
After the promising benchmarks on the two publicly available small molecule datasets, DeepMirror applied their Small Data AI platform to the small dataset curated by Ersilia. Those 300 validated compounds could be used in combination with the thousands of Series 4 compounds that were non-validated, to predict activity against the malarial parasite. DeepMirror kept aside a small number of randomly selected validated datapoints (test dataset) to estimate the performance of the neural network, and then repeated the process of AI training and performance evaluation (cross-validation) multiple times to account for small variations in the data, which are typical for small datasets.
Figure 2: DeepMirror’s Breakthrough Discovery Platform applied to OSM Small Dataset. The DeepMirror platform outperforms conventional approaches to predict drug activity and can be queried to predict new compounds. Conventional vs. DeepMirror: Wilcoxon test (paired samples), P value < 0.001. 15 cross validation experiments were carried out for both the conventional and DeepMirror training.
Comparing the performance of DeepMirror’s Small Data AI, which can learn from both validated and non-validated datasets to conventional AI models, which can only learn from validated data, showed a clear increase in predictive performance measured by the AUC metric (Figure 2). DeepMirror’s AI model can now be applied to predict drug activity in thousands of non-validated candidates and speed up the optimisation stage that can lead to the synthesis and testing of the most effective anti-malarial drugs in the lab.
In conclusion, the collaboration between Ersilia and DeepMirror to support OSM’s mission led to the development of a lead optimisation pipeline that may harness information contained in both small validated datasets and larger non-validated datasets. This novel approach can improve many drug discovery pipelines that tackle the challenges of Global Health such as the fight against malaria.
To reach out for more information please contact:
Open Source Malaria (https://github.com/OpenSourceMalaria) – Open Antimalarial Drug Discovery
Ersilia Open Source Initiative (email@example.com) – Open Source Machine Learning for research in infectious and neglected diseases.
List of Contributors
DeepMirror: Amir Shirian, Ryan Greenhalgh, Dr Max Jakobs, Dr Andrea Dimitracopoulos
Ersilia: Dr Miquel Duran-Frigola, Gemma Turon
Open Source Malaria: the dozens of contributors!
 World Health Organization. 2021. “World Malaria Report 2021”. ISBN 978-92-4-004049-6
 Hamilton, William L., Roberto Amato, Rob W. van der Pluijm, Christopher G. Jacob, Huynh Hong Quang, Nguyen Thanh Thuy-Nhien, Tran Tinh Hien, et al. 2019. “Evolution and Expansion of Multidrug-Resistant Malaria in Southeast Asia: A Genomic Epidemiology Study.” The Lancet Infectious Diseases 19 (9): 943–51.
 Balikagala, Betty, Naoyuki Fukuda, Mie Ikeda, Osbert T. Katuro, Shin-Ichiro Tachibana, Masato Yamauchi, Walter Opio, et al. 2021. “Evidence of Artemisinin-Resistant Malaria in Africa.” The New England Journal of Medicine 385 (13): 1163–71.
 Rosenthal, Philip J. 2021. “Are Artemisinin-Based Combination Therapies For Malaria Beginning To Fail in Africa?” The American Journal of Tropical Medicine and Hygiene 105 (4): 857–58.
 Tse, Edwin, Laksh Aithani, Mark Anderson, Jonathan Cardoso-Silva, Giovanni Cincilla, Gareth J. Conduit, Mykola Galushka, et al. 2021. “An Open Drug Discovery Competition: Experimental Validation of Predictive Models in a Series of Novel Antimalarials.” Journal of Medicinal Chemistry 64(22): 16450-16463
 Gamo, FJ., Sanz, L., Vidal, J. et al. Thousands of chemical starting points for antimalarial lead identification. Nature 465, 305–310 (2010). https://doi.org/10.1038/nature09107
 National Center for Biotechnology Information (2022). PubChem Bioassay Record for AID 1851, Cytochrome panel assay with activity outcomes, Source: National Center for Advancing Translational Sciences (NCATS). https://pubchem.ncbi.nlm.nih.gov/bioassay/1851.