Machine Learning and NLP
This section showcases selected machine learning and natural language processing projects spanning multi-omics, metabolomics, bibliometrics, anomaly detection, recommendation systems, and scientific career analytics. Each project combines reproducible Python workflows with interpretable visual outputs and, where relevant, publication-oriented reporting.
FNRL Mode-of-Action in Arabidopsis Roots: an Integrated Multi-Omics, Machine Learning and Bioinformatics Study
Objective: Investigate the function of the root-specific FNRL gene in Arabidopsis thaliana by integrating phenotyping with transcriptomics, proteomics, metabolomics, machine learning, and bioinformatics. The study compared wild type plants, two independent loss-of-function mutants, and a GFP-complemented line to uncover molecular pathways associated with FNRL regulation.
Dataset: Multi-omics and phenotypic dataset comprising root RNA-seq, LC-MS/MS proteomics, GC-MS metabolomics, nitrate content measurements, and root length data across four genotypes and biological replicates.
Methods: Data cleaning and harmonisation, missing-value imputation, log transformation, cross-omics scaling, PCA, differential expression analysis, biomarker discovery based on mutant consistency and GFP rescue rules, hierarchical clustering, k-means clustering, LDA, Elastic Net and Random Forest modelling, plus bioinformatics mining using TAIR, UniProtKB, GO/AmiGO, KEGG, STRING, Cytoscape, PaintOmics and AraCyc.
- Identified 504 high-confidence FNRL-regulated biomarkers across transcriptomics, proteomics, and metabolomics.
- Resolved biomarker behaviour into 4 main expression profiles, separating FNRL-activated and FNRL-repressed targets with strong or partial genetic rescue.
- Built a conceptual model of FNRL regulation based on two main axes: regulatory direction and rescue strength.
- Showed that root growth phenotypes were highly predictable from biomarker-derived molecular profiles, with Elastic Net models achieving strong performance.
- Integrated pathway and interaction analyses highlighted processes linked to nitrogen metabolism, protein processing, ubiquitin-mediated proteolysis, and root-associated regulation.
- Generated an interpretable multi-omics resource to support biological interpretation of FNRL mode of action and downstream manuscript preparation.
Technical highlights: Python • multi-omics integration • biomarker discovery • clustering • LDA • Elastic Net • Random Forest • pathway analysis • network biology
Status: Article in preparation
Final report
Python code and Jupyter notebook
Machine Learning Reveals Synergistic Effects of Lactobacillus helveticus in Camel Milk Fermentation Using Metabolomics
Objective: Develop a computational biomarker discovery pipeline to isolate metabolites specifically associated with Lactobacillus helveticus fermentation in camel milk, and determine whether co-culture with L. bulgaricus or S. thermophilus induces synergistic or antagonistic metabolic effects.
Dataset: Untargeted UPLC-MS/MS metabolomics dataset covering 13,400 metabolites, including 3,632 identified compounds, generated from camel and bovine milk fermented under mono- and co-culture conditions.
Methods: Three-way ANOVA, PCA, LDA, HDBSCAN, k-means, spectral clustering, self-organising maps (SOM), Random Forest classification, feature importance ranking, metabolite annotation, superclass analysis, and pathway exploration using MetaboAnalyst and external metabolic databases.
- Designed a 4-level biomarker selection pipeline reducing 13,400 compounds to the most biologically relevant features.
- Selected 2,069 metabolites through full-factor interaction testing, then refined to 1,017 metabolites showing clear synergy/antagonism patterns.
- Applied Random Forest modelling to isolate 508 high-impact biomarkers associated with metabolic interactions driven by L. helveticus.
- Retrieved 133 identified metabolites, of which 87 displayed synergistic profiles and 46 antagonistic profiles.
- Highlighted metabolite classes linked to amino acid metabolism, bioactive lipids, carbohydrates, and fermentation-associated functional compounds.
- Developed a reusable in silico protein digestion tool to simulate enzyme cleavage of camel and bovine caseins, supporting interpretation of dairy proteolysis and related cheese studies.
- Delivered a robust analytical framework supporting downstream biological interpretation and manuscript preparation.
Technical highlights: Python • metabolomics • feature selection • clustering • Random Forest • biomarker discovery • pathway analysis • Side tool: in silico protein digestion
Status: Article in preparation
Final report
Python code and Jupyter notebook
Related bioinformatics tool
Associated publication
Machine Learning Modelling of Synergy and Antagonism of Phenolic Compounds Using Antioxidant Assays
Objective: Develop a machine learning framework to quantify synergistic and antagonistic interactions between phenolic compounds across multiple antioxidant assays, using observed-versus-predicted absorbance behaviour to reveal non-additive biochemical effects.
Dataset: Experimental antioxidant dataset comprising individual standards and binary mixtures (CB1–CB5) tested across multiple assay systems, with absorbance measurements collected for pure compounds and combinations under controlled concentrations.
Methods: Exploratory data analysis, assay-wise regression modelling, XGBoost prediction, residual-based synergy scoring, threshold optimisation, confusion matrix evaluation, feature importance ranking, and comparative interpretation across antioxidant assay types.
- Built XGBoost regression models to predict expected absorbance values from single-compound behaviour.
- Calculated synergy and antagonism scores from prediction residuals, enabling quantitative classification of interaction strength.
- Demonstrated that interaction behaviour varies substantially across assay systems, revealing assay-dependent biochemical responses.
- Generated interpretable classification outputs distinguishing synergistic, additive, and antagonistic mixtures.
- Established a transferable computational framework for analysing compound interactions in antioxidant chemistry.
- Produced publication-ready visual outputs supporting article submission.
Technical highlights: Python • XGBoost • regression modelling • residual scoring • assay comparison • feature interpretation
Status: Article under review
Final report
Python code and Jupyter notebook
Bread Protein Biomarker Discovery Assisted by Machine Learning
Objective: Identify peptide biomarkers associated with wheat genotype groups and flour-quality protein expression, using statistical learning and machine learning to support biomarker-assisted selection in bread wheat.
Dataset: Large-scale proteomics dataset comprising thousands of peptides derived from flour proteins quantified across a broad panel of wheat genotypes, integrated with genotype metadata and protein annotation.
Methods: Data normalisation, correlation analysis, t-tests, ANOVA, hierarchical clustering, k-means clustering, self-organising maps (SOM), PCA, LDA, Random Forest, SVM, MLP neural network, and stacked ensemble modelling for genotype classification and biomarker ranking.
- Integrated statistical and machine learning outputs into a unified biomarker discovery framework.
- Resolved genotype structure through multiple complementary clustering approaches, including HCA, k-means, and SOM.
- Applied supervised models to classify genotype groups using peptide abundance profiles.
- Built a stacked ensemble model combining Random Forest, SVM, and MLP to improve classification robustness.
- Identified peptide biomarkers linked to key flour-quality protein groups, including glutenins and gliadins.
- Generated biologically interpretable candidate markers relevant for wheat breeding and flour functionality.
Technical highlights: Python • proteomics • biomarker discovery • clustering • stacked machine learning • classification • feature ranking
Status: Article in preparation
Final report
Python code and Jupyter notebook
Career Trajectory Mapping From My Scientific Publications Using NLP, Machine Learning and Analytics
Objective: Develop a reproducible NLP and machine learning framework to analyse the thematic evolution of a scientific career through publication metadata, abstracts, keywords, and semantic content, with the goal of identifying major research phases, transitions, and future directions.
Dataset: Curated corpus of personal scientific publications spanning multiple years, integrating titles, abstracts, keywords, journal metadata, authorship patterns, and publication timelines.
Methods: Text preprocessing, tokenisation, TF-IDF, topic modelling, keyword frequency analysis, semantic clustering, temporal trend analysis, PCA, LDA, and machine learning-assisted interpretation of research trajectory patterns.
- Built a structured NLP corpus from publication records across multiple scientific domains.
- Identified major thematic transitions across career stages using topic modelling and semantic clustering.
- Mapped the evolution from molecular plant science to multi-omics, machine learning, and computational biology.
- Integrated publication chronology with thematic outputs to reconstruct career progression pathways.
- Generated visual analytics highlighting dominant research themes, emerging directions, and interdisciplinary expansion.
- Produced a data-driven framework transferable to researcher profiling, strategic planning, and academic portfolio analysis.
Technical highlights: Python • NLP • topic modelling • TF-IDF • semantic analysis • temporal analytics • machine learning interpretation
Status: Completed analytical report
Final report
Python code and Jupyter notebook
Aliens, Algorithms & Anomalies: Visualising the NUFORC UFO Sightings
Objective: Develop a machine learning and NLP framework to analyse large-scale UFO sighting reports, identify reporting patterns, classify anomalous events, and prioritise cases with high investigative value.
Dataset: NUFORC database containing more than 150,000 UFO sighting reports, integrating temporal records, geographical metadata, witness descriptions, event characteristics, and free-text narratives.
Methods: Data cleaning, geospatial enrichment, NLP feature extraction from witness reports, exploratory data analysis, temporal and spatial visualisation, classification modelling, anomaly prioritisation, and predictive scoring.
- Cleaned and enriched large-scale historical sighting data with geographical coordinates and regional metadata.
- Extracted structured variables from free-text witness narratives to support NLP-driven analysis.
- Built predictive models to identify high-priority sightings associated with stronger anomaly indicators.
- Developed case prioritisation logic to flag reports with elevated close-encounter characteristics.
- Generated multi-scale visual analytics showing temporal waves, spatial hotspots, and reporting behaviour.
- Combined scientific rigour with unconventional public data to demonstrate transferable anomaly-analysis methodology.
Technical highlights: Python • NLP • geospatial analytics • classification • feature engineering • anomaly scoring • Tableau
Status: Completed analytical report
Final report
Python code and Jupyter notebook
Interactive Tableau dashboard
Decoding Love with Data: Matchmaking in a Dating App
Objective: Develop a machine learning and NLP framework to identify compatible matches between dating profiles by combining structured demographic features, personal preferences, and free-text self-descriptions.
Dataset: Large-scale dating profile dataset including demographic variables, lifestyle attributes, preferences, and millions of words extracted from personal essays written by users.
Methods: Text preprocessing, tokenisation, lemmatisation, topic modelling (LDA), feature encoding, clustering, cosine similarity, nearest-neighbour matching, and interactive match retrieval.
- Processed millions of words from user essays to extract meaningful lifestyle and personality themes.
- Applied topic modelling to identify dominant discussion themes across personal profiles.
- Integrated structured profile variables with NLP-derived features into a unified similarity framework.
- Built cluster-based matching logic to improve candidate relevance before similarity scoring.
- Computed profile-to-profile similarity using cosine similarity within behavioural clusters.
- Developed an interactive interface to retrieve top compatible matches for any selected profile.
Technical highlights: Python • NLP • topic modelling • clustering • cosine similarity • recommendation logic • Gradio interface • Tableau
Status: Completed machine learning project
Final report
Python code and Jupyter notebook
Interactive Tableau dashboard
Mapping the Landscape of Scientific Publications
Objective: Explore large-scale scientific publication metadata to identify bibliometric trends, publication patterns, dominant research topics, and long-term shifts in disciplinary focus.
Dataset: Scientific article metadata dataset containing approximately 120,000 publications with information on titles, authors, publication dates, journals, publishers, languages, article types, references, and subject descriptors.
Methods: Metadata cleaning and wrangling, exploratory data analysis, bibliometric visualisation, subject text preprocessing, TF-IDF vectorisation, and k-means clustering to group fine-grained subject labels into broader thematic categories.
- Cleaned and structured a large bibliographic dataset to support interpretable publication analytics.
- Analysed long-term trends in publication volume, article types, journals, publishers, languages, citations, and title length.
- Identified prolific authors and high-output journals across multiple scientific fields.
- Explored subject evolution over time to distinguish long-lasting, emerging, and short-lived research areas.
- Reduced 1,265 subject labels into 7 broader thematic clusters using text mining and k-means clustering.
- Produced publication-ready visual analytics and an interactive Tableau dashboard for bibliometric exploration.
Technical highlights: Python • bibliometrics • metadata analytics • text mining • TF-IDF • k-means clustering • Tableau
Status: Completed analytical report
Final report
Python code and Jupyter notebook
Interactive Tableau dashboard
Raw dataset