Dr Delphine Vincent | Data Science, Machine Learning, Natural Language Processing & Scientific Analytics

Skills

Computer Operating Systems and Microsoft 365

I work primarily in a Microsoft Windows environment and am fully proficient with the Microsoft 365 productivity and collaboration ecosystem. I use these tools to manage analytical projects, prepare technical and scientific documents, communicate with collaborators, organise project information, and deliver reports, presentations, dashboards, and shared resources.

Microsoft Windows

I use Windows-based computers for data science, scientific computing, website development, file management, cloud access, communication, and administrative work. My regular working environment includes VS Code, JupyterLab, Python and Conda environments, GitHub Desktop, web browsers, command-line tools, and specialist analytical software.

Installation, configuration, and routine use of desktop software and analytical applications
Management of local and cloud-synchronised files, folders, permissions, and project structures
Use of Windows Terminal, Command Prompt, and PowerShell when required for development and automation
Troubleshooting of common software, file-path, browser, dependency, and compatibility issues
Organisation of reproducible project environments for data analysis, coding, reporting, and web development

Microsoft Word

I use Word extensively to prepare and edit scientific manuscripts, technical reports, grant and project documents, resumes, cover letters, reviewer responses, standard operating procedures, website guidance, and client-facing documentation.

Long-document formatting using headings, styles, tables, captions, references, and cross-references
Collaborative editing with comments, Track Changes, and version comparison
Preparation of publication-ready scientific and technical documents
Conversion and quality checking of documents for PDF distribution

Microsoft Excel

I use Excel for inspecting, cleaning, structuring, validating, summarising, and presenting quantitative and categorical data. It is particularly useful for rapid quality control, metadata review, tabular reporting, and communication with collaborators who require accessible data outputs.

Data cleaning, filtering, sorting, validation, and conditional formatting
Formulas, lookup functions, pivot tables, summary statistics, and chart creation
Quality-control checks before and after Python, SQL, or statistical analysis
Preparation of analysis-ready tables and supplementary data for scientific publications
Export and exchange of CSV and spreadsheet files across analytical platforms

Microsoft PowerPoint

I use PowerPoint to transform analytical and scientific results into clear visual narratives for technical and non-technical audiences. My presentations include conference talks, lectures, client reports, project summaries, workflow diagrams, and stakeholder briefings.

Design of structured scientific and business presentations
Integration of charts, tables, diagrams, screenshots, and publication-quality figures
Development of clear visual storytelling around complex analytical results
Preparation of reusable slide templates and presentation-ready project summaries

Microsoft Power BI

I use Power BI to build interactive dashboards and visual reports that support exploratory analysis, project monitoring, scientific interpretation, and communication of complex datasets.

Import and transformation of structured datasets
Creation of interactive charts, filters, slicers, and report pages
Development of dashboards for proteomics, metabolomics, and research-project data
Export of dashboards and figures for reports, presentations, and stakeholder review

Microsoft Teams, Outlook and OneNote

I use Teams, Outlook, and OneNote to coordinate projects, communicate with collaborators and clients, document decisions, organise meetings, and maintain accessible records of ongoing work.

Teams: online meetings, screen sharing, messaging, file collaboration, and project communication
Outlook: professional email, calendar management, meeting invitations, folders, and task coordination
OneNote: structured note-taking, meeting records, research notes, checklists, and project documentation

Microsoft SharePoint and OneDrive

I use SharePoint and OneDrive for cloud-based document storage, controlled file sharing, collaborative editing, version management, and access to project resources across devices and organisations.

Organisation of shared project folders and document libraries
Cloud synchronisation and secure sharing of files with collaborators
Collaborative editing of Word, Excel, and PowerPoint documents
Management of document versions and centralised project resources

Microsoft 365 applications used for documentation, data analysis, reporting, communication, collaboration, and cloud file management

Apple macOS

I am also familiar with Apple macOS and can work across Windows and Mac environments when collaborating with research teams or using platform-specific scientific and analytical software.

General navigation, file management, application use, and system settings
Use of browser-based, cloud-based, and cross-platform analytical tools
Exchange of documents, datasets, and code between Windows and macOS users

Technical highlights: Windows • Microsoft 365 • Word • Excel • PowerPoint • Power BI • Outlook • Teams • OneNote • SharePoint • OneDrive • Windows Terminal • PowerShell • macOS

Programming Languages, Development Environments, Cloud Platforms and AI

I use a broad technical stack spanning Python, R, SQL, web development, machine learning, natural language processing, automation, data visualisation, scientific computing, deployment, and AI-assisted workflows. I select tools according to the analytical question, dataset, deployment environment, and intended audience, with an emphasis on reproducibility, transparency, and maintainable code.

Python Programming

Python is my principal programming language for data science, machine learning, bioinformatics, natural language processing, automation, scientific analysis, web backends, and multimedia processing. I develop reproducible workflows in Jupyter Notebook and JupyterLab, and write modular scripts in VS Code.

Data cleaning, restructuring, integration, validation, and exploratory data analysis
Statistical analysis, feature engineering, predictive modelling, clustering, and model evaluation
Natural language processing, document analysis, topic modelling, semantic similarity, and text classification
Bioinformatics workflows including proteogenomics, sequence processing, in silico protein digestion, and annotation analysis
Automated generation of figures, reports, videos, images, and browser-ready scientific resources
Development of APIs, chatbot backends, interactive interfaces, and deployment-oriented applications

Python Data Science and Scientific Computing Libraries

I use established Python libraries to build end-to-end analytical pipelines, from raw-data processing through statistical modelling, machine learning, visualisation, reporting, and deployment.

Data manipulation: pandas, NumPy, SciPy, JSON, regular expressions, multiprocessing, and tqdm
Statistics: statsmodels and SciPy statistical functions
Machine learning: scikit-learn, XGBoost, LightGBM, TensorFlow, Keras, PyTorch, and MiniSom
Dimensionality reduction and clustering: UMAP, hierarchical clustering, k-means, HDBSCAN, and self-organising maps
Visualisation: Matplotlib, Plotly, Squarify, UpSet plots, PyViz, and WordCloud
Network analysis: NetworkX and graph-based analytical workflows

Natural Language Processing and Document Analytics

I use Python-based NLP tools to structure, analyse, compare, and interpret large collections of free text, scientific publications, website content, profile descriptions, and research documents.

Text processing: NLTK, spaCy, Gensim, regular expressions, and Beautiful Soup
Topic modelling: LDA, BERTopic, pyLDAvis, and clustering of text embeddings
Transformers and embeddings: Hugging Face Transformers and SentenceTransformers
Document extraction: PyMuPDF and GROBID for PDF and scientific-document processing
Similarity and fuzzy matching: cosine similarity and RapidFuzz
Web retrieval and automation: Beautiful Soup and Selenium

Image, Audio and Video Automation

I use Python to automate multimedia processing for scientific communication, website assets, and creative visual-storytelling projects.

Image processing: Pillow, scikit-image, ImageIO, and OpenCV
Audio processing: Librosa, Pydub, MIDI tools, and text-to-speech workflows
Video processing: MoviePy, OpenCV, FFmpeg, and Python subprocess automation
Batch production of image montages, transitions, overlays, compressed video, ambient audio, and narrated media

R Programming and Statistical Analysis

I use R and RStudio for statistical analysis, data transformation, biological data analysis, and publication-quality visualisation, particularly when working with established scientific and biostatistical workflows.

Data wrangling: tidyverse, readr, tidyr, and dplyr
Visualisation: ggplot2 and related graphical packages
Bioinformatics: Bioconductor packages for omics and biological-data analysis
Statistical testing, multivariate analysis, exploratory analysis, and reproducible reporting

SQL and Relational Databases

I use SQL to query, filter, join, aggregate, and restructure relational data. I work with SQLite and SQLiteStudio for local database development, testing, and analysis, and integrate SQL outputs with Python, Excel, dashboards, and reporting workflows.

Creation and querying of relational tables
Filtering, grouping, joins, subqueries, aggregation, and data-quality checks
Preparation of structured datasets for statistical analysis and machine learning
Integration of database outputs with Python, R, Excel, Tableau, and Power BI

Web Development

I develop and maintain websites using HTML, CSS, and JavaScript, and also work with WordPress and Elementor for content-managed websites. My work covers design, responsive layout, accessibility, interactive components, forms, deployment, maintenance, and client handover.

Hand-coded responsive websites using HTML5, CSS3, and vanilla JavaScript
Interactive navigation, tabs, forms, cards, media, and accessible page structures
WordPress and Elementor page editing, service-page creation, menu updates, and post publishing
Deployment through GitHub Pages, cPanel, WebCentral, and hosted web services
Website testing, content maintenance, link checking, and browser troubleshooting

APIs, Chatbots and Interactive Applications

I develop lightweight applications and interfaces that make analytical results and website information easier to access. This includes Python APIs, retrieval-based chatbots, and interactive user interfaces.

FastAPI: development of secure Python endpoints for website chatbot requests
Pydantic: structured input and output validation for API requests
Gradio: development of interactive interfaces for machine learning and recommendation workflows
Retrieval systems: rule-based intent recognition, fuzzy FAQ matching, TF-IDF retrieval, and semantic search
Frontend integration: connection of JavaScript widgets to separately deployed Python backends

Development Environments and Reproducible Workflows

I use a combination of notebook, IDE, command-line, and environment-management tools to create, test, document, and maintain reproducible analytical and web-development projects.

Jupyter Notebook and JupyterLab: exploratory analysis, documented workflows, figures, and reports
VS Code: Python, HTML, CSS, JavaScript, JSON, and configuration-file development
Anaconda Navigator and Conda: environment and dependency management
IPython: interactive Python development and debugging
Bash and Git Bash: command-line file handling, automation, Git operations, and project management
Docker: foundational use of containerised environments for portable and reproducible applications

Git, GitHub and Version Control

I use Git and GitHub to manage code versions, document analytical workflows, publish project resources, deploy websites, and make code, reports, figures, and selected datasets openly accessible.

Creation and maintenance of structured repositories
Version tracking, commits, branch-aware workflows, and change documentation
Publication of Jupyter notebooks, Python scripts, reports, figures, and data resources
GitHub Pages deployment for static websites
Coordination of linked website and backend repositories

Hosting, Cloud and Deployment Platforms

I deploy and maintain analytical and web resources across static hosting, managed application hosting, shared web hosting, and cloud-learning environments.

GitHub Pages: deployment and maintenance of static HTML, CSS, and JavaScript websites
Render: deployment and monitoring of Python FastAPI backends
WebCentral and cPanel: website deployment, file management, staging, domains, email, and PHP configuration
AWS: developing practical knowledge of storage, compute, networking, databases, security, serverless services, and cloud operations through AWS Educate and hands-on practice
Amazon SageMaker Studio Lab: cloud-based notebook experimentation for Python and machine learning

Generative AI and AI-Assisted Workflows

I use generative AI tools as structured assistants within coding, scientific, analytical, writing, troubleshooting, and web-development workflows. I retain responsibility for validating code, checking factual accuracy, reviewing outputs, protecting confidential information, and ensuring that final deliverables remain technically sound and fit for purpose.

ChatGPT: code development, debugging, analytical planning, scientific editing, and workflow refinement
Microsoft Copilot: AI-assisted productivity, drafting, and technical support
LM Studio: local execution of language models for privacy-sensitive or confidential tasks
Prompt engineering, iterative refinement, output validation, and human-in-the-loop quality control
Use of AI to accelerate work without replacing domain expertise, statistical judgement, or scientific interpretation

Programming languages, Python and R libraries, development environments, databases, web technologies, hosting platforms, version-control tools, and artificial intelligence applications used in analytical and development projects

Most of my Python, SQL, R, HTML, CSS, and JavaScript projects are documented through GitHub repositories. These include programs and notebooks for proteogenomics, multi-omics integration, metabolomics, biomarker discovery, machine learning, NLP, publication mining, recommendation systems, in silico protein digestion, website development, chatbot retrieval, and multimedia automation.

Explore my GitHub repositories, source code, datasets, notebooks, and analytical reports

Technical highlights: Python • pandas • NumPy • SciPy • scikit-learn • TensorFlow • PyTorch • XGBoost • NLP • Hugging Face • BERTopic • FastAPI • R • Bioconductor • SQL • SQLite • HTML5 • CSS3 • JavaScript • WordPress • Elementor • JupyterLab • VS Code • Conda • Git • GitHub • GitHub Pages • Render • WebCentral • cPanel • AWS • Docker • ChatGPT • Copilot • LM Studio

Statistical Analyses, Machine Learning and Natural Language Processing

I design and implement statistical, machine learning, and natural language processing workflows for biological, experimental, behavioural, textual, and complex multidimensional datasets. My work spans the full analytical lifecycle, from defining the research question and retrieving data through data preparation, modelling, validation, interpretation, visualisation, and reporting.

I combine statistical rigour with domain knowledge and reproducible coding practices to ensure that analytical outputs are accurate, interpretable, and relevant to the scientific or business question being addressed.

Analytical Workflow Design

I begin by translating a research, technical, or business question into a structured analytical plan. This includes identifying the appropriate data, defining variables and outcomes, selecting suitable statistical or machine learning methods, and establishing validation and reporting criteria.

Clarification of objectives, hypotheses, response variables, predictors, and decision criteria
Assessment of dataset structure, sample size, experimental design, and potential sources of bias
Selection of appropriate statistical, machine learning, or NLP methods
Definition of reproducible preprocessing, modelling, validation, and interpretation steps
Alignment of analytical outputs with scientific, operational, publication, or stakeholder needs

Data Retrieval, Integration and Preparation

I retrieve, combine, clean, and restructure data from spreadsheets, databases, public repositories, scientific instruments, websites, APIs, and text documents. I prepare robust analysis-ready datasets while preserving metadata, provenance, and reproducibility.

Import and integration of CSV, Excel, SQL, JSON, tabular, biological, and text-based datasets
Variable harmonisation, identifier matching, reshaping, merging, and metadata reconciliation
Detection and correction of duplicated, inconsistent, malformed, or missing records
Data-type validation, unit standardisation, date parsing, and categorical encoding
Preparation of transparent data dictionaries and reproducible preprocessing workflows

Data Quality Control and Preprocessing

I evaluate data quality before modelling and apply transformations appropriate to the dataset, analytical method, and experimental context.

Missing-value exploration, imputation assessment, and completeness profiling
Outlier detection using statistical, distance-based, density-based, and model-based methods
Normalisation and scaling using log, Z-score, Pareto, quantile, robust, and Yeo–Johnson transformations
Batch-effect detection, drift assessment, variance filtering, and noise reduction
Feature engineering, encoding, dimensionality reduction, and creation of modelling-ready variables
Quality-control checks before and after each major processing step

Descriptive Statistics and Exploratory Data Analysis

I use descriptive statistics and exploratory visualisation to understand data distributions, identify trends and anomalies, assess relationships, and guide subsequent modelling decisions.

Summary statistics including mean, median, variance, quartiles, standard deviation, and standard error
Distribution analysis using histograms, density plots, box plots, violin plots, and QQ plots
Inspection of skewness, heteroscedasticity, multimodality, and extreme values
Correlation matrices, scatterplots, pair plots, heat maps, treemaps, and volcano plots
Group-wise comparison, stratification, temporal analysis, and pattern discovery
Identification of candidate variables, confounders, interactions, and modelling constraints

Statistical Testing and Inference

I apply univariate and multivariate statistical methods to test hypotheses, estimate effects, compare groups, and quantify relationships while accounting for assumptions and multiple testing.

Correlation and association analysis
Linear regression and general trend modelling
Analysis of variance, including one-way, multi-factor, and interaction-based ANOVA
Parametric and non-parametric group comparisons
Effect-size estimation, confidence intervals, and uncertainty-aware interpretation
Multiple-testing correction and false-discovery-rate control
Assessment of assumptions, residuals, variance structure, and model adequacy

Multivariate Analysis and Structure Discovery

I use multivariate methods to reduce dimensionality, detect hidden structure, compare samples, identify correlated features, and reveal dominant sources of variation.

Principal component analysis
Partial least squares and related supervised projection methods
Linear discriminant analysis
Hierarchical clustering analysis
k-means clustering and self-organising maps
UMAP and t-SNE for nonlinear dimensionality reduction
Cluster validation, stability assessment, and biological or operational interpretation

Supervised Machine Learning

I develop supervised models for classification, regression, ranking, prioritisation, and prediction. Model choice is guided by dataset size, feature structure, interpretability requirements, and the intended use of the output.

Linear and logistic regression
Regularised models including Lasso, Ridge, and Elastic Net
Support vector machines
Decision trees and Random Forest models
Gradient boosting using XGBoost, LightGBM, and related methods
k-nearest neighbours and naïve Bayes classifiers
Multi-layer perceptrons and other neural-network models
Ensemble and stacked models for improved robustness and predictive performance

Unsupervised Learning and Anomaly Detection

I use unsupervised learning to identify naturally occurring groups, latent structure, unusual observations, and complex relationships when a predefined response variable is unavailable.

k-means and hierarchical clustering
Self-organising maps
Density-based clustering using DBSCAN and HDBSCAN
Dimensionality reduction using PCA, UMAP, and t-SNE
Isolation Forest and related anomaly-detection approaches
Similarity-based grouping and nearest-neighbour analysis
Cluster profiling, validation, and interpretation of latent sample groups

Natural Language Processing

I design NLP workflows for scientific documents, publication metadata, website content, profile descriptions, witness reports, free-text survey fields, and other unstructured text collections.

Text cleaning, tokenisation, stop-word removal, stemming, and lemmatisation
Regular-expression extraction and rule-based text parsing
Bag-of-words and TF-IDF feature generation
Keyword analysis, phrase extraction, and corpus profiling
Topic modelling using LDA, NMF, and BERTopic
Sentence embeddings, semantic similarity, and document clustering
Text classification, prioritisation, retrieval, and recommendation workflows
Integration of textual and structured variables within unified machine learning pipelines

Deep Learning, Transformers and Embeddings

I use deep-learning and transformer-based methods when dataset size, complexity, and project objectives justify their use, while maintaining a focus on model validation and interpretability.

Multi-layer perceptron neural networks
Recurrent neural networks including LSTM and GRU architectures
Transformer-based models including BERT and DistilBERT
Hugging Face models and pipelines
Sentence-transformer embeddings for semantic comparison and retrieval
Transfer learning and fine-tuning where appropriate
Comparison of deep-learning outputs with simpler statistical and machine learning baselines

Feature Selection and Biomarker Discovery

A substantial part of my work involves identifying variables, peptides, proteins, metabolites, genes, or text-derived features that contribute most strongly to group separation, prediction, or biological interpretation.

Variance filtering, correlation-based screening, and statistical significance testing
Effect-size ranking and consistency-based selection rules
Model coefficients and regularisation-based feature selection
Random Forest and gradient-boosting feature importance
Recursive feature elimination and cross-validated subset selection
Integration of statistical, machine learning, and domain-based evidence
Prioritisation of interpretable biomarkers for validation and reporting

Model Validation and Performance Evaluation

I evaluate models using methods appropriate to the analytical question and data structure, with particular attention to data leakage, overfitting, class imbalance, grouped samples, and generalisability.

Train, validation, and test splitting
k-fold, stratified, repeated, and grouped cross-validation
Hyperparameter optimisation and threshold tuning
Evaluation using accuracy, precision, recall, specificity, F1-score, and AUC
Regression evaluation using R², RMSE, MAE, residuals, and calibration behaviour
Clustering evaluation using silhouette score, stability, and structural coherence
Confusion matrices, residual diagnostics, learning curves, and error analysis
Comparison with baseline and simpler reference models

Interpretability and Scientific Reasoning

I place strong emphasis on understanding why a model produces a result rather than reporting predictive performance alone. I combine model outputs with statistical evidence, domain knowledge, pathway information, and biological or operational context.

Interpretation of coefficients, effect sizes, residuals, and feature importance
Comparison of model outputs across methods and validation folds
Assessment of whether identified relationships are plausible, stable, and actionable
Integration with bioinformatics, pathway, network, and literature evidence
Clear distinction between association, prediction, and causal interpretation
Translation of complex outputs into testable hypotheses and decision-oriented conclusions

Reporting, Visualisation and Delivery

I deliver analytical work in formats suited to technical specialists, researchers, clients, reviewers, and non-technical stakeholders.

Reproducible Python and R notebooks and scripts
Cleaned datasets, feature tables, model outputs, and quality-control summaries
Publication-ready figures, tables, supplementary files, and statistical reports
Interactive Tableau, Power BI, or prototype application outputs
Technical slide decks, executive summaries, and interpretation-focused reports
GitHub repositories containing code, documentation, figures, and selected datasets
Clear recommendations, limitations, and proposed next steps

Software and Analytical Environments

I use both specialised statistical software and open-source programming environments, selecting the most appropriate combination for each project.

Python: pandas, NumPy, SciPy, statsmodels, scikit-learn, XGBoost, LightGBM, TensorFlow, PyTorch, and NLP libraries
R: tidyverse, ggplot2, Bioconductor, and statistical packages
Specialist software: Genedata Expressionist Analyst, SAS, JMP, and Statistica
Development environments: Jupyter Notebook, JupyterLab, VS Code, Anaconda, and Conda environments
Reporting and visualisation: Matplotlib, Plotly, Tableau, Power BI, Excel, and PowerPoint

End-to-end analytical workflow from defining the research question through data retrieval, data handling, statistical analysis, machine learning modelling, analytics, interpretation, and reporting

Technical highlights: exploratory data analysis • statistical inference • ANOVA • regression • PCA • LDA • clustering • Random Forest • XGBoost • support vector machines • neural networks • NLP • TF-IDF • topic modelling • transformers • embeddings • feature selection • biomarker discovery • cross-validation • model interpretation • reproducible reporting

Bioinformatics and Data Mining

I have more than two decades of experience applying bioinformatics, biological databases, sequence analysis, pathway analysis, network biology, proteomics, metabolomics, and genome-browser tools to interpret complex life-science datasets. My work spans plant science, microbiology, dairy science, medicinal cannabis, wheat proteogenomics, multi-omics integration, biomarker discovery, and scientific data mining.

I use these resources to move from raw experimental outputs to biologically meaningful conclusions by combining sequence evidence, functional annotation, structural information, pathways, molecular interactions, genome context, and published biological knowledge.

Sequence Analysis and Genome Annotation

I use sequence-analysis and genome-annotation tools to identify proteins and genes, compare sequences, validate peptide mappings, inspect gene models, and investigate genome structure.

NCBI BLAST: protein and nucleotide similarity searches, sequence identification, and comparative analysis
BioEdit: sequence inspection, editing, alignment review, and preparation of sequence datasets
Ensembl and Ensembl Plants: gene, transcript, protein, chromosome, and comparative-genomics information
Gramene: plant genome, gene, pathway, and comparative-genomics resources
TAIR: Arabidopsis gene annotation, locus information, mutant evidence, expression data, and functional interpretation
PMN: plant metabolic pathways, enzymes, metabolites, and pathway-level interpretation

Genome Browsers and Proteogenomics

I use genome browsers to visualise experimental evidence in genomic context, inspect gene structures, compare annotations, and evaluate peptide support for coding regions and transcript isoforms.

JBrowse and Apollo/JBrowse: visualisation of peptide-to-genome mappings, exon structure, transcript isoforms, and annotation evidence
Integrated Genome Browser: inspection of genomic tracks and aligned biological features
GFF3-based workflows: programmatic reconstruction of peptide coordinates from genes, transcripts, CDS features, and protein annotations
BED track generation: preparation of browser-ready BED6 and BED12 files for genome visualisation
Translation validation: confirmation that projected genomic coordinates reproduce the expected peptide sequence

In my recent wheat proteogenomics work, I developed a genome-guided pipeline to project peptide evidence onto the wheat reference genome, distinguish within-exon from exon-spanning peptides, validate translated genomic sequence, and generate Apollo/JBrowse tracks for annotation review.

Protein Identification, Annotation and Structure

I use protein databases and structural resources to identify proteins, retrieve curated annotations, evaluate domain and functional information, and investigate experimentally determined or predicted structures.

UniProtKB: protein names, functions, domains, subcellular localisation, catalytic activity, and cross-references
ExPASy: protein sequence analysis, physicochemical properties, proteomics resources, and functional annotation
PDBe and RCSB PDB: experimentally determined protein structures, ligands, domains, and structural context
AlphaFold Protein Structure Database: predicted protein structures and structural interpretation of poorly characterised proteins
EMBL-EBI resources: integrated sequence, protein, structure, pathway, and functional databases
Mercator4: assignment of plant proteins and genes to functional categories for pathway-level interpretation

Gene Ontology and Functional Enrichment

I use Gene Ontology and enrichment-analysis tools to determine which molecular functions, biological processes, cellular components, and pathways are over-represented among selected genes or proteins.

Gene Ontology and AmiGO: controlled biological terminology, gene annotations, and ontology exploration
ShinyGO: functional enrichment, pathway analysis, gene characteristics, and network visualisation
AgriGO: Gene Ontology enrichment analysis for plant datasets
REVIGO: reduction and visualisation of redundant Gene Ontology terms
Pathway Tools: pathway reconstruction, genome-scale metabolic interpretation, and organism-specific pathway databases

I routinely integrate enrichment results with experimental direction, effect size, clustering patterns, protein interactions, and domain knowledge rather than relying on enrichment scores alone.

Pathway and Multi-Omics Integration

I use pathway databases and multi-omics platforms to connect genes, proteins, and metabolites into coherent biological processes and to compare responses across molecular layers.

KEGG: metabolic pathways, signalling pathways, genes, enzymes, compounds, and pathway maps
Reactome: curated pathways, reactions, molecular events, and pathway enrichment
BioCyc: organism-specific pathways, metabolic networks, reactions, enzymes, and metabolites
PaintOmics: integration and visualisation of transcriptomics, proteomics, and metabolomics data on pathways
MetaboAnalyst: metabolomics preprocessing, statistical analysis, biomarker exploration, enrichment, and pathway analysis

In recent multi-omics projects, I integrated transcriptomic, proteomic, metabolomic, and phenotypic data to identify biomarkers, classify response profiles, assess genetic rescue, and construct interpretable models of biological regulation.

Protein Interaction and Network Biology

I use interaction databases and network-visualisation platforms to identify functional modules, interaction partners, central proteins, enriched biological processes, and candidate mechanisms.

STRING: known and predicted protein-protein interactions, functional associations, and enrichment analysis
Cytoscape: construction, styling, filtering, annotation, and interpretation of biological networks
Integration of interaction evidence with differential abundance, biomarker rankings, pathway membership, and functional annotation
Identification of connected modules, hub proteins, and candidate regulatory relationships

Proteomics and Peptide-Centric Resources

I use proteomics databases and software to identify proteins and peptides, inspect peptide evidence, validate protein assignments, compare datasets, and support biological interpretation.

PeptideAtlas: peptide observations, proteomics evidence, and peptide-to-protein mapping
Human Protein Atlas: tissue expression, cellular localisation, protein expression, and pathology-related information
Trans-Proteomic Pipeline: processing and statistical validation of mass-spectrometry-based proteomics data
ProteoWizard: conversion, inspection, and processing of mass-spectrometry data formats
Galaxy: reproducible, browser-based analysis workflows and data processing
MassIVE, ProteomeXchange, and PRIDE: discovery, retrieval, deposition, and reuse of public proteomics datasets

My proteomics work includes protein identification, peptide filtering, quantitative analysis, biomarker discovery, proteogenomics, public-data reuse, and preparation of large datasets for reproducible analysis.

Metabolite and Chemical Annotation

I use metabolite and chemical databases to identify compounds, retrieve molecular properties, investigate biochemical roles, and connect metabolomics features with pathways and published evidence.

HMDB: human metabolites, biochemical properties, pathways, spectra, and disease associations
PubChem: chemical structures, identifiers, molecular properties, synonyms, and bioactivity information
KEGG Compound and BioCyc: metabolite participation in biochemical pathways and reactions
MetaboAnalyst: metabolite annotation, statistical analysis, enrichment, and pathway interpretation

Phylogenetics and Evolutionary Analysis

I use evolutionary and phylogenetic tools to compare sequences, examine relatedness, and support interpretation of gene and protein families.

MEGA: multiple-sequence analysis, phylogenetic tree construction, evolutionary comparison, and visualisation
Comparison of orthologues, paralogues, conserved regions, and sequence divergence
Integration of evolutionary evidence with protein function and biological context

Data Mining and Evidence Integration

My bioinformatics approach combines database searching with computational analysis and critical biological interpretation. I cross-check information across multiple resources because annotations, identifiers, pathway assignments, and interaction evidence can differ between databases.

Identifier mapping across genes, transcripts, proteins, peptides, metabolites, and database accessions
Manual and programmatic retrieval of annotations from public biological resources
Integration of experimental results with sequence, structure, pathway, interaction, and literature evidence
Filtering of low-confidence, redundant, obsolete, or unsupported annotations
Development of reproducible Python workflows for large-scale biological data mining
Translation of complex bioinformatics outputs into testable biological hypotheses and clear scientific narratives

Bioinformatics databases and tools used for sequence analysis, genome browsing, protein annotation, structure prediction, pathway analysis, metabolomics, proteomics, interaction networks, and biological data mining

Technical highlights: BLAST • JBrowse • Apollo • GFF3 • BED tracks • Ensembl Plants • Gramene • TAIR • UniProtKB • ExPASy • PDBe • RCSB PDB • AlphaFold • Gene Ontology • ShinyGO • AgriGO • REVIGO • KEGG • Reactome • BioCyc • PaintOmics • STRING • Cytoscape • MetaboAnalyst • PeptideAtlas • ProteoWizard • Trans-Proteomic Pipeline • HMDB • PubChem • MEGA • multi-omics integration • proteogenomics

Data Visualisation

I create clear, accurate, and publication-ready visualisations to explore data, reveal patterns, compare groups, communicate uncertainty, and translate complex analytical results into accessible insights. My visualisation work spans scientific research, machine learning, bioinformatics, dashboards, technical reports, conference presentations, websites, and stakeholder communication.

I select chart types according to the structure of the data and the message that needs to be communicated, rather than applying a single visual style to every problem. My workflow includes exploratory plotting, iterative refinement, accessibility checks, annotation, and preparation of final outputs for screen, print, publication, or interactive use.

Visualisation Strategy and Chart Selection

I begin by defining the purpose of the visualisation: exploration, comparison, explanation, monitoring, publication, or decision support. I then select the chart type, level of detail, annotation, and format best suited to the audience and analytical objective.

Match chart type to variable type, dimensionality, sample size, and analytical question
Distinguish exploratory graphics from presentation-ready explanatory figures
Prioritise clarity, proportional representation, legibility, and meaningful visual hierarchy
Avoid unnecessary decoration, misleading scales, overcrowding, and unsupported interpretation
Adapt figures for scientific, technical, executive, educational, and public audiences

Distribution and Data Quality Visualisation

I use distribution plots to understand the shape, spread, central tendency, variability, skewness, multimodality, outliers, and quality of numerical data before formal modelling.

Histograms and density plots for frequency and distribution shape
Box-and-whisker plots for medians, quartiles, spread, and outlier inspection
Violin plots for comparing full distributions across groups
QQ plots for checking normality and distributional assumptions
Ridgeline plots for comparing multiple distributions
Strip plots, swarm plots, and jittered points for displaying individual observations
Missing-data maps and quality-control summaries

Relationships, Associations and Correlations

I create visualisations that reveal relationships between variables, highlight trends, identify associations, and expose unusual or influential observations.

Scatterplots with regression lines, confidence intervals, and group overlays
Bubble plots for displaying three or more variables simultaneously
Correlation matrices and clustered heat maps
Pair plots and pair grids for multivariable relationship screening
Hexbin and density plots for large or overlapping point clouds
Volcano plots for displaying effect size and statistical significance
Residual and diagnostic plots for model assessment

Group Comparison and Statistical Results

I use comparative plots to show differences between categories, experimental conditions, treatments, genotypes, clusters, or model outputs.

Vertical and horizontal bar charts
Grouped, stacked, and 100% stacked bar charts
Dot plots, lollipop plots, and Cleveland-style comparisons
Box plots, violin plots, and raincloud-style comparisons
Forest plots for effect sizes and confidence intervals
Waterfall charts for sequential gains, losses, or contributions
Slope charts and dumbbell plots for paired or before-and-after comparisons

Composition and Part-to-Whole Relationships

I visualise how individual categories contribute to a total while ensuring that proportions remain easy to compare and interpret.

Stacked and 100% stacked bar charts
Stacked area charts for changing composition over time
Treemaps for hierarchical part-to-whole relationships
Pie and donut charts for limited numbers of clearly distinct categories
Marimekko-style and mosaic plots where appropriate
UpSet plots and Venn diagrams for set overlap and intersection analysis

Time-Series and Longitudinal Visualisation

I use temporal graphics to display trends, cycles, interventions, transitions, and changes over time.

Line charts for continuous temporal trends
Multi-series line charts with clear grouping and annotation
Area and stacked-area charts for cumulative or compositional change
Rolling averages, smoothing, and confidence bands
Event and intervention markers
Gantt charts for project plans, workflows, and timelines
Circular or radial plots for cyclical patterns and genomic layouts

Multivariate and High-Dimensional Data

I visualise complex multidimensional datasets using projection, clustering, matrix, and network-based representations that help reveal latent structure and sample relationships.

PCA, LDA, UMAP, and t-SNE score plots
Clustered heat maps with row and column annotation
Self-organising map visualisations
Parallel-coordinate plots
Scatterplot matrices and pair grids
Chord diagrams and Sankey diagrams for flows and relationships
Network diagrams for biological interactions, pathways, and connectivity

Machine Learning and Model Evaluation Graphics

I create model-focused visualisations to assess predictive performance, compare algorithms, explain errors, and communicate the factors driving predictions.

Confusion matrices
ROC and precision-recall curves
Calibration plots
Observed-versus-predicted plots
Residual distributions and diagnostic plots
Feature-importance and coefficient plots
Learning curves and cross-validation summaries
Cluster-quality, silhouette, and anomaly-score plots

Genomics, Proteomics and Bioinformatics Visualisation

I create specialised scientific graphics for omics, genome annotation, biomarker discovery, pathway interpretation, and molecular interaction studies.

Genome-browser tracks and peptide-to-genome evidence views
Circular chromosome and genome-wide coverage plots
Volcano plots, heat maps, and expression-profile plots
Pathway diagrams and enrichment visualisations
Protein-interaction and gene-regulatory networks
Multi-omics integration figures
Biomarker-ranking and feature-selection plots
Publication schematics and analytical workflow diagrams

Text and Natural Language Processing Visualisation

I visualise patterns in textual data to summarise language use, themes, semantic structure, and relationships within document collections.

Word clouds for high-level term summaries
Keyword and phrase-frequency charts
Topic-distribution plots
pyLDAvis interactive topic exploration
Document and sentence-embedding projections
Semantic-cluster maps
Temporal topic-trend visualisation
Co-occurrence and concept networks

Geospatial Visualisation

I use geographic visualisation to examine spatial distributions, regional trends, hotspots, and relationships between location-based variables.

Latitude-and-longitude point maps
Bubble and proportional-symbol maps
Choropleth maps
Density and hotspot maps
Regional aggregation and comparison
Integration of geographic metadata with temporal, categorical, or predictive outputs

Interactive Dashboards and Analytical Interfaces

I develop interactive dashboards that allow users to filter, compare, and explore data without needing to interact directly with the underlying code.

Power BI: interactive reports, slicers, dashboards, and stakeholder-facing summaries
Tableau: exploratory dashboards, geographic views, filters, and public-facing visual analytics
Plotly: interactive Python charts with hover, zoom, and selection functionality
Gradio: prototype interfaces for machine learning and recommendation outputs
Export of dashboard views for reports, presentations, and publication support

Publication-Ready Figures and Scientific Storytelling

I prepare figures for manuscripts, reports, presentations, and supplementary materials with careful attention to consistency, resolution, annotation, and readability.

Multi-panel figure assembly and consistent labelling
Clear legends, axis titles, captions, units, and statistical annotations
Vector and high-resolution raster export
Journal-compliant figure sizing and formatting
Accessible palettes and sufficient contrast
Visual consistency across figures, tables, reports, and slide decks
Integration of analytical results into coherent visual narratives

Tools and Visualisation Environments

I use a combination of coding, dashboard, and office-based tools depending on the complexity, interactivity, publication requirements, and intended audience.

Python: Matplotlib, Seaborn, Plotly, Squarify, PyViz, NetworkX, WordCloud, and specialist plotting libraries
R: ggplot2 and related visualisation packages
Dashboards: Tableau and Power BI
Office tools: Excel and PowerPoint for accessible reporting and presentation outputs
Web delivery: HTML, CSS, JavaScript, and embedded interactive content

Examples of charts and visualisation tools used for distributions, comparisons, relationships, time series, multivariate analysis, geographic data, machine learning, and scientific reporting

Technical highlights: Matplotlib • Seaborn • Plotly • ggplot2 • Tableau • Power BI • Excel • histograms • box plots • violin plots • heat maps • volcano plots • PCA • UMAP • networks • Sankey diagrams • dashboards • geospatial maps • model diagnostics • publication-ready figures • scientific storytelling

Data Representation, Interpretation, and Storytelling

I transform complex scientific, analytical, and technical results into clear, accurate, and engaging narratives for researchers, clients, reviewers, stakeholders, students, and non-technical audiences. My work combines data interpretation, visual design, scientific reasoning, and structured communication to ensure that results are not only correct, but also meaningful, memorable, and useful for decision-making.

As an author of approximately 40 peer-reviewed scientific publications, an experienced conference presenter, journal reviewer, editor, consultant, and data scientist, I have extensive experience communicating complex findings through manuscripts, reports, dashboards, presentations, posters, lectures, websites, and publication-ready figures.

From Data to Meaning

I interpret analytical results in the context of the original research or business question, ensuring that statistical significance, biological relevance, practical importance, uncertainty, and limitations are considered together.

Identify dominant trends, patterns, associations, anomalies, and sources of variability
Distinguish meaningful signals from noise, artefacts, and technically driven effects
Relate model outputs to the experimental design and broader scientific context
Evaluate whether observed patterns are plausible, reproducible, and actionable
Translate quantitative results into clear interpretations and testable hypotheses
Separate association, prediction, inference, and causation in the final narrative

Scientific and Analytical Reasoning

My interpretation process integrates statistical outputs with domain expertise, biological databases, published literature, pathway information, model diagnostics, and experimental context.

Compare results across statistical, machine learning, and bioinformatics methods
Assess consistency across datasets, experimental groups, replicates, and validation procedures
Investigate unexpected findings rather than automatically excluding them
Use complementary evidence to strengthen or challenge an interpretation
Identify limitations, confounding factors, uncertainty, and alternative explanations
Develop coherent mechanistic or operational interpretations grounded in available evidence

Structuring a Clear Narrative

I organise results into a logical sequence that guides the audience from the original question to the evidence, interpretation, conclusion, and recommended next steps.

Define the central message before selecting figures or writing detailed results
Structure content around question, evidence, interpretation, and outcome
Prioritise the most important findings and remove unnecessary detail
Use headings, signposting, summaries, and transitions to maintain narrative flow
Connect individual analyses into a coherent overall story
Conclude with practical implications, limitations, and future directions

Data Representation

I select visual formats that accurately represent the data while supporting rapid understanding. Depending on the audience and objective, I use charts, dashboards, diagrams, workflows, infographics, tables, or schematic models.

Charts for distributions, comparisons, relationships, trends, and model performance
Dashboards for interactive exploration and stakeholder reporting
Heat maps, networks, pathway diagrams, and genome tracks for scientific interpretation
Workflow diagrams for explaining analytical methods and project structure
Conceptual models for summarising biological mechanisms or analytical conclusions
Tables for precise numerical comparison and detailed supporting information
Infographics for communicating complex information to broad audiences

Scientific Figures and Conceptual Models

I create publication-ready scientific graphics that combine experimental results, molecular mechanisms, biological pathways, structural information, and explanatory annotation.

Biological pathway and mode-of-action diagrams
Multi-omics integration schematics
Gene, protein, metabolite, and pathway relationship models
Proteogenomics and genome-annotation workflow diagrams
Plant, microbial, and molecular interaction illustrations
Graphical abstracts and summary figures
Multi-panel figures combining plots, diagrams, annotations, and interpretation

Interactive Dashboards

I design dashboards that allow users to explore large or complex datasets through filters, comparisons, maps, summaries, and interactive visual elements.

Tableau: public-facing dashboards, exploratory analysis, geographic views, filters, and interactive comparisons
Power BI: project dashboards, scientific reporting, stakeholder summaries, and analytical monitoring
Python and Plotly: interactive charts, hover information, zooming, selection, and dynamic outputs
Clear visual hierarchy, accessible labelling, and audience-appropriate levels of detail
Export of dashboard findings into reports, presentations, and publication figures

Writing for Scientific and Technical Audiences

I prepare structured scientific and technical documents that communicate methods, results, interpretations, limitations, and conclusions with precision.

Peer-reviewed manuscripts and technical notes
Results and discussion sections
Abstracts, graphical summaries, and plain-language summaries
Reviewer responses and revision documents
Technical reports and analytical deliverables
Grant, project, and experimental-method documentation
Supplementary tables, figure legends, and reproducibility statements

Communication for Non-Technical Audiences

I adapt technical content for clients, managers, collaborators, students, and general audiences without sacrificing accuracy.

Replace unnecessary jargon with clear, audience-appropriate language
Explain methods through examples, analogies, diagrams, and concise summaries
Focus on implications, risks, opportunities, and recommended actions
Separate essential findings from supporting technical detail
Provide layered explanations so readers can choose the level of detail they need
Communicate uncertainty and limitations transparently

Conference Presentations and Seminars

I have presented research findings at scientific conferences, seminars, stakeholder meetings, and professional events. My presentations combine strong visual structure with a clear spoken narrative and audience-focused interpretation.

Design of scientific conference talks and invited seminars
Preparation of presentation slides, speaker notes, and visual summaries
Communication of proteomics, metabolomics, bioinformatics, plant science, dairy science, and data science results
Adaptation of content to specialist, interdisciplinary, and non-specialist audiences
Presentation of methods, findings, limitations, implications, and future directions

Finding the LMA needle in the wheat haystack proteome Mining the wheat grain proteome The power of three for shotgun proteomics Proteomics tools for medicinal cannabis Milk top-down proteomics Proteomics rocks my world MALDI Biotyper: an alternative for identifying microorganisms Stagonospora nodorum effector mode of action in wheat Fungal secretomes: not so secret anymore Apprendre à faire du vin aux États-Unis

Scientific Posters

I design scientific posters that present a complete research story within a limited visual space, using concise text, clearly ordered sections, high-quality figures, and prominent conclusions.

Logical reading order from context and objective to results and conclusions
Integration of data plots, photographs, diagrams, tables, and key messages
Concise language suited to rapid conference viewing
Consistent colour, typography, spacing, and visual hierarchy
High-resolution preparation for print and digital distribution

Finding the LMA needle in the wheat haystack proteome Top-down, middle-down and bottom-up proteomics of medicinal cannabis Optimisation of protein extraction from medicinal cannabis Top-down proteomics investigation of age gelation in milk Analysis of intact major milk proteins using LC-MS Bottom-up and top-down analysis of milk proteins using LC-MS/MS A proteomics approach to dissect SnToxA mode of action The secretome of Laccaria bicolor High-resolution analysis of fungal secretomes Water-deficit-responsive proteins in poplar roots

Teaching and Scientific Lectures

I prepare lectures and educational materials that explain scientific principles, analytical methods, and experimental workflows in a structured and visually accessible manner.

Progressive explanation from foundational concepts to advanced applications
Use of diagrams, workflows, examples, and annotated data
Adaptation to different levels of scientific and technical knowledge
Integration of theory, experimental practice, data analysis, and interpretation

Analysis of peptides and proteins using mass-spectrometry-based proteomics Protéomique quantitative : électrophorèse bidimensionnelle

Tableau Public Dashboards

I use Tableau to create interactive visual stories from scientific, social, geographic, and public datasets. These dashboards allow users to explore trends, filters, categories, maps, and model outputs directly.

Dating-app profiles, NLP, and machine learning Metadata from approximately 120,000 scientific publications Wildlife strikes involving aircraft in the United States Tree census and household income in New York Worldwide volcanic eruptions Rotten Tomatoes films, genres, studios, and audience scores

Power BI Scientific Dashboards

I use Power BI to support scientific project analysis, explore large datasets, compare experimental conditions, and communicate major findings through interactive and exportable reports.

Safflower proteomics and metabolomics dashboards Large-scale wheat proteomics dashboards

Client, Stakeholder and Project Reporting

I prepare reports and presentations that help collaborators and clients understand what was done, what the results mean, how confident the conclusions are, and what should happen next.

Executive summaries and key findings
Methods written at an appropriate technical level
Annotated charts, tables, diagrams, and model outputs
Interpretation of practical and scientific implications
Clear presentation of uncertainty, risks, assumptions, and limitations
Prioritised recommendations and next-step options
Delivery in Word, PDF, PowerPoint, dashboard, website, or notebook formats

Tools and Communication Environments

I use a broad combination of analytical, visual, writing, and presentation tools to produce outputs suited to each audience and delivery format.

Analysis and figures: Python, R, Matplotlib, Seaborn, Plotly, ggplot2, and Excel
Dashboards: Tableau and Power BI
Documents: Word, PowerPoint, PDF, and Jupyter Notebook
Scientific schematics: diagramming, image editing, and custom graphical assembly
Web communication: HTML, CSS, JavaScript, WordPress, and GitHub Pages
Versioned delivery: GitHub repositories, reports, supplementary files, and presentation archives

Examples of dashboards, scientific figures, biological models, publication graphics, and visual storytelling outputs created for data analysis and scientific communication

Technical highlights: data interpretation • scientific reasoning • visual storytelling • Tableau • Power BI • Python • R • dashboards • manuscripts • conference talks • posters • lectures • graphical abstracts • biological schematics • stakeholder reporting • publication-ready figures

Author, Editor and Reviewer

I have extensive experience across the complete scientific-publication lifecycle as an author, guest editor, and peer reviewer. I have co-authored approximately 40 peer-reviewed publications and contributed to manuscripts spanning plant science, proteomics, metabolomics, bioinformatics, dairy science, microbiology, and data analytics.

My publication work includes experimental and analytical design, data interpretation, manuscript preparation, figure and table development, journal selection, responses to reviewers, revision, and final publication. I also contribute to scientific quality assurance through journal reviewing and editorial leadership.

Scientific Authorship

I write and revise scientific manuscripts that translate complex experimental and analytical work into clear, structured, and defensible scientific narratives.

Development of manuscript structure, central argument, and publication strategy
Writing and revision of abstracts, introductions, methods, results, discussions, and conclusions
Integration of statistical, bioinformatics, machine learning, and biological interpretation
Preparation of figures, tables, supplementary files, graphical summaries, and figure legends
Clear reporting of experimental design, analytical workflows, validation, limitations, and reproducibility
Adaptation of manuscripts to journal scope, formatting requirements, and readership
Coordination of co-author input and consolidation of multiple rounds of revision

Manuscript Revision and Reviewer Responses

I have substantial experience revising manuscripts following peer review and preparing detailed, evidence-based responses to editors and reviewers.

Systematic assessment of reviewer comments and editorial requirements
Prioritisation of major scientific, analytical, and reporting revisions
Preparation of point-by-point response documents
Revision of text, analyses, figures, tables, supplementary files, and methods
Clear explanation of changes, retained decisions, and scientific justification
Consistency checking across the manuscript, figures, tables, data, and supplementary material
Final quality control before resubmission

Publication Planning and Journal Submission

I support the strategic and practical aspects of scientific publication, from identifying suitable journals to preparing submission materials.

Assessment of journal scope, audience, reputation, article type, and methodological fit
Preparation of cover letters and statements of significance
Development of concise titles, keywords, highlights, and graphical summaries
Review of authorship, contribution statements, acknowledgements, and data-availability statements
Preparation and checking of supplementary information
Adaptation of references, word counts, figures, and formatting to journal guidelines
Support for preprint, repository, and open-data deposition where appropriate

View my publications and citation record on Google Scholar

Journal highlights: Nature Communication • Journal of Proteome Research • Scientific Reports • GigaScience • Plant Physiology • Molecular Plant Pathology • New Phytologist • Journal of Dairy Science • Food Chemistry • Biomolecules • Proteomes • IJMS • PLOS ONE • Proteomics • Frontiers in plant science • Frontiers in genetics • Plant, cell & environment • Functional & integrative genomics • Journal of experimental botany • Phytochemistry • NFS Journal

Representative scientific journals in which I have published peer-reviewed research

Editorial Leadership

I have acted as a guest editor and Research Topic editor for multiple scientific journals, helping define thematic scope, attract relevant submissions, coordinate peer review, and support publication of coherent collections.

Development of Special Issue and Research Topic concepts
Definition of scientific scope, aims, and target contributors
Invitation and coordination of contributing authors
Assessment of manuscript suitability and scientific relevance
Selection and coordination of expert reviewers
Evaluation of revisions and editorial recommendations
Contribution to introductory editorials and thematic synthesis

Edited Research Topics and Special Issues

Frontiers in Plant Science

How Can Secretomics Help Unravel the Secrets of Plant–Microbe Interactions? Secretomics: More Secrets to Unravel on Plant–Fungus Interactions, Volume I Secretomics: More Secrets to Unravel on Plant–Fungus Interactions, Volume II

Proteomes

Proteomics: Technologies and Their Applications

Biomolecules

Plant Adaptation to Their Biotic and Abiotic Environment Through the Lens of Secretomics Sowing the Seed to Ensure the Future of Plant Proteomics: Commemorative Issue in Honour of Dr Dominique Job

International Journal of Molecular Sciences

State-of-the-Art Molecular Plant Sciences in Australia

Scientific journals for which I have served as guest editor or Research Topic editor

Peer Review

I have reviewed manuscripts for scientific journals since 2006. My reviews assess scientific validity, methodological rigour, analytical appropriateness, interpretation, novelty, reproducibility, and clarity of presentation.

Evaluation of experimental design, sample size, controls, and methodological suitability
Assessment of statistical analyses, bioinformatics, machine learning, and data interpretation
Verification that conclusions are supported by the evidence presented
Identification of missing controls, unclear methods, unsupported claims, and reporting inconsistencies
Assessment of figures, tables, supplementary information, and data availability
Constructive recommendations to improve scientific quality and readability
Clear distinction between essential revisions and optional improvements

Journals Reviewed For

Since 2006: Proteomics; Electrophoresis
Since 2008: New Phytologist; Annals of Forest Science
Since 2009: Journal of Proteome Research; Plant, Cell & Environment
Since 2011: Biotechnology and Molecular Biology Reviews; Journal of Agricultural and Food Chemistry
Since 2014: Frontiers in Plant Science
Since 2015: Animal Production Science; BMC Microbiology; PLOS ONE
Since 2016: Annals of Botany; Talanta
Since 2017: Journal of Microbiology
Since 2018: Bentham Science journals; BMC Plant Biology
Since 2019: Molecules; Proteomes; International Journal of Molecular Sciences
Since 2020: Biomolecules; Scientific Reports; Pathogens; Food Research International; Plants
Since 2021: Cells
Since 2022: Foods

Editorial and Review Principles

I approach authorship, editing, and peer review with a consistent emphasis on scientific integrity, transparency, fairness, and constructive communication.

Evidence-based and methodologically rigorous assessment
Respectful, specific, and actionable feedback
Confidential handling of unpublished research
Recognition of disciplinary and methodological context
Attention to reproducibility, data transparency, and reporting standards
Balanced evaluation of novelty, limitations, and practical contribution
Support for clearer and stronger science rather than criticism alone

Scientific Writing and Publication Tools

Writing and revision: Microsoft Word, Track Changes, comments, and document comparison
References: reference managers, journal databases, DOI records, and publication metadata
Figures and tables: Python, R, Excel, Power BI, Tableau, and PowerPoint
Collaboration: Teams, SharePoint, OneDrive, email, and shared document workflows
Repositories: GitHub, bioRxiv, Zenodo, MassIVE, ProteomeXchange, and other discipline-specific resources
Author profiles: Google Scholar, ORCID, Scopus, Web of Science, and institutional profiles

Technical highlights: scientific authorship • manuscript preparation • journal submission • peer review • guest editing • Special Issues • Research Topics • reviewer responses • publication strategy • scientific integrity • technical editing • figure and table preparation • Google Scholar • ORCID

AI-Assisted Professional Workflows and Prompt Engineering

I use generative artificial intelligence as a structured professional assistant across scientific research, data analysis, machine learning, coding, website development, technical writing, project documentation, and client communication.

My approach combines prompt engineering, domain expertise, iterative refinement, source verification, and human quality control. I use AI to accelerate complex work while retaining responsibility for the accuracy, interpretation, confidentiality, and suitability of every final output.

Prompt Engineering

I design prompts that provide sufficient context, define the task precisely, establish constraints, identify the intended audience, and specify the required format and level of detail.

Set the project context, purpose, audience, and expected outcome
Provide relevant text, code, files, results, screenshots, or background information
Define terminology, factual constraints, exclusions, and assumptions
Use zero-shot, one-shot, and few-shot prompting where appropriate
Break complex assignments into smaller, sequential subtasks
Request structured outputs such as code blocks, tables, reports, checklists, or replacement HTML
Refine prompts iteratively after reviewing each response
Validate final outputs against source material and project requirements

Prompt-engineering workflow from project context and source material to artificial intelligence assistance, human review and validated output

Working with ChatGPT

I use ChatGPT as an interactive assistant for complex professional projects, particularly when tasks require a combination of scientific knowledge, analytical reasoning, coding, editing, troubleshooting, and structured communication.

Scientific writing: revise manuscripts, abstracts, methods, results, discussions, figure legends, cover letters, contribution statements, and responses to reviewers
Data analysis: plan analytical workflows, assess assumptions, interpret outputs, compare methods, and translate statistical or machine-learning results into clear conclusions
Python development: write, debug, refactor, document, and validate code for data processing, machine learning, bioinformatics, visualisation, and automation
Web development: revise HTML, CSS, and JavaScript; troubleshoot responsive layouts; improve accessibility; and develop interactive website components
Bioinformatics: clarify computational pipelines, audit processing steps, interpret outputs, and improve descriptions of large-scale proteomics and proteogenomics workflows
Professional communication: draft emails, reports, website content, client recommendations, application documents, and technical guidance
Project documentation: create maintenance procedures, checklists, implementation guides, handover instructions, and reproducible workflows

My Interaction Workflow with ChatGPT

For substantial projects, I use a collaborative and iterative workflow rather than relying on a single prompt.

Explain the broader project and the purpose of the current task
Provide the original material and all relevant evidence
State what must remain unchanged and what needs improvement
Specify the desired tone, length, structure, and output format
Address one clearly bounded problem at a time
Review the response for factual, scientific, technical, and stylistic accuracy
Correct assumptions or clarify requirements when necessary
Request a complete final version after individual decisions are resolved
Test generated code or website changes in the appropriate environment
Retain final responsibility for all professional deliverables

Iterative professional collaboration with ChatGPT involving a project brief, draft response, feedback, revision, testing and final validated result

Using Google Gemini

I also use Google Gemini as an alternative AI assistant for drafting, comparison, brainstorming, summarisation, technical explanation, and independent review of selected outputs.

Compare alternative explanations or proposed approaches
Generate additional wording or structural options
Review whether an initial response has overlooked important considerations
Explore alternative coding, analytical, or communication strategies
Cross-check AI-generated suggestions rather than relying on a single system
Select and refine the output best supported by evidence and project requirements

Local AI with LM Studio

For confidential or privacy-sensitive work, I use LM Studio to run a language model locally on my computer rather than sending project material to a cloud-hosted AI service.

My current local model is Gemma 4 E4B, which can be run through LM Studio as an on-device language model. This provides a private environment for tasks involving unpublished documents, confidential assessments, internal reports, sensitive client material, or other information that should remain on the local computer. :contentReference[oaicite:0]{index=0}

Run prompts and model inference locally
Keep source documents and prompts on the computer
Review confidential scientific or technical documents
Draft preliminary summaries, comments, and structured assessments
Break long documents into manageable sections for local processing
Use carefully designed prompts to request consistent evaluation criteria
Manually verify outputs because smaller local models may have capability limitations
Transfer only approved, non-confidential conclusions into external workflows

Privacy-focused local artificial intelligence workflow using confidential documents, LM Studio and the Gemma 4 E4B language model on a local computer

Prompting for Scientific and Analytical Projects

My scientific prompts combine project context with experimental design, numerical results, analytical assumptions, manuscript requirements, and supporting source material.

Clarify the scientific question and study design
Provide exact counts, model results, validation criteria, and terminology
Distinguish observations from interpretation and speculation
Prevent unsupported claims and invented references
Check consistency between text, figures, tables, and supplementary files
Adapt writing to journal scope, article type, and word limits
Preserve scientific meaning while improving clarity and structure

Prompting for Coding and Technical Work

For coding tasks, I provide the existing script or code block, describe the expected behaviour, identify the error or limitation, and specify the required inputs and outputs.

Provide complete error messages and tracebacks
Preserve existing filenames, paths, column names, and workflow conventions
Request full replacement code when partial edits may create integration errors
Define expected validation checks and output files
Run the revised code and report the resulting behaviour
Compare outputs with known counts or quality-control criteria
Iteratively refine the solution until the workflow completes correctly

Prompting for Website Development

For website tasks, I provide the relevant HTML, CSS, JavaScript, screenshots, and a precise description of the required desktop and mobile behaviour.

Request copy-and-paste-ready replacement code
Preserve existing IDs, classes, image paths, and naming conventions
Describe the current problem and the expected user experience
Test changes on desktop, mobile, and different device orientations
Review navigation, accessibility, responsiveness, and content visibility
Confirm whether changes also require chatbot indexing or backend updates

Human Review and Quality Assurance

I do not treat AI-generated output as automatically correct. Every response is reviewed in relation to the original evidence, professional context, and intended use.

Verify scientific statements against source data and publications
Test generated code before adoption
Check calculations, labels, counts, filenames, and outputs
Review wording for accuracy, tone, and unsupported overstatement
Reject suggestions that conflict with evidence or project constraints
Compare alternative AI responses when additional review is valuable
Maintain version-controlled files and documented decisions

Professional Relevance

These AI-assisted workflows improve efficiency while preserving the scientific judgement, technical oversight, confidentiality, and critical reasoning required for high-quality professional work.

Accelerate drafting, analysis, coding, troubleshooting, and documentation
Structure complex multidisciplinary projects
Improve consistency across repeated workflows
Explore alternative solutions rapidly
Translate technical material for different audiences
Protect confidential information through local AI when required
Maintain human responsibility for all final decisions and outputs

Technical highlights: prompt engineering • generative AI • ChatGPT • Google Gemini • LM Studio • Gemma 4 E4B • local language models • privacy-aware AI • zero-shot prompting • one-shot prompting • few-shot prompting • iterative refinement • task decomposition • scientific editing • data analysis • Python development • machine learning • website development • human-in-the-loop review • quality assurance

Portfolio

Big Data Projects in Wheat Proteomics and Proteogenomics

I developed and applied a series of large-scale, reproducible analytical workflows to characterise the wheat grain proteome, investigate late-maturity alpha-amylase, and map mass-spectrometry-derived peptide evidence onto the wheat reference genome.

These connected projects progressed from laboratory-method optimisation and high-throughput proteome screening to biomarker discovery, genome annotation, genome-browser resource development, and machine learning. They required the integration of thousands of biological samples, hundreds of mass-spectrometry files, millions of peptide observations, genome annotations, statistical analyses, bioinformatics, and custom Python workflows.

Project Overview

Bread wheat, Triticum aestivum, has a large and complex hexaploid genome. Connecting experimentally observed peptides with genes, transcripts, and genomic coordinates provides valuable evidence for protein expression, gene-model validation, cultivar comparison, and annotation refinement.

Optimised protein extraction and digestion for wheat grain proteomics
Applied the workflow to 4,087 wheat grain samples
Processed and interpreted large peptide and protein datasets across diverse cultivars
Investigated biomarkers associated with late-maturity alpha-amylase
Mapped peptide evidence onto the wheat reference genome
Generated reusable data resources, Python workflows, dashboards, and genome-browser tracks
Extended the dataset through unsupervised and supervised machine learning

1. Optimising Wheat Grain Proteomics

Objective: Develop a rapid, accurate, and reproducible workflow for extracting, digesting, identifying, and comparing proteins from large numbers of wheat grain samples.

Challenge: Wheat grain contains highly abundant storage proteins, starch, and other compounds that can interfere with protein extraction, digestion, peptide identification, and quantitative comparison.

Methods: Protein-extraction optimisation, enzymatic digestion, LC-MS/MS proteomics, peptide and protein identification, quality-control assessment, and comparative evaluation of analytical performance.

Established a reproducible workflow suitable for high-throughput wheat-grain analysis
Improved recovery and identification of diverse wheat proteins and peptides
Created a robust methodological foundation for the subsequent large-scale cultivar study
Produced an analytical strategy transferable to other complex plant tissues

Mining the Wheat Grain Proteome

Technical highlights: wheat grain • protein extraction • enzymatic digestion • LC-MS/MS • shotgun proteomics • analytical optimisation • quality control

Workflow developped for the large-scale protein profiling of thousands of wheat grains

2. Large-Scale Screening of 4,087 Wheat Grain Samples

Objective: Apply the optimised proteomics workflow at scale to characterise protein and peptide profiles across thousands of wheat samples and investigate molecular features associated with late-maturity alpha-amylase.

Dataset: Proteomic profiles generated from 4,087 wheat grain samples, representing a large and diverse experimental collection of cultivars and grain material.

Methods: High-throughput data processing, peptide and protein filtering, metadata harmonisation, descriptive statistics, multivariate analysis, clustering, biomarker discovery, functional annotation, pathway analysis, and interactive Power BI visualisation.

Built a large community resource for exploring the wheat grain proteome
Identified proteins and pathways associated with late-maturity alpha-amylase
Revealed coordinated changes in primary metabolism, protein synthesis, folding, and assembly
Detected responses involving phytohormones, defence, chromatin, ribosomes, and microtubules
Observed substantial effects on grain-storage proteins and carbohydrate metabolism
Generated analysis-ready outputs suitable for subsequent proteogenomic and machine-learning studies

Biological interpretation: Late-maturity alpha-amylase was associated with broad molecular remodelling rather than an isolated change in starch degradation. Affected grain showed evidence of altered central metabolism, gene-expression machinery, protein translation and folding, stress and defence responses, cellular organisation, storage-protein composition, and carbohydrate metabolism.

A Community Resource to Mass Explore the Wheat Grain Proteome and Its Application to the Late-Maturity Alpha-Amylase Problem Power BI Dashboards Supporting the Wheat Study

Technical highlights: 4,087 wheat samples • large-scale proteomics • biomarker discovery • multivariate analysis • pathway interpretation • Power BI • late-maturity alpha-amylase

Large-scale wheat proteomics workflow covering thousands of grain samples, protein identification, bioinformatics, visualisation, for biomarker discovery

3. Sequence-Based Proteogenomic Mapping with tBLASTn

Objective: Map experimentally observed wheat peptides directly onto the reference genome to identify genomic regions supported by mass-spectrometry evidence and contribute to wheat genome-annotation refinement.

Approach: Peptide sequences were aligned against the wheat genome using a tBLASTn-based strategy. This sequence-driven approach searched for genomic regions capable of encoding the experimentally observed peptides, independently of their existing annotation status.

Converted peptide identifications into genome-searchable sequence inputs
Aligned peptides against the wheat reference genome
Generated genomic coordinates for matched peptide evidence
Identified experimentally supported regions relevant to gene-model assessment
Produced browser-compatible outputs for genome-context visualisation
Demonstrated the value of public proteomics data for genome annotation

Community Resource: Large-Scale Proteogenomics to Refine Wheat Genome Annotations

Technical highlights: proteogenomics • tBLASTn • peptide alignment • genome coordinates • wheat genome annotation • BED files • genome-browser integration

End-to-end tBLASTn-based wheat proteogenomics workflow covering public peptide retrieval, tBLASTn genomic projection, data analysis, and Apollo/JBrowse deployment

4. Genome-Guided Proteogenomics and Wheat Genome-Annotation Validation

Objective: Develop a scalable, annotation-aware proteogenomics workflow to reconstruct experimentally identified wheat peptides at precise genomic coordinates, validate those projections rigorously, quantify protein-level support for high- and low-confidence gene models, and deploy the resulting evidence as an accessible community resource.

Dataset: Public wheat proteomics datasets comprising 577 raw mass-spectrometry files, approximately 1.0 TB of data, and 32 tissues and developmental stages were retrieved from PRIDE and MassIVE and reprocessed against the IWGSC RefSeq v2.1 high- and low-confidence wheat proteome.

Approach: Raw MS/MS data were searched using FragPipe/MSFragger, after which a custom GFF3-based Python workflow linked identified peptides to proteins, transcripts, coding sequences, genes, chromosomes, and strand orientation. Protein-space peptide coordinates were then converted into exon-resolved genomic coordinates and subjected to translation and coordinate-level validation before export as Apollo/JBrowse-compatible BED tracks.

Reanalysed 577 raw mass-spectrometry files representing approximately 1.0 TB of public wheat proteomics data from 32 tissues
Generated 2,226,779 non-redundant peptides and 1,648,740 unique protein accessions using FragPipe/MSFragger
Parsed gene, transcript, exon, CDS, and protein relationships from the IWGSC RefSeq v2.1 GFF3 annotations
Projected protein-space peptide positions into transcript and chromosome coordinates while preserving exon structure and positive- or negative-strand orientation
Distinguished peptides contained within a single exon from peptides spanning annotated exon junctions
Produced 8,291,056 peptide-to-genome projection rows, with 100% of annotation-supported peptide–protein–gene rows assigned genomic coordinates before validation
Applied four independent validation procedures covering translated peptide sequence, BED geometry, chromosome and strand concordance, and protein-coordinate consistency
Achieved 99.07% translation validation after accounting for isoleucine/leucine equivalence and a 98.14% final validation rate after all quality-control procedures
Produced a final resource of 3,138,903 non-redundant validated peptide projections, including 2,775,671 within-exon and 363,232 exon-spanning mappings
Supported 267,166 protein isoforms and 238,590 wheat gene models
Provided protein-level evidence for 103,095 high-confidence gene models, representing 96.4% of parsed HC annotations
Supported 135,495 low-confidence gene models, representing 84.8% of parsed LC annotations
Delivered experimental peptide support for 89.4% of all parsed wheat gene models
Generated exon-aware BED6 and BED12 tracks for genome-wide and locus-level visualisation in Apollo/JBrowse
Enabled inspection of high-confidence gene validation, low-confidence gene support, exon-spanning peptides, and isoform-specific peptide evidence
Prepared a reproducible Jupyter Notebook, Python scripts, BED files, figures, supplementary tables, and public genome-browser resources

Validation and biological significance: The workflow reconstructed valid genomic coordinates for the overwhelming majority of projected peptide–protein pairs across tissues, exon structures, transcript isoforms, homeologous loci, and strand orientations. Extensive peptide support confirmed most high-confidence gene models and provided strong experimental evidence for many low-confidence loci, indicating that a substantial proportion of LC annotations likely represent genuinely expressed protein-coding genes that merit future annotation review.

Community resource: The validated BED tracks allow researchers to inspect peptide evidence directly in its chromosome, gene, transcript, exon, and isoform context. Apollo/JBrowse views display dense peptide support across high-confidence loci, isoform-specific evidence, and extensively supported low-confidence genes that may be candidates for future reclassification.

Read the bioRxiv preprint: Community Resource: A Genome-Based Extension of Large-Scale Wheat Proteogenomics Explore the validated peptide tracks in Apollo/JBrowse

Technical highlights: FragPipe • MSFragger • Python • JupyterLab • GFF3 • IWGSC RefSeq v2.1 • exon-resolved peptide projection • translation validation • BED6 • BED12 • high-confidence and low-confidence gene models • Apollo • JBrowse • reproducible proteogenomics • genome-annotation refinement

End-to-end GFF3-based wheat proteogenomics workflow covering public LC-MS/MS retrieval, FragPipe peptide identification, exon-resolved genomic projection, validation, data analysis, and Apollo/JBrowse deployment

5. Machine Learning Extension of the Wheat Proteome Resource

Objective: Reanalyse the large wheat peptide-profile dataset using machine learning to identify natural sample structure, classify proteomic patterns, and determine which peptides contribute most strongly to the detected groups.

Methods: Data filtering, feature engineering, dimensionality reduction, unsupervised clustering, classification, model evaluation, and feature-importance analysis.

Reused the large-scale cultivar dataset as a machine-learning resource
Investigated latent sample classes and peptide-profile structure
Compared complementary clustering and classification approaches
Prioritised peptides contributing most strongly to proteomic separation
Connected machine-learning outputs with biological and functional interpretation

Status: Article in preparation

Technical highlights: machine learning • clustering • classification • feature importance • wheat cultivar profiling • peptide biomarkers • Python

Large-scale wheat proteomics workflow covering the genotype classification using statistical analyses and machine learning

Scale, Reproducibility and Technical Delivery

These projects required more than analysing a single large table. They involved coordinating heterogeneous files, metadata, sequence identifiers, annotation formats, biological hierarchies, quality-control rules, and publication outputs across several generations of the wheat resource.

Large-scale file and metadata management
Chunked and memory-aware Python processing
Structured quality-control and validation checkpoints
Reproducible notebooks and standalone scripts
Version-controlled code and documented analytical outputs
Integration of proteomics, genome annotations, statistics, bioinformatics, and visualisation
Preparation of manuscripts, figures, tables, supplementary resources, and browser tracks

Technical highlights: big data • wheat proteomics • 4,087 grain samples • 2.23 million non-redundant peptides • 8.29 million genome projections • Python • tBLASTn • GFF3 • translation validation • proteogenomics • Apollo/JBrowse • gene-model support • biomarker discovery • machine learning • reproducible science

Machine Learning and NLP

This section showcases selected machine learning and natural language processing projects spanning multi-omics, metabolomics, bibliometrics, anomaly detection, recommendation systems, and scientific career analytics. Each project combines reproducible Python workflows with interpretable visual outputs and, where relevant, publication-oriented reporting.

FNRL Mode-of-Action in Arabidopsis Roots: an Integrated Multi-Omics, Machine Learning and Bioinformatics Study

Objective: Investigate the function of the root-specific FNRL gene in Arabidopsis thaliana by integrating phenotyping with transcriptomics, proteomics, metabolomics, machine learning, and bioinformatics. The study compared wild type plants, two independent loss-of-function mutants, and a GFP-complemented line to uncover molecular pathways associated with FNRL regulation.

Dataset: Multi-omics and phenotypic dataset comprising root RNA-seq, LC-MS/MS proteomics, GC-MS metabolomics, nitrate content measurements, and root length data across four genotypes and biological replicates.

Methods: Data cleaning and harmonisation, missing-value imputation, log transformation, cross-omics scaling, PCA, differential expression analysis, biomarker discovery based on mutant consistency and GFP rescue rules, hierarchical clustering, k-means clustering, LDA, Elastic Net and Random Forest modelling, plus bioinformatics mining using TAIR, UniProtKB, GO/AmiGO, KEGG, STRING, Cytoscape, PaintOmics and AraCyc.

Identified 504 high-confidence FNRL-regulated biomarkers across transcriptomics, proteomics, and metabolomics.
Resolved biomarker behaviour into 4 main expression profiles, separating FNRL-activated and FNRL-repressed targets with strong or partial genetic rescue.
Built a conceptual model of FNRL regulation based on two main axes: regulatory direction and rescue strength.
Showed that root growth phenotypes were highly predictable from biomarker-derived molecular profiles, with Elastic Net models achieving strong performance.
Integrated pathway and interaction analyses highlighted processes linked to nitrogen metabolism, protein processing, ubiquitin-mediated proteolysis, and root-associated regulation.
Generated an interpretable multi-omics resource to support biological interpretation of FNRL mode of action and downstream manuscript preparation.

Technical highlights: Python • multi-omics integration • biomarker discovery • clustering • LDA • Elastic Net • Random Forest • pathway analysis • network biology

Status: Article in preparation

Final report Python code and Jupyter notebook FNRL study overview

Machine Learning Reveals Synergistic Effects of Lactobacillus helveticus in Camel Milk Fermentation Using Metabolomics

Objective: Develop a computational biomarker discovery pipeline to isolate metabolites specifically associated with Lactobacillus helveticus fermentation in camel milk, and determine whether co-culture with L. bulgaricus or S. thermophilus induces synergistic or antagonistic metabolic effects.

Dataset: Untargeted UPLC-MS/MS metabolomics dataset covering 13,400 metabolites, including 3,632 identified compounds, generated from camel and bovine milk fermented under mono- and co-culture conditions.

Methods: Three-way ANOVA, PCA, LDA, HDBSCAN, k-means, spectral clustering, self-organising maps (SOM), Random Forest classification, feature importance ranking, metabolite annotation, superclass analysis, and pathway exploration using MetaboAnalyst and external metabolic databases.

Designed a 4-level biomarker selection pipeline reducing 13,400 compounds to the most biologically relevant features.
Selected 2,069 metabolites through full-factor interaction testing, then refined to 1,017 metabolites showing clear synergy/antagonism patterns.
Applied Random Forest modelling to isolate 508 high-impact biomarkers associated with metabolic interactions driven by L. helveticus.
Retrieved 133 identified metabolites, of which 87 displayed synergistic profiles and 46 antagonistic profiles.
Highlighted metabolite classes linked to amino acid metabolism, bioactive lipids, carbohydrates, and fermentation-associated functional compounds.
Developed a reusable in silico protein digestion tool to simulate enzyme cleavage of camel and bovine caseins, supporting interpretation of dairy proteolysis and related cheese studies.
Delivered a robust analytical framework supporting downstream biological interpretation and manuscript preparation.

Technical highlights: Python • metabolomics • feature selection • clustering • Random Forest • biomarker discovery • pathway analysis • Side tool: in silico protein digestion

Status: Article submitted

Final report Python code and Jupyter notebook Related bioinformatics tool Associated publication camel milk study overview

Machine Learning Modelling of Synergy and Antagonism of Phenolic Compounds Using Antioxidant Assays

Objective: Develop a machine learning framework to quantify synergistic and antagonistic interactions between phenolic compounds across multiple antioxidant assays, using observed-versus-predicted absorbance behaviour to reveal non-additive biochemical effects.

Dataset: Experimental antioxidant dataset comprising individual standards and binary mixtures (CB1–CB5) tested across multiple assay systems, with absorbance measurements collected for pure compounds and combinations under controlled concentrations.

Methods: Exploratory data analysis, assay-wise regression modelling, XGBoost prediction, residual-based synergy scoring, threshold optimisation, confusion matrix evaluation, feature importance ranking, and comparative interpretation across antioxidant assay types.

Built XGBoost regression models to predict expected absorbance values from single-compound behaviour.
Calculated synergy and antagonism scores from prediction residuals, enabling quantitative classification of interaction strength.
Demonstrated that interaction behaviour varies substantially across assay systems, revealing assay-dependent biochemical responses.
Generated interpretable classification outputs distinguishing synergistic, additive, and antagonistic mixtures.
Established a transferable computational framework for analysing compound interactions in antioxidant chemistry.
Produced publication-ready visual outputs supporting article submission.

Technical highlights: Python • XGBoost • regression modelling • residual scoring • assay comparison • feature interpretation

Status: Article accepted for publication in Scientific Reports

Final report Python code and Jupyter notebook antioxidant assay study overview

Bread Protein Biomarker Discovery Assisted by Machine Learning

Objective: Identify peptide biomarkers associated with wheat genotype groups and flour-quality protein expression, using statistical learning and machine learning to support biomarker-assisted selection in bread wheat.

Dataset: Large-scale proteomics dataset comprising thousands of peptides derived from flour proteins quantified across a broad panel of wheat genotypes, integrated with genotype metadata and protein annotation.

Methods: Data normalisation, correlation analysis, t-tests, ANOVA, hierarchical clustering, k-means clustering, self-organising maps (SOM), PCA, LDA, Random Forest, SVM, MLP neural network, and stacked ensemble modelling for genotype classification and biomarker ranking.

Integrated statistical and machine learning outputs into a unified biomarker discovery framework.
Resolved genotype structure through multiple complementary clustering approaches, including HCA, k-means, and SOM.
Applied supervised models to classify genotype groups using peptide abundance profiles.
Built a stacked ensemble model combining Random Forest, SVM, and MLP to improve classification robustness.
Identified peptide biomarkers linked to key flour-quality protein groups, including glutenins and gliadins.
Generated biologically interpretable candidate markers relevant for wheat breeding and flour functionality.

Technical highlights: Python • proteomics • biomarker discovery • clustering • stacked machine learning • classification • feature ranking

Status: Article in preparation

Final report Python code and Jupyter notebook wheat proteomics biomarker charts

Career Trajectory Mapping From My Scientific Publications Using NLP, Machine Learning and Analytics

Objective: Develop a reproducible NLP and machine learning framework to analyse the thematic evolution of a scientific career through publication metadata, abstracts, keywords, and semantic content, with the goal of identifying major research phases, transitions, and future directions.

Dataset: Curated corpus of personal scientific publications spanning multiple years, integrating titles, abstracts, keywords, journal metadata, authorship patterns, and publication timelines.

Methods: Text preprocessing, tokenisation, TF-IDF, topic modelling, keyword frequency analysis, semantic clustering, temporal trend analysis, PCA, LDA, and machine learning-assisted interpretation of research trajectory patterns.

Built a structured NLP corpus from publication records across multiple scientific domains.
Identified major thematic transitions across career stages using topic modelling and semantic clustering.
Mapped the evolution from molecular plant science to multi-omics, machine learning, and computational biology.
Integrated publication chronology with thematic outputs to reconstruct career progression pathways.
Generated visual analytics highlighting dominant research themes, emerging directions, and interdisciplinary expansion.
Produced a data-driven framework transferable to researcher profiling, strategic planning, and academic portfolio analysis.

Technical highlights: Python • NLP • topic modelling • TF-IDF • semantic analysis • temporal analytics • machine learning interpretation

Status: Completed analytical report

Final report Python code and Jupyter notebook career trajectory NLP charts

Aliens, Algorithms & Anomalies: Visualising the NUFORC UFO Sightings

Objective: Develop a machine learning and NLP framework to analyse large-scale UFO sighting reports, identify reporting patterns, classify anomalous events, and prioritise cases with high investigative value.

Dataset: NUFORC database containing more than 150,000 UFO sighting reports, integrating temporal records, geographical metadata, witness descriptions, event characteristics, and free-text narratives.

Methods: Data cleaning, geospatial enrichment, NLP feature extraction from witness reports, exploratory data analysis, temporal and spatial visualisation, classification modelling, anomaly prioritisation, and predictive scoring.

Cleaned and enriched large-scale historical sighting data with geographical coordinates and regional metadata.
Extracted structured variables from free-text witness narratives to support NLP-driven analysis.
Built predictive models to identify high-priority sightings associated with stronger anomaly indicators.
Developed case prioritisation logic to flag reports with elevated close-encounter characteristics.
Generated multi-scale visual analytics showing temporal waves, spatial hotspots, and reporting behaviour.
Combined scientific rigour with unconventional public data to demonstrate transferable anomaly-analysis methodology.

Technical highlights: Python • NLP • geospatial analytics • classification • feature engineering • anomaly scoring • Tableau

Status: Completed analytical report

Final report Python code and Jupyter notebook Interactive Tableau dashboard ufo anomaly analytics charts

Decoding Love with Data: Matchmaking in a Dating App

Objective: Develop a machine learning and NLP framework to identify compatible matches between dating profiles by combining structured demographic features, personal preferences, and free-text self-descriptions.

Dataset: Large-scale dating profile dataset including demographic variables, lifestyle attributes, preferences, and millions of words extracted from personal essays written by users.

Methods: Text preprocessing, tokenisation, lemmatisation, topic modelling (LDA), feature encoding, clustering, cosine similarity, nearest-neighbour matching, and interactive match retrieval.

Processed millions of words from user essays to extract meaningful lifestyle and personality themes.
Applied topic modelling to identify dominant discussion themes across personal profiles.
Integrated structured profile variables with NLP-derived features into a unified similarity framework.
Built cluster-based matching logic to improve candidate relevance before similarity scoring.
Computed profile-to-profile similarity using cosine similarity within behavioural clusters.
Developed an interactive interface to retrieve top compatible matches for any selected profile.

Technical highlights: Python • NLP • topic modelling • clustering • cosine similarity • recommendation logic • Gradio interface • Tableau

Status: Completed machine learning project

Final report Python code and Jupyter notebook Interactive Tableau dashboard dating app NLP matching charts

Mapping the Landscape of Scientific Publications

Objective: Explore large-scale scientific publication metadata to identify bibliometric trends, publication patterns, dominant research topics, and long-term shifts in disciplinary focus.

Dataset: Scientific article metadata dataset containing approximately 120,000 publications with information on titles, authors, publication dates, journals, publishers, languages, article types, references, and subject descriptors.

Methods: Metadata cleaning and wrangling, exploratory data analysis, bibliometric visualisation, subject text preprocessing, TF-IDF vectorisation, and k-means clustering to group fine-grained subject labels into broader thematic categories.

Cleaned and structured a large bibliographic dataset to support interpretable publication analytics.
Analysed long-term trends in publication volume, article types, journals, publishers, languages, citations, and title length.
Identified prolific authors and high-output journals across multiple scientific fields.
Explored subject evolution over time to distinguish long-lasting, emerging, and short-lived research areas.
Reduced 1,265 subject labels into 7 broader thematic clusters using text mining and k-means clustering.
Produced publication-ready visual analytics and an interactive Tableau dashboard for bibliometric exploration.

Technical highlights: Python • bibliometrics • metadata analytics • text mining • TF-IDF • k-means clustering • Tableau

Status: Completed analytical report

Final report Python code and Jupyter notebook Interactive Tableau dashboard Raw dataset scientific publication analytics charts

Patents and Applied Innovation

I contributed scientific and technical expertise to two patented inventions arising from applied research in medicinal cannabis and plant-microbiome analysis. These projects translated experimental methods, mass-spectrometry workflows, and analytical findings into intellectual property with potential research and commercial applications.

My contributions included experimental-method development, analytical optimisation, biological interpretation, data generation, and preparation of evidence supporting the patented technologies.

Protein Extraction from Cannabis Plant Material

Innovation: Development of a method for extracting proteins from medicinal cannabis tissues, particularly mature plant material containing compounds that can interfere with protein recovery and downstream proteomic analysis.

Scientific context: Cannabis tissues present substantial analytical challenges because of their complex biochemical composition, including cannabinoids, phenolics, pigments, lipids, and other secondary metabolites. Effective protein extraction is therefore essential for reliable bottom-up, middle-down, and top-down proteomics.

Optimisation of protein extraction from complex medicinal-cannabis tissues
Evaluation of extraction efficiency and protein recovery
Preparation of protein samples suitable for mass-spectrometry-based proteomics
Support for molecular phenotyping, protein identification, and cultivar characterisation
Translation of experimental methodology into protected intellectual property

Patent: WO 2020/124128 A1 / CA 3122758 A1

Method of Protein Extraction from Cannabis Plant Material

Technical highlights: medicinal cannabis • protein extraction • sample preparation • proteomics • LC-MS/MS • analytical optimisation • intellectual property

Plant Microbiome Profiling

Innovation: Development of methods for profiling microorganisms associated with plants, supporting rapid characterisation of plant-associated microbial communities.

Scientific context: Plant microbiomes can influence plant health, growth, productivity, stress tolerance, and disease. Analytical methods capable of profiling plant-associated microorganisms can therefore support crop research, disease surveillance, biological-product development, and microbial ecology.

Characterisation of microorganisms associated with plant material
Application of mass-spectrometry-based microbial profiling
Use of MALDI-based analytical approaches for rapid microbial identification
Comparison and interpretation of microbial profile patterns
Support for development of a reproducible plant-microbiome profiling method

Patent: US 2022/0333151 A1

Plant Microbiome and Methods for Profiling Plant Microbiome

Technical highlights: plant microbiome • microbial profiling • MALDI mass spectrometry • microbial identification • plant health • applied research • intellectual property

From Research to Intellectual Property

These inventions demonstrate my ability to contribute to research that extends beyond academic publication and produces practical, protectable methodologies.

Identification of technical challenges with applied scientific relevance
Design and optimisation of experimental methods
Generation and interpretation of supporting analytical evidence
Collaboration across research, technical, and commercial teams
Translation of scientific findings into patentable processes and applications

Technical highlights: patents • intellectual property • applied innovation • medicinal cannabis • plant microbiome • protein extraction • MALDI mass spectrometry • proteomics • method development • research translation

Visual Automation and Creative Coding

I combine photography, Python programming, image processing, audio design, and video automation to transform my personal photographic archive into short-form visual experiences for social media.

I have been taking digital photographs for approximately three decades, documenting travel, architecture, landscapes, gardens, art, wildlife, patterns, textures, and everyday visual details. In May 2025, I began publishing selected work through @60_seconds_of_calm, a creative project designed to offer viewers a brief, immersive visual pause.

60 Seconds of Calm

Objective: Create short, atmospheric videos from my photographic archive using a repeatable and largely automated production workflow.

Concept: Each video presents a one-minute visual journey built around a coherent theme, such as castles, churches, national parks, cities, villages, gardens, art, nature, architecture, colour, geometry, or whimsical subjects.

Curate photographs from a personal archive spanning approximately three decades
Organise images into thematic collections for batch processing
Generate consistent short-form videos suitable for YouTube and TikTok
Combine images, transitions, environmental sound, music, and animated overlays
Maintain a recognisable visual identity across the series

View 60 Seconds of Calm on YouTube View 60 Seconds of Calm on TikTok

Automated Photo-Montage Production

I developed a Python workflow that processes folders of photographs and converts them into polished video montages with consistent framing, transitions, compression, audio mixing, and branded opening and closing effects.

Loads and validates JPG, or PNG photos as well as mp4 video files from structured project folders
Resizes and pads files to fit horizontal or vertical frame while preserving aspect ratio
Displays each file for a defined duration within a timed sequence
Applies smooth fades and crossfades between consecutive images
Processes multiple themes in batch to reduce repetitive manual editing
Exports consistent full-HD MP4 files ready for publication

Dynamic Image Transformation with scikit-image

In addition to conventional photo montages, I use scikit-image, NumPy, OpenCV, and Pillow to transform visually striking photographs into dynamic short animations.

This approach is particularly effective for images containing strong geometry, repeating patterns, symmetry, architectural detail, bold colour, unusual texture, or abstract visual structure. Rather than presenting a photograph as a single static frame, I generate a sequence of transformed frames that creates movement, depth, rhythm, and visual immersion.

Crop, rotate, mirror, translate, rescale, and progressively zoom source images
Generate animated reflections, repeated motifs, symmetry, and kaleidoscopic effects
Apply geometric transformations and frame-by-frame image warping
Alter contrast, brightness, saturation, colour balance, and tonal range over time
Create evolving masks, reveals, dissolves, and pattern-based transitions
Animate selected regions of an image while retaining the broader composition
Convert generated image sequences into short immersive MP4 animations

These transformations allow a single photograph to become a distinct visual artwork, while still preserving the composition, pattern, texture, or architectural structure that made the original image compelling.

Audio Design and Atmosphere

The visual sequence is supported by layered sound designed to reinforce the mood of each video without overpowering the photographs.

Selection of ambient soundscapes appropriate to each theme
Integration of nature sounds, environmental audio, and gentle background music
Volume balancing between music and ambient layers
Application of fade-in and fade-out effects
Precise alignment of audio duration with the final visual sequence
Automated mixing and encoding through FFmpeg

Door-Opening and Door-Closing Animation

I designed a stylised door animation that opens at the beginning of each video and closes at the end. This recurring visual device reinforces the idea of briefly stepping away from daily activity and entering a calm visual space.

Custom opening and closing animation
Overlay compositing with the underlying video
Timed fade, opacity, and transition control
Reusable branding element across the series
Integration into the automated FFmpeg production workflow

End-to-End Automated Workflow

Image retrieval and validation: locate, count, and validate source JPG and PNG files.
Preparation: correct orientation, resize, pad, crop, or transform images while preserving visual quality.
Sequence generation: arrange photographs or transformed frames into a timed visual sequence.
Transition creation: apply fades, crossfades, zooms, reveals, and other animated effects.
Visual transformation: use scikit-image, OpenCV, Pillow, and NumPy to generate immersive frame-by-frame animations where appropriate.
Initial video rendering: assemble frames into a full-HD video sequence.
Compression: encode and compress the montage using FFmpeg and the H.264 codec.
Audio integration: mix ambient sound and background music with controlled volume and fading.
Branding overlay: add the custom door-opening and door-closing animation.
Final export: generate a publication-ready MP4 file for social-media delivery.

Technologies and Libraries

Python: workflow control, batch processing, file management, and automation
scikit-image: geometric transformations, warping, masking, filtering, and frame generation
OpenCV: image processing, resizing, transition creation, frame handling, and video assembly
Pillow: image loading, compositing, annotation, resizing, and format conversion
NumPy: pixel-array manipulation and mathematical image transformations
ImageIO: image-sequence and frame input/output
MoviePy: video composition, timing, transitions, and audio integration where appropriate
FFmpeg: video encoding, H.264 compression, audio mixing, overlays, and final rendering
subprocess: automated execution of command-line FFmpeg operations from Python

Creative and Technical Outcomes

Transform a large photographic archive into structured and reusable creative assets
Reduce repetitive manual editing through reproducible batch processing
Produce consistent videos across multiple photographic themes
Combine still photography with motion, sound, coding, and visual storytelling
Create both conventional photographic montages and original generated animations
Develop a transferable workflow for social-media content, artistic projects, educational media, and scientific communication

View of the 60 Seconds of Calm YouTube channel showcasing automated photographic montages and immersive short visual animations

Technical highlights: Python • scikit-image • OpenCV • Pillow • NumPy • ImageIO • MoviePy • FFmpeg • subprocess • batch processing • image transformation • geometric animation • video automation • audio mixing • creative coding • photography • visual storytelling • YouTube • TikTok

Website Design, Development and Integrated Chatbot

This section showcases website projects that I have designed, built, maintained, or substantially improved. My work ranges from hand-coded static websites and chatbot integration to WordPress/Elementor administration, content architecture, production deployment, technical troubleshooting, and ongoing client support.

Professional Website and Integrated Chatbot

Objective: Build and maintain a distinctive professional website that presents my scientific career, data science capabilities, portfolio, publications, training, and consulting services in a clear, interactive, and visually recognisable format.

Website: dlf2024.github.io

Methods: Designed and coded the website from the ground up using HTML5, CSS3, and vanilla JavaScript; implemented responsive tabbed sections, reusable content components, custom graphics, structured metadata, accessible navigation, and GitHub Pages deployment. I also developed and connected a Python chatbot hosted separately on Render.

Created a complete one-page professional website with dedicated sections for Biography, Skills, Portfolio, Learning, Certificates, Resume, and Publications.
Developed a cohesive visual identity using a custom colour palette, branded illustrations, responsive cards, tabbed content panels, and mobile-specific layout adjustments.
Implemented interactive JavaScript navigation for nested portfolio, biography, skills, and learning content without requiring a web framework.
Improved the mobile viewing experience by repositioning selected tab content directly beneath its corresponding button and removing nested internal scrolling on smaller screens.
Added a full-screen image viewer with zoom, horizontal and vertical panning, reset and close controls, keyboard access, and responsive behaviour across desktop, mobile, and device orientation changes.
Refined the viewer so enlarged images can be explored across their complete width and height, including previously inaccessible left-side content.
Added search-engine and social-sharing metadata, including descriptive page titles, Open Graph tags, Twitter metadata, and structured content to improve discoverability and presentation.
Integrated downloadable, date-versioned resume and publication resources and established a repeatable maintenance workflow for updating website assets and links.
Built a lightweight Python chatbot that indexes a curated snapshot of the website, applies TF-IDF-based semantic retrieval, and returns concise answers with direct links to relevant website sections and documents.
Separated the chatbot into its own GitHub repository and Render backend, with a documented synchronisation and redeployment checklist whenever the website snapshot, resume, or publications file changes.
Continuously test the website across desktop and mobile layouts, refine readability and navigation, and update content as new projects, publications, qualifications, and consulting activities are completed.

Technical highlights: HTML5 • CSS3 • JavaScript • responsive design • mobile tab positioning • full-screen image viewer • image zoom and panning • accessibility • GitHub Pages • Git/GitHub • SEO metadata • Python • FastAPI backend • rule-based intent recognition • fuzzy FAQ matching • TF-IDF retrieval • Render deployment

Status: Live and under continuous development

Visit website GitHub profile Overview of Delphine Vincent's professional website and integrated chatbot

Ori Scientific Website Administration and Content Development

Objective: Improve and maintain the public website of Ori Scientific, an Australian food research and development company, while supporting clearer service communication, easier navigation, and a consistent professional presentation.

Website: oriscientific.au

Methods: Conducted a content and usability audit, then implemented updates through WordPress and Elementor. Work includes page editing, template reuse, menu administration, service architecture, post creation, brand-consistent visual updates, and preparation of repeatable maintenance procedures.

Reviewed the existing website and provided prioritised recommendations covering wording, consistency, navigation, content gaps, visual presentation, and future social-media integration.
Revised key homepage, About, Contact, and service content to communicate Ori Scientific's analytical testing, product development, pilot-scale processing, research, and regulatory capabilities more clearly.
Created new service content for Pilot Plant Laboratory and Research Permits, Import Approvals and Accreditations.
Built new Elementor pages by reusing and adapting the established Research and Development template, preserving the site's structure and visual identity.
Added the new services to the main Our Services dropdown menu and to the quick-access navigation on the services page.
Created and published a new News & Updates post, including category assignment, excerpt, featured image, permalink review, and front-end validation.
Applied consistent brand colours, iconography, headings, calls to action, and page hierarchy across newly developed content.
Documented repeatable procedures for creating service pages, updating dropdown menus, and publishing future news posts so that ongoing maintenance remains efficient and reliable.
Provide continuing webmaster support for content updates, quality assurance, website maintenance, and future LinkedIn/Facebook communication.

Technical highlights: WordPress • Elementor • page templates • responsive content editing • information architecture • menu administration • blog publishing • brand consistency • webmaster documentation

Status: Active client project and ongoing maintenance

Visit website Overview of Ori Scientific website content and service-page updates

Data Biome Website Design, Deployment and Secure Contact Infrastructure

Objective: Design and deploy a production-ready website for Data Biome, a biotechnology startup developing probiotics for ruminants to reduce methane emissions and mitigate greenhouse-gas impact.

Website: databiome.au

Methods: Built the full website locally using HTML5, CSS3, vanilla JavaScript, PHP, and custom media assets, then deployed it to WebCentral hosting through cPanel. The project also included secure contact-form delivery, domain email authentication, responsive testing, and production troubleshooting.

Created the complete project structure, brand palette, one-page information architecture, reusable responsive cards, and sticky accessible navigation.
Integrated custom visual assets and a full-width animated hero background, processed and compressed with FFmpeg for web delivery.
Implemented semantic HTML, keyboard-accessible navigation, visible focus states, form validation, bot protection, and reduced-motion considerations.
Developed the contact workflow using contact.php, PHPMailer, authenticated SMTP, and end-to-end browser-to-inbox testing.
Diagnosed and resolved production-only email failures by configuring the hosting provider's SMTP relay and repairing SPF and DMARC records.
Adapted the CSS to WebCentral's legacy linter by replacing unsupported modern features while retaining responsive behaviour.
Established a password-protected staging area, versioned ZIP deployments, cache-busting conventions, and post-launch quality checks.
Completed production launch, client handover, and an ongoing maintenance plan covering content, links, form health, and future feature development.

Technical highlights: HTML5 • CSS3 • JavaScript • PHP • PHPMailer • SMTP • SPF/DMARC • FFmpeg • WebCentral cPanel • responsive QA

Status: Live production website with ongoing maintenance support

Visit website Overview of the Data Biome biotechnology website

Learning

AWS Educate and Amazon Web Services

I am expanding my cloud-computing capability through AWS Educate, using a structured, self-directed learning pathway covering cloud fundamentals, the AWS Management Console, storage, compute, networking, databases, cloud operations, security, serverless computing, artificial intelligence, machine learning, and core AWS concepts.

This training complements my experience in data science, machine learning, bioinformatics, web deployment, and reproducible analytical workflows by strengthening my understanding of cloud infrastructure, scalable computing, managed services, secure resource configuration, and cloud-based analytical environments.

AWS Educate Learning Pathway

Objective: Build a practical foundation in Amazon Web Services and understand how cloud-based storage, compute, networking, databases, security, and deployment services can support modern data-science and scientific-computing workflows.

Develop cloud-computing literacy through structured AWS Educate courses
Understand the purpose and configuration of major AWS services
Strengthen awareness of cloud security, costs, scalability, and reliability
Explore cloud environments relevant to machine learning and data analytics
Connect AWS concepts with bioinformatics, omics, web deployment, and reproducible research

AWS Educate training badges covering cloud computing, storage, compute, networking, databases, cloud operations, security, and serverless computing

Introduction to the AWS Management Console

Training focus: Secure browser-based access to AWS services through the AWS Management Console.

This course introduced console navigation, service discovery, dashboard customisation, account monitoring, AWS global infrastructure, payment models, and the factors influencing service costs.

Purpose and navigation of the AWS Management Console
Finding, launching, and managing AWS services
AWS global infrastructure overview
Account monitoring and service customisation
Payment models and cost-awareness concepts

Introduction to Cloud 101

Training focus: Core cloud-computing concepts and foundational AWS services.

The course introduced cloud service and deployment models, AWS global infrastructure, the shared responsibility model, the AWS Well-Architected Framework, and entry-level cloud career pathways.

Benefits and characteristics of cloud computing
Cloud service and deployment models
AWS global infrastructure
AWS shared responsibility model
AWS Well-Architected Framework principles
Hands-on practice with selected AWS core services

Getting Started with Storage

Training focus: AWS storage services, with particular emphasis on Amazon Simple Storage Service, or Amazon S3.

The course covered object-storage concepts, buckets, objects, storage classes, permissions, security, cost optimisation, and use cases including static websites, backup, archival storage, Internet of Things applications, and big-data analytics.

AWS storage types and their use cases
Amazon S3 buckets, objects, and storage classes
Object management and access configuration
Security and cost-optimisation settings
Using Amazon S3 to host a static website

Getting Started with Compute

Training focus: AWS compute services, with particular emphasis on Amazon Elastic Compute Cloud, or Amazon EC2.

The course introduced compute options, instance families, workload requirements, storage choices, security settings, and the practical launch and management of EC2 instances.

AWS compute types and workload considerations
Amazon EC2 concepts and instance families
Selection of instance types according to workload needs
Launching, configuring, and managing EC2 instances
Compute, storage, and security configuration

Getting Started with Networking

Training focus: Cloud-networking fundamentals, with particular emphasis on Amazon Virtual Private Cloud, or Amazon VPC.

The course covered virtual private clouds, public and private subnets, route tables, gateways, IP addressing, security groups, and network access control lists.

Networking fundamentals and Amazon VPC concepts
Public and private subnet configuration
Route tables and gateway configuration
IP addressing within a virtual network
VPC security using security groups and network ACLs

Getting Started with Databases

Training focus: Cloud-database fundamentals, with particular emphasis on Amazon Relational Database Service, or Amazon RDS.

The course introduced relational and non-relational database concepts, AWS database services, the key features of Amazon RDS, database configuration, and the use of SQL to read and write data.

Relational and non-relational database concepts
Overview of AWS database services
Amazon RDS features and use cases
Configuration of a managed relational database
Use of SQL commands to read and write data

Getting Started with Cloud Operations

Training focus: Cloud-operations principles, service monitoring, cost management, and operational best practice.

The course introduced key elements of the AWS Well-Architected Framework, AWS cost-management tools, cloud-operations services, and their configuration.

Cloud-operations fundamentals
AWS Well-Architected Framework
AWS cost-management tools
Cloud monitoring and operational services
Configuration of selected cloud-operations resources

Getting Started with Security

Training focus: Secure cloud operations, with particular emphasis on AWS Identity and Access Management, or IAM.

The course covered IAM concepts and features, users, groups, roles, permissions, security policies, credential review, and multi-factor authentication.

Cloud-security fundamentals
AWS IAM features, benefits, and use cases
Creation and configuration of IAM users, groups, and roles
Application of IAM policies and permissions
Credential review and multi-factor authentication setup

Getting Started with Serverless

Training focus: Serverless computing, event-driven architecture, and AWS Lambda.

The course explained the benefits of serverless services, the role of microservices, event-driven design, architectural decoupling, and the configuration and monitoring of AWS Lambda functions.

Microservices and serverless-computing concepts
Event-driven architectures
Benefits and use cases of AWS Lambda
Creation and configuration of Lambda functions
Monitoring serverless functions
Use of serverless services to decouple application components

Connecting AWS Services

The training has helped me understand how individual AWS services can be combined within a broader cloud architecture rather than used in isolation.

Amazon S3: object storage, static websites, backup, and data repositories
Amazon EC2: configurable cloud-based computing capacity
Amazon VPC: isolated networking environments and controlled connectivity
Amazon RDS: managed relational databases
AWS IAM: users, roles, permissions, and access control
AWS Lambda: serverless and event-driven processing
Cloud operations: monitoring, cost awareness, and architectural best practice

AWS service icons representing Amazon S3 storage, Amazon EC2 compute, and Amazon VPC networking

Hands-On Cloud Practice

In parallel with AWS Educate, I use Amazon SageMaker Studio Lab and an AWS Free Tier account as practical learning environments.

Explore cloud-based notebook environments for Python and machine learning
Practise navigating AWS services and configuration interfaces
Consolidate concepts in storage, compute, networking, databases, and security
Develop familiarity with cloud-resource creation and management
Connect theoretical course material with self-directed practical exercises

Current and Planned Progression

I completed the available Getting Started courses first and am progressing through the Artificial Intelligence and Machine Learning series. I plan to continue with the Core Concepts series.

This sequential pathway supports my broader objective of integrating cloud literacy with data science, machine learning, bioinformatics, omics analytics, reproducible workflows, and scientific web-resource development.

Completed foundational AWS Management Console training
Completed the available Getting Started courses
Progressing to AI and machine learning training
Planning subsequent Core Concepts training
Applying knowledge through SageMaker Studio Lab and AWS Free Tier practice

Relevance to My Professional Work

Cloud storage for large scientific and analytical datasets
Scalable compute for data science, machine learning, and bioinformatics
Managed databases for structured analytical data
Secure access control for cloud resources and collaborative environments
Cloud deployment of websites, APIs, analytical tools, and scientific resources
Serverless processing for event-driven and lightweight automation workflows
Cost-aware and reproducible design of cloud-based projects

Technical highlights: AWS Educate • AWS Management Console • cloud computing • Amazon S3 • Amazon EC2 • Amazon VPC • Amazon RDS • AWS IAM • AWS Lambda • serverless computing • cloud operations • cloud security • SQL • SageMaker Studio Lab • AWS Free Tier • scalable analytics • machine learning • web deployment

CodeCademy Career Path — Data Science and Analytics

I completed the CodeCademy Career Path: Data Scientist — Analytics Specialist, a comprehensive programme combining theory, practical exercises, assessments, and guided projects.

The course strengthened my skills in SQL, Python 3, Tableau, statistical analysis, data visualisation, exploratory data analysis, and analytical storytelling. It also provided a structured framework for moving from raw data to clear, evidence-based conclusions.

View the CodeCademy course syllabus

Learning Outcomes

Training focus: End-to-end data analysis using SQL, Python, Excel, Tableau, statistical methods, exploratory analysis, data visualisation, and reporting.

Query and manipulate structured data using SQL
Clean, transform, analyse, and visualise data using Python
Apply descriptive and inferential statistical methods
Build exploratory and presentation-ready visualisations
Create interactive dashboards using Tableau
Communicate analytical findings through structured reports and presentations

Workflow representing the CodeCademy Data Scientist Analytics Specialist learning pathway

Application to Large-Scale Wheat Proteogenomics

I applied the Python and data-analysis skills developed through this training to the analysis of a large wheat proteogenomics dataset.

This work involved large-scale data handling, peptide and protein processing, exploratory analysis, visualisation, biological interpretation, and development of reproducible analytical workflows.

Community Resource: Large-Scale Proteogenomics to Refine Wheat Genome Annotations

Final Portfolio Project

Project: Analysis of metadata from approximately 120,000 scientific publications.

The final CodeCademy portfolio project was open-ended, allowing me to select a dataset and define my own analytical questions. I chose a Kaggle dataset containing metadata from scientific publications to investigate publication activity, journals, publishers, authors, citation patterns, publication types, languages, and research subjects.

Define analytical hypotheses and research questions
Assess the quality and structure of the source dataset
Clean and transform publication metadata
Perform exploratory data analysis using Python
Create an interactive Tableau dashboard
Prepare a written analytical report and presentation-ready outputs

1. Analytical Questions and Hypotheses

I began by defining a set of questions to guide data cleaning, feature engineering, exploratory analysis, and visualisation.

How did publication rates change over time?
Which publishers and journals were most strongly represented?
Which authors appeared most frequently?
How did citation numbers change over time?
Did article-title length change over time?
Which publication types were most common?
Which languages were used most frequently?
Which research subjects were dominant, persistent, declining, or emerging?

Brainstorming and hypothesis-development icons for the scientific-publication metadata project

2. Data Assessment and Preparation

Tools: Excel, Python, pandas, Jupyter Notebook, and Tableau.

Initial data assessment helped determine which variables were informative, which fields required transformation, and where missing, inconsistent, or poorly formatted values were present.

Reviewed dataset dimensions, column types, completeness, and value distributions
Selected variables relevant to the project questions
Converted and standardised date fields
Renamed columns using clearer and more meaningful labels
Investigated and handled missing values
Cleaned and transformed article titles
Created new numerical and categorical variables for analysis
Prepared cleaned outputs for Python and Tableau

View the GitHub repository, raw and cleaned datasets, and Jupyter notebook

Data-analysis tools and charts used for publication-metadata processing

3. Exploratory Data Analysis

Exploratory analysis was used to identify long-term trends, dominant categories, unusual observations, relationships between variables, and potential dataset biases.

Line charts for publication and citation trends over time
Bar charts for journals, publishers, authors, languages, and publication types
Bubble plots for multivariable comparisons
Pie charts and treemaps for category composition
Scatterplots for relationships between numerical variables
Histograms and box-and-whisker plots for distributions and outliers
Word clouds and text summaries for publication-title content

Exploratory data-analysis charts generated from scientific-publication metadata

4. Tableau Dashboard and Analytical Report

I transformed the cleaned and analysed dataset into an interactive Tableau dashboard that allows users to explore publication trends, categories, publishers, journals, authors, citation behaviour, and research subjects.

I also prepared a structured report describing the analytical process, visualisations, observations, limitations, and proposed future directions.

Explore the Tableau Public dashboard Read the project report

Screenshots of the Tableau dashboard and final scientific-publication metadata report

5. Main Findings

Annual publication numbers generally increased from the 1960s onward.
The dataset reached approximately 2,500 publications per year by the mid-1990s.
Publication numbers were unusually high in 2007, suggesting either a genuine event or a dataset-related bias requiring further investigation.
Journal articles represented approximately 94% of the records.
Proceedings articles represented approximately 3%, and book chapters approximately 2%.
The representation and longevity of journals varied substantially.
Large publishers tended to host a greater number of journals.
English was overwhelmingly the dominant publication language.
Both citation counts and title length showed increasing trends over time.
Frequently represented subjects included social sciences, computer science, electrical engineering, general engineering, and chemistry.
Some research subjects persisted over long periods, whereas others appeared to be emerging or short-lived.

Data-interpretation icons representing conclusions from the publication-metadata analysis

6. Limitations and Interpretation

The project provided valuable insight into the structure of the dataset, but the results must be interpreted in the context of its source, coverage, metadata quality, and possible sampling bias.

The dataset may not represent the complete scientific literature.
Uneven coverage across years may affect apparent publication trends.
The 2007 publication spike requires further validation.
Missing or inconsistent metadata may affect author, journal, and subject summaries.
Citation numbers may be influenced by publication age and database coverage.
Subject labels may overlap or vary in specificity.

7. Future Directions

Create Sankey or chord diagrams to explore relationships between publishers, journals, subjects, and publication types
Apply more advanced NLP methods to publication titles
Compare alternative topic-modelling approaches
Retrieve richer metadata including co-authors, keywords, affiliations, and geographic information
Build co-authorship and publication-network visualisations
Investigate temporal changes in research topics in greater detail
Assess and correct potential dataset biases where possible

Future-development icons for extending the scientific-publication metadata project

Professional Relevance

This project demonstrated my ability to independently define an analytical problem and deliver a complete workflow from raw data to interpretable outputs.

Formulate analytical questions and hypotheses
Assess and clean a large real-world dataset
Perform reproducible exploratory data analysis in Python
Integrate numerical, categorical, temporal, and textual variables
Create interactive Tableau dashboards
Translate complex findings into a clear analytical narrative
Document code, data, limitations, and conclusions transparently

Technical highlights: CodeCademy • data science • analytics • Python • pandas • Jupyter Notebook • SQL • Excel • Tableau • exploratory data analysis • data cleaning • feature engineering • text analysis • word clouds • publication metadata • dashboard development • analytical reporting • GitHub • data storytelling

CodeCademy Career Path — Data Science and Machine Learning

I completed the CodeCademy Career Path: Data Scientist — Machine Learning Specialist, a comprehensive programme combining mathematical foundations, statistics, Python programming, machine learning theory, practical exercises, assessments, and applied projects.

The course covered both supervised and unsupervised learning for classification, regression, clustering, recommendation, anomaly detection, and pattern discovery. Topics included ensemble methods, support vector machines, recommender systems, naïve Bayes, deep learning, neural networks, model evaluation, and feature engineering.

View the CodeCademy course syllabus

Learning Outcomes

Training focus: End-to-end machine learning using Python, with emphasis on data preparation, modelling, validation, interpretation, and deployment-oriented thinking.

Apply supervised learning to classification and regression problems
Use unsupervised learning for clustering, latent structure, and anomaly detection
Develop recommender and similarity-based systems
Engineer, encode, transform, and weight model features
Optimise model parameters and compare alternative methods
Evaluate performance using suitable validation metrics
Interpret model outputs and communicate findings clearly
Build interactive interfaces for model results

Brainstorming and machine-learning workflow for the OKCupid final portfolio project

Final Portfolio Project

Project: Unsupervised matching of similar users from an OKCupid Date-a-Scientist dataset.

The final portfolio project required analysis of a dating-app dataset and the application of machine learning in a self-defined way. Because the dataset did not contain a single target variable representing compatibility or closeness, I designed an unsupervised recommendation workflow.

The objective was to structure the user-profile data, reduce noise, identify groups of broadly similar individuals, and calculate pairwise similarity within those groups so that the application could suggest relevant matches.

Inspect and clean mixed numerical, categorical, and textual data
Use NLP to represent user essays and profile descriptions
Identify and remove anomalous records
Cluster users into groups with similar profile characteristics
Rank potential matches using cosine similarity
Display recommendations through an interactive Gradio interface

1. Study Design

Context: The dataset did not contain a predefined outcome representing compatibility, match quality, or interpersonal closeness.

This ruled out a conventional supervised target-prediction approach and required an unsupervised strategy to organise the profiles and identify similarities.

Rationale: I combined exploratory data analysis, natural language processing, anomaly detection, dimensionality reduction, clustering, and pairwise similarity to create a structured recommendation pipeline.

Use exploratory analysis to understand the user population and feature distributions
Process textual profile information using NLP
Reduce noise and exclude highly atypical profiles
Identify latent groups within the user population
Calculate similarity within clusters to generate candidate matches

Software and analytical tools used for the OKCupid machine-learning project

2. Data Inspection and Cleaning

The dataset contained numerical, categorical, ordinal, nominal, and free-text variables, as well as missing values, inconsistent entries, outliers, and varying levels of profile completeness.

Inspected column types, value ranges, missingness, and category distributions
Identified erroneous, inconsistent, and implausible values
Filtered and normalised numerical variables
Encoded ordinal and nominal categorical variables
Processed free-text profile essays using NLP methods
Created a completeness score reflecting missingness across weighted features
Selected sufficiently complete rows for modelling

View the GitHub repository, cleaned and encoded dataset, and Jupyter notebook

3. Exploratory Data Analysis

Exploratory analysis was used to understand the composition of the user population, identify dominant profile characteristics, evaluate missingness, and reveal broad user archetypes.

Histograms for numerical-variable distributions
Pie charts and bar charts for categorical summaries
Treemaps for hierarchical category representation
Word clouds for textual-profile summaries
Heat maps for associations and profile comparison
Identification of common behavioural, demographic, and lifestyle patterns
Interpretation of the dominant archetypes represented in the dataset

Exploratory data-analysis charts generated from the OKCupid user-profile dataset

4. Text Processing and Topic Modelling

The dataset included ten essay-style profile fields containing free text. These fields were combined and analysed using Latent Dirichlet Allocation.

Cleaned and concatenated profile essays
Converted text into a modelling-ready representation
Optimised LDA parameters
Identified 30 main topics across the profile corpus
Used topic outputs as structured features for downstream modelling
Reduced the dimensional complexity of the raw text fields

5. Feature Weighting and Profile Completeness

Features were weighted according to their perceived relevance to matching, allowing more important profile characteristics to contribute more strongly to similarity calculations.

Assigned relative importance to selected features
Combined numerical, categorical, and topic-derived variables
Calculated a row-level completeness score
Prioritised profiles with sufficient information for reliable comparison
Reduced the influence of sparse and poorly documented profiles

6. Correlation Analysis and Missing-Value Support

A correlation matrix was used to examine relationships between variables, identify potential redundancy, and support missing-value decisions.

Assessed relationships between numerical and encoded variables
Confirmed the absence of severe collinearity
Identified associations potentially useful for imputation
Evaluated whether strongly related variables carried redundant information
Supported downstream feature-selection and preprocessing decisions

7. Anomaly Detection

I used Isolation Forest with optimised parameters to detect and remove atypical observations that could introduce noise into clustering and matching.

Calculated anomaly scores for user profiles
Optimised Isolation Forest parameters
Identified unusually structured or extreme profiles
Reduced noise before dimensionality reduction and clustering
Improved the coherence of the remaining modelling dataset

8. Dimensionality Reduction and Clustering

I used UMAP to visualise the high-dimensional profile space and HDBSCAN to identify naturally occurring groups of users.

Reduced high-dimensional data into a lower-dimensional representation
Optimised HDBSCAN clustering parameters
Identified five uneven user groups
Visualised the cluster structure using UMAP
Used cluster membership to constrain subsequent matching
Reduced the complexity of all-against-all profile comparison

9. Similarity-Based Matching

Potential matches were identified within HDBSCAN clusters using cosine similarity.

Calculated pairwise similarity between profiles within each cluster
Ranked candidate matches from most to least similar
Limited comparison to broadly compatible profile groups
Generated up to 20 closest matches per selected user
Evaluated matching performance using an F1 score
Achieved an F1 score of 90% under the project evaluation framework

Machine-learning workflow for topic modelling, anomaly detection, clustering, and similarity-based matching

10. Interactive Gradio Interface

I developed an interactive Gradio interface to make the matching results easier to explore.

Select a user profile by identifier
Retrieve up to 20 closest matches
Display ranked similarity results
Compare selected profile characteristics
Present model outputs without requiring direct interaction with the notebook code

11. Tableau Dashboard and Project Report

I used Tableau and PowerPoint to communicate the structure of the dataset, main user archetypes, modelling workflow, cluster behaviour, matching results, limitations, and future-development options.

Explore the Tableau Public dashboard Read the project report

Screenshots of the Tableau dashboard and final OKCupid machine-learning report

12. Main Findings

Exploratory analysis revealed several dominant user archetypes within the dataset.
LDA converted unstructured profile essays into 30 interpretable topic features.
Isolation Forest reduced noise by identifying atypical profiles.
HDBSCAN identified five uneven clusters within the user population.
Cosine similarity enabled ranked matching between users within clusters.
The matching workflow achieved an F1 score of 90% under the selected evaluation approach.
Heat maps, word clouds, and interactive tables helped interpret individual profiles and candidate matches.
The Gradio interface made model outputs accessible without requiring programming knowledge.

Data-interpretation graphics summarising the OKCupid machine-learning results

13. Limitations

No ground-truth compatibility variable was available.
Feature weighting introduced subjective modelling choices.
Profile completeness influenced which rows were retained.
Outlier removal may have excluded valid but uncommon profiles.
User essays were concatenated before topic modelling, reducing distinction between individual essay types.
Geographic proximity was not incorporated into the matching model.
Model validation was constrained by the absence of real match outcomes.

14. Future Directions

Reduce categorical granularity by combining similar labels
Compare weighted and unweighted feature representations
Test alternative feature-weighting strategies
Retain all rows and compare multiple imputation approaches
Evaluate mean, k-nearest-neighbour, and model-based imputation
Analyse outliers rather than automatically excluding them
Use UMAP projections as additional matching features
Retrieve geographic coordinates for towns and include location proximity
Model the ten essay variables separately rather than concatenating them
Compare LDA with contextual transformer approaches such as BERT
Test the workflow with new fictional or publicly available user profiles
Extend the Gradio interface to search by desired traits rather than profile identifier

Future-development icon for extending the OKCupid recommendation workflow

Professional Relevance

This project demonstrated my ability to design and implement a complete unsupervised machine-learning and recommendation workflow using mixed structured and unstructured data.

Define a modelling strategy when no target variable is available
Integrate EDA, NLP, anomaly detection, clustering, and similarity modelling
Handle numerical, categorical, ordinal, and textual profile features
Optimise several models within one analytical pipeline
Develop an interpretable recommendation system
Create an interactive interface for non-technical users
Communicate methods, limitations, and findings through dashboards and reporting

Technical highlights: CodeCademy • machine learning • Python • pandas • scikit-learn • exploratory data analysis • NLP • LDA • feature engineering • missing-value analysis • Isolation Forest • anomaly detection • UMAP • HDBSCAN • clustering • cosine similarity • recommender systems • F1 score • Gradio • Tableau • GitHub • data storytelling

CodeCademy Skill Path — Build a Website

I completed the CodeCademy Skill Path: Build a Website with HTML, CSS and GitHub Pages, a practical course covering website structure, styling, responsive design, development workflows, and publication through GitHub Pages.

After completing the course, I applied these skills by designing, coding, testing, and publishing my professional website—the website you are currently browsing.

View the CodeCademy course syllabus

Learning Outcomes

Training focus: Build and publish a responsive website using HTML, CSS, JavaScript, GitHub, and GitHub Pages.

Structure web pages using semantic HTML
Style layouts, typography, colour, spacing, and media using CSS
Use Flexbox and responsive rules to support different screen sizes
Add interactive behaviour using JavaScript
Test and debug web pages locally
Use Git and GitHub for version control
Publish a static website through GitHub Pages

Professional Website Project

I used the course as the foundation for creating a complete professional website presenting my biography, technical skills, scientific background, data-science projects, learning activities, publications, resume, and consulting services.

Planned the information architecture and navigation structure
Organised content into Biography, Skills, Portfolio, Learning, and Certificates sections
Created interactive tab-based content to manage long sections
Integrated images, downloadable documents, external profiles, and project links
Developed responsive layouts for desktop, tablet, and mobile use
Published and maintained the website through GitHub Pages

Brand and Visual Identity

I used the website project to develop a personal visual identity based on a simple, high-contrast design with vivid colours and recognisable branding.

Created a business-card-style header displaying my full name
Developed a favicon using my initials
Applied a consistent purple, pink, yellow, and light-violet colour palette
Used recurring card, border, heading, and button styles throughout the site
Maintained visual consistency across the website, downloadable documents, and profile imagery

Personal website brand and visual identity elements

Development Environment

I developed the website using Visual Studio Code with the Live Server extension, allowing me to edit, preview, test, and debug the site locally.

HTML: page structure, semantic sections, links, forms, images, and media
CSS: layout, colours, typography, responsive design, and interactive states
JavaScript: navigation, tab switching, mobile behaviour, and image viewing
Live Server: local browser preview during development
Browser developer tools: inspection, responsive testing, and debugging

Visual Studio Code, GitHub and GitHub Pages workflow used to develop and publish the website

Version Control and Publication

Once changes were tested locally, I used GitHub Desktop and GitHub Pages to manage versions and publish the website.

Maintained the website files in a structured GitHub repository
Tracked changes through commits
Reviewed modifications before pushing updates
Published the static website through GitHub Pages
Maintained linked website and chatbot-backend repositories where required

Navigation System

The website includes a fixed navigation bar that provides direct access to the main sections.

Remains visible at the top of the page while scrolling
Uses partial transparency so underlying content remains visible
Highlights selected or hovered links through colour changes
Collapses into a toggle menu on smaller screens
Provides direct access to the resume and publication list

Responsive fixed navigation bar used on the professional website

Interactive Tabbed Content

Long sections are divided into selectable tabs so that users can focus on one topic at a time without being presented with an excessively long initial page.

Hidden content panels are revealed through clickable buttons
Selected buttons remain visually highlighted
On smaller screens, selected content appears directly beneath its corresponding button
Long desktop sections remain scrollable within their content panels
Internal links can open relevant Biography, Skills, Portfolio, or Learning subsections

Interactive tabbed sections used to display website content

Responsive Design

The website uses responsive CSS rules to maintain readability and visual balance across desktop monitors, tablets, and smartphones.

Flexible section widths and containers
Responsive images and media
Reflowing navigation and social links
Stacked mobile layouts for complex sections
Adaptive typography using relative and clamped font sizes
Orientation-aware behaviour on phones and tablets

Images and Full-Screen Viewing

Images are used throughout the website to improve visual communication and demonstrate my scientific, analytical, technical, and creative work.

Illustrative images accompany major skills and project sections
Images resize responsively within their containers
Users can open selected images in a full-screen viewer
Images can be zoomed and panned horizontally and vertically
The viewer adapts when a phone or tablet changes orientation
Closing the viewer returns users to their original position on the page

Links and Downloadable Resources

The website provides direct access to publications, presentations, dashboards, source-code repositories, professional profiles, and downloadable documents.

External resources open in new browser tabs
Links change colour on hover, focus, or selection
Resume and publication files are accessible from the main navigation
Project cards include links to GitHub, Tableau, articles, reports, and presentations
Internal links connect related information across the website

Accessibility

I incorporated accessibility features to improve navigation and interpretation for users with different needs and input methods.

Meaningful alternative text for informative images
Presentation roles where images are decorative
Keyboard access to expandable images
ARIA labels for image-viewer controls
Visible focus states for interactive elements
High-contrast colours for headings, buttons, and links
Responsive layouts supporting zoom and small screens

Accessibility, responsive design and interactive website features

Ongoing Website Maintenance

The website is an actively maintained professional resource rather than a static course exercise.

Update skills, projects, publications, training, and professional roles
Improve design and mobile usability based on testing
Review and correct links, filenames, and downloadable resources
Synchronise relevant content changes with the website chatbot backend
Test deployment after updates
Use versioned releases and dated files where appropriate

Professional Relevance

This project demonstrated my ability to move from introductory web-development training to the design, deployment, maintenance, and continuous improvement of a complete professional website.

Translate content requirements into an organised website structure
Develop responsive pages using HTML, CSS, and JavaScript
Create interactive and accessible components
Manage version control and deployment through GitHub
Test website behaviour across devices and screen sizes
Maintain linked front-end and backend resources
Use the website as an evolving professional communication platform

Accessibility, responsive design and interactive website features

Technical highlights: CodeCademy • HTML5 • CSS3 • JavaScript • responsive web design • Flexbox • Visual Studio Code • Live Server • Git • GitHub Desktop • GitHub Pages • semantic HTML • accessibility • ARIA • mobile navigation • interactive tabs • image lightbox • version control • website maintenance

CodeCademy Skill Path — Build a Chatbot in Python

I completed the CodeCademy Skill Path: Build a Chatbot in Python, a practical course combining Python programming, natural language processing, data science, machine learning, and artificial intelligence.

The course introduced several chatbot architectures, from deterministic rule-based systems to retrieval-based approaches and more advanced generative models.

View the CodeCademy course syllabus

Learning Outcomes

Training focus: Design and implement chatbot systems using Python, NLP, machine learning, retrieval methods, and conversational logic.

Understand the differences between rule-based, retrieval-based, and generative chatbots
Build pattern-matching and intent-recognition workflows
Process and compare text using NLP techniques
Retrieve relevant answers from structured knowledge sources
Apply machine learning to conversational and text-based problems
Design chatbot responses and fallback behaviour
Connect a chatbot interface to a Python backend
Consider privacy, deployment, maintenance, and future extensibility

Python, natural language processing, machine learning and chatbot technologies

Chatbot Architectures Covered

Rule-based chatbots: use predefined patterns, keywords, commands, and decision rules
Retrieval-based chatbots: select the most relevant response from a structured collection of answers
Deep-learning and generative chatbots: use trained neural models to generate or predict conversational responses

Course Projects

The course included several guided projects covering conversational systems, language processing, and text-based machine learning.

View the chatbot notebooks on GitHub View the machine-translation notebooks on GitHub View the Twitter text-analysis notebooks on GitHub

Capstone Project

Project: Development of a closed-domain chatbot for my professional website.

I used the course as the foundation for designing and deploying a custom chatbot that helps visitors navigate my website and retrieve concise information about my biography, skills, portfolio, learning activities, publications, and professional experience.

Develop a custom Python backend
Integrate a browser-based JavaScript chat widget
Recognise navigation requests and frequently asked questions
Handle spelling errors and informal wording
Retrieve relevant content from the website
Provide direct links to related sections and documents
Deploy and maintain the chatbot independently from the static website

Python, natural language processing, machine learning and chatbot technologies

1. Python Backend and FastAPI

I developed a custom backend using Python and FastAPI to receive chatbot requests, validate inputs, process user queries, and return structured responses.

Created API endpoints for chatbot requests
Used structured request and response models
Separated the chatbot logic from the static website frontend
Implemented error handling and fallback responses
Designed the codebase so individual retrieval components could be updated independently

2. Render Deployment

The FastAPI backend is deployed on Render, allowing the GitHub Pages website to communicate with a separately hosted Python application.

Connected the backend repository to Render
Configured automatic builds from GitHub commits
Hosted the chatbot API independently from the website
Monitored deployment status following backend updates
Tested the deployed endpoint after each significant change

3. JavaScript Chat Widget

I integrated a custom JavaScript chat widget into the professional website so visitors can interact with the chatbot directly from any section of the page.

Expandable chatbot launcher positioned at the bottom-right corner
Text-input field for visitor questions
Asynchronous communication with the FastAPI backend
Display of user and chatbot messages within the widget
Clickable links to relevant website sections and documents
Graceful handling of slow responses or backend errors

4. Multi-Stage Intent Recognition

The chatbot uses several complementary methods rather than relying on a single matching strategy.

Rule-based patterns: recognise direct commands, navigation requests, greetings, and common question types
Fuzzy string matching: uses RapidFuzz to interpret spelling mistakes, partial matches, and informal phrasing
FAQ retrieval: searches a structured faqs.json file for predefined answers
Website-content retrieval: searches indexed website sections when no direct FAQ answer is available
Fallback handling: provides a useful response when no sufficiently relevant match is found

5. Structured FAQ Knowledge Base

Frequently asked questions are stored in a structured faqs.json file so that common queries can be answered consistently.

Questions and answers stored separately from the main application logic
Support for multiple phrasings of similar questions
Direct links to relevant website sections where appropriate
Easy maintenance without rewriting the entire backend
Reusable content for navigation, professional information, and project summaries

6. TF-IDF Website Retrieval

I added a lightweight TF-IDF retrieval system that indexes the main website sections and identifies the content most relevant to a visitor’s query.

Indexing of Biography, Skills, Portfolio, Learning, and related website content
Conversion of website text into TF-IDF vectors
Similarity comparison between user queries and indexed content
Selection of the most relevant section or passage
Generation of concise answers based on retrieved website content
Inclusion of direct “Read more” links for further detail

7. Website Snapshot

The backend maintains a local HTML snapshot of the professional website so that the chatbot can index website content without repeatedly loading the live site.

Website content copied into a backend snapshot file
Text extracted from the relevant HTML sections
Navigation, interface controls, and non-content elements excluded where possible
Snapshot updated when substantive website content changes
Purely visual or interaction-only website changes do not require reindexing

8. Reindexing and Hot Reloading

I created an administrative /reindex endpoint that allows the chatbot knowledge resources to be refreshed without rebuilding the full application.

Reload the structured FAQ data
Re-extract content from the website snapshot
Rebuild the TF-IDF index
Apply content updates without changing the frontend widget
Reduce maintenance time when website information changes

9. Navigation and Content-Aware Responses

The chatbot is designed to provide both concise answers and practical navigation support.

Answer questions about my background, skills, and projects
Direct users to relevant Biography, Skills, Portfolio, or Learning subsections
Provide links to my resume and publication list
Help visitors locate project repositories, dashboards, and reports
Recognise common project names and technical terms
Return contextual “Read more” links when additional detail is available

10. Privacy-First Contact Workflow

I designed the chatbot to avoid collecting or reproducing unnecessary personal information.

Mask personal information where relevant
Avoid exposing private contact details through chatbot responses
Direct genuine professional enquiries to approved website contact pathways
Provide access to professional information through the embedded resume link
Keep the chatbot focused on public, website-based content

11. Automated Deployment Workflow

The chatbot uses a linked GitHub and Render workflow for version control, deployment, and maintenance.

Backend code maintained in a dedicated GitHub repository
Changes committed and pushed using GitHub Desktop
Render automatically rebuilds the backend following relevant commits
Deployment status checked through the Render dashboard
Live chatbot endpoint tested following deployment
Website and backend repositories updated together when content dependencies change

12. Modular Code Structure

The backend was structured so that retrieval, FAQ handling, intent recognition, indexing, and response generation could be maintained as separate components.

Clear separation of configuration, retrieval, and API logic
Reusable functions for matching and ranking responses
Independent FAQ and website-content resources
Maintainable deployment configuration
Architecture that can support future retrieval or generative components

Current Chatbot Capabilities

The website chatbot currently combines deterministic and retrieval-based methods to provide context-aware responses grounded in the website content.

Rule-based navigation and command handling
Fuzzy matching for spelling and phrasing variation
Structured FAQ retrieval
TF-IDF website-content retrieval
Direct links to relevant website sections
Backend reindexing without full redeployment
Independent frontend and backend deployment

Try the Chatbot

The chatbot widget is located in the bottom-right corner of this website. Open the widget and try a question such as:

“What is your OKCupid project about?”
“What machine learning skills do you have?”
“Tell me about your wheat proteogenomics work.”
“Where can I find your resume?”
“What consulting services do you offer?”

Chatbot icon used to open the professional website assistant

Potential Future Development

Add conversational context across multiple messages
Improve topic recognition and query reformulation
Expand the indexed website content dynamically
Evaluate embedding-based retrieval
Develop a retrieval-augmented generation workflow
Integrate an appropriate language model for richer responses
Retain source grounding and direct links to website content
Continue prioritising privacy, transparency, and maintainability

Professional Relevance

This project demonstrates my ability to move from guided chatbot training to the design, deployment, maintenance, and continuous improvement of a complete web-integrated conversational application.

Design a closed-domain chatbot around a defined knowledge source
Combine rules, fuzzy matching, FAQ retrieval, and document retrieval
Develop and deploy a FastAPI backend
Integrate a JavaScript frontend with a remote Python service
Maintain linked website and backend repositories
Implement reindexing and automated deployment workflows
Build a modular foundation for future RAG or LLM development

Technical highlights: CodeCademy • Python • FastAPI • JavaScript • chatbot development • natural language processing • rule-based systems • retrieval-based chatbots • RapidFuzz • fuzzy matching • JSON • FAQ retrieval • TF-IDF • cosine similarity • website indexing • Render • GitHub • API deployment • reindexing • privacy-aware design • retrieval-augmented generation

CodeCademy Certification — Learn Prompt Engineering

I completed the CodeCademy course: Learn Prompt Engineering, which introduced practical techniques for communicating effectively with generative artificial intelligence systems.

The course strengthened my ability to provide clear context, define constraints, structure complex requests, refine outputs iteratively, and use generative AI as a productive assistant within analytical, scientific, technical, and creative workflows.

View the CodeCademy course syllabus

Learning Outcomes

Training focus: Design prompts that give generative AI systems sufficient context, clear instructions, appropriate examples, and well-defined output requirements.

Set context and define the role of the AI system
State objectives, constraints, assumptions, and expected outputs clearly
Use zero-shot, one-shot, and few-shot prompting
Break complex tasks into smaller, manageable subtasks
Use examples to guide tone, structure, formatting, and reasoning
Iteratively refine prompts based on the quality of previous outputs
Introduce external information through retrieval-augmented generation
Evaluate AI-generated outputs for accuracy, relevance, and completeness

Prompt-engineering concepts for interacting effectively with generative artificial intelligence systems

Setting the Context

Effective prompting begins by giving the AI system enough background to understand the task, audience, subject area, and intended outcome.

Describe the project and its broader purpose
Identify the intended audience
Provide relevant technical or scientific background
Define the desired tone and level of detail
Explain which information should be prioritised or excluded
Specify whether the task involves writing, analysis, coding, editing, or planning

Zero-Shot Prompting

Zero-shot prompting asks the AI system to complete a task using only the instruction and context provided, without supplying a worked example.

Useful for familiar and clearly defined tasks
Requires precise wording and explicit constraints
Suitable for summarisation, rewriting, classification, and straightforward coding tasks
Works best when the expected output format is clearly specified

One-Shot and Few-Shot Prompting

One-shot and few-shot prompting provide one or more examples to demonstrate the desired style, structure, terminology, or decision pattern.

Provide examples of preferred wording or formatting
Show how categories or labels should be applied
Demonstrate the expected level of technical detail
Improve consistency across repeated outputs
Guide the AI when instructions alone may be ambiguous

Breaking Complex Tasks into Subtasks

Large or multidisciplinary tasks are more reliable when divided into a clear sequence of smaller steps.

Separate planning from implementation
Review data structure before developing code
Address manuscript sections or reviewer comments individually
Test one analytical step before extending the workflow
Separate content editing from formatting and presentation
Validate intermediate results before producing final conclusions

Iterative Prompt Refinement

Prompt engineering is an iterative process. I frequently refine an initial request after reviewing the first response, clarifying requirements, correcting assumptions, or adding missing constraints.

Identify which part of an output is incomplete or inaccurate
Preserve useful material while revising only the necessary section
Add examples when the requested style is not sufficiently clear
Specify terminology that must be retained or avoided
Request full replacement code when partial amendments would be difficult to integrate
Repeat validation after each substantial revision

Retrieval-Augmented Generation

The course introduced retrieval-augmented generation, or RAG, in which relevant information is retrieved from a defined knowledge source and supplied to a generative model before a response is produced.

Ground responses in custom documents or datasets
Retrieve only the information relevant to the current query
Reduce reliance on unsupported model knowledge
Connect generative AI with websites, reports, FAQs, databases, or document collections
Improve traceability by linking responses to source material
Support closed-domain assistants and knowledge-based chatbots

How I Use Prompt Engineering in Professional Projects

I use structured prompts to collaborate with generative AI, particularly ChatGPT, across scientific, analytical, coding, web-development, and professional-communication projects. I provide the relevant context, source material, constraints, and expected output, then review and refine the result rather than accepting it uncritically.

Scientific manuscripts: revise abstracts, methods, results, discussions, figure legends, contribution statements, cover letters, and responses to reviewers
Data analysis: plan analytical workflows, inspect assumptions, interpret model outputs, compare alternative methods, and translate technical results into clear conclusions
Python development: write, debug, refactor, and document code for data cleaning, statistical analysis, machine learning, bioinformatics, automation, and visualisation
Website development: revise HTML, CSS, and JavaScript; improve responsive behaviour; troubleshoot interactive components; and structure new professional content
Machine learning projects: refine experimental design, feature engineering, model validation, interpretation, reporting, and reproducibility
Bioinformatics and proteogenomics: clarify computational workflows, audit data-processing steps, interpret outputs, and improve scientific descriptions of large-scale analyses
Professional applications: tailor resumes, cover letters, selection responses, profile summaries, and skills statements to specific roles
Client communication: draft emails, website recommendations, technical guidance, article feedback, and concise summaries for collaborators and stakeholders
Project documentation: create checklists, implementation guides, maintenance instructions, standardised workflows, and handover documents
Creative and multimedia work: develop concepts, captions, descriptions, visual-storytelling workflows, and automation plans

My Prompting Workflow

For complex professional tasks, I generally use a structured, multi-step workflow.

Provide the project context and explain the purpose of the task
Supply the relevant text, code, files, results, or screenshots
State factual constraints and information that must not be invented
Define the audience, tone, length, and desired format
Ask for one clearly bounded task at a time
Review the first output against the original evidence
Correct misunderstandings and refine individual sections
Request a complete final version once all decisions are settled
Test code or website changes in the appropriate environment
Retain responsibility for the final scientific, technical, and professional output

Prompting for Code Development

When working on code, I provide the existing script, the intended behaviour, relevant inputs and outputs, and the exact error or limitation that needs to be addressed.

Request full replacement code when integration accuracy is important
Preserve existing filenames, column names, paths, and workflow conventions
Describe expected outputs and validation checks
Provide error messages and traceback information
Test revised code and report the resulting behaviour
Iteratively correct issues until the workflow completes successfully
Compare final outputs with known counts, assumptions, or quality-control criteria

Prompting for Scientific Editing

For scientific writing, I provide the original text, the study context, the underlying results, reviewer comments, and any journal-specific constraints.

Preserve scientific meaning while improving clarity
Avoid unsupported claims and invented references
Maintain consistency with figures, tables, and supplementary files
Distinguish established results from interpretation or speculation
Adapt wording to journal scope and article type
Reduce repetition while retaining necessary methodological detail
Check terminology and numerical consistency across sections

Prompting for Website Development

For website work, I provide the existing HTML or CSS block and describe the visual or functional change required.

Request copy-and-paste-ready replacement blocks
Preserve existing IDs, classes, paths, and naming conventions
Explain expected desktop and mobile behaviour
Use screenshots to demonstrate layout problems
Test changes across screen sizes and device orientations
Review accessibility, navigation, responsive behaviour, and content visibility
Confirm whether changes affect linked backend or chatbot resources

Human Review and Quality Control

I use generative AI as an assistant rather than an autonomous decision-maker. I remain responsible for checking the accuracy, suitability, and integrity of every final output.

Verify scientific statements against source data and publications
Run and test generated code before adopting it
Check calculations, counts, labels, and file outputs
Review language for tone, accuracy, and unintended overstatement
Confirm that confidential information is handled appropriately
Reject suggestions that conflict with evidence or project requirements
Document important decisions and maintain version-controlled files

Privacy and Confidentiality

Prompt design also requires careful consideration of privacy and data sensitivity. I adapt the workflow according to the confidentiality of the material.

Avoid sharing unnecessary personal or confidential information
Remove or mask identifiers where appropriate
Use public or approved source material for external tools
Use local language-model environments for privacy-sensitive tasks where feasible
Separate confidential source documents from public-facing outputs
Review generated text before external distribution

Professional Relevance

Prompt engineering strengthens my ability to use generative AI productively while retaining scientific judgement, technical oversight, and responsibility for final decisions.

Accelerate drafting, coding, troubleshooting, and documentation
Structure complex multidisciplinary tasks
Improve consistency across repeated professional workflows
Translate between technical and non-technical communication
Develop grounded chatbot and retrieval workflows
Support iterative problem-solving without replacing domain expertise
Maintain human review, evidence checking, and quality control

Technical highlights: CodeCademy • prompt engineering • generative artificial intelligence • context setting • zero-shot prompting • one-shot prompting • few-shot prompting • task decomposition • iterative refinement • retrieval-augmented generation • ChatGPT • scientific editing • Python development • data analysis • machine learning • website development • human-in-the-loop validation • privacy-aware AI workflows

Biography

Research Scientist (1998-2023)

Data Scientist (2024-present)

Work with Me

Research Scientist (1998-2023)

Academic and Research Career

Quantitative and Statistical Expertise

Bioinformatics and Scientific Data Interpretation

Scientific Communication and Leadership

Data Scientist (2024-now)

1/ CodeCademy Career Path: Data Scientist - Analytics Specialist

2/ CodeCademy Career Path: Data Scientist - Machine Learning

3/ CodeCademy Skill Path: Build a Website with HTML, CSS and GitHub Pages

4/ CodeCademy Course: Learn Prompt Engineering

5/ CodeCademy Skill Path: Build a Chatbot with Python

6/ CodeCademy Course: Large Language Model (LLM)

7/ AWS Educate: Cloud Computing, AWS Infrastructure and AI/ML Foundations

8/ Automating Visual Storytelling with Python

Work with Me

Independent Consultant

How I Can Help

Typical Deliverables

Suitable Projects

Contact Form

Skills

Computer OS

Programming

Statistical Analyses, ML & NLP

Bioinformatics & Data Mining

Data Visualisation

Data interpretation & Storytelling

Author, Editor & Reviewer

AI & Prompt engineer

Computer Operating Systems and Microsoft 365

Microsoft Windows

Microsoft Word

Microsoft Excel

Microsoft PowerPoint

Microsoft Power BI

Microsoft Teams, Outlook and OneNote

Microsoft SharePoint and OneDrive

Apple macOS

Programming Languages, Development Environments, Cloud Platforms and AI

Python Programming

Python Data Science and Scientific Computing Libraries

Natural Language Processing and Document Analytics

Image, Audio and Video Automation

R Programming and Statistical Analysis

SQL and Relational Databases

Web Development

APIs, Chatbots and Interactive Applications

Development Environments and Reproducible Workflows

Git, GitHub and Version Control

Hosting, Cloud and Deployment Platforms

Generative AI and AI-Assisted Workflows

Statistical Analyses, Machine Learning and Natural Language Processing

Analytical Workflow Design

Data Retrieval, Integration and Preparation

Data Quality Control and Preprocessing

Descriptive Statistics and Exploratory Data Analysis

Statistical Testing and Inference

Multivariate Analysis and Structure Discovery

Supervised Machine Learning

Unsupervised Learning and Anomaly Detection

Natural Language Processing

Deep Learning, Transformers and Embeddings

Feature Selection and Biomarker Discovery

Model Validation and Performance Evaluation

Interpretability and Scientific Reasoning

Reporting, Visualisation and Delivery

Software and Analytical Environments

Bioinformatics and Data Mining

Sequence Analysis and Genome Annotation

Genome Browsers and Proteogenomics

Protein Identification, Annotation and Structure

Gene Ontology and Functional Enrichment

Pathway and Multi-Omics Integration

Protein Interaction and Network Biology

Proteomics and Peptide-Centric Resources

Metabolite and Chemical Annotation

Research Scientist
(1998-2023)

Data Scientist
(2024-present)

Work with
Me