TY - JOUR TI - Application of ANN, XGBoost, and Other ML Methods to Forecast Air Quality in Macau AU - Lei, Thomas M. T. AU - Ng, Stanley C. W. AU - Siu, Shirley W. I. T2 - Sustainability AB - Air pollution in Macau has become a serious problem following the Pearl River Delta’s (PRD) rapid industrialization that began in the 1990s. With this in mind, Macau needs an air quality forecast system that accurately predicts pollutant concentration during the occurrence of pollution episodes to warn the public ahead of time. Five different state-of-the-art machine learning (ML) algorithms were applied to create predictive models to forecast PM2.5, PM10, and CO concentrations for the next 24 and 48 h, which included artificial neural networks (ANN), random forest (RF), extreme gradient boosting (XGBoost), support vector machine (SVM), and multiple linear regression (MLR), to determine the best ML algorithms for the respective pollutants and time scale. The diurnal measurements of air quality data in Macau from 2016 to 2021 were obtained for this work. The 2020 and 2021 datasets were used for model testing, while the four-year data before 2020 and 2021 were used to build and train the ML models. Results show that the ANN, RF, XGBoost, SVM, and MLR models were able to provide good performance in building up a 24-h forecast with a higher coefficient of determination (R2) and lower root mean square error (RMSE), mean absolute error (MAE), and biases (BIAS). Meanwhile, all the ML models in the 48-h forecasting performance were satisfactory enough to be accepted as a two-day continuous forecast even if the R2 value was lower than the 24-h forecast. The 48-h forecasting model could be further improved by proper feature selection based on the 24-h dataset, using the Shapley Additive Explanations (SHAP) value test and the adjusted R2 value of the 48-h forecasting model. In conclusion, the above five ML algorithms were able to successfully forecast the 24 and 48 h of pollutant concentration in Macau, with the RF and SVM models performing the best in the prediction of PM2.5 and PM10, and CO in both 24 and 48-h forecasts. DA - 2023/01// PY - 2023 DO - 10.3390/su15065341 DP - www.mdpi.com VL - 15 IS - 6 SP - 5341 LA - en SN - 2071-1050 UR - https://www.mdpi.com/2071-1050/15/6/5341 Y2 - 2023/04/11/09:51:09 KW - Macau KW - air pollution KW - air quality KW - air quality forecast KW - machine learning ER - TY - JOUR TI - Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network AU - Chen, Jiarui AU - Si, Yain-Whar AU - Un, Chon-Wai AU - Siu, Shirley W. I. T2 - Journal of Cheminformatics AB - As safety is one of the most important properties of drugs, chemical toxicology prediction has received increasing attentions in the drug discovery research. Traditionally, researchers rely on in vitro and in vivo experiments to test the toxicity of chemical compounds. However, not only are these experiments time consuming and costly, but experiments that involve animal testing are increasingly subject to ethical concerns. While traditional machine learning (ML) methods have been used in the field with some success, the limited availability of annotated toxicity data is the major hurdle for further improving model performance. Inspired by the success of semi-supervised learning (SSL) algorithms, we propose a Graph Convolution Neural Network (GCN) to predict chemical toxicity and trained the network by the Mean Teacher (MT) SSL algorithm. Using the Tox21 data, our optimal SSL-GCN models for predicting the twelve toxicological endpoints achieve an average ROC-AUC score of 0.757 in the test set, which is a 6% improvement over GCN models trained by supervised learning and conventional ML methods. Our SSL-GCN models also exhibit superior performance when compared to models constructed using the built-in DeepChem ML methods. This study demonstrates that SSL can increase the prediction power of models by learning from unannotated data. The optimal unannotated to annotated data ratio ranges between 1:1 and 4:1. This study demonstrates the success of SSL in chemical toxicity prediction; the same technique is expected to be beneficial to other chemical property prediction tasks by utilizing existing large chemical databases. Our optimal model SSL-GCN is hosted on an online server accessible through: https://app.cbbio.online/ssl-gcn/home. DA - 2021/11/27/ PY - 2021 DO - 10.1186/s13321-021-00570-8 DP - BioMed Central VL - 13 IS - 1 SP - 93 J2 - Journal of Cheminformatics SN - 1758-2946 UR - https://doi.org/10.1186/s13321-021-00570-8 Y2 - 2022/09/21/05:33:31 KW - ADMET KW - Chemical toxicity KW - Deep learning KW - Graph convolutional neural network KW - Mean teacher KW - Semi-supervised learning KW - Tox21 ER - TY - JOUR TI - Multi-Branch-CNN: Classification of ion channel interacting peptides using multi-branch convolutional neural network AU - Yan, Jielu AU - Zhang, Bob AU - Zhou, Mingliang AU - Kwok, Hang Fai AU - Siu, Shirley W. I. T2 - Computers in Biology and Medicine AB - Ligand peptides that have high affinity for ion channels are critical for regulating ion flux across the plasma membrane. These peptides are now being considered as potential drug candidates for many diseases, such as cardiovascular disease and cancers. In this work, we developed Multi-Branch-CNN, a CNN method with multiple input branches for identifying three types of ion channel peptide binders (sodium, potassium, and calcium) from intra- and inter-feature types. As for its real-world applications, prediction models that are able to recognize novel sequences having high or low similarities to training sequences are required. To this end, we tested our models on two test sets: a general test set including sequences spanning different similarity levels to those of the training set, and a novel-test set consisting of only sequences that bear little resemblance to sequences from the training set. Our experiments showed that the Multi-Branch-CNN method performs better than thirteen traditional ML algorithms (TML13), yielding an improvement in accuracy of 3.2%, 1.2%, and 2.3% on the test sets as well as 8.8%, 14.3%, and 14.6% on the novel-test sets for sodium, potassium, and calcium ion channels, respectively. We confirmed the effectiveness of Multi-Branch-CNN by comparing it to the standard CNN method with one input branch (Single-Branch-CNN) and an ensemble method (TML13-Stack). The data sets, script files to reproduce the experiments, and the final predictive models are freely available at https://github.com/jieluyan/Multi-Branch-CNN. DA - 2022/08/01/ PY - 2022 DO - 10.1016/j.compbiomed.2022.105717 DP - ScienceDirect VL - 147 SP - 105717 J2 - Computers in Biology and Medicine LA - en SN - 0010-4825 ST - Multi-Branch-CNN UR - https://www.sciencedirect.com/science/article/pii/S0010482522004954 Y2 - 2022/09/21/05:32:54 KW - Classification KW - Deep learning KW - Drug discovery KW - Ion channel KW - Multi-Branch-CNN KW - Peptides ER - TY - JOUR TI - Using Machine Learning Methods to Forecast Air Quality: A Case Study in Macao AU - Lei, Thomas M. T. AU - Siu, Shirley W. I. AU - Monjardino, Joana AU - Mendes, Luisa AU - Ferreira, Francisco T2 - Atmosphere AB - Despite the levels of air pollution in Macao continuing to improve over recent years, there are still days with high-pollution episodes that cause great health concerns to the local community. Therefore, it is very important to accurately forecast air quality in Macao. Machine learning methods such as random forest (RF), gradient boosting (GB), support vector regression (SVR), and multiple linear regression (MLR) were applied to predict the levels of particulate matter (PM10 and PM2.5) concentrations in Macao. The forecast models were built and trained using the meteorological and air quality data from 2013 to 2018, and the air quality data from 2019 to 2021 were used for validation. Our results show that there is no significant difference between the performance of the four methods in predicting the air quality data for 2019 (before the COVID-19 pandemic) and 2021 (the new normal period). However, RF performed significantly better than the other methods for 2020 (amid the pandemic) with a higher coefficient of determination (R2) and lower RMSE, MAE, and BIAS. The reduced performance of the statistical MLR and other ML models was presumably due to the unprecedented low levels of PM10 and PM2.5 concentrations in 2020. Therefore, this study suggests that RF is the most reliable prediction method for pollutant concentrations, especially in the event of drastic air quality changes due to unexpected circumstances, such as a lockdown caused by a widespread infectious disease. DA - 2022/09// PY - 2022 DO - 10.3390/atmos13091412 DP - www.mdpi.com VL - 13 IS - 9 SP - 1412 LA - en SN - 2073-4433 ST - Using Machine Learning Methods to Forecast Air Quality UR - https://www.mdpi.com/2073-4433/13/9/1412 Y2 - 2022/09/21/05:32:43 KW - COVID-19 KW - air pollution KW - air quality KW - air quality forecast KW - gradient boosting KW - multiple linear regression KW - random forest KW - support vector regression ER - TY - JOUR TI - Recent Progress in the Discovery and Design of Antimicrobial Peptides Using Traditional Machine Learning and Deep Learning AU - Yan, Jielu AU - Cai, Jianxiu AU - Zhang, Bob AU - Wang, Yapeng AU - Wong, Derek F. AU - Siu, Shirley W. I. T2 - Antibiotics AB - Antimicrobial resistance has become a critical global health problem due to the abuse of conventional antibiotics and the rise of multi-drug-resistant microbes. Antimicrobial peptides (AMPs) are a group of natural peptides that show promise as next-generation antibiotics due to their low toxicity to the host, broad spectrum of biological activity, including antibacterial, antifungal, antiviral, and anti-parasitic activities, and great therapeutic potential, such as anticancer, anti-inflammatory, etc. Most importantly, AMPs kill bacteria by damaging cell membranes using multiple mechanisms of action rather than targeting a single molecule or pathway, making it difficult for bacterial drug resistance to develop. However, experimental approaches used to discover and design new AMPs are very expensive and time-consuming. In recent years, there has been considerable interest in using in silico methods, including traditional machine learning (ML) and deep learning (DL) approaches, to drug discovery. While there are a few papers summarizing computational AMP prediction methods, none of them focused on DL methods. In this review, we aim to survey the latest AMP prediction methods achieved by DL approaches. First, the biology background of AMP is introduced, then various feature encoding methods used to represent the features of peptide sequences are presented. We explain the most popular DL techniques and highlight the recent works based on them to classify AMPs and design novel peptide sequences. Finally, we discuss the limitations and challenges of AMP prediction. DA - 2022/10// PY - 2022 DO - 10.3390/antibiotics11101451 DP - www.mdpi.com VL - 11 IS - 10 SP - 1451 LA - en SN - 2079-6382 UR - https://www.mdpi.com/2079-6382/11/10/1451 Y2 - 2022/11/09/06:00:01 KW - antimicrobial peptide KW - classification KW - deep learning KW - machine learning KW - medicine KW - regression KW - therapeutic peptide ER -