Materials alternative recommender using machine learning based on COSMO-SAC

Document Type : Research Article

Authors

1 Al-Zahrawi University College, Karbala, Iraq

2 Department of Pharmacy, Al-Noor University College, Nineveh, Iraq

3 Collage of Dentist, National University of Science and Technology, Dhi Qar, 64001, Iraq

4 Medical technical college, Al-Farahidi University, Iraq

5 Physics Department, College of Science, University of Halabja, 46018, Halabja, Iraq

Abstract

Finding alternative materials and solvents in a chemistry lab or the process of designing would be a time-consuming matter. The activity coefficient is one of the most important thermodynamic properties that could be used for this purpose. COSMO-SAC modeling is a reliable method to determine the activity coefficient of the mixtures and is used to find alternatives to the organic materials in the present study. A dataset of 96 organic molecules’ activity coefficients in the different solvents (water, ethanol, methanol, toluene, and benzene) mixtures have been obtained in full range composition with COSMO-SAC. The created database has been merged with the FreeSolv dataset to extend the diversity of the properties to enrich the dataset for machine learning training. Unsupervised machine learning methods (clustering) including centroid-based and density-based clustering methods have been conducted to introduce the best alternatives for the studied 96 organic materials. Proper pre-processing for these methods has been utilized to evaluate the optimum parameters of the clustering methods including the elbow method for centroid-based clustering and k-nearest neighbors for the density-based clustering. The centroid-based clustering methods recommend a different variety of materials based on the cluster numbers and sorting the alternatives based on the nearest properties. However, the density-based method works with the optimum distance and the number of the k-nearest neighbors that were 0.08 and 7, respectively for the created dataset. Its results are exclusive and show that the clustering could be used to isolate the clusters based on the chemical families which were 5 clusters and 12 out layers. The out layers are important since no alternatives have been introduced for them in the trained dataset and should be considered as unique materials. The density-based clustering results were more promising using COSMO-SAC data for organic materials alternative recommender.

Graphical Abstract

Materials alternative recommender using machine learning based on COSMO-SAC

Keywords

Main Subjects