Research Article: Predictive modeling for odor character of a chemical using machine learning combined with natural language processing

Date Published: June 14, 2018

Publisher: Public Library of Science

Author(s): Yuji Nozaki, Takamichi Nakamoto, Hiroaki Matsunami.


Recent studies on machine learning technology have reported successful performances in some visual and auditory recognition tasks, while little has been reported in the field of olfaction. In this paper we report computational methods to predict the odor impression of a chemical from its physicochemical properties. Our predictive model utilizes nonlinear dimensionality reduction on mass spectra data and performs the clustering of descriptors by natural language processing. Sensory evaluation is widely used to measure human impressions to smell or taste by using verbal descriptors, such as “spicy” and “sweet”. However, as it requires significant amounts of time and human resources, a large-scale sensory evaluation test is difficult to perform. Our model successfully predicts a group of descriptors for a target chemical through a series of computer simulations. Although the training text data used in the language modeling is not specialized for olfaction, the experimental results show that our method is useful for analyzing sensory datasets. This is the first report to combine machine olfaction with natural language processing for odor character prediction.

Partial Text

The source of smells is airborne chemical molecules. Olfactory receptor neurons within the olfactory epithelium are activated when they bind with molecules and provide electrical signals to olfactory nerves. Then the signals are delivered to the olfactory bulb and form a pattern on it. Afterwards, on the basis of the response pattern on the olfactory bulb, comprehensive information processing associated with emotion and memory is performed in the cerebrum [1]. As each type of olfactory receptor has different molecular selectivity, the pattern of stimuli appearing on the olfactory bulb varies from molecule to molecule [2]. That is, the impression of odor also varies from molecule to molecule. From previous studies, one of the key factors contributing to differences in patterns is considered to be the molecular structure [3], [4]. If we can predict the smell impression from the physicochemical properties of a molecule, it will be an important breakthrough for the cosmetic, beverage, and food industries because a large number of experienced panelists are currently required to create the desired odors through trial and error in these industries.

Mass spectra are physicochemical properties representing structural information of molecules and are given as a plot of intensity vs m/z (mass-to-charge ratio). The mass spectrum is uniquely determined for each molecule given the same measurement conditions. Large-scale mass spectrum datasets are available as it is possible to perform a number of mass spectrum measurements under a uniform condition.

We propose a neural network that predicts the presence or absence of a specific descriptor from the mass spectrum of a chemical molecule. The input units of the neural network correspond to the m/z values of the mass spectrum, and the output units correspond to the descriptors of the odor impression.

In this paper, we propose an approach to predicting odor characters of chemicals using clusters with larger granularity containing similar descriptors. For example, all similar descriptors such as “Rose”, “Violet”, and “Lavender” are grouped in the same cluster. This cluster may represent applicability to “flower”.

When any of the descriptors in a cluster has a value of 1, the value of the corresponding cluster that includes the sample is set to 1. For example, when “rose”, “lavender”, and “iris” belong to a certain cluster, a sample having at least one of the three descriptors in the original catalog data is regarded as having the odor character of the cluster. (see Fig 3).

The number of cluster Kp3 must be determined carefully since the distribution of samples strongly affect performance of the predictive model. Fig 5 shows the distribution of samples with respect to Kp3. As shown in Fig 5, sample distribution of correlation-based model is out of balance from cluster to cluster when Kp3 is 6. This means that most of the descriptors belong to one huge cluster while the other clusters include very small number of descriptors. In such case, a predictive model may mark very high accuracy while the prediction provides little information.

In this paper, we proposed a predictive model incorporating the language modeling method Word2vec to predict odor characters of chemicals represented by binary values from mass spectra. In the catalog data of Sigma-Aldrich used in this study, descriptors to represent the odor characters of molecules are used exclusively even if other descriptors represent similar odor characters, resulting in banishing of similarity between descriptors.




0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments