NC State
BioResources
Hwang, S.-W., Park, G., Kim, J., Kang, K.-H., and Lee, W.-H. (2024). “One-dimensional convolutional neural networks with infrared spectroscopy for classifying the origin of printing paper,” BioResources 19(1), 1633-1651.

Abstract

Herein, the challenge of accurately classifying the manufacturing origin of printing paper, including continent, country, and specific product, was addressed. One-dimensional convolutional neural network (1D CNN) models trained on infrared (IR) spectrum data acquired from printing paper samples were used for the task. The preprocessing of the IR spectra through a second-derivative transformation and the restriction of the spectral range to 1800 to 1200 cm-1 improved the classification performance of the model. The outcomes were highly promising. Models trained on second-derivative IR spectra in the 1800 to 1200-cm-1 range exhibited perfect classification for the manufacturing continent and country, with an impressive F1 score of 0.980 for product classification. Notably, the developed 1D CNN model outperformed traditional machine learning classifiers, such as support vector machines and feed-forward neural networks. In addition, the application of data point attribution enhanced the transparency of the decision-making process of the model, offering insights into the spectral patterns that affect classification. This study makes a considerable contribution to printing paper classification, with potential implications for accurate origin identification in various fields.


Download PDF

Full Article

One-Dimensional Convolutional Neural Networks with Infrared Spectroscopy for Classifying the Origin of Printing Paper

Sung-Wook Hwang,a Geungyong Park,b Jinho Kim,b Kwang-Ho Kang,c,* and Won-Hee Lee b,*

Herein, the challenge of accurately classifying the manufacturing origin of printing paper, including continent, country, and specific product, was addressed. One-dimensional convolutional neural network (1D CNN) models trained on infrared (IR) spectrum data acquired from printing paper samples were used for the task. The preprocessing of the IR spectra through a second-derivative transformation and the restriction of the spectral range to 1800 to 1200 cm-1 improved the classification performance of the model. The outcomes were highly promising. Models trained on second-derivative IR spectra in the 1800 to 1200-cm-1 range exhibited perfect classification for the manufacturing continent and country, with an impressive F1 score of 0.980 for product classification. Notably, the developed 1D CNN model outperformed traditional machine learning classifiers, such as support vector machines and feed-forward neural networks. In addition, the application of data point attribution enhanced the transparency of the decision-making process of the model, offering insights into the spectral patterns that affect classification. This study makes a considerable contribution to printing paper classification, with potential implications for accurate origin identification in various fields.

DOI: 10.15376/biores.19.1.1633-1651

Keywords: Classification; Convolutional neural network; Printing paper; Infrared spectroscopy; Data point attribution

Contact information: a: Human Resources Development Center for Big Data-Based Glocal Forest Science 4.0 Professionals, Kyungpook National University, 80 Daehak-ro, Buk-gu, Daegu 41566, Republic of Korea; b: Department of Wood Science and Technology, College of Agriculture and Life Sciences, Kyungpook National University, 80 Daehak-ro, Buk-gu, Daegu 41566, Republic of Korea; c: HP Printing Korea, 26 Yeonnaegaeul-ro, Sujeong-gu, Seongnam-si, Gyeonggi-do 13105, Republic of Korea;

* Corresponding authors: kwangho.kang@hp.com; leewh@knu.ac.kr

INTRODUCTION

The rapid advancement of modern technology has been expected to considerably reduce paper consumption in many fields owing to increasing digitalization. However, in reality, the consumption of paper has been increasing owing to various complex factors such as the need for document backup, security concerns, packaging, technological disparities, and user preferences (Shah et al. 2023). The development of printing and output technologies further highlights the significance of paper. Printing technology has revolutionized document creation, image reproduction, and data storage, considerably affecting business, education, and research domains. The quality and characteristics of printing paper are crucial components of these technologies, and fine analysis and distinction of these properties represent one of the vital challenges in this context.

Identifying the origin of paper products is essential for supporting global efforts to combat the illegal trade in timber products and promote sustainability (Australian Government 2012; Korea Legislation Research Institute 2020; Federal Register 2021; European Commission 2023). Furthermore, ensuring optimal printing quality for a particular printing device may require the identification of paper products that are suitable or unsuitable for the device. Paper origin identification is an advanced technology that offers practical value and innovations in various fields, including the prevention of document forgery and the development of new materials and manufacturing processes.

The combination of spectroscopy and multivariate analysis has proven to be a promising approach in classification problems involving various materials (Soriano-Disla et al. 2014; Chang et al. 2015; Hwang et al. 2016; Horikawa et al. 2019; Hwang et al. 2021). Infrared (IR) spectra are used to measure the IR emission of objects at specific wavelengths; these data can be used to discern the unique characteristics of paper (Stuart 2004; Trafela et al. 2007; Causin et al. 2010). Recent advancements in machine learning have further improved the predictive performance of models, making them more accurate and robust (Coppola et al. 2023; Hwang et al. 2023). Machine learning algorithms have already been used to process spectral data and identify the distinctive signatures of different types of paper through pattern recognition and feature extraction (Meza Ramirez et al. 2021). Recent research combines laser-induced breakdown spectroscopy (LIBS) and machine learning to achieve diverse goals. One aspect involves enhancing judicial expertise by analyzing ink marks in handwriting identification using LIBS and machine learning (Feng et al. 2023). Another facet addresses the misclassification of recyclable waste, employing LIBS and machine learning to create an effective online source tracing system (Chen et al. 2023). The system successfully identifies and categorizes smoke from waste paper incineration, demonstrating the possibility of tracing the source of waste paper.

Herein, a one-dimensional convolutional neural network (1D CNN) model using IR spectra was developed to accurately classify the manufacturing origin of printing paper, including the continent, country, and product. This model processes the spectral data of printing paper, learning patterns and features from the data. Furthermore, data point attribution analysis was used to understand how specific absorption bands in the IR spectrum contribute to the classification decisions of the model. This process enhanced the transparency of the decision-making process of the model and improved its interpretability. This article presents the results of machine learning-based classification of the manufacturing origin of printing paper, contributing to the existing body of research in this field.

EXPERIMENTAL

Printing Papers

Herein, 65 commercial products from 24 different manufacturers spanning 11 countries were used for printing paper classification (Table 1). Each product was categorized based on its country of production rather than the manufacturer’s country of registration. The products in the sample exhibited considerable variation, with the majority originating from China (28 products), whereas Finland was represented by only one product.

The majority of the selected products were typical office printing papers with a weight range of 70 to 90 grams per square meter (gsm). Some products exceeded 100 gsm and were intended for special documents, promotional materials, business cards, and other similar applications. Among the 28 products, those of Chinese origin were A4 or A3-sized printing papers with a weight of 70 to 80 gsm, manufactured by 12 companies. Among the 65 products tested, four were composed of recycled paper, whereas a unique product from the United States was produced from cotton. In addition, three original equipment manufacturer products with unverified manufacturer and country of origin information were included in the study. These items were incorporated into the analysis to predict their respective countries of origin.

Table 1. Number of Papers Analyzed and Country of Manufacture

Dataset

IR spectra

The IR spectra of printing paper samples spanning the wavenumber range of 4000 to 400 cm−1 were acquired using attenuated total reflection infrared (ATR-IR) spectroscopy (ALPHA-P, Bruker Optics, Ettlingen, Germany). The spectral resolution was set to 4 cm−1, and average spectra derived from 16 repeated scans were obtained. ATR-IR spectroscopy can analyze a wide range of samples, including liquids, solids, and powders. It requires minimal sample preparation, allowing for quick and direct analysis. For each printing paper product, the IR spectra from 5 samples were collected, resulting in a dataset comprising 325 spectra for the classification model.

Data preprocessing

The IR spectra were preprocessed using a Savitzky–Golay filter (Savitzky and Golay 1964). The original spectra were transformed into second-derivative spectra using a third polynomial with 21-point smoothing. This preprocessing was used to consistently adjust the baseline of the spectra and amplify peaks, thus emphasizing differences between the spectra (Hwang et al. 2016).

The IR spectra in the range of 4000 to 400 cm−1 comprise 2545 input variables, including zero-filled points. The IR spectra may contain information that is either noisy or not useful for sample characterization. Excess input variables are a primary factor increasing the computational cost of the model. Therefore, herein, the IR data from two regions were used for model training: 4000 to 400 cm−1 (the entire range) and 1800 to 1200 cm−1 (the selected range). The selected range is suitable for paper characterization (Kim and Eom 2016), and it corresponds to 425 input variables.

Through data preprocessing and selection, four datasets were generated from the original IR spectra, including the entire range (Dataset A) and selected range (Dataset B) of the original IR spectra, as well as the entire range (Dataset C) and selected range (Dataset D) of the second-derivative spectra. These four datasets were then used to develop respective classification models through Euclidean (L2) norm-based vector normalization using Eq. 1. (Fig. 1),

(1)

where v is the vector (IR spectrum) to be normalized, vi is ith element (data point) of vector v, and n is the number of vector elements.

Fig. 1. Diagram for the classification of printing paper using 1D CNN

Dataset splitting

The datasets were split into training, validation, and test sets in a 3:1:1 ratio to build and evaluate the classification models. This ratio represented the minimum requirement for allocating data to each subset in product-level classification. The data were partitioned using stratified random sampling to maintain the specified split ratio for all classes.

Principal Component Analysis (PCA)

To analyze the IR spectral data of printing paper, PCA was conducted on four datasets. Through the PCA, the high-dimensional IR data were transformed into a new orthogonal coordinate system with six principal components (PCs). The transformed data were subsequently visualized in a two-dimensional (2D) space to investigate the structure and patterns in the IR data for printing paper.

1D CNN Classification Model

1D CNN architecture

The CNNs are fundamental for deep learning; they are predominantly used in image processing, where they excel in feature extraction from 2D data to facilitate image recognition and classification. Similarly, 1D CNNs, operated based on the same technical principles, are used to extract features from 1D data for predictive purposes.

The architecture of the 1D CNN models tailored for the classification of printing paper in this study is illustrated in Fig. 2. The used 1D CNN networks comprise two convolutional layers and two fully connected layers, with each convolutional layer forming a module in conjunction with a max-pooling layer. These modules abstract and extract features from the input data, specifically from the IR spectrum, through data abstraction and down-sampling. Rectified linear unit (ReLU) was used as the activation function. The learned features from the convolutional modules are passed to a network composed of one flatten layer, two fully connected layers, and one softmax layer for training and performing prediction tasks using the input data.

Fig. 2. Architecture of the 1D CNN model for printing paper classification. Numbers in parentheses indicate layer shapes. Notes: Conv, convolution layer; FC, fully connected layer

The details of the hyperparameters tested and their application within the network for establishing the 1D CNN model are shown in Table 2. These hyperparameters were optimized through loop-based testing. Each 1D CNN model for printing paper classification was trained for 700 epochs using categorical crossentropy as the loss function.

Evaluation metric

Printing paper products are inequally distributed across manufacturing countries; thus, the evaluations of the classification performance of models using accuracy may be biased because of oversampled classes. Consequently, in this study, the weighted F1 score was used for assessing the classification performance of the 1D CNN models. The F1 score, which is the harmonic mean of precision (Eq. 2) and recall (Eq. 3), is a commonly used performance metric in classification problems with class imbalance; it is defined in Eq. 3.

(2)

(3)

where TP is the true positives, FP is the false positives, and FN is the false negatives.

(4)

where F1i, Pi, and Ri are the F1 score, precision, and recall for class i, respectively.

Table 2. Detailed Hyperparameters Used for Building the 1D CNN Model

Notes: kernel_size, the size of the convolutional kernel; filters: the number of filters applied in the convolutional layers; pool_size: the size of the pooling window in max pooling layers; dense_units: the number of nodes in the dense layers; dropout_rate: the rate at which dropout is applied in dropout layers; learning_rate: the learning rate used in the training; optimizer: the optimization algorithm chosen for training the model; SGD, stochastic gradient descent; Adam, adaptive moment estimation; RMSProp, root mean squared propagation; The values in square brackets represent the values of each hyperparameter used in building the model; Layer, the specific layer in the network architecture; Layer Shape, the shape or dimensions of the layer; Hyperparameters, the hyperparameters applied to each layer; Conv, convolution layer; Max_pool, maximum pooling layer; Dense, dense layer; n_features, the number of data points comprising the input data.

The weighted F1 score used for the assessment of 1D CNN model performance takes into account class imbalances by calculating the weights for each class (Eq. 5) and incorporates them into their respective F1 scores (Eq. 6). Through this process, the weighted F1 score assesses individual classes and the overall model performance even for imbalanced datasets.

(5)

where wi is the weight of class i; Ni is the number of samples in class i; and Ti is the total number of samples.

(6)

Data point attribution

To assess the effect of individual data points within the given input IR data on the predictions of the 1D CNN models, the gradient-weighted activation mapping (Grad-CAM) method was used for data point attribution. Data point attribution is a fundamental tool for interpreting model predictions by tracing back the output of the model.

Data point attribution involves computing the gradient of the loss function to understand its sensitivity to each parameter and input data point. The gradient value indicates the extent to which a given data point affects the loss function, with a higher absolute gradient value signifying a substantial effect of that data point on the output of the model. The results of data point attribution were visualized alongside the IR spectra to quantitatively determine the importance of each data point and facilitate model interpretation.

Prediction of unknown products

Three products with unknown manufacturing information were used to predict their origins using the developed models. The PCA was performed on their IR spectra to analyze their relationships with existing data. Subsequently, they were input into the established 1D CNN models to calculate the prediction probabilities for each class (Fig. 1). When inputting the unknown products into the 1D CNN model, the IR spectra were preprocessed in the same way as those used in model construction.

Model Comparison

The classification performance of the constructed 1D CNN models was compared with those of conventional machine learning classifiers: feed-forward neural network (FNN) and support vector machine (SVM). They were trained on the same four datasets used to establish the 1D CNN models, thus constructing their respective classification models.

FNN

The FNN with a backpropagation algorithm was used as a benchmark method for the 1D CNN. When constructing the models, ReLU was adopted as the activation function and crossentropy was used as the loss function. Stochastic gradient descent and adaptive moment estimation (Adam) were used to optimize the loss function. The initial learning rate ranged from 0.0001 to 0.1, with a maximum of 1000 iterations. The FNN architecture was configured with either one or two hidden layers, each containing 12, 256, or 512 nodes. A grid search was conducted to determine the optimal parameters and network structure for the FNN models.

SVM

To facilitate performance comparison, SVM models were constructed using the radial basis function kernel (Vert et al. 2004), a technique that projects data into a high-dimensional space to determine hyperplanes. During model construction, the parameter cost, responsible for regulating the misclassification cost of the training data, was set in the range of 100 to 105. In addition, the parameter Gamma, which governs the Gaussian kernel used for nonlinear classification, was configured within the range of 10−1 to 10−6. These parameters were optimized via a grid search.

RESULTS AND DISCUSSION

IR Spectral Characteristics of Printing Paper

The IR spectrum contains valuable information for capturing and interpreting the characteristics of paper. Figure 3 presents the IR spectra of select samples from the four datasets. In the original IR spectra (Fig. 3a and 3b), distinct peaks at 3600 to 3000 cm−1 were assigned to OH groups (Hofstetter et al. 2006), peaks at 2890 to 2780 cm−1 to CH stretching (Xiao et al. 2015), at 1647 to 1635 cm−1 to adsorbed water (Olsson and Salmén 2004), at 1430 to 1416 cm−1 to the CH2 bending of crystalline cellulose (Schwanninger et al. 2004; Delmotte et al. 2008), and at 1200 to 900 cm−1 to the fingerprint peaks of cellulose (Garside and Wyeth 2003).

Fig. 3. Entire IR spectra of printing paper samples and the selected region: original (a, b) and second-derivative data (c, d)

In the second-derivative spectra (Fig. 3c and 3d), peaks were amplified, accentuating the distinctions between spectra. Moreover, the 1600 to 1200 cm−1 region, where multiple peaks in the original spectra overlap, is distinctly separated through the second-derivative transformation (Fig. 3d). In the second-derivative spectra, several absorption bands, apart from those prominently present in the original spectra, are enhanced. These include bands assigned to carbonyl groups at 1740 cm−1 (Schwanninger et al. 2004), aromatic parts in lignin at 1510 cm−1 (Pandey and Pitman 2003) and 1244 cm−1 (Delmotte et al. 2008), amorphous cellulose at 1466 to 1460 cm−1 (Hajji et al. 2016), and crystalline cellulose at 1315 cm−1 (Colom and Carrillo 2002).

PCA

PCA is a useful technique for extracting and analyzing patterns and structures in high-dimensional data. Figure 4 shows score plots depicting the first two PCs derived from the four IR datasets.

Fig. 4. PC score plots for the first two PCs in the 4000–400 cm−1 (a) and 1800–1200 cm−1 (b) regions of the original IR spectra, and score plots in the 4000–400 cm−1 (c) and 1800–1200 cm −1 (d) regions of second-derivative spectra

In all the score plots, data points from the majority of samples are mixed, forming a unified, large cluster, whereas some samples are grouped into smaller, distinct clusters. Notably, in contrast to other score plots (Fig. 4a, 4b, 4c), the smaller clusters are more prominently separated from the larger cluster in the score plots of second-derivative spectra in the 1800 to 1200 cm−1 region (Fig. 4d). These clusters correspond to products manufactured in Korea, Germany, China, and the United States, with each cluster comprising data points from the same product. Korean and German products have been identified as recycled paper. The second-derivative spectra of four products isolated from the large cluster in the PC score plot (Fig. 4d) and the loading values of the first two PCs are shown in Fig. 5. The absorbance bands at 1510 and 1595 cm−1 (Fig. 5b), assigned to the aromatic part of lignin, contributed to the positioning of Korean and German recycled products in the high-PC2 region of the score plot. They showed stronger peaks in those regions than in the control IR spectrum. These results suggest that unbleached pulp was possibly used in the manufacturing of the recycled products. The characteristics of the IR spectra and PC loading of the Chinese products, forming a distinct cluster, closely resemble those of the Korean and German recycled products. The isolated cluster of the United States products with high PC1 values exhibited a conspicuously strong peak at 1416 cm−1 (Fig. 5a), which was assigned to the crystalline cellulose. Furthermore, the substantial negative value in the PC1 loading for this region indicated that it was a distinctive feature of these products. Variations in the crystallinity of cellulose in printing paper samples were attributed to differences in cooking methods and conditions (Gümüşkaya et al. 2003). However, given that the printing paper is kraft pulp based, factors other than cooking methods likely contributed to this effect because the use of recycled pulp in the manufacturing of this product cannot be discounted (Sheikhi et al. 2010).

Fig. 5. Loadings for the first two PCs in the 1800 to 1200 cm−1 region of the second-derivative IR spectra. Numbers in parentheses indicate product number for printing paper samples.

Classification of Printing Papers

1D CNN models

The 1D CNN models trained on the IR spectra were constructed for the classification of printing paper samples. The learning curves of the models for classifying the country of manufacturing shown in Fig. 6 demonstrate a smaller loss difference between the training and validation curves in the selected range of 1800 to 1200 cm−1 (Fig. 6b), as opposed to the entire IR spectra (Fig. 6a). The variation between the training and validation curves can provide insights into model overfitting and generalization performance, with the smaller difference in the selected range suggesting that models trained on the IR data from this region are more likely to possess predictive capabilities for new data (Anzanello and Fogliatto 2011).

Fig. 6. Learning curves of the 1D CNNs for the training on IR spectral data and model validation for the classification of the country of manufacturing for printing paper samples. Learning curve for the entire IR data from 4000 to 400 cm−1 (a) and learning curve for selected data in the 1800 to 1200 cm−1 range (b)

Figure 7 presents the F1 scores for the 1D CNN models trained on the IR spectral data for the classification of printing paper samples. The hyperparameters applied to each model, determined through loop-based optimization, are detailed in Table 3. In the classification of the continent of manufacturing, all models exhibited F1 scores exceeding 0.967, confirming that printing paper samples share similar characteristics by continent. In the classification of the country of manufacturing, models trained on the IR data from the selected region of 1800 to 1200 cm−1 exhibited higher performance, with all classes perfectly classified regardless of spectral preprocessing.

The classification performance of the 1D CNN models at the product level was lower than that at other classification levels. Models trained on the original and second-derivative IR spectra encompassing the entire spectra (4000 to 400 cm−1) showed the F1 scores of 0.788 and 0.818, respectively. These findings were attributed to the inherent challenges associated with product-level classification, which involves a considerable number of classes (65), where each class is represented by only 5 samples. This limitation prevented the models from adequately learning the distinctive features of each individual class. However, the use of spectral data from the selected region (1800 to 1200 cm−1), notably improved the F1 scores, which reached 0.939 and 0.980, respectively. These results suggest that the selected spectral region is well-suited for the characterization of printing paper samples.

Fig. 7. Weighted F1 scores of the 1D CNN models for the classification of printing paper manufacturing continents (a), countries (b), and products (c)

Table 3. Performance of the 1D CNN Models in Printing Paper Classification and Their Optimal Hyperparameter Combinations