Abstract
The evaluation and classification of chemical properties in different copy-paper products could significantly help address document forgery. This study analyzes the feasibility of utilizing infrared spectroscopy in conjunction with machine learning algorithms for classifying copy-paper products. A dataset comprising 140 infrared spectra of copy-paper samples was collected. The classification models employed in this study include partial least squares-discriminant analysis, support vector machine, and K-nearest neighbors. The key findings indicate that a classification model based on the use of attenuated-total-reflection infrared spectroscopy demonstrated good performance, highlighting its potential as a valuable tool in accurately classifying paper products and ensuring assisting in solving criminal cases involving document forgery.
Download PDF
Full Article
Classification Analysis of Copy Papers Using Infrared Spectroscopy and Machine Learning Modeling
Yong-Ju Lee,a Tai-Ju Lee,b,* and Hyoung Jin Kim a,*
The evaluation and classification of chemical properties in different copy-paper products could significantly help address document forgery. This study analyzes the feasibility of utilizing infrared spectroscopy in conjunction with machine learning algorithms for classifying copy-paper products. A dataset comprising 140 infrared spectra of copy-paper samples was collected. The classification models employed in this study include partial least squares-discriminant analysis, support vector machine, and K-nearest neighbors. The key findings indicate that a classification model based on the use of attenuated-total-reflection infrared spectroscopy demonstrated good performance, highlighting its potential as a valuable tool in accurately classifying paper products and ensuring assisting in solving criminal cases involving document forgery.
DOI: 10.15376/biores.19.1.160-182
Keywords: Attenuated-total-reflection infrared spectroscopy (ATR-IR); Partial least squares-discriminant analysis (PLS-DA); Support vector machine (SVM); K-nearest neighbor (KNN); Machine learning; Document forgery; Forensic document analysis
Contact information: a: Department of Forest Products and Biotechnology, Kookmin University, 77 Jeongneung-ro, Seongbuk-gu, Seoul 02707 Republic of Korea; b: National Institute of Forest Science, Department of Forest Products and Industry, Division of Forest Industrial Materials, 02455, Seoul, Republic of Korea; *Corresponding authors: leetj@korea.kr; hyjikim@kookmin.ac.kr
INTRODUCTION
Various types of copy papers are produced and sold worldwide, finding extensive usage in institutions such as schools, offices, and printing companies. Each copy paper possesses distinct properties, including basis weight, whiteness, and composition. These attributes are influenced by factors, such as the type and ratio of pulp; additives, such as fillers and sizing agents; and variations in manufacturing processes. Surface sizing agents, such as starch, are commonly employed in copy-paper manufacturing to enhance surface strength and printing suitability (Moutinho et al. 2011).
Despite the advances in information technology that have led to the concept of a paperless society, many tasks are still conducted using paper documents. Paper remains a crucial medium for recording various daily tasks and activities, including taking notes, jotting down memos, and formalizing important contracts (Ganzerla et al. 2009; Lee et al. 2023).
Meanwhile, owing to recent technological advances in office automation devices, document forgery has become more accessible, consequently underscoring the growing significance of paper document examination in fields such as forensic investigation and evidence analysis (Lee et al. 2023). In South Korea, a substantial number of document forgery cases have been documented. In 2020, the country recorded 2,217 cases of document forgery related to official documents and 7,604 cases concerning private documents. Over a three-year period starting in 2018, the statistics indicate that nearly ten thousand criminal cases of document forgery are reported annually (Choi et al. 2018a; 2018b). Hence, the development of robust forensic capabilities for detecting forgery becomes paramount (Kim et al. 2016; Lee et al. 2023).
Various classification methods for paper have been reported, primarily relying on chemical analysis. The X-ray diffraction (XRD) method has been employed to investigate the inorganic substances used as paper fillers (Foner and Adan 1983; Causin et al. 2010). Spence et al. (2000) demonstrated that document paper could be identified by utilizing ICP-MS to trace the elements within the paper. Ganzerla et al. (2009) conducted a comprehensive study on the characteristics of ancient documents produced in Palazzo Ducale, Italy, employing various analytical tools, including scanning electron microscopy-energy dispersive X-ray spectrometry (SEM–EDX), high-performance liquid chromatography mass spectrometry (HPLC-MS/MS), and pyrolysis–gas chromatography and mass spectrometry (Py-GC/MS). Choi et al. (2018a; 2018b) utilized SEM-EDX to analyze fillers in 188 types of copy paper and further determined the mix ratios of fiber components through dissociation tests.
In recent years, the utilization of infrared spectroscopy has gained prominence for identifying paper types (Kher et al. 2001, 2005; Kumar et al. 2017; Kang et al. 2021; Kim et al. 2022). Infrared spectroscopy (IR) serves as a fundamental tool for investigating paper structure and pulp chemistry (Workman 1999; Pan and Nguyen 2007). The paper industry has been utilizing IR for non-destructive process control and rapid determination of specific parameters, including identification of paper (Hodges et al. 2006; Ganzerla et al. 2009; Jang et al. 2020; Seo et al. 2023). Kher et al. (2005) reported a successful attempt to identify six types of paper by analyzing mid-infrared wavelengths (MIR, 2500 to 4000 cm−1) using an FT-IR spectrometer.
Additionally, IR spectroscopy combined with multivariate statistical methods has proven to be an effective approach for distinguishing similar paper products, modeling systematic data variances, and presenting data in a concise manner (Marcelo et al. 2014). Multivariate statistical methods offer several classification models, including principal component analysis (PCA), partial least squares-discriminant analysis (PLS-DA), support vector machine (SVM), and K-nearest neighbor (KNN) (Agarwal et al. 2021). PCA is an unsupervised model that uncovers relationships between samples and analytical data, resulting in distinct groupings of variables and samples. This grouping can be visualized as a reduction from a multidimensional space to a two-dimensional representation. Kim et al. (2016) utilized IR spectroscopy and PCA to effectively distinguish traditional paper products originating from Korea, China, and Japan. Similarly, Kang et al. (2021) and Kim et al. (2022) demonstrated the efficacy of IR and PCA in classifying paper products according to their respective continents or countries of origin.
In contrast, PLS-DA, SVM, and KNN are supervised models that employ labeled datasets to train the models for classifying new samples based on known classes. These models enable the prediction of a sample’s class by leveraging its spectral characteristics and previously labeled data (Singh et al. 2023). Jang et al. (2020) utilized supervised models, including PLS-DA, SVM, and random forest, to predict the types of traditional Korean paper with varying raw materials. Hwang et al. (2023) also reported the feasibility of discriminating manufacturing origins with artificial neural networks (ANN) and infrared spectroscopy. Canals et al. (2008), Ruiz et al. (2011), and Xia et al. (2023) developed algorithms for classifying a wide range of paper types, such as base, coated, printed, recycled, and hygiene papers using infrared spectra data.
Despite the considerable amount of research on the utilization of multivariate statistical methods in conjunction with IR spectroscopy for classifying diverse paper products, a significant research gap exists pertaining to the classification of copy papers within the same grade, employing IR spectroscopy combined with machine learning algorithms, such as PLS-DA, SVM, and KNN.
The main objective of this study was to examine the feasibility of employing IR in conjunction with machine learning algorithms for identifying copy paper products as same as printing paper and document paper. Therefore, a dataset comprising 140 IR spectra of copy-paper samples was collected. In addition, this study utilized PLS-DA, SVM, and KNN as the classification models. Moreover, the effectiveness of these three classification models were compared in terms of efficiency and being expeditious approaches for analyzing the constituent materials in copy paper.
EXPERIMENTAL
Materials
The samples were conditioned for more than 48 h at a temperature of 23 °C ± 1 °C and a relative humidity of 50% ± 2%, according to ISO 187 (1990). Table 1 provides information regarding paper products, manufacturers, as well as the physical and optical properties examined in this study.
Various non-destructive methods for paper analysis and identification are currently in use. These methods encompass the comparison of physical characteristics such as basis weight, thickness, apparent density, surface roughness, brightness, opacity, and whiteness (Lee et al. 2023). Table 2 provides details about the equipment and standards utilized for the evaluation of the physical and optical properties.
Analysis of the Inorganic Filler Content
Fillers can significantly influence the discrimination procedure for paper samples (Causin et al. 2010; Choi et al. 2018). The content and types of fillers in each paper product vary due to differences in processing parameters, manufacturing conditions, and the formulation of additives (Causin et al. 2010).
The analysis of inorganic filler content, including clay (CaSiO3) and calcium carbonate (CaCO3) and titanium dioxide (TiO2), was performed using an ash content analyzer (Emtec, Germany). This measurement method is based on a combination of X-ray fluorescence analysis and the X-ray transmission method (Hu et al. 2020).
SEM-EDX
SEM-EDX allowed the production of images at high magnification and the determination of major elemental components in the samples. JSM 7401F (JEOL Ltd., Japan) was used for imaging and energy dispersive X-ray spectroscopy (EDX, X-Max, Oxford instruments, UK) for elemental analysis. The acceleration voltage of the electron beam was set at 10 kV for the imaging, and 15 kV for EDX, respectively.
Table 1. Physical and Optical Properties of Samples
Table 2. The information for Analysis of Physical and Optical Properties
ATR-IR Analysis and Data Preprocessing
The ATR-IR spectra of the copy paper samples were determined using an ATR-IR (Bruker Optics, Germany). Every spectrum was recorded in the range of 4,000 to 400 cm−1 with a 4 cm−1 resolution, 32 scans, and the air absorbance was recorded as a reference standard. To eliminate undesired scatter effects, such as baseline shift and nonlinearity, as many spectra preprocessing steps as possible were applied, such as the selection of a spectral range of 1,800–800 cm−1.
The total crystallinity index (TCI) (Nelson and O’Connor 1964) was calculated to derive the peak intensities from the spectral data. In addition, the chemical properties resulting from the raw materials were evaluated along with additives used in the production of copy papers. The TCI was computed by dividing the value at 1,372 cm−1, corresponding to C-H bending, by the peak intensity at 2,900 cm−1, corresponding to C-H and CH2 stretching, as shown in Eq. 1.
(1)
Additionally, the hydrogen bond intensity (HBI), which represents the bonding index of hydroxyl groups within cellulose, was calculated based on the peak intensity of 1,336 cm−1, corresponding to C–OH, and 3,336 cm−1, indicative of intermolecular hydrogen-bonding, as shown in Eq. 2 (Široký et al. 2010).
(2)
Proposed Approach
Machine learning process
The machine learning modeling for identification of copy paper is visualized in Fig. 1. First, the spectral pre-processing steps needed to identify an indispensable part of spectral data (Lasch 2012). This process involves, among others, outlier rejection, normalization, filtering, detrending, transformation, folding, and feature selection. In this study, the fifth-order polynomial Savitzky–Golay second-derivative was employed (Savitzky and Golay 1964). Next, the training dataset and test dataset were separated from the input dataset. The training data are employed to train the model, the development set evaluates various versions of the proposed model during development, and the test set confirms the answers to the primary research questions, such as the product names of copy papers. In this study, the input IR spectra dataset was divided into a training set and a test set in a 7:3 ratio. The stratified sampling method was applied to split the training and test sets to prevent sample selection bias resulting from the division process (Hens and Tiwari 2012; Ye et al. 2013). Then, the extracted IR spectra datasets were learned using PLS-DA, SVM, and KNN.
Fig. 1. The process of machine learning modeling for classification of copy paper
Machine learning model development
Three classification algorithms, namely PLS-DA, SVM, and KNN, were utilized for classification prediction. The complete data processing and classification procedures were performed using the R-statistical software (R Core Team, version 4.3.0, Auckland, New Zealand).
PLS-DA is widely recognized as one of the prominent classification techniques in the field of chemometrics. In addition, extensive research has been conducted and documented on PLS-DA and its properties (Indahl et al. 2007). The PLS-DA model can be expressed as a regression equation: Y = XB, where X represents an n × p matrix representing n samples, with each sample characterized by a vector of p feature values. Matrix Y is an n × k matrix comprising information about the class memberships of the samples, with k denoting the number of classes. The individual element, yi,j, follows the structure described in Eq. 3.
(3)
where i and j represent the sample and class numbers, respectively, with i ranging from 1 to n and j ranging from 1 to k. The binary Y matrix exhibits a structured format, in which each row sums up to unity.
After estimating regression matrix B using the PLS2 algorithm, the prediction for a new set of samples can be performed as follows: Ytest = XtestB. However, the predicted values in the Ytest matrix are continuous numbers, requiring a conversion to class memberships. In this study, the class membership of each unknown sample is assigned based on the column index with the largest absolute value in the corresponding row of the Ytest matrix (Chevallier et al. 2006). Figure 2 shows the visualization of PLS-DA model.
The SVM is a nonparametric classifier that constructs a hyperplane to maximize the margin between classes. The hyperplane is built based on the training observations closest to different classes (Samanta et al. 2003). These training observations, known as support vectors, play a crucial role in constructing the separating hyperplane.
Fig. 2. Visualization of PLS-DA model
By using SVM, an optimal hyperplane must be determined that not only separates the classes but also maximizes the margin, representing the distance between the hyperplane and nearest data points from each class (Mancini et al. 2019). By maximizing the margin, SVM improves the generalization and robustness of the classifier when classifying new, unseen data points. SVM identifies the support vectors with the most significant influence on determining the position of the hyperplane. These support vectors are essential in defining the decision boundary between classes and contribute to the overall performance of the SVM classifier (Chauchard et al. 2008). In this study, a radial basis function (RBF) kernel was used to determine the hyperplane by projecting data onto a high dimensional feature space (Vert et al. 2004). The cost and gamma parameters of the RBF kernel SVM were optimized using grid searches with a logarithmic grid from 2-7 to 27 for cost and from 10-5 to 105 for gamma. Cost and gamma are parameters that control the cost of misclassification of the data and a Gaussian kernel for nonlinear classification, respectively. Figure 3 shows the nonlinear classification process of SVM model.
Fig. 3. Visualization of SVM model
The KNN procedure was employed to classify an unknown sample based on its proximity to previously categorized samples, similar to Mahalanobis’ generalized distance technique (Tsuchikawa et al. 2003). Specifically, the predicted class of an unknown sample was determined by considering the classes of its KNNs. Analogous to polling, each of the k-closest training-set samples contribute one vote for its respective class. Subsequently, the unknown sample is assigned to the class receiving the highest number of votes. Therefore, the selection of an appropriate k value, representing the number of neighbors participating in the voting process, is crucial (Guo et al. 2003). The KNN procedure offers a flexible and intuitive approach to the classification, relying on the proximity of samples and the majority vote principle to assign classes to unknown samples. The selection of an appropriate k value must be considered carefully to achieve accurate and reliable classification results (Kabir et al. 2003). In this study, the number of nearest neighbors (k) was set to odd numbers in the range of 1 to 11, and the optimal k was determined using a grid search. Figure 4 shows the classification process of KNN model.
Fig. 4. Visualization of KNN model
Cross-validation
All the employed classification methods utilized cross-validation, a technique used to estimate the generalization error by utilizing holdout data (Xia et al. 2019). Among the available cross-validation techniques, leave-one-out cross-validation (LOOCV) is the most commonly used. LOOCV iteratively excludes a single data point from the training set, while utilizing the remaining data for model training, as depicted in Fig. 5. This process is repeated for each data point in the dataset. LOOCV is particularly advantageous for small datasets, as it minimally affects the available training data.
In this study, the classification models were evaluated using LOOCV to determine the predicted accuracies on the training datasets, that were calculated as the average of each operation.
Fig. 5. The mechanism of leave one out cross validation (LOOCV)
Confusion matrix
A confusion matrix is widely used for evaluating the performance of classification algorithms (DeVries et al. 2003; Ruuska et al. 2018; Xu et al. 2020). It provides valuable information about the actual and predicted classifications made by the algorithm and is applicable to both two-class (binary) and multiclass classification problems (Xu et al. 2020). Figure 6 illustrates confusion matrix for a two-class classifier and a multiclass classifier.
In classification tasks, the accuracy of classifying observations into positive and negative categories must be assessed. When observations belonging to the positive class are correctly classified, they are referred to as true positives (TP), while correctly classified observations belonging to the negative class are termed true negatives (TN). Furthermore, instances of positive classes incorrectly classified as negative are referred to as false negatives (FNs), and instances of negative classes incorrectly classified as positive are termed as false positives (FPs).
Fig. 6. Visualization of the confusion matrix for two-class and multiclass classifications with N classes
From these values, various performance indicators can be calculated to evaluate the classifier’s ability to detect the target class (Piras et al. 2018). The commonly used indicators include accuracy, sensitivity, and specificity. Accuracy, sensitivity, and specificity are calculated as (TP + TN)/(TP + TN + FP + FN), TP/(TP + FN), and TN/(TN + FP), respectively.
RESULTS AND DISCUSSION
Inorganic Filler Content
Table 3 summarizes the inorganic filler content in copy paper samples. Titanium dioxide was not detected in any of the copy papers. From Table 1, employing analysis of variance (ANOVA) in R software and subsequent grouping, it was determined that there were no significant differences in the physical and optical properties of some copy papers at a 95% confidence level. This suggests that the evaluation of physical and optical method was not able to provide distinguishing features for classification and identification of copy paper product. Such differentiation became even more challenging when the manufacturers were the same.
On the other hand, the results in Table 3 suggest that the inorganic filler content has the feasibility of being a distinguishable feature of each copy paper compared to the physical and optical methods. For example, in the comparison between copy paper M and N, detecting the clay contents in M made it possible to find different features of both samples.
Nevertheless, the results showed that it was still hard to identify the copy papers A and E, which have similar CaCO3 and total ash contents.
Table 3. Summary of Inorganic Filler Content
SEM-EDX
The observation of the samples with SEM made it possible to study the fibers and fillers at high magnification. Figures 7 and 8 show the SEM and EDX elemental mapping images respectively. The copy paper H in Fig. 7(c) differed from the others: it was a machine finished coated paper but especially visible were some typical pigments of clay (Choi et al. 2018). The EDX mapping image also shows that the surface of copy paper H was coated with coating pigments.
Fig. 7. SEM images of copy papers (a: Sample A, b: Sample E, c: Sample H).
Fig. 8. EDX elemental mapping images of copy papers (a: Sample A, b: Sample E, c: Sample H).
However, excluding sample H, other samples also exhibited similar characteristics, making it challenging to demonstrate distinct differences between the samples. The samples A and E exhibited similar morphology, including fiber shapes and filler types, as shown in Fig. 7. The images showed small crystals of precipitated calcium carbonate (PCC) incorporated into the sizing of the sheets (Ganzerl et al. 2009). Figure 9 shows the EDX spectra of copy papers. The elemental analysis made with the probe EDX detected the amounts of calcium carbonate in all samples. The presence of Cl in copy paper A is probably a remnant of the process of bleaching, as shown in Fig. 9(a). Chlorine has been used since the late 18th century to produce printing paper (Seo et al. 2023).
The results demonstrated that copy papers which have similar filler types and contents were hard to classify.
Fig. 9. EDX spectra of copy papers (a: Sample A, b: Sample E, c: Sample H)
ATR-IR Data
The spectra data were collected in the range of 4,000 to 400 cm−1. However, as the key peaks associated with cellulose, hemicellulose, lignin, and moisture, which are major components of paper, are typically found in the range of 1,800 to 800 cm−1, only the spectral region within 1,800 to 800 cm−1 was extracted and utilized for this study.
By focusing on this specific spectral range, the characteristic peaks relevant to the identification and analysis of cellulose, hemicellulose, lignin, and moisture in paper samples could be evaluated. This selective extraction of the spectral data allowed for a more precise and efficient analysis of paper composition and its constituents (Kim and Eom 2016).
Figure 10(a) shows the raw IR spectra, and Fig. 10(b) illustrates the second-derivative spectra. The data preprocessing using the Savitzky-Golay filter serves to make the baseline of spectra consistently adjusted and enhance the peaks, thus emphasizing differences between samples (Hwang et al. 2016). In the raw IR spectra shown in Fig. 10(a), several differences were revealed in the absorption bands at 1647 to 1635 cm-1 (water), 1422 cm-1 (CH2 bending), 1337 cm-1 (amorphous cellulose), and 1200 to 900 cm-1 (cellulose fingerprint) for each copy paper (Garside and Wyeth 2003; Polovka et al. 2006; Ciolacu et al. 2010; Castro et al. 2011). However, in the second derivative spectra presented in Fig. 10(b), additional absorption peaks were observed at 1730, 1700, 1680, 1661, 1644, 1620, 1547, and 1510 to 1505 cm-1.
The 1730 cm-1 peak was attributed to the oxidation of cellulose (Sistach et al. 1998). Differences at 1700 cm-1 may indicate the use of rosin in the sizing of paper products (Ganzerla et al. 2009). The peak at 1661 cm-1 suggests the presence of carbonyl groups along the cellulose chain, potentially keto or aldehyde groups linked to the phenyl ring of lignin (Proniewicz et al. 2001; 2002). The absorption peak at 1510 to 1505 cm-1 is characteristic of lignin. The peaks at 1644 and 1547 cm-1 (Amin I and II) may suggest the use of gelatine as glue (Calvini and Gorassini 2002; Gracia 2001). The vibrations of stretching of C=C aromatic ring at 1510 cm−1 and the peak at 1680 cm-1, given by oxidation of keto or aldehydic groups linked to the ring (Ganzerla et al. 2009). The spectra at 1620 cm-1 are indicative of calcium carbonate and gypsum (Gracia 2001).