A comparison study to detect seam carving forgery in JPEG images with deep learning models

A comparison study to detect seam in JPEG images with Abstract Aim: Although deep learning has been applied in image forgery detection, to our knowledge, it still falls short of a comprehensive comparison study in detecting seam-carving images in multimedia forensics by comparing the popular deep learning models, which is addressed in this study. Methods: To investigate the performance in detecting seam-carving-based image forgery with popular deep learning models that were used in image forensics, we compared nine different deep learning models in detecting untouched JPEG images, seam-insertion images, and seam removal images (three-class classification), and in distinguishing modified seam-carving images from untouched JPEG images (binary classification). We also investigate the different learning algorithms with the Efficientnet-B5 in adjusting the learning rate with three popular optimizers in deep learning. Results: Our study shows that EfficientNet performs the best among the nine different deep learning frameworks, followed by SRnet, and LFnet. Different algorithms for adjusting the learning rate result in different detection testing accuracy with Efficientnet-B5. In our experiments, decouples the optimal choice of weight decay factor from the setting of the learning rate (AdamW) is generally superior to Adaptive Moment Estimation (Adam) and Stochastic Gradient Descent (SGD). Our study also indicates that deep learning is very promising for image forensics, such as the detection of image forgery. Conclusion: Deep learning is very promising in image forensics that is hardly discernable to human perceptions, but the performance varies over different learning models and frameworks. In addition to the models, the optimizer has a considerable impact on the final detection performance. We would recommend EfficientNet, LFnet and SRnet for seam-carving detection.


INTRODUCTION
Digital images have played an essential role in our daily lives. Digital image editing tools are easily available and have evolved tremendously due to the increasing requests. With the popularity of open-source image editing tools that are easily accessible, many individuals without advanced image processing techniques can effortlessly generate a fake image. Hence, the authenticity of the images in the virtual environment becomes a critical problem [1,2] .
Seam carving is one of the popular image manipulation strategies adequate for content-aware image resizing. The functionality of seam carving is to assemble several seams in an image and automatically eliminate seams to decrease image size or insert seams to extend it. The corresponding image is reduced by one pixel in height/width at a time. In seam carving, the idea is that the given input image is resized by removing or inserting optimal seams described as corresponding pixel paths reaching from top to bottom or left to right while keeping the photorealism of the image. The seam located vertically in an image is a path of pixels linked with one pixel per row from top to bottom. The seam located horizontally is a way of pixels linked from left to right with one pixel per column. Hence, the underlying algorithm is elegant and straightforward. By reducing the energy cost of a seam [3,4] to re-generate an image, seam carving can efficiently resize an image [5] . By using seam carving, an unwanted object can be removed from an image without any perceivable distortion [6,7] .
As mentioned, seam carving has been applied for object removal and image resizing; it is also applied for tampering with illusions such as image or object insertion in image. Seam carved images can be intentionally manipulated to misinterpret or extract original content in the image; thus, detecting the forgery content, which are artifacts of seam carving, has evolved into a crucial issue in multimedia forensics [8,9] .

Introduction to seam carving
In seam carving, imperceptible pixels on the least essential seams coordinating with their surroundings are terminated or inserted to perform content-aware scaling. Formally, let I will be a p × q image, and a vertical seam is defined by: Similarly, a horizontal seam is defined by: where y is a mapping y:[1,...,q][1,...,p].
The pixel path of seam S is denoted as . Given an energy function e, the cost of a seam is calculated by ( ) = ( ) = =1 ( ( )). The optimal seam S* generally minimizes the seam cost: In the reference [4] , several image importance measures are examined as the energy function. Although no single energy function performs well across all images, in general, the following two measures e1 and eHoG work quite well.
where HoG( ( , )) is a histogram of oriented gradients at each pixel.

Seam carving and relevant image forgery detection
Sarker et al. [5] proposed a blind seam carving detection system which models the distinction JPEG 2-D array with Markov random process [27] ; in their detection system, the Markov features are utilized to exhibit the imprint in block-based frequency-domain to distinguish modified images from original ones. Fillion and Sharma [28] proposed a model comprising energy bias-based features and decisive wavelet moments to detect seam carving. Chang et al. [29] illustrated the artifact caused by seam carving in JPEG images. They discussed the misalignment due to the modification occurrence, and seam-carved images and unaltered images may have distinctive DCT blocking artifacts when recompression has been made [30] . Liu et al. [7] proposed rich model-based features [31] with calibrated neighboring joint density and a hybrid large feature mining approach to achieve state-of-the-art in terms of detection accuracy [32] . Deep learning is widely used for various areas [33][34][35][36][37][38][39] , it is also applied to detect image forgery [40][41][42] . For example, Yao et al. [43] developed a reliability fusion map based CNN model to detect image forgery. Du et al. [44] proposed a locality-aware autoencoder to detect deepfake images. The authors used a pixel-wise mask to regularize the local interpretation of Locality-Aware AutoEncoders to improve the model' s generalization capabilities. The authors evaluated their proposed approach on the Dresden dataset. They claimed that their model achieved high performance. Thakur et al. [45] proposed two algorithms to classify and localize image forgery. The authors evaluated their proposed approach using the CASIA-v1, CASIA-v2, DVMM, and BSDS300 datasets. They reported that their proposed approach achieved state-of-the-art performance. Majumder et al. [42] developed a CNN architecture to train a model for forgery detection. The researchers tested their solution on the CASIA v2 dataset. They also applied transfer learning to adapt VGG-16, VGG-19, and Resnet models for forgery detection. The authors compared the performance of their proposed model with the transfer learning models. Their method achieved good performance without multi-level feature extraction, and that transfer learning is unsuitable for this task. Muhammad et al. [46] developed an image forgery detection method based on the steerable pyramid transform and local binary pattern techniques. Their proposed method was evaluated on three datasets: CASIA v1, CASIA v2, and Columbia color image. Goel et al. [47] developed a copy move forgery detection algorithm that uses dual branch convolutional neural networks to identify forged images. Le-Tien et al. [48] developed a technique to identify forged images using neural networks. The authors evaluated their model on the CASIA-v2 dataset.
Diallo et al. [41] developed a framework for deep learning detection of forged images. The researchers evaluated their proposed framework on the CMI dataset. They also evaluated a set of transfer learning models adapted to forgery detection, namely, ResNet, VGG-19, and DenseNet. Han et al. [49] applied Maximum Mean Discrepancy (MMD) loss to train a machine learning model for forgery detection. The authors evaluated their proposed approach on the DF-TIMIT, UADFV, Celeb-DF, and FaceForensics++ datasets. Aneja et al. [50] developed a new transfer learning approach for forgery detection. The authors evaluated their approach on the FaceForensics++ dataset. Bourouis et al. [51] developed a framework for the Bayesian learning of finite GID mixture models. The authors evaluated their approach using forgery detection. The datasets used are MICC-F220 and MICC-F2000. Cao et al. [52] developed an anti-compression facial forgery detection technique which extracts compression insensitive features from compressed and uncompressed forgeries and learns a robust partition. Asghar et al. [53] developed an image forensics technique that uses discriminative robust local binary patterns to encode tamper traces. Their approach also uses an SVM classifier to detect forged images. Zhang et al. [54] applied a cross-layer intersection mechanism to a dense u-net classifier for image forgery detection. They evaluated their approach on several datasets: CASIA, NC2016, and Columbia Uncompressed. Optimization needs to be performed during detections [55][56][57][58] . Other seam-carving-based detection includes but not limited to [59][60][61][62] .
Deep learning model-based approaches are widely applied for image forgery detection. These techniques can improve different aspects of life and research on image forgery detection is required. Unfortunately, it falls short of a comprehensive comparison of the state-of-the-art deep learning models in detection image forgery, especially seam-carving image forgery in JPEG images, which is addressed by our study.

Dataset
The dataset used in this research is a custom-balanced dataset. The dataset contains 12916 images. Half of the dataset, 6458 images, are untouched JPEG images and the rest of 6458 were touched images (altered/manipulated) using seam carving with a quality of 75. The untouched images are everyday pictures of dimensions 1234×1858 or 1858 × 1234. Touched images are modified by removed objects, horizontal and/or vertical resizing or other forms of image forgery. Figure 1 represents examples of the images from the dataset. In the touch images category, some of the modifications can be easily observed with the human eye; on the other hand, some of them have subtle signal modifications which cannot be distinguished by the human eye and are visually perceptible. In our experiment, images that were exclusively resizing the original image are not considered and not included in the dataset. In other words, we exclude the dimensional modified images in the untouched image set. Therefore, excluding resized images from the untouched set eliminates the vagueness in labeling when we resize the images. Our model constrained us to train on smaller resized versions of the image. Excluding resized images from our dataset eliminates the ambiguity in labeling when we resize our images for uniformity.  right. Figure 2 briefly shows the architecture of a simple convolutional neural network (CNN), which starts with an input dataset that mainly combines images and videos. The convolutional layer is the first step of the CNN structure that takes the input images and extracts various features. In this layer, convolution operations are conducted within the input and a filter. The output received after applying the convolutional layer is a feature map with some activation functions. The taken feature map is then fed to other layers to comprehend various other features of the input image. In most CNN architectures, the pooling layer is the next step after the convolutional layer. The pooling layer performs by decreasing the connections between layers and independently operates on the feature map to reduce the computational costs. The Pooling Layer is used as a bridge between the Convolutional Layer and the Fully Connected Layer. After the Pooling layer, the output layers are flattened and fed to the Fully connected layer -this layer is based on weights and biases and the neurons. The Fully Connected layer is used to connect the neurons between two different layers, and these layers are the last layers of the CNN structure. It is worthy of noting that Figure 2 only shows a simple architecture of a CNN, and different deep learning models come out with different structures and different layers. CNN has shown tremendous success in image processing and computer vision. CNN works typically by a number of convolutional layers chained together with fully connected layers. Using two-dimensional filters, CNN computes the feature map of the given image. The layers also include max-pooling layers, which preserve significant features of the image. In CNN, the output of the last convolutional layer is kept by the dense layer by converting the layer to a vector; in other words, the last layer is converted to a vector by flattening the layer. Then softmax function is used in the dense layer by multiclass classification. Multiclass classification specifies probability values for each class.

EfficientNet architecture
Tan and Le proposed a model called EfficientNet [59] . In their research, they expanded CNN architecture by scaling models. To improve CNN architecture performance, scaling in resolution, depth, and width is required. The most common way to scale up CNN architecture was either by one of the three dimensions: depth, width, or image resolution. On the other hand, EfficientNets perform compound scaling, which scales all three dimensions while preserving a balance between all network dimensions. Compound scaling is also used on ResNet architectures. Usually, the dimensions are tested arbitrarily to improve model accuracy in  practice. The authors adopted the more principled approach to scaling the model on every dimension using fixed scaling coefficients. Whenever the fixed scaling coefficients are determined, the black-box networks are scaled to reach higher accuracy. In this study, EfficientNet is represented by three dimensions: (1) depth; (2) width; and (3) resolutions; the parameter scales by every dimension as a .
Starting from the baseline EfficientNet-B0, the authors in the paper [59] apply the compound scaling method to scale it up with two steps:  We compare AlexNet [63] , Xception [64] , ResNet50 [33] , VGG [34] , BayarNet [35] , HeNet [36] , YeNet [37] , LFNet [38] , SRNet [39] and EfficientNet [59] models in our experiment. These models are slightly modified to make three predicted probabilities in the last layer. In the experiment, each model was trained until it was adequately incorporated in terms of training accuracy and loss for 50 epochs. Figure 4 shows the training accuracy (A) and training loss (B) for each network for 50 epochs.
The learning scheduler ReduceLROnPlateau [58] is used to allow dynamic learning rate reduction based on some validation measurements. The scheduler had a factor of 0.5, a patience value of 1, and a threshold of 0.0001. The cross-entropy loss is adopted. Cross-entropy for n-classes is defined as Where is the truth label and is the softmax probability for the ℎ class, The experiment shows that trivial progress appeared in the implementation of each model after 50 epochs. We utilized multiple classifiers in the experiments.

Evaluation metrics
In this research, evaluation of classification accuracy is measured, which is (nt/ts)100(%), where nt represents the total number of correctly predicted values and ts represents the total number of testing samples. While original and seam-insertion images were classified (the latter two are produced based on original untouched images), each class' s ratio data was possessed at 1:1:1 when calculating the testing set. In addition, the receiver operating characteristic (ROC) curves were calculated to assess the performance of all the Deep Learning models. The area under AUC curve is used as a performance evaluation and the ROC curve is used to represent the relation between the true positive rate and the false positive rate. If the value of AUC is close to 1, it is considered as the model has outstanding performance, which also means it has a good measure of separability.

Performance evaluation of networks
In the performance of evaluation, we first evaluated the performance of EfficientNet and compared it with other CNNs. Table 1 shows the result of retargeting ratio parameters. The end of the An AUC ROC is used for visualizing a model' s performance between sensitivity and specificity. Sensitivity refers to the ability to correctly identify entries that fall into the positive class. On the other hand, specificity refers to the ability to correctly identify entries that fall into the negative class. Hence, an AUC ROC plot is used to identify how well the model can distinguish between classes. In our experiment, we performed three class classifications which represent original, seam inserted, and seam removed. To distinguish the images among the three classes: original (class 0), seam inserted (class 1), and seam removed (class 2), the ROC curve for each class was generated, and each AUC value for the ROC curves was calculated. For every CNN model, we instantiate a visualizer object and fit that to the training data, then generate the score by feeding in the test data. Figure 5 illustrates the ROC curves that were computed to evaluate the performance of CNN models in detail. As illustrated in Figure 5, the ROC curve for three class generation was performed, and each AUC value for the ROC curves was calculated. ROC curves of EfficientNet, LFNet, and SRNet measurements were closer to each other, and we can conclude that EfficientNet, LFNet, and SRNet perform better than the other CNN models.
The AUC values with EfficientNet are 0.991, 0.996, and 0.994 for each class, which are higher than the ones with LFNet and SRNet. The thoroughly analyzed results of Table I and Figure 5 reveal that the EfficientNet performs the best.
We also compared the three popular optimizers: Stochastic Gradient Descent (SGD) [55] , Adam (Adaptive Moment Estimation) [56] , and decouples the optimal choice of weight decay factor from the setting of the learning rate for Adam (AdamW) [57] in the binary classification between untouched JPEG images and seamcarving JPEG images.
Adaptive optimizers like Adam have become a default choice for training neural networks. However, when aiming for state-of-the-art results, researchers often prefer stochastic gradient descent (SGD) with momentum because models trained with Adam have been observed not to generalize as well.
Weight decay is performed only after controlling the parameter-wise step size in Adam optimizer. The weight Figure 5. The ROC curve for three class generation was performed, which represents original, seam inserted, and seam removed. To distinguish the images among the three classes: original (class 0), seam inserted (class 1), and seam removed (class 2), each AUC value for the ROC curves was calculated.
decay or regularization term does not end up in the moving averages and is thus only proportional to the weight itself. The authors [57] show experimentally that AdamW yields better training loss and that the models generalize much better than models trained with Adam.
We compare Adam, AdamW, and SGD optimizers in this study. We ran ten experiments under each learning algorithm with randomly selected training data set, validation dataset, and testing dataset on each experiment, the boxplots of the ten testing results under the three optimizers are given in Figure 6. The X-axis shows the three groups of optimizers, and Y-axis shows the testing accuracy (%). Our results show that the AdamW and Adam are better than SGD in this study.
We would note that AdamW and Adam perform better than SGD with EfficientNet-B5 in this study, while across different deep learning architectures and different deep learning models in a recent study in detecting COVID-19 chest X-ray images, SGD is generally superior to Adam and AdamW [26] .

RESULTS
Deep learning is very successful in computer vision, e.g., object detection, image classification, but these tasks can be easily perceived by human beings. However, it is incapable for human eyes to distinguish forged images from untouched ones. To investigate the performance in detecting seam-carving-based image forgery with popular deep learning models that were used in image forensics, including image steganalysis and image forgery detection, we compared several deep learning models, and the study shows that EfficientNet performs the best, followed by SRnet and LFnet. The current study also demonstrates that the different optimizers led to different detection testing accuracy with the Efficientnet-B5. In the experiments, AdamW is generally superior to Adam and SGD. The current study also indicates that deep learning is very promising in image forensics that is hardly discernable to human perceptions.

Authors' contributions
Performed the comparison study of the nine deep learning models and drafted the paper: Celebi NH Helped in producing a part of the data and assisted in the draft: Hsu TL Supervised the study, assisted in the experiments, and the draft: Liu QZ

Availability of data and materials
Dataset will be availables at https://github.com/Frank-SHSU/seam_carving-image-dataset

Financial support and sponsorship
None.

Conflicts of interest
All authors declared that there are no conflicts of interest.

Ethical approval and consent to participate
Not applicable.