Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
Hindawi Journal of Robotics Volume 2021, Article ID 3986497, 13 pages https://doi.org/10.1155/2021/3986497 Research Article Hand Gesture Recognition Algorithm Using SVM and HOG Model for Control of Robotic System Phat Nguyen Huu and Tan Phung Ngoc School of Electronics and Telecommunications, Hanoi University of Science and Technology, Hanoi, Vietnam Correspondence should be addressed to Phat Nguyen Huu; phat.nguyenhuu@hust.edu.vn Received 7 May 2021; Accepted 3 June 2021; Published 17 June 2021 Academic Editor: L. Fortuna Copyright © 2021 Phat Nguyen Huu and Tan Phung Ngoc. *is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In this study, we propose the gesture recognition algorithm using support vector machines (SVM) and histogram of oriented gradient (HOG). Besides, we also use the CNN model to classify gestures. We approach and select techniques of applying problem controlling for the robotic system. *e goal of the algorithm is to detect gestures with real-time processing speed, minimize interference, and reduce the ability to capture unintentional gestures. Static gesture controls are used in this study including on, off, increasing, and decreasing. Besides, it uses motion gestures including turning on the status switch and increasing and decreasing the volume. Results show that the algorithm is up to 99% accuracy with a 70-millisecond execution time per frame that is suitable for industrial applications. Currently, a new research direction towards the usability 1. Introduction of industrial robot control is gesture control. *e robot will observe human gestures through sensors mounting on the Today, science and technology develop very quickly making new technologies and ideas easy to apply for the industry to body or through an image from the camera to perform increase productivity and work efficiency. As a result, in- corresponding actions that have been set up. *e basic dustrial robots become faster, smarter, and cheaper. More advantage of the approach is flexibility and speed for the and more companies are beginning to integrate the tech- operator that raises safety requirements for users of heavy nology in conjunction with their workforce. *is does not robots. Image processing today is no longer complicated mean that robots are replacing humans while it is true that achieving high-speed equivalent to real-time or even faster some of the more undesirable jobs are being filled by ma- since control methods by image analysis are handy for the chines. *is trend has several more positive outcomes for the user and high efficiency. manufacturing industry. In this study, we, therefore, propose a gesture recog- *e actions of the robot are directed by a combination of nition algorithm using support vector machines (SVM) and programming software and controls. Typically, industrial ro- histogram of oriented gradient (HOG) based on the previous bots are preprogrammed to perform repetitive tasks. However, work [1]. Besides, we also use the CNN model to classify there are still jobs that require human interaction. Human- gestures. *e goal of the algorithm is to detect gestures with robot interaction is aimed at controlling robots that perform real-time processing speed, minimize interference, and re- jobs that humans cannot work directly. Today, the common duce the ability to capture unintentional gestures. *e static control systems are mainly screen and keyboard interaction gesture controls include on, off, up, and down in this study. and it is directly on the robot or remote control. However, it Besides, the dynamic gestures in this study include the will not be convenient and not user-friendly in some cases. following: 2 Journal of Robotics at each location on the image. *e SSD will compute (i) Toggle state switch is hand from spread state up- wards, into grip state and evaluate information at each location to see if there is an object or not. If there is an object on that (ii) Up order is hand from outstretched state up to left site, it will determine which one it is. Based on the (iii) Down order is hand from outstretched state up to results of close proximity, SSD will compute an right amalgamation box covering the object. *e rest of the study is presented as follows. In Section 2, Since the detection and recognition algorithms require a we will present related work. In Sections 3 and 4, we present large amount of computation and the accuracy cannot reach and evaluate the effectiveness of the proposed model, re- 100%, the object tracking techniques in gesture recognition spectively. Finally, we give a conclusion in Section 5. are also widely applied to ensure the continuous real-time recording of subject location and avoid interference in multisubject environments. *ere are many targeting al- 2. Related Work gorithms for image processing such as BOOSTING [7], MIL, KCF [8], TLD, MEDIANFLOW [9], GOTURN [10], MOSSE *e problem of visual hand recognition and tracking is quite [11], and CSRT [12, 13]. Depending on each problem to- challenging. Many approaches used position markers or wards the accuracy or processing speed, we can select the colored bands to make the problem of hand recognition right algorithm. easier. However, they cannot be considered as a natural Besides, the problem requires processing image recog- interface for the robot control due to their inconvenience. nition in real time. Using CPU, GPU, and FPGA has their *e motion recognition problem can be solved by com- own advantages depending on the specific application for bining basic image processing problems, namely, object image processing algorithms. Image processing algorithms detection, recognition, and tracking. *ere are many image usually consume a lot of computing resources. In many processing algorithms that have been developed in target cases, the continuously growing performance of CPUs is detection and recognition. *ey are divided into two main sufficient to handle such tasks within a specified time. groups, namely, advanced machine learning (ML) and deep However, GPU and FPGA processors are widely used to learning (DL) techniques [2–14]. replace CPUs for image processing applications. Besides, ML techniques are general terms commonly used with CNN (cellular neural network) technology is an analog basic feature extraction methods from original data and then parallel computing paradigm defined in space and found by combining, for example, SVM, decision tree, and nearest- the locality of connections between processing elements neighbor, to train identity models. *ere are several ex- (cells or neurons). It has been introduced as a special high- traction techniques for typical object detection as follows: speed parallel neural structure for image processing and (i) Viola–Jones target detection technique [2]: it is the recognition [14]. first technique in real-time target detection based on Haar feature extraction. *is technique is com- 3. Proposed Algorithm monly used in face detection. (ii) Scale-invariant feature transform (SIFT) [3]: the *e goal of the algorithm is to detect gestures with real-time special feature of SIFT is scale-invariant since it processing speed, minimize interference, and reduce the gives stable results with different aspect ratios of the ability to capture unintentional gestures. *eir datasets are image. Besides, the algorithm is rotation-invariant depicted in Figure 1. that ensures the result with different rotations of the *e proposed algorithm will perform as shown in Fig- object. ure 2. In the model, the image is processed in real time. We will then conduct hand-holding area detection based on the (iii) HOG [4]: it is calculated on a dense grid of cells and previous model. *e system then extracts the region of normalized the contrast between blocks to improve interest (ROI) of the frame. *e object tracking module will accuracy. It is mainly used to describe the shape and get the coordinates of ROI and lock the track object for the appearance of an object of the image. next frames. Next, the identity module is activated to Advanced DL techniques often use multilayered con- evaluate ROI whether the gesture has started or not. As a volutional neural networks training on labeled datasets. result, it will decide to continue to perform ROI of the next Several techniques are commonly applied in object detection frames to find the ending and drawing gesture. and recognition that include the following: Ifthe algorithm did not detect any gestures, we will reinitiate the process. If an operation has been repeated for (i) Region proposals (R-CNN, Fast R-CNN, Faster too long, we will start the program again. R-CNN, and cascade R-CNN) [5]: the method proposes areas capable of containing the object and performs identification to save computational 3.1. Overview of Techniques capacity. (ii) Single shot multiBox detector (SSD) [6] such as 3.1.1. Detecting Object. *e technique has two requirements. YOLO and ReneDet: the main idea of SSD comes *e first requirement is to detect the image that contains the from using bounding boxes by preinitializing boxes object or not. *e second requirement is to find out the Journal of Robotics 3 Toggle state switch Up Down Figure 1: Gesture dataset for controlling. Calculating Combining to Calculating Dividing data Data input characteristics of make features gradient into blocks each block of image Figure 2: Proposed gesture recognition model. position of the subject of the image. As introduced in the difficult to use detection techniques. *e requirement for object tracking technique in this problem is fast adhesion previous section, there are many algorithms that perform the task. In this study, the requirement is the accuracy of the with acceptable accuracy. In this study, we select kernelized correlation filter (KCF) algorithm to track the object. *e results as well as being fast enough to operate for real-time applications. In the system, the action is the object detection algorithm has good speed and consistent accuracy. It will not operation. *erefore, it is necessary to select techniques with restore tracking target when losing its target that will reduce relatively fast calculation speed. *erefore, our imple- noise for the gesture control system. mentation idea is to use the multiscale and sliding window technique to separate the image into ROI. We then crop the 3.1.3. Object Classification Technique. Object classification image from those ROI regions and extract their HOG techniques are particularly applied for image processing. In feature. *e SVM technique is used to classify whether an artificial intelligence applications, the classifier requirements image contains an object or not. We then conclude that areas are to be able to distinguish the gestures that begin with each are likely to contain objects. Finally, we use the non- other. *e required accuracy is high and this will ensure the maximum suppression technique to find the most suitable desired level of control accuracy. Processing speed is not ROI. required too high since information containing object is known and categorization does not occur continuously in 3.1.2. Tracking Object. When the algorithm has detected the every frame. frame containing the object as well as the ROI area, the next When the object (gesture is detected) contains the target, thing is to lock and track the target when it is moving or may the next task is to recognize them. Once gesture recognition be partially deformed in the next frames. begins, we continue to track the target and end gesture. To *e use of the tracking object will be necessary since the perform the task, we choose the convolutional neural net- user of gesture will take place in a few seconds. If we work (CNN) model. CNN is used in many problems such as continue to use classification and detection techniques to image recognition, video analysis, MRI images, and natural conclude, it will be difficult to achieve the desired processing language processing. speed or it may lead to false conclusions. For example, the action initiates ending the gesture that comes from two completely different objects. 3.2. Object Detection and Partitioning Techniques. As ana- lyzed in Section 3.1, we choose HOG characteristic ex- During the intermediate stage between starting and ending of gesture, the subject may be partially deformed. It is traction technique, combined with the SVM classification 4 Journal of Robotics algorithm for the proposed model in Figure 2. HOG characteristic is proposed by Lee and Chung [4]. *e typical HOG idea comes from the object of form and state. It can be characterized by the intensity and direction distribution of 49 92 pixel value and is represented as a vector calling a gradient vector. Gradient is a vector of elements that represent how fast a pixel of value changes. *e gradient vector value brings a lot of useful information. It represents the change in the luminance value of pixels. *e gradient vector value changes Figure 3: Calculating blocks of gradient vector. when the pixel is in the corner and edge areas of the object. *erefore, the HOG feature is effective in choosing the angle of inclination of coordinate pixel (x, y) is representation of posture. discrete into p bin. We unsigned HOG discretization *e essence of the HOG method is to use information (p � 9) according to the following expression: about the distribution of intensity gradients or edge di- rections to describe local objects in an image. *e HOG p × α(x, y) B(x, y) � round mod p. (3) operators are implemented by dividing an image into subregions calling cells. We will compute a histogram of directions of gradients for points for each cell. To combine Unsigning HOG (p � 18), we have the histograms together, we get a representation of the p × α(x, y) original image. To enhance recognition performance, local B(x, y) � round mod p, (4) histograms can be normalized for contrast by calculating an 2π intensity threshold in an area larger than the cell calling blocks. We will use the threshold value to normalize all cells where the bin value is determined by the total in the block. *e result after the normalization step will be a variable intensity of pixels. feature vector that is more invariant to changes in lighting A block consists of 4 cells. Joining four cells, we get conditions. *e following are the steps to extract the features the feature vector of a block. *e characteristic vector of HOG: dimension of the block is 4 × p bin with p � 9 (unsigning HOG) or 18 (signing HOG). (1) Step 1: calculating the gradient vector for each pixel. For a grayscale image, the pixel values are from 0 to (4) Step 4: calculating the characteristic vector: 255. If a pixel with neighboring values is left, right, We normalize the feature vector of blocks by di- above, and below, the pixel of the gradient vector is viding by their magnitude. Combining feature vec- represented by different pair. tors of each block to make up the image, we have the Let I and I be the different values of two pairs of HOG feature. *e number of characteristic vector x y left and right, and up and down pixels. *e gradient dimensions of the image is calculated by vectors are calculated using the following formula: size � n ∗size , (5) ������ featureimage block/image feature/block 2 2 G � I + I , x y where n is the block and size is the block/image feature/block (1) I number of characteristic vector dimensions per θ � arctan . block. SVM is a machine learning algorithm belonging to the (2) Step 2: creating blocks. We divide the output image supervised learning group. It is used in classification or of the previous step into equal blocks. Each block is regression problems. It is a binary classification algorithm. divided into 4 cells where each cell has an equal *e SVM takes the input and classifies them into two dif- number of pixels. *e blocks are stacked on top of ferent classes. SVM training algorithm builds a model to each other as shown in Figure 3. *e number of classify them into those two categories. blocks is calculated using the following formula: *e idea of SVM is to find a hyperlane to separate data points. *is hyperplane will divide the space into different w − w × w h − h × h i b c i b c n � + 1 × + 1, domains and each domain will contain a type of data. For block w h c c example, we have a dataset of blue and red points placed on (2) the same plane. We can find a line to separate the set of red and blue points as shown in Figure 4 [15]. where w , h , w , h , w , and h are the width and i i b b c c However, we need more than one straight line to height of image, block, and square, respectively. divide it for complex datasets. We use an algorithm to (3) Step 3: calculating characteristic vector. We compute map them to more dimensional space (n dimensions) and the characteristic vector for each cell in block. We find the hyperplane. *e example in Figure 5 converts data then divide the directional space into p bin (the from two-dimensional space to three-dimensional space number of typical vector dimensions of a cell). *e [15]. Journal of Robotics 5 Margin d+ d– Figure 6: Example of supersets for two-dimensional space. Figure 4: Example of classifying dataset. We next choose two support hyper lane H passing through points of negative layer and H passing points of the positive layer that are parallel to H where the distance from H to H is d−, distance from H to H is 1 0 2 0 d+, and m � d + d is the margin level. − + *e optimal superplanar is the separating hyperplane with the largest margin. *e theory of machine learning has shown that a superplane minimizes the limit of error. To calculate the margin m, we have the following: Two-dimensional space Three-dimensional space *e distance from a point X to the superflat H is k 0 Figure 5: Mapping data from two-dimensional space to three- |W.X + b|/W, where W is the length of vector W calculated dimensional space. as ����������������� � 2 2 2 2 (6) *ere are many superplanes to divide a dataset. How- W � W.W � w + w + w + · · · w . 1 2 3 n ever, we need to adhere to the following principles for best *e distance from a point X on H to H is optimization: i 1 0 (i) Firstly, we must definitely be able to divide the W.X + b 1 (7) d � � . dataset. ‖W‖ ‖W‖ (ii) Secondly, the rule is that the distance from the *e distance from a point X on H to H is j 2 0 nearest point of a certain layer to the superplane must be as large as possible. *is distance is known W.X + b j 1 (8) d � � . as margin. ‖W‖ ‖W‖ Margin is the distance between the superplane to the two *erefore, we can calculate the margin as nearest data points corresponding to two subclasses. SVM tries to optimize the algorithm by maximizing the margin m � d + d � . (9) − + value. *erefore, we have to find the best superplanar to ‖W‖ divide the two data layers. *erefore, the model training of the SVM technique *e problem is to find two boundaries of two data layers corresponds to the minimization problem 2/‖W‖ in that the distance between these two lines is the greatest. *e condition green layer of boundary will pass through one or several green points. *e border of the red layer will pass through one or W.X + b≤ − 1,if y � −1, i i several red points. *e blue and red points lying on the two (10) W.X + b≥1, if y � 1. borders are called the vector supports since they are re- i i sponsible for the superflat finding as shown in Figure 6 [15]. *is is the condition of hard-margin problem of SVM. Superplane is represented by the function W.X � b and *e determination of H hyper lanes is assumed under ideal X are vectors; 〈.W.X 〉 is the product of two vector scalar. conditions: the dataset can be linearly separated and find two We have to classify the positive (blue) labeled class dataset as marginal hyper lanes H and H without data points be- 1 2 1 and the negative (red) labeled class data as −1. tween them. *erefore, if these points do not satisfy the Superplane separates the two data layers H to satisfy condition, the problem will not find a solution. W.X + b � 0. *e hyperplane creates two half-spaces of data as follows: *e space of negative layer data X satisfies W.X ≤ − 1 3.3. Pyramid Technique. As analyzed above, it is basically i i possible to classify images containing objects using HOG and the space of positive layer data X satisfies W.X ≥1. and SVM. However, the subject is only a small part of the i 6 Journal of Robotics filter. In the filter, the correlation between two samples is actual image. If you only categorize the whole image, the results will be inaccurate. *erefore, it is necessary to have an taken. When these samples match, the correlation value is the highest. A correlation can be found between the version algorithm to determine the position and size of the object with high accuracy. of root (ROI area contains the tracked target) and ROI Multiscale is an image representing many ratios. Using region at the same location in the next frame. *is indicates image pyramids allows us to find objects of the image at the direction the tracked subject has moved in. different scales. We have the original size in terms of width In the standard correlation filter, the following object and height based on the pyramid. *e image is resized model is not updated. If the object image changes signifi- (subsampled) and optionally smoothed (usually through cantly, the tracker performance decreases. In the KCF Gaussian blurring) at each subsequent layer. *ey are tracker, the model of the object being monitored is updated subsampled gradually until several stop criteria are met or directly and continuously using the linear ridge regression model. when the minimum size is reached and subsampling is no longer required. *e process of adhering to targets using the basic KCF method consists of the following steps: *e second important component is the sliding window. A sliding window is a fixed-sized rectangle that slides from (i) Determination of grip area: it can be the initial user- left to right and top to bottom in an image. We will extract defined area or an area detected by the system from the ROI, start the classifier, and get predictive results at each the previous frame step. (ii) Description of features: define the characteristics of Combining with the image pyramid, the sliding window the image area allows localizing objects in different positions and various proportions of the input image. *e results after processing (iii) Regression training: the detected ROI features will can be multiple in one image. *ere are multiple outcomes be added to form a dataset including past and for an object that are at different scale levels at one location present features as a basis for rapid training or neighboring locations. (iv) *e results after regression training are a new *e result shows that our classifier is returning the model, and the model is the basis for the next target growing probability of the object. However, there is only one detection step object and we need to collapse and delete the excess results. *e characteristics of the KCF method are relatively high To solve the problem, we apply the nonmaximum sup- accuracy, medium speed, and especially inability to recover pression (NMS) method that will reduce the overlapping when losing targets in a short time. *erefore, we choose the regions. method for the paper. *e idea of the approach is as follows: (i) We have a set of ROI regions that are called R with corresponding S confidence points and an overlap 3.5. Proposed Algorithm. *e proposed algorithm uses a threshold N. At the same time, initialize an empty dataset from live-stream images of the camera for the list D. purpose of testing and solving fundamental problems. Details of the proposed algorithm are as shown in Figure 7. (ii) Select the ROI area with the highest confidence point and remove from R, and add D. (iii) Compare the newly added ROI region to D in R 3.5.1. Preprocessing. Preprocessing the input image before through the intersection over union index (IoU). If recognition is the step that improves image quality to the threshold value is greater than the originally eliminate noise and increase the ability to recognize the initialized overlap threshold N, then remove those correct type of gesture. During the preprocessing step, we ROI regions from R. use adaptive histogram equalization techniques and an (iv) Continue to select ROI region with the highest averaging filter to eliminate noise that enhances the quality of the input image. Besides, we also standardize the image confidence point currently available in R and add to D. size before processing. In the preprocessing step, we use the following (v) Compare the IoU value of the area just added to D techniques: with the rest of regions of R; if it is greater than the overlap threshold; then remove from R. (i) Image resizing: the image is resized to a new size in order to synchronize processing steps, reduce image (vi) Continue performing until there are no more el- size, and save the number of calculations. ements of R. (ii) Contrast limited adaptive histogram equalization (vii) *e ending results are the elements of set D. (CLAHE):it is a method to help balance histogram charts with limiting contrast levels. *e image is divided into small blocks called cells (8 × 8). Each of 3.4. Real-Time Object Tracking Technique. Due to real-time applications, modern object trackers try to reconcile as many these blocks is then charted as normal. *erefore, samples as possible and keep the computation low. Ker- the histogram will be limited to one small area. If nelized correlation filter (KCF) is a variant of correlation any cell exceeds a specified contrast level, those Journal of Robotics 7 multiplier area of 5 × 5 pixels. *e result is shown in Figure 9. Video input 3.5.2. Building Training Sample. *e construction of the training sample will be based on our actual images to bring the most realistic training results for smart indoor appli- Resize cation. *e training sample is taken from the user of actual CLAHE Preprocessing images. median blurring (1) Develop a Detection Training Dataset. *e detection training sample is images of a human hand at the beginning HOG and SVM of motion. Our training sampling method is performed No detect multiscale Hand detection using a python application as the following idea: NMS (i) Continuously receiving frames from the webcam of Yes the laptop and displaying them on the screen. (ii) Drawing on the frames as follows: ROI areas are No fixed as 190 × 190 pixels and started from the top Tracking KCF left corner of the first frame. We then move from the next frames using the sliding window method Yes with a horizontal of 6 pixels and a vertical of 80 pixels. (iii) Executing the command to save a frame within a No Gesture CNN period of about 50 milliseconds, the file name is recognition stored in a text file with coordinates and size of ROI area corresponding to (x, y, x + 190, y + 190). Yes We run the application and move our hands within ROI on each frame. We get a folder containing image templates No Hand tracking and a text file containing hand position information in the respective image. We perform one more time to build the training dataset as shown in Figure 10. Yes For example, the content of the corresponding ROI zone information file is as follows: “0: (6,0,196,190),1: No (12,0,202,190), 2: (18,0,208,190),3: (24, 0,214,190),4: Stop (30,0,220,190), 5: (36,0,226,190), 6: (42,0,232,190),7: (48,0,238,190), 8: (54,0,244,190),9: (60,0,250,190),10: Yes (66,0,256,190).” We then perform a random selection of images and check the location of the file. We then evaluate whether the Conclusion sample quality is consistent with the ROI or not. We can perform more times in the different backgrounds and the lighting to add the dataset. Figure 7: Proposed gesture recognition algorithm. (2) Developing Identity Training Dataset. For identifying the pixels will be clipped and distributed uniformly to training sample, we perform with a similar idea to the de- the other cells before applying histogram equal- tection training pattern. In this case, the output is the ROI ization. To eliminate intercell bias, the linear in- region image that cuts out from the frames as shown in terpolation balance method is used as shown in Figure 11. Figure 8. We increase the number of runs and change the exe- (iii) Median filter: we will take the average of all pixels in cution environment for more diverse background sample kernel area and center element to replace with an results. We perform with each starting and ending position. average value. *is is highly effective against the *e results will be four datasets corresponding to 1 starting noise of images. In the filters, the center element pose and 3 ending positions. We used the program to build 4 with the new value is calculated and effectively datasets corresponding to the following poses: spread your reduced noise. Its multiplication area size must hand up, left, right, and hold your hand. Each dataset always be an odd positive integer. We use a contains 1000 featured images. 8 Journal of Robotics 0 50 100 150 200 250 0 50 100 150 200 250 (a) (b) Figure 8: Result of balancing adaptive histogram: (a) original and (b) output image. Original Blur 0 0 100 100 200 200 300 300 400 400 0 100 200 300 400 500 600 0 100 200 300 400 500 600 (a) (b) Figure 9: Images before and after filtering median. Figure 10: Dataset contains hand patterns in different positions. 3.5.3. Training Model to Detect Beginning Posture. To train sample image is a list element consisting of numpy matrices the detection model, we use HOG and SVM techniques representing the image. *e list of ROI regions contains supporting in the dlib library with input data as follows: the objects of form “dlib.rectangle” with the condition C>0. As Journal of Robotics 9 (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 11: Several hand up datasets for training model. (a) 1239.jpg. (b) 1237.jpg. (c) 1235.jpg. (d) 1219.jpg. (e) 1217.jpg. (f) 1215.jpg. (g) 1199.jpg. (h) 1197.jpg. (i) 1195.jpg. mentioned above, a small C value is a larger allowable In Table 2, the training results show that the model is deviation that can lead to underfitting. If C is too large, the completed when the val_loss index is very small. tolerance is small that can cause overfitting. *erefore, it is necessary to choose the appropriate C parameter as shown in 4. Simulation and Result Figure 12. 4.1.Setup. We perform simulation on a computer with Core We note the sample set list length and ROI area. In our i5 4310 CPU configuring at 2GHz without GPU. We self-build script, we test and choose parameter C � 2. Testing evaluated accuracy and execution time for three scenarios the training results with module “dlib.test” (sim- including hand zone recognition, static gesture recognition, ple_object_detector) produces the following: training met- and motion gesture recognition. In our study, the resolution rics: precision is 1, recall is 0.995825, and average precision is of the input video is able to change depending on appli- 0.995825. cations. *e ROI regions of detecting objects are resized into Results will be stored as a “∗.svm” file for detection. 64 × 64 or 128 × 128. 3.5.4. Training Gesture Recognition Model. Firstly, we will 4.2. Result build a dataset from a previously created set of 4 folders for 4 poses. Data are read from folders one by one and labeled. *e 4.2.1. Hand Detection Results. We performed reevaluation results of the dataset are listed in the list with the number of of target detection results by HOG and SVM using images elements equal to the number of sample images. Each ele- with many different backgrounds. *e results are shown in ment has a structure consisting of a matrix of image de- Table 3 and Figure 13. Figure 13 shows a number of cases scriptions and a label of the sample. where the hand is determined to be faulty due to the *e training uses “tensorflow.keras” to build the model. background change. We found that image brightness is an *rough the training process with many types of structures, important factor to improve the accuracy of the algorithm. the model with the number and size of layers is selected as follows: 4.2.2. Static Posture Identification Results. According to the dense_layers �[0, 1, 2] CNN model, the output of the classifier will be a 4-element sequence. Each element represents a classifier label and has a layer_sizes �[32, 64, 128] value between 0 and 1. When the representative value of the conv_layers �[1, 2, 3] label is close to 1, the result of the classifier is similar to the We have selected the optimal model with three layers of label. We choose a limit of 0.85. *e label will be selected Conv2D with sizes 32, 64, and 128, respectively. ReLU ac- when the corresponding value is greater than 0.85. If there is tivation function is three-layer MaxPooling2D and one no label with a corresponding value greater than 0.85, the Flatten layer, respectively. *e output is a dense layer of size result will be counted as unrecognizable. If there is a label 128 and a layer of dense of size 4. An input image will have with the corresponding value greater than 85% but not the four outputs. Table 1 is a description of a neural network. In correct label identified before the inspection, the result is Table 1, total parameters are 240,772; trainable parameters also counted as false identification. are 240,772; and nontrainable parameters are 0. *e results are shown in Table 4 and Figure 14. In *e results after training with parameter epochs � 300 Figure 14(a), the actual state is the first one (upward state). are shown in Table 2. However, the results show the third state directing to right 10 Journal of Robotics (a) (b) (c) Figure 12: Result of the training process with C parameter based on [16] (a) underfitting, (b) fitting, and (c) overfitting. Table 1: Sequential parameter model. Layer Output shape Number of parameters conv2d_1 (None, 126, 126, 32) 320 Activation_1 (None, 126, 126, 32) 0 max_pooling2d_1 (None, 42, 42, 32) 0 conv2d_2 (None, 40, 40, 64) 18496 Activation_2 (None, 40, 40, 64) 0 max_pooling2d_2 (None, 13, 13, 64) 0 conv2d_3 (None, 11, 11, 128) 73856 Activation_3 (None, 11, 11, 128) 0 max_pooling2d_3 (None, 3, 3, 128) 0 Flatten_1 (None, 1152) 0 Dense_1 (None, 128) 147584 Activation_4 (None, 128) 0 Dense_2 (dense) (None, 4) 516 Activation_5 (None, 4) 0 Table 2: *e result of the training model. Epoch Time for step (seconds) Loss Accuracy Val_loss Val_accuracy 1 21 1.2469 0.5143 0.7816 0.7944 2 14 0.0568 0.9881 0.0281 0.9944 . . . . . . . . . . . . . . . . . . 298 15 2.7532e −08 1.0000 1.0010e −06 1.0000 299 15 2.7248e −08 1.0000 9.9832e −07 1.0000 300 15ms/step 2.6680e −08 1.0000 9.9170e −07 1.0000 Table 3: Results of object detection. Posture Number of tests Number of false identifications Identification time (milliseconds/images) Error rate (%) Spreading arm up 1000 90 63.47 9 that has the highest reliability. In Figure 14(b), the actual However, we can judge by the accuracy of hand position state is the fourth state (toggle state). However, the results detection steps as well as the step of recognizing the be- show the first state (upward) that has the highest reliability. ginning and ending posture. *e results in Table 5 indicate that the algorithm improves with accuracy over 86%. We perform to compare our proposal with other 4.2.3. Dynamic Posture Identification Results. We perform methods. *e results are shown in Table 6. In Table 6, we can by real-time webcam. As a result, image processing speed has see that the detection with datasets in different environments achieved real-time with selecting configuration computer. still gives special error results when the background changes *e results are shown in Table 5. fast. However, the result is acceptable since the detection can Due to the limited number of gestures, we have not fully take place continuously at a high speed (0.06 seconds). evaluated the effectiveness of the proposed method. *erefore, the CNN model has very high accuracy (over Journal of Robotics 11 (a) (b) (c) (d) Figure 13: Results of detecting hand error for several cases: (a) face, (b) hair, (c) background, and (d) wrist. Table 4: Result of static posture recognition for 1000 images. Posture False identification rate (‰) Identification time for image (milliseconds) Accuracy Holding hands 4 69.41 0.99 Spread posture up 5 72.15 0.99 Spread left posture 11 66.52 0.98 Opening right hand 7 67.32 0.99 (a) (b) Figure 14: Result of false identification for (a) up and (b) toggle gestures. accuracy. Another method also has a relatively high accuracy Table 5: Result of dynamic posture recognition for 30 videos. [20]. However, the method is based on face recognition Posture Number of false identifications Accuracy along with motion detection and based on movement his- Switch state (on/off) 2 0.93 tory. However, the method has the disadvantage of being Increasing 4 0.87 difficult to apply for an environment with a lot of noise Decreasing 3 0.90 relating to colors and gestures. In the proposed method, we aim to take advantage of the rapid detection advantages of HOG and SVM in coordi- 96%) for all postures. *is result is suitable for real-time nation with identification using the CNN model. *e ad- applications. vantage of the model is highly accurate but requiring *e proposed method outperforms other methods such relatively strong GPU for real-time applications. *erefore, as hand gesture recognition and detection using boosted we have an average speed processing system that still pro- classifiers and active learning [19] with approximately 70% duces acceptable results (90%). 12 Journal of Robotics Table 6: Results of comparison with other hand gesture detection systems. Training data Accuracy Detection frame Method Platform Hardware (frame) (%) (seconds) CPU (i5, 2.3GHz, 16GB Combination edge detectio [17] 3154 CPU 82 From 10 to 15 RAM) HOG characters and SVM [18] 1000 CPU 91 N/A N/A Boosted classifiers and active Pentium 4, 3.2GHz 1GB 300 CPU 70 0.089 learning [19] RAM CPU (i5, 2GHz, 4Gb Our proposal 1000 CPU 90 0.15 RAM) 4.2.4. Discussion. *e authors [21–27] performed for the Data Availability CNN processor with low hardware configuration for image *e authors confirm that the data of this study are built by processing. In [21], real-time requirements for video pro- themselves. Other data that are not theirs are fully referenced cessing applications are fully satisfied that allows early in this study. segmentation to be used and efficient preprocessing tech- nique to perform sophisticated routines for configuration. Conflicts of Interest *e results show the feasibility of real-time image processing to support the gesture recognition process of robot control *e authors declare that they have no conflicts of interest. applications. In [26], the authors proposed a novel algorithm for the local binary pattern (LBP) feature extraction using Acknowledgments CNN. Using the dynamic parallelism of CNN, the feature can be performed effectively in terms of power consumption *is research was carried out in the framework of the project and speed. funded by the Ministry of Education and Training (MOET), When using hardware as described in Section 4.1, we find Vietnam, under Grant no. B2020-BKA-06. *e authors that the computer simulation uses 10 to 15% of CPU for would like to thank the MOET for their financial support. detecting mode and 60 to 70% CPU in classifying mode. RAM is used (including emulation software) less than 510Mb. It References can be seen that the hardware is just enough for processing image. *erefore, we will approach to optimize hardware for [1] P. N. Huu, T. P. Ngoc, and H. T. Manh, “Proposing gesture recognition algorithm using hog and svm for smart-home image processing in the next step. Besides, the authors [26] applications,” Lecture Notes of the Institute for Computer show the feasibility of using CNN for basic image processing. Sciences, Social Informatics and Telecommunications Engi- In [26, 27], the authors have shown an idea using CNN for neering, Springer International Publishing, vol. 379, deep learning applications. To optimize the system, the re- pp. 315–323, , New York, NY, USA, 2021. quirements for the size of the input image are 128 × 128 when [2] P. Viola and M. Jones, “Robust real-time face detection,” in the minimum processing speed and RAM is 30 frames per Proceedings of the Eighth IEEE International Conference on second and 512Mb, respectively. In our study, we use the Computer Vision, July 2001. training algorithm with only 1000 images, a personal com- [3] D. Lowe, “Object recognition from local scale-invariant puter with low configuration, and an execution time of fewer features,” in Proceedings of the Seventh IEEE International than 0.15 milliseconds per frame. It is completely applicable to Conference On Computer Vision, September 1999. [4] H.-J. Lee and J.-H. Chung, “Hand gesture recognition using hardware configurations using common CNN chips. orientation histogram,” in Proceedings of the IEEE. IEEE Region 10 Conference. TENCON 99. “Multimedia Technology 5. Conclusion for Asia-Pacific Information Infrastructure” (Cat. No.99CH37030), September 1999. In this study, we have built a gesture recognition algorithm [5] R. Girshick, “Fast R-CNN,” in Proceedings of the 2015 IEEE based on HOG incorporating SVM that is able to apply to International Conference on Computer Vision (ICCV), De- robotic systems. *e result shows that the accuracy of the cember 2015. proposed algorithm is improved up to 99%. However, the [6] C. Ning, H. Zhou, Y. Song, and J. Tang, “Inception single shot gesture dataset is not large enough since the effectiveness of multibox detector for object detection,” in 2017 IEEE Inter- national Conference on Multimedia Expo Workshops the proposed method is not high. *erefore, we are able to (ICMEW), July 2017. improve the accuracy of detection and recognition steps at [7] L. Breiman, “Bias, variance, and arcing classifiers,” University the beginning and ending gesture. of California, Los Angeles, CA, USA, 460, 1996. In the future, we will perform the next steps to increase [8] L. Dalei, L. Ruitao, and Y. Xiaogang, “Object tracking based the frame rate per second, to improve accuracy by increasing on kernel correlation filter and multi-feature fusion,” in the resolution of the input image or using methods in our Proceedings of the 2019 Chinese Automation Congress (CAC), previous paper [28, 29], and to combine neural networks November 2019. with other networks to increase the efficiency of calculations [9] T. Datta, S. Han, M.-J. Kim, V. Maik, and J. Paik, Keypoint- and performance with any object. based object tracking using modified median flow,” in Journal of Robotics 13 Proceedings of the 2016 IEEE International Conference on [25] S.-A. Chen, J.-F. Chung, S.-F. Liang, and C.-T. Lin, “Cellular Consumer Electronics-Asia (ICCE-Asia), October 2016. neural network (CNN) circuit design for modeling of early- [10] C. Wang, H. K. Galoogahi, C.-H. Lin, and S. Lucey, “Deep-lk stage human visual system,” in Proceedings of the IEEE In- ternational Workshop On Biomedical Circuits And Systems, for efficient adaptive object tracking,” in Prceedings of the 2018 IEEE International Conference on Robotics and Automation 2004, December 2004. (ICRA), May 2018. [26] O. Lahdenoja, M. Laiho, and A. Paasio, “Local binary pattern [11] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, feature vector extraction with CNN,” in Porceedings of the “Visual object tracking using adaptive correlation filters,” in 2005 9th International Workshop On Cellular Neural Net- Proceedings of the 2010 IEEE Computer Society Conference on works And Geir Applications, May 2005. Computer Vision and Pattern Recognition, June 2010. [27] A. Horvath, ´ M. Hillmer, Q. Lou, X. S. Hu, and M. Niemier, [12] F. Feng, X.-J. Wu, and T. Xu, “Object tracking with kernel “Cellular neural network friendly convolutional neural net- correlation filters based on mean shift,” in Proceedings of the works—CNNs with CNNs,” in Proceedings of the Design, 2017 International Smart Cities Conference (ISC2), September Automation Test In Europe Conference Exhibition (DATE), 2017. 2017, March 2017. [13] J. A. T. Olivero, C. M. B. Anillo, J. P. G. Barrios, E. M. Morales, [28] N. H. Phat, T. Q. Vinh, and T. Miyoshi, “Video compression E. J. Gachancipa,´ and C. A. Z. d. l. Torre, “Comparing state-of- schemes using edge feature on wireless video sensor net- works,” Journal of Electrical and Computer Engineering, the-art methods of detection and tracking people on security cameras video,” in Proceedings of the 2019 XXII Symposium vol. 2012, Article ID 421307, 20 pages, 2012. On Image, Signal Processing And Artificial Vision (STSIVA), [29] P. N. Huu, V. Tran-Quang, and T. Miyoshi, “Image com- April 2019. pression algorithm considering energy balance on wireless [14] P. Arena, M. Bucolo, S. Fazzino, and M. Frasca, “*e CNN sensor networks,” in Proceedings of the 8th IEEE International paradigm: shapes and complexity,” International Journal of Conference on Industrial Informatics (INDIN 2010), July 2010. Bifurcation and Chaos, vol. 15, no. 7, pp. 2063–2090, 2005. [15] S. Kandukuri, A. Klausen, H. V. Khang, and K. Robbersmyr, “Fault diagnostics of wind turbine electric pitch systems using sensor fusion approach,” Journal of Physics: Conference Series, vol. 1037, no. 3, p. 32036, 2018. [16] A. *arwat, “Parameter investigation of support vector ma- chine classifier with kernel functions,” Knowledge and In- formation Systems, vol. 61, no. 3, p. 12, 2019. [17] M. Kounavis, “Fingertip detection without the use of depth data, color information, or large training data sets,” in Pro- ceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), October 2017. [18] K.-P. Feng and F. Yuan, “Static hand gesture recognition based on hog characters and support vector machines,” in Proceedings of the 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA), December 2013. [19] H. Francke, J. R.-d. Solar, and R. Verschae, “Real-time hand gesture detection and recognition using boosted classifiers and active learning,” in Advances in Image and Video Tech- nology, D. Mery and L. Rueda, Eds., Springer, Berlin, Ger- many, pp. 533–547, 2007. [20] C.-C. Hsieh, D.-H. Liou, and D. Lee, “A real time hand gesture recognition system using motion history image,” in Pro- ceedings of the 2010 2nd International Conference on Signal Processing Systems, July 2010. [21] P. Arena, A. Basile, M. Bucolo, and L. Fortuna, “An object oriented segmentation on analog CNN chip,” IEEE Trans- actions on Circuits and Systems I: Fundamental Geory and Applications, vol. 50, no. 7, pp. 837–846, 2003. [22] P. Kaluzny and S. Kuklinski, “Properties of cellular neural networks in selected image processing applications,” in Proceedings of the IEEE International Workshop On Cellular Neural Networks And Geir Applications, December 1990. [23] C.-C. Lee, J. P. d. Gyvez, Color image processing in a cellular neural-network environment,” IEEE Transactions on Neural Networks, vol. 7, no. 5, pp. 1086–1098, 1996. [24] K. R. Crounse and L. O. Chua, “Methods for image processing and pattern formation in cellular neural networks: a tutorial,” IEEE Transactions on Circuits and Systems I: Fundamental Geory and Applications, vol. 42, no. 10, pp. 583–601, 1995.
Journal of Robotics – Hindawi Publishing Corporation
Published: Jun 17, 2021
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.