Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A review of methodologies for natural-language-facilitated human–robot cooperation:

A review of methodologies for natural-language-facilitated human–robot cooperation: Natural-language-facilitated human–robot cooperation refers to using natural language to facilitate interactive information sharing and task executions with a common goal constraint between robots and humans. Recently, natural-language- facilitated human–robot cooperation research has received increasing attention. Typical natural-language-facilitated human–robot cooperation scenarios include robotic daily assistance, robotic health caregiving, intelligent manufactur- ing, autonomous navigation, and robot social accompany. However, a thorough review, which can reveal latest meth- odologies of using natural language to facilitate human–robot cooperation, is missing. In this review, we comprehensively investigated natural-language-facilitated human–robot cooperation methodologies, by summarizing natural-language- facilitated human–robot cooperation research as three aspects (natural language instruction understanding, natural language-based execution plan generation, knowledge-world mapping). We also made in-depth analysis on theoretical methods, applications, and model advantages and disadvantages. Based on our paper review and perspective, future directions of natural-language-facilitated human–robot cooperation research were discussed. Keywords Natural language, human–robot cooperation, NL instruction understanding, NL-based execution plan generation, knowledge-world mapping Date received: 22 December 2017; accepted: 23 April 2019 Topic: AI in Robotics; Human Robot/Machine Interaction Topic Editor: Chrystopher L Nehaniv Associate Editor: Hagen Lehmann of assembly sequence. By giving robots NL instructions, Introduction such as “drill holes, then clean surface, last install screws,” Attracted by the naturalness of natural language (NL) com- a human’s high-level plan was combined with robots’ low- munications among humans, intelligent robots start to level executions, such as “grasping drillers and brush,” and understand NL to develop intuitive human–robot coopera- “motion planning in drilling, cleaning, and installing,” and 1,2 tion in various tasks. Natural-language-facilitated finally conducting natural cooperation. human–robot cooperation (NLC) has received increasing attention in human-involved robotics research over the recent decade. By using NL, human intelligence at high- Robotics Institute (RI), Carnegie Mellon University, Pittsburgh, PA, USA level task planning and robot physical capability—such as Intelligent Robotics and Systems Lab (IRSL), Colorado School of Mines, 3 4 2 Golden, CO, USA force, precision, and speed —at low-level task execu- 5,6 tions are combined to perform intuitive cooperation. For Corresponding author: example, in furniture assembly, it is challenging to perform Xiaoli Zhang, Intelligent Robotics and Systems Lab (IRSL), Colorado natural cooperation, for that a human has limited precision School of Mines, Golden, CO 80401, USA. and speed in hold driller, while a robot lacks understanding Email: xlzhang@mines.edu Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/ open-access-at-sage). 2 International Journal of Advanced Robotic Systems Figure 1. Promising areas using NLC. (a) daily robotic assistance using NL. A robot categorized daily objects with human NL instructions. (b) Autonomous manufacturing using NL. An industrial robot welded parts under human’s oral instructions. (c) Robotic navigation using NL. A quadcopter navigated in indoor Figure 2. The annual amount of NLC-related publications since environments with human’s oral guidance. (d) Social accompany. the year 2000 according to our paper review. In the past 18 years, A pet dog is playing balls with a human with socialized verbal the number of NLC publications are steadily increasing and communications. NLC: natural-language-facilitated human–robot reaching a history-high level in current time, revealing that NLC cooperation; NL: natural language. research is encouraged by other researches such as robotics and NLP. NLC: natural-language-facilitated human–robot coopera- Currently, typical manners in human–robot cooperation tion; NL: natural language. include tactile indications, such as contact location, force 8 9 strength, and visual indications, such as body pose, and papers were retrieved from Google Scholar, then with motion. Compared with these methods, using NL to con- a focus of NL-facilitated human–robot cooperation, duct an intuitive NLC has several advantages. First, NL about 570 papers were related. The publication trend makes human–robot cooperation natural. For traditional is shown in Figure 2, where the increasing significance methods mentioned above, humans need to be trained to use of NLC is reflected by steadily increasing publication certain actions/poses for making themselves understandable numbers. 11,12 by a robot. While in NLC, even nonexpert users without Compared with existing review papers about human– prior training can use verbal conversations to instruct robot cooperation using communication manners such as 28,29 30 31 robots. Second, NL transfers human commands efficiently. gesture and pose, action and motion, and tactile, The traditional communication methods using visual/motion a review paper about human–robot cooperation using indications require the design of informative patterns, “‘lift NL communication is lacking. Therefore, given the hand’ means ‘stop’, ‘horizontal hand movements’ means huge potentials of facilitating human–robot cooperation ‘follow,’” for delivering human commands. Existing lan- and increasingly received attention in NLC, in this guages, such as “English, Chinese, and German,” already review paper, we aim to summarize the state-of-the-art have standard linguistic structures, which contain abun- NLC methodologies in wide-range domains, revealing dant informative expressions to serve as patterns. current research progress and signposting future NLC NL-based methods do not need to design specific informa- research. Our novelty is that we summarized the NLC tive patterns for various NL commands, making human– research as three aspects: NL instruction understanding, robot cooperation efficient. Lastly, since NL instructions NL-based execution plan generation, and knowledge- are delivered orally instead of being physically involved, world mapping. Each aspect was comprehensively human hands are set free to perform more important execu- analyzed with research progress, method advantages, and limitations. The organization of this article is shown tions. Typical areas using NLC are shown in Figure 1. Advancements of NLP support an accurate understand- in Figure 3. ing of the task in NLC. Advancement of a robot’s physical capability support increasingly improved task execution in Framework of NLC realization NLC. With supporting technique from both natural lan- guage processing (NLP) and robot execution, NLC has Realization of NLC is challenging due to the following been developed from low-cognition-level symbol matching aspects. First, human NL is abstract and ambiguous. It is control, such as using “yes/no” to control robotic arms, to hard to understand humans accurately during task assign- high-cognition-level task understanding, such as identify- ments, impeding natural communications between a robot ing a plan from the description “go straight and turn left at and a human. Second, NL-instructed plans are implicit. It is the second cross.” difficult to reason appropriate execution plans from human NLC research is regularly published in international NL instructions for effective human–robot cooperation. 20 21 22 23 journals, such as IJRR, TRO, AI, ]and KBS, and Third, NL-instructed knowledge is information- 24 25 international conferences such as ICRA, IROS, and incomplete and real-world inconsistent. It is difficult to AAAI. By using keywords “‘NLP, human, robot, coop- map enough theoretical knowledge into the real world for eration, speech, dialog, natural language,” about 1400 supporting successful NLC. To solve these problems for Liu and Zhang 3 Figure 4. Typical literal models for NL instruction understanding. (a) House et al. is a grammar model. The robotic arm’s motion was controlled by predefined vowels, such as “aw, ee, ch,” in Figure 3. Organization of this review paper. This review sys- human speech. (b) Dominey et al. is an association model. NL tematically summarized methodologies for using NL to facilitate expressions, such as “OpenLeft,” was interpreted as specific parameter “open left hand for 1 DOF” for robotic arms. NL: human–robot cooperation. Three main researches are intro- duced as NL instruction understanding, NL-based execution plan natural language. DOF: degrees of freedom. generation, and knowledge-world mapping. In each research, typical models, application scenarios, model comparison, and implicitly extracted indicated by humans. The difference open problems are summarized. between them, however, is the information source. The literal models only extract information from human NL effective and natural NLC, mainly three types of research instructions, while the interpreted models will also extract have been done. information from human’s surrounding environment. With literal models, the robot understands tasks merely by fol- NL instruction understanding: To accurately under- lowing human NL instructions, while with interpreted mod- stand assignments during NLC, the research of NL els, robots understand tasks by critically thinking about instruction understanding has been done to build cooperation-related practical environment conditions, semantic models for extracting cooperation-related becoming situation aware. knowledge from human NL instructions. From the model construction perspective, to analyze NL-based execution plan generation: To reason a meanings of human NL instructions in NLC, literal models robot’s execution plans from human NL instruc- mainly use literal linguistic features, such as words, Part- tions, the research of NL-based execution plan gen- of-Speech (PoS) tags, word dependencies, word references, eration has been done to create various reasoning and sentence syntax structures, shown in Figure 4; inter- mechanisms for identifying human requests and for- preted models mainly use interpreted linguistic features, mulize robot execution strategies. such as temporal and spatial relations, object categories, Knowledge-world mapping: To map NL-instructed object physical properties, object functional roles, action theoretical knowledge to real-world situations for usages, and task execution methods, as shown in Figure practical cooperation, the research of knowledge- 5. Literal linguistic features were directly extracted from world mapping research has been done to recom- human NL instructions, while interpreted linguistic fea- mend the missing knowledge and correct the tures were indirectly inferred from common sense based real-world inconsistent knowledge for realizing on NL expressions. NLC in various real-world environment. Literal models NL instruction understanding With regard to involvement manners of literal linguistic NL instruction understanding enables a robot to receive features, literal models are categorized into the following human-assigned tasks, identify human-preferred execution types. (1) Grammar model: Literal linguistic feature pat- procedures, and understand the surrounding environment terns such as “action þ destination” are manually defined. from abstract and ambiguous human NL instructions dur- (2) Association model: Literal linguistic features are ing NLC. By improving the robot’s understanding toward mutually associated with commonsense knowledge. the human, the accuracy and naturalness during NLC are improved. To intuitively understand human NL expres- Grammar models. To initially identify key cooperation- sions with an environment awareness, two types of seman- related information, such as goal, tool usage, and action tic analysis models were developed: literal models and sequences, from human NL instructions, grammar patterns interpreted models. For both literal models and interpreted are defined to build grammar models. Grammar patterns models, cooperation-related information is explicitly or refer to keyword combinations, PoS tag combinations, and 4 International Journal of Advanced Robotic Systems relations ; general terms such as “beverage” are specified to “juice” according to cooperation types and task–object probabilistic relations. With this probabilistic association model, the uncer- tainty in NL expressions was modeled, disambiguating NL instructions and improving a robot’s adaptation toward different human users with various NL expressions. Another typical association model is an empirical associa- tion model. High-level abstract literal linguistic features, such as ambiguous words and uncertain NL phrases, are empirically specified by low-level detailed literal features such as action usage, sensor values, and tool usages. The rationale is that general knowledge could be recommended for disambiguating ambiguous NL instructions in specific situations. Compared with probabilistic association mod- Figure 5. A typical interpreted model for NL instruction els, which use objective probabilistic calculation, empirical understanding. Robot memory, real-world states and human NL association models use subjective empirical association. instructions were integrated to instruct robot executions. NL: Typical usages include the following types. natural language. By defining sensor value ranges as ambiguous NL 35,36 keyword-PoS tag combinations. By using these gram- descriptions, such as “slowly, frequently, heavy,” mar models, robot behaviors will be triggered by the gram- ambiguous execution-related NL expressions were mars mentioned in human NL instructions. Some grammar quantitatively interpreted, making ambiguous NL 41,47 patterns explored execution logics. For example, verbs and expressions sensor-perceivable. nouns were combined to describe a type of actions such as By integrating key aspects, such as execution pre- 37–39 V(go) þ NN(Hallway) and V(grasp) þ NN(cup). conditions, action sequences, human preferences, Some grammar patterns explored temporal relations, such tool usages, and location information, into abstract as the if–then relation “if door open, then turn right” and the NL expressions—such as “drill a hole”—human 40,41 step 1 to step 2 relation “go—grasp.” Some grammar instructed high-level plans were specified into patterns explored spatial relations, such as the IN relation detailed robot-executable plans—such as “clean the 4,36,39,42,48 “cup IN room” and the CloseTo relation “cup CloseTo surface,” or “install a screw”. 42,43 plate.” The rationale of the grammar model is that By using discrete fuzzy statuses—such as “close, sentences with a similar meaning have similar syntax struc- far, cold, warm”—to divide continuous sensor data tures. Similarity of NL meanings was calculated by evalu- ranges, unlimited objective sensor values were ating the syntax structure similarity. “translated” into limited subjective human feelings, such as “close to the robot, day is hot,” supporting a 49,50 Association models. To understand abstract and implicit NL human-centered task understanding. execution commands during cooperation, association models By combining human factors, such as “human’s were developed by associating different literal linguistic fea- visual scope,” with linguistic features, such as a key- tures together to extract new semantic meanings. Essentially, word “wrench” in human NL instructions, empirical the association model exploited existed knowledge by creat- association model became environmental-context- ing high-level abstract knowledge from low-level detailed sensitive, making a robot to understand a human knowledge. One typical association model is a probabilistic NL instructions such as “deliver him a wrench” from association model. Informative literal linguistic features in the human perspective “human desired wrench is 51–53 NL instructions were correlated with other informative key- actually the human-visible wrench.” The words by using probability likelihoods computed from human advantage of using association models in NLC is communications. Typical works are as follows. that the robot cognition level is improved by means of mutual knowledge compensation. With this asso- Learning from previous human execution experi- ciation model, a robot can explore unfamiliar envir- ences: Cooperation-needed actions are inferred onments by exploiting its existing knowledge. based on mentioned tasks, locations, and their prob- abilistic associations. Learning from daily common sense: Quantitative Interpreted models dynamic spatial relations such as “away from, between, ... ” have been associated with its corre- Human requests are usually situated, which means human sponding NL expressions based on their probabilistic NL expressions are with default environmental Liu and Zhang 5 preconditions, such as “cup is dirty, a driller is missing,  By exploring multimodality-information sources, robot is far from a human.” Human NL instructions are rich information can be extracted for an accurate closely correlated with situation-related information, such NL instruction understanding. Information in one modality can be compensated by as human tactile indication (tactile modality), human hand/ information learned from other modalities for better body pose (vision modality and motion dynamics modal- NL disambiguation. ity), and environmental conditions (environment sensor Consistency of multimodality information enables modality). mutual confirmations among knowledge from mul- To accurately understand human NL instructions, tiple modalities. A reliable NL command under- interpreted models are developed to integrate informa- standing could be conducted. tion from multimodalities, instead of merely from NL modality. The rationale behind interpreted models is that Supported by these advantages, multimodality models a human is dependent with their surrounding environ- have the potential to understand complex plans and various ment and better understanding of human needs to be users and to perform practical NL instruction understand- environmentally context aware. With multimodality ing in real-world NLC situations. models, information from different modalities related to human, robot, and their surrounding environment was 54–56 aligned to establish semantic corrections. Using NL Model comparison instructions and human-related features to understand Literal models, which use basic linguistic features directly human NL instructions, typical features beyond linguis- from human NL instructions, are shallow literature-level tic features considered in single-modality models also understanding. Interpreted models, which use multimodal- include the following: individual identity detected by ity features interpreted from human NL instructions, are radio-frequency identification (RFID) sensor, touch comprehensive connotation-level understanding. Each of events detected by tactile sensors, facial expressions them has unique advantages, therefore suitable for different (joy, sad), hand poses detected by computer vision application scenarios. For literal models, they are good at systems, and human head orientations detected by scenarios with simple procedures and clear work assign- motion tracking systems. Supported by rich informa- ments, such as robot arm control and robot pose control. tion from multimodality information, typical problems For interpreted models, they are good at scenarios with tackled for NLC include complex-instruction under- involvements of daily common sense, human cognitive 62 61 standing, human-like cooperation, human social logics, rich domain information, such as object physical behavior understanding, and mimicking. For multi- property-assisted object searching, intuitive machine- modality models using environment and robot-related executable plan generation, as well as vision–verbal– features to understand human NL instructions in NLC, motion-supported object delivery. From literal models to typical features also include the following: spatial interpreted models, robots have been more closely inte- object-robot relations indicated by human hand direc- grated with humans both physically and mentally. This tions, temporal robot-speech-and-head-orientation integration enables a robot to accurately understand both dependencies measured by computer vision systems, human requests and practical environments, improving the 63,64 object visual cues detected by cameras, robot sen- effectiveness and naturalness of NLC. sorimotorbehaviors monitoredby bothmotionsystems and computer vision systems. Supportedbyrichinfor- Open problems mation from these features, typical problems tackled in NLC include real-time communication, context-sensitive Although robots using grammar models have an initial cooperation (sensor-speech alignment), machine- capability of understanding human NL instructions during executable task plan generation, and implicit human cooperation, the drawback is that feature correlations request interpretation. Typical algorithms used for con- needed for understanding have been exhaustively listed. structing multimodality models include hidden Markov It is difficult to summarize all the likely encountered gram- model (HMM) for modeling hidden probabilistic rela- mar rules. Compared with grammar models, association 63,65 tions among interpreted linguistic features, Bayesian models give more cooperation-related knowledge to a robot network (BN) for modeling probabilistic transitions by exploiting associations among literal features. Even 66–68 among task-execution steps, and first-order logic for though the association model could interpret abstract lin- modeling semantic constraints among interpreted lin- guistic features into detailed execution plan, it still suffers 69,70 guistic features. These algorithms integrate different from incorrect association problems. These open problems modalities with appropriate contribution distributions are decreasing NL instruction understanding accuracy and and extract contributive feature patterns among modal- further decreasing robot adaptability. ities. Multimodality models have three potential advan- Although interpreted models are capable of comprehen- tages in understanding human NL instructions. sively understanding human NL instructions by 6 International Journal of Advanced Robotic Systems Table 1. Summary of NL instruction understanding methods. Literal models Grammar Association Interpreted models Knowledge Linguistic structures Meaningful concepts Semantic correlations format Algorithms First-order logic Ontology tree Typical classification algorithms (NB, SVM), first-order logic User Low Low High adaptability Tackled Initially understand logic relations, Specify abstract executions Complex task instruction understanding, problems temporal and spatial relations in into machine-executable human-like human–robot cooperation, execution processes executions context-sensitive cooperation Advantages Performance is good and steady in Model human cognitive Rich cooperation-related information is trained situations process, scaling up robot involved. information is more reliable. knowledge Disadvantages Exhaustive listing of NL instructions, Lacking standards for concept Difficult to combine different-modality time-consuming and labor- interpretation and features, difficult to extract important NL intensive interpretation evaluation features 35,36,40,41,43 44,45,47,48,50 63,65,66,67,70 Typical references NL: natural language. considering practical environment conditions, it is difficult logic relations are defined. Regarding reasoning mechan- to combine different types of modalities such as motion, isms, generation models have three main types. speech, and visual cues with an appropriate manner to Probabilistic models: To enable robots with cooper- reveal practical contribution distributions for different ation associative capability, in which a likely plan is modalities. Second, it is difficult to extract contributive inferred, and appropriate tools and actions are rec- features for describing both distinctive and common ommended, probabilistic models were developed aspects of one modality in understanding NL instructions. based on probabilistic dependencies, shown in Fig- Third, the overfitting problem still exists when using multi- ure 6. modality information to understand NL instructions. NL Logic models: To enable robots with logical reason- instruction understanding based on different modalities ing capability, in which internal logics among exe- could be mutually conflicting, thereby preventing the prac- cution procedures are followed, logic models were tical implementation of multimodality models. Model developed based on ontology and first-order logics, details are presented in Table 1. shown in Figure 7. Cognitive models: To enable robots with cognitive thinking capability, in which plans are intuitively NL-based execution plan generation made and adjusted, cognitive models were devel- With task knowledge extracted in NL instruction under- oped based on weighted logics, shown in Figure 8. standing, it is critical to use the task knowledge to plan robot executions in NLC. Models for NL-based execution plan generation (“generation model” for short) are devel- Probabilistic model oped for formulizing robot execution plans, theoretically supporting robots to cooperate with humans in appropriate Joint-probabilistic BN methods. To enable a robot with manners. In these models, previously learned piecemeal cooperation planning based on various observations, knowledge is organized with different algorithm structures. joint-probabilistic BN methods are developed. By using Different algorithms enable the models with different coop- single-joint probability p(x, y), a robot could use the prob- eration manners under various human–robot cooperation abilistic association p between a human NL instruction y, scenarios. For example, dynamic models supported by such as “move,” and one execution parameter x, such as HMM enable real-time NL understanding and execution, object “ball,” to plan simple cooperation such as object while static models supported by naive Bayesian (NB) placement “move ball.” Typical joint-probability asso- enable spatial human–robot relation exploration. During a ciations in NLC include activity-object associations, such plan generation, correlations among NLC-related knowl- as “drink-cup,” activity-environment associations such as 74 75 edge, such as execution steps, step transitions, and actions, “drink—hot day,” and action-sensor associations. Dur- tools, or locations—as well as their temporal, spatial, and ing the generation of cooperation strategies, a single joint- Liu and Zhang 7 Figure 8. In the cognitive model, human’s cognitive process in decision-making was simulated by execution logics with different influence weights, based on which important logics with larger weights could be emphasized and trivial logics with smaller weights could be ignored. With this soft logic manner, the flexible cooperation between a human and a robot could be conducted. probabilistic BN association is used as independent evi- dence to describe one semantic aspect of a task. For using Figure 6. Typical probabilistic models. (a) Takano’s is an HMM multiple joint-probabilistic associations (,), interpreted model, in which NLC task’s potential execution sequences are linguistic features of NLC task are collected from various modeled by hidden Markov statuses. (b) Salvi et al.’s is a naive NL descriptions and sensor data, describing relatively com- Bayesian model, in which observations “object size, object shape” and their conditional correlations such as “size-big, shape- plex plans. Typical methods using multiple joint- ball, ... ” are combined to form joint-probability correlations such probability associations include Viterbi algorithm, NB 74 76 as “object-size-shape, ... .” NLC: natural-language-facilitated algorithm, and Markov random field. With these algo- human–robot cooperation; NL: natural language. rithms, the most complete plan described in human NL instructions are selected as a human-desired plan. With multi-joint-probabilistic BN models, tackled problems are as follows. Modeling plans by extracting linguistic features, 76,77 such as NL instruction patterns ; Enriching cooperation details by aligning multiple types of sensor data, such as speech meaning, task execution statuses, and robot or human motion status ; Making flexible plans by specifying verbally- described tasks with appropriate execution details, 78,79 such as execution actions and effects ; Intuitively cooperating with a human by integrating current NL descriptions with previous execution experiences ; Accurate tool searching by associating theoretical knowledge, such as tool identities with practical real-world evidences, such as tools’ colors and pla- cement locations. One common characteristic of probabilistic models, Figure 7. In the study by Dantam and Stilman, hard logic rela- such as NB, is that dependencies among task features are tions, such as “move ¼ (grasp, place), ... ,” were defined to simplified to be fully or partially independent. In control robot motion in playing chess with a human. 8 International Journal of Advanced Robotic Systems practical situations, when a set of observations are made, decomposed into sequential logic formulas by satisfying evidence, such as speech, object, context, and action which specific NLC task could be accomplished. In a logic involved in cooperation, is usually not mutually indepen- model, logics are equally important without contribution dent. As for task plan representation, this simplification differences toward execution success. Logic relations, brings both negative effects, such as undermining the plan including tool usages, action sequences, and locations, are representation accuracy, and positive effects, such as pre- defined in the structure. Typical tackled problems include venting overfitting problems in plan-representation pro- the following: cess. The common problem of multi-joint-probabilistic Autonomous robot navigation by using logic navi- BN models is that temporal associations are ignored, lim- gation sequences, such as going to a location iting the implementations of real-time NLC. “hallway” then going to a new location “rest 66,70 DBN methods. To enable a temporal knowledge association room.” for real-time cooperation planning, dynamic Bayesian net-  Environment uncertainty modeling by summarizing work (DBN) was developed. With DBN, temporal depen- potential executions, such as “ground atoms (Boo- dencies p(| ) were propagated among NLC-related lean random variables) eats (Dominik, Cereals), uses requests and object usages . Given that the final format (Dominik, Bowl), eats (Michael, Cereals) and uses of DBN is the joint probabilistic form p(y,, , ), DBN is (Michael, Bowl).” still a joint-probabilistic model. A widely used DBN algo-  Robot action control by defining action-usage logics rithm in NLC is HMM algorithm, which uses a Markov such as “move (grasp piece(location, grip), place 66,90–92 chain assumption to explore the hidden influence of previ- piece(location, ungrip)).” ous task-related features on the current NLC status. The  Autonomous failure analysis by looking up first- rationale of HMM in NLC is that human-desired execu- order logic representations to detect the missing 4,92 tions, such as going to a position, grasping a tool, and knowledge, such as “tool brush, action: sweep.” lifting a robot hand, are decided by the previous coopera-  NL-based robot programming by using the grammar tion ( ), such as action sequence, and current cooperation. language, such as point(object, arm-side), lookA- These statuses include environmental conditions, task exe- t(object), and rotate(rot-dir, arm-side). cution progress, and human NL instructions, as well as The drawback of logic models in modeling NLC tasks is working statuses for the human and robot. HMM uses both that logic relations defined in the model are hard con- observation probabilities (absolute probability p(x)) and straints. If one logic formula was violated in practical exe- transitions abilities (conditional probability p(Y/X)) for cution processes, the whole logic structure would be modeling associations P(x, y) among NLC-related knowl- 71,84 inapplicable, and the task execution would fail. This draw- edge. With HMMs, tackled problems mainly include back limits models’ implementation scopes and reduces a real-time task assignments, dynamic human-centered 71,86 robot’s environment adaptability. Moreover, hard con- cooperation adjustment, accurate tool delivering by straints were defined indifferently, ignoring the relative simultaneously fusing multi-view data such as NL instruc- importance of executions. The execution flexibility is tion, shoulder coordinates, shoulders-elbows’ 3-D angle 84,87 undermined due to critical executions not being focused data, and hand poses. Limited by Markov assumptions, and trivial executions not being ignored when the NLC plan HMM is only capable of modeling shallow-level hidden modifications are necessary. correlations among NLC-related knowledge. Moreover, given that hidden statuses need to be explored for HMM modeling, a large amount of training data is needed, limit- Cognitive model ing HMM implementations in unstructured scenarios with 93 94 limited training data availability. Neural science research and psychology research proved that a cognitive human planning is not a sensorimo- tor transforming, instead a goal-based cognitive thinking. Logic model This reasoning is reflected on that cognitive thinking of To support a robot with rational logical reasoning of coop- cooperation is not relying on specific objects and specific eration strategies, rather than merely conducting exhaust- executions, instead it is merely relying on goal realization. ing probabilistic inferences from various NL-indicated Based on this theory, another generation model category is evidences, logic models were developed. Logic models summarized as a cognitive model. Human-like robot cog- teach robots using unviolated logic formulas to describe nitive planning in NLC is reflected in flexibly changing complex execution procedures which include multiple execution plans (different procedures), adjusting execution actions and statuses. Unviolated logics usually are first- orders (same procedures, different orders), removing some order logic formulas, such as “in possible worlds a kitchen less-important execution steps (same procedures, less is a region (8w8x(kitchen(w, x) ! region(w, x))).” The steps), and adding more critical executions procedures rationale behind logic models in NLC is that an NLC task is (similar procedures, similar orders). Liu and Zhang 9 Cognitive models. To develop human-like robot cognitive selection. For the logic model, it is good for scenarios with planning for robust NLC, cognitive models are developed either poor evidence or multiple objective goals, such as by using soft logic, which is defined by both logic formulas assembly planning and cup grasping planning. For the cog- and their weighted importance. A typical cognitive model nitive model, it is good for rich or poor evidence and mul- is Markov logic network (MLN) model. MLN represents tiple subjective goals, such as human emotion-guided NLC task in a way such as “0.3Drill(1) ^ 0.3TransitionFea- social interaction, and human preference-based object sible(1, 2) ^ 0.3Clean(2) ) 0.9Task(1, 2),” imitating the assembly. human cognition process in task planning. In this model, single execution steps and step transitions Open problems were defined by logic formulas, which could be grounded Probabilistic models lack explorations of indirect human into different logic formulas by substituting real-world con- cognitive processes in NLC, limiting naturalness of ditions. With this cognitive model, a flexible execution robotic executions. Logic models are inflexible and incap- plan can be generated by omitting non-contributing and able of simulating a human’s intuitive planning in real- weak-contributing logic formulas and involving strong- world environments. The cognitive model is close to a contributing logic formulas. Different from hard con- human’s cognitive process in simulating flexible straints in logic models, constraints (logic formula) in MLN decision-making processes. However, cognitive models are soft. These soft constraints mean when human NL are still suffering from two types of shortcomings. One instructions are partially obeyed by a robot, the task could shortcoming is that cognitive process simulation is still still be successfully executed. Typical tackled problems not a cognitive process because the fundamental theory include using MLN to generate a flexible machine- of cognitive process modeling is lacking insufficient sup- executable plan from human NL instructions for autono- port for a human-like task execution. The second prob- mous industrial task execution, NL-based cooperation in lem is the difficulties of cognitive model learning. uncertain environments by using MLN to meet constraints Different individuals have different cognitive processes, from both robots’ knowledge availability (human-NL- thus making it difficult to learn a general reasoning model. instructed knowledge) and real-world’s knowledge require- 89,95 Model details are presented in Table 2. ments (practical situation conditions). The advantage of using cognitive models in NLC is that soft logic is rel- atively like a human’s cognitive process reflected in human Knowledge-world mapping NL instructions during cooperation. It helps a robot with intuitive cooperation in unfamiliar situations by modifying, With understanding of NL language and execution plans, it or replacing, and executing plan details, such as tool or is critical for a robot to use this knowledge in practical action usages, improving robots’ cognition levels and cooperation scenarios. Knowledge-world mapping meth- enhancing its environment adaptability. The major draw- ods are developed to enable intuitive human–robot cooper- back is that MLN is still different from human cognitive ation in real-world situations. The general process of processes to consider logic conditions at a deep level to knowledge-world mapping is shown in Figure 9. Consid- enable plan modification, new plan making, and failure ering the different implementation problems, knowledge- analysis. Logic parameters for analyzing real-world condi- world mapping methods include two main types: tions are still insufficient to imitate logic relations in the theoretical knowledge grounding and knowledge gap fill- human mind, thereby limiting robots’ performances in ing. Theoretical knowledge grounding methods accurately adapting to users and environments. mapped learned knowledge items, such as objects, spatial/ temporal logic relations, into corresponding objects and rela- tions in real-world scenarios. Gap filling methods detect and Model comparison recommend both the missing knowledge, which is needed in Usually the probabilistic model is conducted in an end-to- real-world situations, but has not been covered by theoretical end manner, which directly reasons cooperation strategies execution plans, as well as real-world inconsistent knowl- from observations, ignoring internal correlations among edge, which is provided by a human, but could not find execution procedures. A logic model uses a step-by-step corresponding things in practical real-world scenarios. manner, with which ontology correlations and temporal or spatial correlations among execution procedures are Theoretical knowledge grounding explored, enabling process reasoning for intuitive planning. The cognitive model also uses a step-by-step model. To accurately map theoretical knowledge to practical Including logic correlations, the cognitive model also things, knowledge grounding methods are developed. In explores relative influences of execution procedures, these methods, a knowledge item is defined by properties, enabling a flexibly plan adjustment. For the probabilistic such as visual properties “object color and shape” captured model, it is good for scenarios with rich evidence and single by RGB cameras, motion properties “action speed” cap- objective goal, such as tool delivery and navigation path tured by motion tracking systems and execution properties 10 International Journal of Advanced Robotic Systems Table 2. Method summary of NL-based execution plan generation. Probabilistic model Joint-probabilistic BN Dynamic Bayesian network Logic model Cognitive model Knowledge Joint probabilistic Conditional probabilistic Logic formulas Logic formulas, their format correlations correlations weighted influences Algorithms Joint BN, NB, MRF Conditional BN, Viterbi First-order Logic, Ontology MLN, fuzzy logic Algorithm, HMM Tree User Moderate Moderate Low High adaptability Tackled Modeling meaning Meaning disambiguation, Autonomous robot Support a flexible problems distributions on NL entity-sensor data mapping, navigation, environment machine-executable instructions, aligning human-attended object uncertainty modeling, plan implementation, multi-view sensor data, identification, real-time autonomous execution task execution in action and tool uncertainty assessment failure diagnosis, NL-based unknown recommendation robot programming environments Advantages Good at representing a Good at distinguishing plans Strong logic correlations Flexible task plan, human complete plan among execution steps cognitive process imitating, strong environment/user adaptability Disadvantages Weak capability in modeling Weak capability in Inflexible task execution, Parameters are difficult the mutual distinctiveness representing a complete weak environment to learn, the current among tasks. task; rely on large amount adaptation soft logic is still far of training data from a human cognitive process 72,75,77,78,79 83–87 66,70,88,90,91 73,89,95 Typical references NL: natural language; NB: naı¨ve Bayesian; BN: Bayesian network; HMM: hidden Markov model; MRF: Markov random field; MLN: Markov logic network. “tool usage and location” captured by RFID. Different from  Executing NL-instructed motion plans, such as “pick the direct symbol mapping method, which has an element- up the tire pallet” by focusing on realizing actions “drive, insert, raise, drive, set.” mapping manner, the general property mapping method has Identifying human-desired cooperation places, such a structural mapping manner. The rationale behind these as “lounge, lab, conference room,” by checking methods is that a knowledge item can be successfully spatial-semantic distributions of landmarks, such grounded into the real world by mapping its properties. The as “hallway, gym, ... ” properties were collected by using methods such as “semantic similarity measurement,” which can establish With mapping methods, knowledge could be mapped correlations between an object and their corresponding into real world in a flexible manner, in which only parts properties. One typical mapping method by using general of properties need to be mapped for grounding a theoretical 96,98 property mapping is semantic map. Theoretical indoor item into a real-world thing. This manner could improve a entities such as rooms and objects are identified by mean- robot’s adaptability toward users and environments. The ingful real-world properties, such as location, color, point limitation is that these mapping methods still use predefini- cloud, spatial relation “parallel,” neighbor entities, con- tions to give a robot knowledge, reducing the intuitiveness structing a semantic map with both objective locations and of human–robot cooperation. semantic interpretations “wall, ceiling, wall, floor.” For detecting visual properties in the real world, RGBD cameras are usually used. For spatial relations, it is detected by laser Knowledge gap filling sensors and motion tracking systems. By identifying these properties in the real world, an indoor entity is identified, A theoretical execution plan defines an ideal real-world enabling an accurate robotic navigation in real-world NLC. situation. Given unpredicted aspects in a practical situation, Other typical mapping method also include the following. even if all defined knowledge has been accurately mapped into the real world, it is still challenging to ensure the Object searching by using NL instructions (detected success of NLC by providing all knowledge needed in a by microphones) as well as visual properties such as practical situation. Especially in real-world situations, object color, size, and shape (detected by motion human users and environment conditions vary, causing the tracking systems and cameras). occurrences of knowledge gaps, which are knowledge Liu and Zhang 11 Hierarchical knowledge structure checking, which detects knowledge gaps by checking real-world- available knowledge from top-level goals to low- level NLC execution parameters defined in a hier- 41,103 archical knowledge structure. Knowledge-applicability assessment, which detects knowledge gaps by checking the similarities between theoretical scenarios and real-world 34,103 scenarios. Performance-triggered knowledge gap estimation, which detects knowledge gaps by considering the 105,106 final execution performances. Hierarchical knowledge structure checking has the ratio- nale that if desired knowledge defined in a knowledge structure is missing in real-world situations, then knowl- edge gaps exist. Knowledge applicability assessment has a rationale that if the NLC situation is not similar with the previously trained situations, then knowledge gaps exist. Performance-triggered knowledge gap estimation has a rationale that if the final NLC performances of a robot is not acceptable, then knowledge gaps exist. In this detection stage, execution plan provides reasoning mechanisms. While real world provides practical things such as objects, locations, human identities, and relations such as spatial relations and temporal relations, which are detected by perceiving systems. The second step of gap filling is gap filling. Gap filling methods mainly include the following. Figure 9. Typical methods for theoretical knowledge grounding. In (a), Takano and Nakamura predefined motions, such as Using existing alternative knowledge such as “avoiding, bowing, carring, ... ,” were directly associated with their corresponding symbolic words. In (b), Hemachandra “brush” in the robot knowledge base to replace inap- et al.’s special features, such as “kitchen location, lab propriate knowledge such as “vacuum cleaner” in 4,106 locations, .. . ,” were considered to identify human-desired paths. NLC tasks such as “clean a surface” ; Using general commonsense knowledge “drilling action needs driller” in a robot database to satisfy required by real-world situations but are missing from a the need for a specific type of knowledge such as robot’s knowledge database. “tool for drilling a hole in the install a screw To ensure the success of a robot’s execution, knowledge 106,103 task” ; gap filling methods are developed to fill in these knowledge Asking knowledge input from human users by gaps. There are three main types of knowledge gaps: (1) proactively asking questions such as “where is the environment gaps, which are constraints such as tool avail- 105,107,108 table leg” ; ability and space or location limitations imposed by unfa- Autonomously learning from the Internet for recog- miliar environments ; nizing human daily intentions, such as “drink water, (2) robot gaps, which are constraints such as a robot’s 107,108 wash dishware.” physical structure strength, capable actions, and operation precision ; (3) user gaps, which are missing information In gap filling stage, execution plan describes the needed caused by abstract, ambiguous, or incomplete human NL knowledge items. Real world provides practical objects as 104,105 instructions. Filling up these knowledge gaps well as robot performance monitoring. enhances robot capability in adapting dynamic environ- ments and various tasks or users. Knowledge gap filling Model comparison is challenging in that it is difficult to make a robot aware of its knowledge shortage in specific situations, and it is Knowledge grounding model and knowledge gap filling difficult to make a robot understand how missing knowl- model are two critical steps for a successful mapping edge should be compensated for successful task executions. between NL-instructed theoretical knowledge and real- The first step of gap filling is gap detection. Gap detec- world cooperation situations. For knowledge grounding tion methods mainly include the following. models, the objective is strictly mapping NL-instructed 12 International Journal of Advanced Robotic Systems Table 3. Summary of knowledge-world mapping methods. Knowledge gap filling Hierarchical knowledge Performance-triggered knowledge Theoretical knowledge grounding structure checking gap estimation Knowledge Real-world objects, spatial/temporal/logic Real-world objects, spatial/ Real-world objects, spatial/ format correlations temporal/logic temporal/logic correlations correlations Algorithms Typical classification algorithms First-order logic, ontology Typical classification algorithms tree User Low Middle Middle adaptability Tackled Indoor routine identifying, accurate object Detecting gaps among Detecting gaps among robots, problems searching, scene understanding robots, users, and robots users, and robots Advantages Flexible knowledge usage, improving robots’ Improving smoothness of Improving the success rate of task environment adaptability. task executions executions Disadvantages Predefined manner limits robots’ environment Difficult to decide which Difficult to decide which gaps lead adaptability, execution naturalness and knowledge can miss to the failure of task executions intuitiveness 96,99–101 80,103,106 103,107,108 Typical references objects and logic relations into real-world conditions. It is a availability, location, and distances to a robot or human was necessary step for all the NLC application scenarios, such ignored, it is difficult for a robot to infer which object a as human-like action learning, indoor, and outdoor coop- human user needs. erative navigation. For knowledge gap filling models, the For knowledge gap filling, when a robot queries knowl- objective is to detect and repair missing or incorrect knowl- edge from either a human or open knowledge databases edge in human NL instructions. It is only necessary when such as openCYC, the scalability is limited. For a spe- human NL instruction cannot ensure successful NLC under cific user or a specific open knowledge database, available given real-world conditions. Typical scenarios include contents are insufficient to satisfy general knowledge needs daily assistance such as serving drink, where information in various NLC executions. The time and labor cost are such as correct types of “drink,” “vessel,” and default high, further limiting knowledge supports for NLC. Model places for drink delivery is missing; cooperative surface details are presented in Table 3. processing where execution procedures are incorrect and tools are missing. Discussion DL for better command understanding Open problems Nowadays, NLP is undergoing a deep learning (DL) revo- A typical problem of theoretical knowledge grounding is lution to create sophisticated models for semantic analysis. the non-executable-instruction problem. Human NL- Potential benefits of using advanced DL models for NLC instructed knowledge is usually ambiguous that NL- include the following. mentioned objects are too ambiguous to be identified in real world; abstract that high-level cooperation strategies  The word embedding methods, add semantic corre- are difficult to be interpreted into low-level execution lations, such as “cats and dogs are animals,” into details; information-incomplete that important cooperation irrelevant words, such as “cat, dog.” In future information such as tool usages, action selections, and NLC research, embedding methods could be used working locations are partially ignored; real-world incon- to introduce extra task-specific meanings, such as sistent that human NL-instructed knowledge is not avail- general common sense “drilling needs the tool able in real world. These non-executable problems limit driller” and safety rules, “stay away from hot surface practical executions of human NL-instructed plans. One and sharp tools,” endowing robots with better com- type of cause of non-executable-instruction problems mand understanding with awareness of environment include intrinsic NL characteristics, such as omitting, refer- limitations and human requirements. ring, and simplifying, as well as human speaking habits,  Sequence-to-sequence language model, such as such as different sentence organizations and phrase usages. long- and short-term memory, sequentially out- Another type of cause is the lack of environment under- putted meaning based on the continuously inputted 109,111 standing. For example, if object-related information such as text. In future NLC research, sequence-to- Liu and Zhang 13 sequence models will enable robots to follow real- safety and psychological comfortableness. Given the time instructions for timely executions and modifi- unique aspects of NLC systems in communication and cations, by aligning temporal verbal instructions interaction, the safety considerations are typically three with task-related knowledge, such as “action types, summarized as follows. First, verbally instructed sequence and location assignments.” actions that endanger human being’s safety should be Attention-based NL understanding models, such as rejected or be carefully assessed by the robotic systems. “Recurrent neural networks (RNN) encoder– This requires NLC systems to have safety-related common decoder,” emphasize relatively important words by sense for safety issue understanding and prediction. Sec- increasing the weights of the important expressions. ond, the verbal abuse from either a human to a robot or For example, in translating sentence “that does not from a robot to a human should be avoided, because verbal mean that we want to bring an end to subsidization,” abuse will make a human feel uncomfortable psychologi- keyword “subsidization” which with key information cally and finally influence the performance of NLC sys- 113 121,122 is emphasized. In future NLC research, attention tems. The psychology comfortableness of NLC models could help a robot to focus on human-desired systems requires the research of human behaviors and executions by analyzing verbal attention. human–machine trust. Third, it is necessary to enforce rigid risk assessment and controlled safety verification before releasing the NLC systems into the market, to make sure NLC systems are safe and helpful without causing safety Cost reduction for knowledge learning issues for humans. Cost reduction in knowledge collection is critical for intui- tive NLC. On the one hand, to understand human NL instructions, represent tasks, or fill in knowledge gaps, a Conclusion large scale of reliable knowledge is needed. On the other This article reviewed state-of-the-art methodologies for hand, time, economic cost, and labor investments need to realizing NLC. With in-depth analysis of application sce- be reduced. To solve this problem, two trends in developing narios, method rationales, method formulizations, and cur- knowledge-scaling-up methods appear recently: existing- rent challenges, and research of using NL to push forward knowledge exploitation and new-knowledge exploration. the limits of human–robot cooperation were summarized In existing-knowledge exploitation, existing knowledge is from a high-level perspective. This review article mainly interpreted and extracted into general knowledge, thereby categorized a typical NLC process into three steps: NL increasing knowledge interchangeability. Knowledge for instruction understanding, NL-based execution plan gener- specific situations, such as “use cup and spoon for prepar- ation, and knowledge-world mapping. With these three ing coffee,” could be used for general situations, such as steps, a robot can communicate with a human, reason about “preparing drink.” In new-knowledge exploration, new human NL instruction, and practically provide human- knowledge is collected by proactively asking human and desired cooperation according to human NL instructions. autonomously retrieving from the Word Wide Web, 115 116 117 books, operation logs, and videos. Declaration of conflicting interests The author(s) declared no potential conflicts of interest with NLC system personalization respect to the research, authorship, and/or publication of this article. When a robot cooperates with a specific human for a long time, the personalization of a robot becomes critical. For Funding personalization, it does not only mean defining individua- The author(s) received no financial support for the research, lized knowledge for a robot to adapt to a specific user, but it authorship, and/or publication of this article. also means designing a knowledge-individualization method, for a robot to autonomously adapt to variable ORCID iD users. Therefore, a future research in NLC would be Rui Liu https://orcid.org/0000-0002-8932-1147 developing knowledge-personalization methods to con- Xiaoli Zhang https://orcid.org/0000-0002-2949-4644 sider both execution preferences and social norms, support- ing a long-term NLC personalization. References 1. Winograd T. Understanding natural language. Cogn Psychol Safety consideration in NLC system design 1972; 3: 1–91. When robots are deployed in human-involved environ- 2. Baraglia J, Cakmak M, Nagai Y, et al. Initiative in robot 118,119 ments, it is necessary to follow the safety standards assistance during collaborative task execution. In: ACM/ to minimize and detail safety hazards before actual imple- IEEE HRI, New Zealand, March 2016, pp. 67–74. mentations. The objective of enforcing the safety standards 3. Tellex S A, Kollar T, Dickerson S, et al. Understanding nat- in NLC system design is for protecting humans’ individual ural language commands for robotic navigation and mobile 14 International Journal of Advanced Robotic Systems manipulation. In: AAAI, San Francisco, California, 7–11 20. Hollerbach J. The international journal of robotics research. August 2011, pp. 1507–1514. SAGE. [Online]. http://journals.sagepub.com/home/ijr. 4. Liu R, Webb J, and Zhang X. Natural-language-instructed (accessed 27 January 2017). industrial task execution. In: ASME IDETC/CIE,North 21. Park F. IEEE Xplore: IEEE transactions on robotics [Online]. Carolina, USA, 21–24 August 2016, pp. v01bt02a043– http://ieeexplore.ieee.org/xpl/RecentIssue.jsp? v01bt02a04. punumber¼8860. (accessed 27 January 2017). 5. Tenorth M, Nyga D, and Beetz M. Understanding and execut- 22. Dechter R. Artificial intelligence. [Online]. https://www.jour ing instructions for everyday manipulation tasks from the nals.elsevier.com/artificial-intelligence/. (accessed 27 Janu- world wide web. In: IEEE ICRA, 2010, pp. 1486–1491. ary 2017). 6. Hemachandra S, Duvallet F, and Howard TM. Learning mod- 23. Fujita H and Lu J. Knowledge-based systems. [Online]. els for following natural language directions in unknown https://www.journals.elsevier.com/knowledge-based-sys environments. In: IEEE ICRA, 2015, pp. 5608–5615. tems. (accessed 27 January 2017). 7. Iwata H and Sugano S. Human-robot-contact-state identifica- 24. ICRA2015. in ICRA2015. [Online]. http://ieeexplore.ieee. tion based on tactile recognition. IEEE Trans Ind Electron org/xpl/mostRecentIssue.jsp? punumber¼71 28761. 2005; 52(6): 1468–1477. (accessed 27 January 2017). 8. Kruger J and Surdilovic D. Hand force adjustment: robust 25. IROS2015.in IROS2015. [Online]. https://www.ieee.org/con control of force-coupled human–robot-interaction in assem- ferences_events/conferences/conferenc edetails/index.html? bly processes. CIRP Ann Manuf Technol 2008; 57(1): 41–44. Conf_ID¼33365. (accessed 27 January 2017). 9. Kim S, Jung J, Kavuri S, et al. Intention estimation and rec- 26. AAAI15: Twenty-Ninth conference on artificial intelligence ommendation system based on attention sharing. In: Interna- (AAAI15), 1995. [Online]. http://www.aaai.org/Conferences/ tional conference on neural information processing, 2013, AAAI/aaai15.php. (accessed 27 January 2017). pp. 395–402. 27. GoogleScholar. [Online]. https://scholar.google.com. 10. Liu R and Zhang X. Understanding human behaviors with an 28. Argall BD, Chernova S, and Veloso M. A survey of robot object functional role perspective for robotics. IEEE Trans learning from demonstration. Robot Auton Syst 2009; 57: Cogn Develop Syst 2016; 8(2): 115–127. 469–483. 11. Takano W and Nakamura Y. Action database for categorizing 29. Rautaray SS and Agrawal A. Vision based hand gesture rec- and inferring human poses from video sequences. Robot ognition for human computer interaction: a survey. Artif Auton Syst 2015; 70: 116–125. Intell Rev 2015; 43: 1–54. 12. Raman V, Lignos C, Finucane C, et al. Sorry Dave, I’m afraid 30. Bethel C, Salomon K, and Murphy RR. Survey of psycho- I can’t do that: explaining unachievable robot tasks using physiology measurements applied to human-robot interac- natural language. In: Robotics: science and systems, Berlin, tion. In: IEEE RO-MAN, 2007, pp. 732–737. Germany, 24–28 June 2013. 31. Argall BD and Billard AG. A survey of tactile human–robot 13. Matuszek C, Herbst E, Zettlemoyer L, et al.. Learning to interactions. Robot Auton Syst 2010; 58: 1159–1176. parse natural language commands to a robot control system. 32. House B, Malkin, and Bilmes J. The VoiceBot: a voice con- In: Desai J, Dudek G, Khatib O, and Kumar V (eds) Experi- trolled robot arm. In: ACM SIGCHI, 2009, pp. 183–192. mental robotics. Springer tracts in advanced robotics, Vol. 33. Dominey PF, Mallet A, and Yoshida E. Progress in program- 88. Heidelberg: Springer, 2016, pp. 403–415. ming the hrp-2 humanoid using spoken language. In: IEEE 14. Waldherr S, Romero R, and Thrun S. A gesture based inter- ICRA, 2007, pp. 2169–2174. face for human-robot interaction. Auton Robot 2000; 9(2): 34. Ovchinnikova E and Wachter M. Multi-purpose natural lan- 151–173. guage understanding linked to sensorimotor experience in 15. Hunston S and Francis G. Pattern grammar: a corpus-driven humanoid robots. In: IEEE-RAS Humanoids, 2015, pp. approach to the lexical grammar of English. Comput Linguist 365–372. 2000; 27(2): 318–320. 35. Lee KW, Kim HR, Yoon WC, et al. Designing a human-robot 16. Hsiao K, Vosoughi S, Tellex S, et al. Object schemas for interaction framework for home service robot. In: IEEE RO- responsive robotic language use. In: Proceedings of the third MAN, 2005, pp. 286–293. ACM/IEEE international conference on Human robot inter- 36. Lee S, Kim C, Lee J, et al. Affective effects of speech- action, Amsterdam, The Netherlands, 12–15 March 2008, pp. enabled robots for language learning. In: IEEE SLT, 2010, 233–240. ACM: New York, USA. pp. 145–150. 17. Guerin K, Lea C, and Paxton C. A framework for end-user 37. Motallebipour H and Bering A. A spoken dialogue system to instruction of a robot assistant for manufacturing. In: IEEE control robots. 2002. ICRA, Seattle, May 2015, pp. 6167–6174. 38. Dantam N and Stilman M. The motion grammar: analysis of a 18. Steels L and Kaplan F. AIBO’s first words: the social learning linguistic method for robot control. IEEE Trans Robot 2013; of language and meaning. Evol Commun 2000; 4: 3–32. 29: 704–718. 19. Huang A, Tellex S, and Bachrach A. Natural language com- 39. Bicho E, Louro L, and Erlhagen W. Integrating verbal and mand of an autonomous micro-air vehicle. In: IEEE/RSJ nonverbal communication in a dynamic neural field IROS, 2010, pp. 2663–2669. Liu and Zhang 15 architecture for human–robot interaction. Front Neurorobot 56. Jesse T, Shiqi Z, Raymond M, et al. Learning to interpret natural language commands through human-robot dialog. 2010; 4: 1–13. In: IJCAI, 2015. 40. Mcguire P, Fritsch J, Steil J, et al. Multi-modal human- 57. Foster ME, By T, and Richert M, et al. Human-robot dialogue machine communication for instructing robot grasping tasks. for joint construction tasks. In: ICMI Proceedings of the In: IEEE/RSJ IROS, Vol. 2, 2005, pp. 1082–1088. eighth international conference on multimodal interface, 41. Zender H, Jensfelt P, Mozos O, et al. An Integrated robotic Banff, Alberta, Canada, 2–04 November 2006, pp. 68–71. system for spatial understanding and situated interaction in ACM: USA, New York. indoor environments. In: AAAI, 2007, pp. 1584–1589. 58. Bischoff R and Graefe V. Dependable multimodal commu- 42. Guadarrama S, Riano L, and Golland D. Grounding spatial nication and interaction with robotic assistants. In: IEEE RO- relations for human-robot interaction. In: IEEE/RSJ IROS, MAN, 2002, pp. 300–305. 2013, pp. 1640–1647. 59. Breazeal C and Aryananda L. Recognition of affective com- 43. Scheutz M, Cantrell R, and Schermerhorn P. Toward human- municative intent in robot-directed speech. Auton Robot like task-based dialogue processing for human robot interac- 2002; 12: 83–104. tion. AI Mag 2011; 32: 77–84. 60. Connell J, Marcheret, Pankanti S, et al. An extensible language 44. Kollar T, Perera T, and Nardi D. Learning environmental interface for robot manipulation. In: ICAGI, 2012, pp. 21–30. knowledge from task-based human-robot dialog. In: IEEE 61. Stiefelhagen R, Fugen C, and Gieselmann R. Natural human- ICRA, 2013, pp. 4304–4309. robot interaction using speech. head pose and gestures. In: 45. Fasola J and Mataric M. Using semantic fields to model IEEE/RSJ IROS, 2004, pp. 2422–2427. dynamic spatial relations in a robot architecture for natural 62. Ghidary S, Nakata Y, and Saito H. Multi-modal human robot language instruction of service robots. In: IEEE/RSJ IROS, interaction for map generation. In: IEEE/RSJ IROS, 2001, pp. 2013, pp. 143–150. 2246–2251. 46. Lu D, Wu F, and Chen X. Understanding user instructions by 63. Levinson S, Zhu W, Li D, et al. Automatic language acqui- utilizing open knowledge for service robots. arXiv: 1606. sition by an autonomous robot. In: IJCNN, Vol. 4, 2003, pp. 02877v1, 2016. 2716–2721. 47. Brenner M, Hawes N, Kelleher J, et al. Mediating between 64. Kruijff GM, Zender H, and Jensfelt P. Situated dialogue and qualitative and quantitative representations for task- spatial organization: what, where and why. IJARS 2007; 4: orientated human-robot interaction. In: Proceedings of the 20th international joint conference on Artificial intelligence, 65. Oliveira JL, Ince G, Nakamura K, et al. An active audition IJCAI 2007, Hyderabad, India, 6–12 January 2007 pp. framework for auditory-driven HRI: application to interactive 2072–2077. San Francisco, CA, USA: Morgan Kaufmann robot dancing. In: IEEE RO-MAN, 2012, pp. 1078–1085. Publishers Inc. 66. Shimizu N and Haas AR. Learning to follow navigational 48. Cantrell R, Schermerhorn P, and Scheutz M. Learning actions route instructions. In: IJCAI 2009, pp. 1488–1493. from human-robot dialogues. In: IEEE RO-MAN, 2011, pp. 67. Gemignani G, Veloso M, and Nardi D. Language-based sen- 125–130. sing descriptors for robot object grounding. In: Robot Soccer 49. Jayawardena C, Watanabe K, and Lzumi K. Posture control World Cup, 2015, pp. 3–15. of robot manipulators with fuzzy voice commands using a 68. Dongcai L, Shiqi Z, Peter S, et al. Leveraging commonsense fuzzy coach–player system. Adv Robot 2007; 21: 293–328. reasoning and multimodal perception for robot spoken dialog 50. Zhang J and Knoll A. A two-arm situated artificial commu- systems. In: IEEE IROS, 2017, pp. 3855–3863. nicator for human–robot cooperative assembly. IEEE Trans 69. Allen J, Duong Q, and Thompson C. Natural language service Ind Electron 2003; 50: 651–658. for controlling robots and other agents. In: IEEE KIMAS 51. Misra DK, Sung J, and Lee K. Tell me dave: context-sensitive 2005, pp. 592–595. grounding of natural language to manipulation instructions. 70. Bos J. Applying automated deduction to natural language Int J Robot Res 2016; 35: 281–300. understanding. J Appl Logic 2009; 7: 100–112. 52. Fong T, Nourbakhsh I, Kunz C, et al. The peer-to-peer 71. Takano W. Learning motion primitives and annotative texts human-robot interaction project. In: AAAI Space Forum, from crowd-sourcing. Robomech J 2015; 2: 1–9. Long Beach, California, August 2005, pp. 6750. 72. Salvi G, Montesano L, and Bernardino A. Language boot- 53. Rybski PE, Stolarz J, Yoon K, et al. Using dialog and human strapping: learning word meanings from perception–action observations to dictate tasks to a learning robot assistant. Intel association. IEEE Trans Syst Man Cybern B 2011; 42: Serv Robot 2008; 1: 159–167. 660–671. 54. Walter M, Hemachandra S, Homberg B, et al. Learning 73. Liu R and Zhang X. Generating machine-executable plans semantic maps from natural language descriptions. In: RSS, from end-user’s natural-language instructions. Knowl Based Syst 2017; 140: 15–26. 55. Bastianelli E, Croce D, Vanzo A, et al. A discriminative 74. Liu R, Zhang X, and Webb J. Context-specific intention approach to grounded spoken language understanding in awareness through web query in robotic caregiving. In: IEEE interactive robotics. In: IJCAI, 2016, pp. 9–15. ICRA, 2015, pp. 1962–1967. 16 International Journal of Advanced Robotic Systems 75. Sattar J and Dudek G. Towards quantitative modeling of task 93. Johnson SHF. What’s so special about human tool use? Neu- confirmations in human-robot dialog. In: IEEE ICRA, 2011, ron 2003; 39: 201–204. pp. 1957–1963. 94. Gray J and Breazeal C. Manipulating mental states through 76. Liu R and Zhang X. Context-specific grounding of web nat- physical action. Int J Soc Robot 2014; 6: 315–327. ural descriptions to human-centered situations. Knowl Based 95. Tenorth M. Knowledge processing for autonomous robots. Dissertation. Universita¨t Mu¨nchen, 2011. Syst 2016; 111: 1–16. 77. Dindo H and Zambuto D. A probabilistic approach to learn- 96. Hemachandra S, Walter M, and Tellex S. Learning spatial- semantic representations from natural language descriptions ing a visually grounded language model through human-robot interaction. In: IEEE/RSJ IROS, 2010, pp. 790–796. and scene classifications. In: IEEE ICRA, 2015, pp. 78. Oates JT. Grounding knowledge in sensors: unsupervised 2623–2630. learning for language and planning. 2001. 97. Bastianelli E, Croce D, Basili R, et al. Using semantic models 79. Krunic V, Salvi G, and Bernardino A. Affordance based for robust natural language human robot interaction. In: AIIA, word-to-meaning association. In: IEEE ICRA, 2009, pp. 2015 pp. 343–356. 4138–4143. 98. Nuchter A and Hertzberg J. Towards semantic maps for 80. Deits R, Tellex S, and Thaker P. Clarifying commands with mobile robots. Robot Auton Syst 2008; 56: 915–926. information-theoretic human-robot dialog. JHRI 2012; 1: 99. Steels L, and Baillie J. Shared grounding of event descrip- 78–95. tions by autonomous robots. Robot Auton Syst 2003; 43: 81. Matuszek C, Fitzgerald N, Zettlemoyer L, et al. A joint model 163–173. 100. Tellex S, Thaker P, Joseph J, et al. Learning perceptually of language and perception for grounded attribute learning. arXiv preprint arXiv:1206.6423. 2012. grounded word meanings from unaligned parallel data. 82. Bustamante C, Garrido L, and Soto R. Fuzzy naive Bayesian Mach Learn 2014; 94: 151–167. classification in RoboSoccer 3D: a hybrid approach to deci- 101. Spexard T, Li S, Wrede B, et al. BIRON. Where are you? sion making. In: Robot Soccer World Cup, 2006; pp. Enabling a robot to learn new places in a real home envi- ronment by integrating spoken dialog and visual localiza- 507–515. 83. Tahboub K. Intelligent human-machine interaction based on tion. In: IEEE/RSJ IROS, 2006, pp. 934–940. dynamic Bayesian networks probabilistic intention recogni- 102. Mason M and Lopes M. Robot self-initiative and persona- tion. J Intell Robot Syst 2006; 45: 31–52. lization by learning through repeated interactions. In: ACM/ 84. Burger B, Ferrane I, Lerasle F, et al. Two-handed gesture IEEE HRI, 2011, pp. 433–440. recognition and fusion with speech to command a robot. 103. Chen X, Xie J, Ji J, et al. Toward open knowledge enabling for human-robot interaction. JHRI 2012; 1: 100–117. Auton Robot 2012; 32: 129–147. 104. Thomas BJ, and Jenkins OC. RoboFrameNet: verb-centric 85. Doshi F and Roy N. Spoken language interaction with model uncertainty: an adaptive human–robot interaction system. semantics for actions in robot middleware. In: IEEE ICRA, Connect Sci 2008; 20: 299–318. 2012, pp. 4750–4755. 86. Takano W and Nakamura Y. Bigram-based natural language 105. Knepper RA, Tellex S, Li A, et al. Recovering from failure by asking for help. Auton Robot 2012; 39: 347–362. model and statistical motion symbol model for scalable lan- 106. Scioni E, Borghesan G, Bruyninckx H, et al. Bridging the guage of humanoid robots. In: IEEE ICRA, 2012, pp. gap between discrete symbolic planning and optimization- 1232–1237. 87. Rossi S, Leone E, Fiore M, et al. An extensible architecture based robot control. In: 2015 IEEE international conference for robust multimodal human-robot communication. In: on robotics and automation (ICRA), 26–30 May 2015, pp. 5075–5081. USA: IEEE. IEEE/RSJ IROS, 2013, pp. 2208–2213. 107. Toris R, Kent D, and Chernova S. Unsupervised learning of 88. Bos J and Oka T. A spoken language interface with a mobile robot. Artif Life Robot 2017; 11: 42–47. multi-hypothesized pick-and-place task templates via 89. Jain D, Mosenlechner L, and Beetz M. Equipping robot con- crowdsourcing. In: IEEE ICRA, 2015, pp. 4504–4510. trol programs with first-order probabilistic reasoning capabil- 108. Tenorth M, Klank U, and Pangercic D. Web-enabled robots. ities. In: IEEE ICRA, 2009, pp. 3626–3631. IEEE Robot Autom Mag 2011; 18: 58–68. 90. Dzifcak J, Scheutz M, and Baral C. What to do and how to do 109. Akshat A, Swaminathan G, Vasu S, et al. Community reg- ularization of visually-grounded dialog. 2018. arXiv:1808. it: translating natural language directives into temporal and dynamic logic representation for goal management and action 04359. execution. In: IEEE ICRA, 2009, pp. 4163–4168. 110. Weston J, Ratle F, and Collobert R. Deep learning via semi- 91. Finucane C and Jing G. LTLMoP: experimenting with lan- supervised embedding. In: Proceedings of the 25th interna- guage. Temporal logic and robot control. In: IEEE/RSJ IROS, tional conference on Machine learning (ICML), pp. 639–655, 2008. 2010, pp. 1988–1993. 111. Sutskever I, Vinyals O, and Le QV. Sequence to sequence 92. Cheng Y, Jia Y, Fang R, et al. Modelling and analysis of natural language controlled robotic systems. IFAC Proc Vol learning with neural networks. Adv Neural Inf Process Syst 2014; 47: 11767–11772. 2014; 4: 3104–3112. Liu and Zhang 17 112. Bahdanau D, Cho K, and Bengio Y. Neural machine 119. Jacobs T and Virk GS. ISO 13482—The new safety standard translation by jointly learning to align and translate. In: for personal care robots, ISR/Robotik 2014. In: 41st inter- International conference on learning representation national symposium on robotics, Munich, Germany, 2014, (ICLR),2015. pp. 1–6. 113. Ling W, Trancoso I, Dyer C, et al. Character-based neural 120. Singh S, Payne SR, and Jennings PA. Toward a methodol- machine translation. In: ICLR, 2016. ogy for assessing electric vehicle exterior sounds. IEEE 114. Samadi M, Kollar T, and Veloso M. Using the web to inter- Trans Intell Transp Syst 2014; 15(4): 1790–1800. actively learn to find objects. In: Proceedings of the 26th 121. Han M, Lin C, and Song K. Robotic emotional expression AAAI conference on artificial intelligence, 2012. generation based on mood transition and personality model. 115. Gordon G and Breazeal C. Bayesian active learning-based IEEE Trans Cybern 2013; 43(4) 1290–1303. robot tutor for children’s word-reading skills. In: Proceedings 122. Li S, Wrede B, and Sagerer G. A dialog system for com- of the 29th AAAI conference on artificial intelligence,Austin, parative user studies on robot verbal behavior. In: IEEE Texas, 25–30 January 2015. pp. 1343–1349. international symposium on robot and human interactive 116. Zeng Q, Sun S, Duan H, et al. Cross-organizational colla- communication, 6–8 September 2006, pp. 129–134. Hat- borative workflow mining from a multi-source log. Decis field, UK: IEEE. Support Syst 2013; 54: 1280–1301. 123. Guiochet J, Martin-Guillerez D, and Powell D. Experience 117. Liu R, Zhang X, and Zhang H. Web-video-mining- with model-based user-centered risk assessment for service supported workflow modeling for laparoscopic surgeries. robots. In: IEEE international symposium on high- Artif Intell Med 2016; 74: 9–20. assurance systems engineering (HASE), 3–4 November 118. https://www.iso.org/standard/53820.html. 2010, pp. 104–113, CA, USA: IEEE. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Advanced Robotic Systems SAGE

A review of methodologies for natural-language-facilitated human–robot cooperation:

Loading next page...
 
/lp/sage/a-review-of-methodologies-for-natural-language-facilitated-human-robot-StgXQJdn4B

References (134)

Publisher
SAGE
Copyright
Copyright © 2022 by SAGE Publications Ltd, unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses.
ISSN
1729-8814
eISSN
1729-8814
DOI
10.1177/1729881419851402
Publisher site
See Article on Publisher Site

Abstract

Natural-language-facilitated human–robot cooperation refers to using natural language to facilitate interactive information sharing and task executions with a common goal constraint between robots and humans. Recently, natural-language- facilitated human–robot cooperation research has received increasing attention. Typical natural-language-facilitated human–robot cooperation scenarios include robotic daily assistance, robotic health caregiving, intelligent manufactur- ing, autonomous navigation, and robot social accompany. However, a thorough review, which can reveal latest meth- odologies of using natural language to facilitate human–robot cooperation, is missing. In this review, we comprehensively investigated natural-language-facilitated human–robot cooperation methodologies, by summarizing natural-language- facilitated human–robot cooperation research as three aspects (natural language instruction understanding, natural language-based execution plan generation, knowledge-world mapping). We also made in-depth analysis on theoretical methods, applications, and model advantages and disadvantages. Based on our paper review and perspective, future directions of natural-language-facilitated human–robot cooperation research were discussed. Keywords Natural language, human–robot cooperation, NL instruction understanding, NL-based execution plan generation, knowledge-world mapping Date received: 22 December 2017; accepted: 23 April 2019 Topic: AI in Robotics; Human Robot/Machine Interaction Topic Editor: Chrystopher L Nehaniv Associate Editor: Hagen Lehmann of assembly sequence. By giving robots NL instructions, Introduction such as “drill holes, then clean surface, last install screws,” Attracted by the naturalness of natural language (NL) com- a human’s high-level plan was combined with robots’ low- munications among humans, intelligent robots start to level executions, such as “grasping drillers and brush,” and understand NL to develop intuitive human–robot coopera- “motion planning in drilling, cleaning, and installing,” and 1,2 tion in various tasks. Natural-language-facilitated finally conducting natural cooperation. human–robot cooperation (NLC) has received increasing attention in human-involved robotics research over the recent decade. By using NL, human intelligence at high- Robotics Institute (RI), Carnegie Mellon University, Pittsburgh, PA, USA level task planning and robot physical capability—such as Intelligent Robotics and Systems Lab (IRSL), Colorado School of Mines, 3 4 2 Golden, CO, USA force, precision, and speed —at low-level task execu- 5,6 tions are combined to perform intuitive cooperation. For Corresponding author: example, in furniture assembly, it is challenging to perform Xiaoli Zhang, Intelligent Robotics and Systems Lab (IRSL), Colorado natural cooperation, for that a human has limited precision School of Mines, Golden, CO 80401, USA. and speed in hold driller, while a robot lacks understanding Email: xlzhang@mines.edu Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/ open-access-at-sage). 2 International Journal of Advanced Robotic Systems Figure 1. Promising areas using NLC. (a) daily robotic assistance using NL. A robot categorized daily objects with human NL instructions. (b) Autonomous manufacturing using NL. An industrial robot welded parts under human’s oral instructions. (c) Robotic navigation using NL. A quadcopter navigated in indoor Figure 2. The annual amount of NLC-related publications since environments with human’s oral guidance. (d) Social accompany. the year 2000 according to our paper review. In the past 18 years, A pet dog is playing balls with a human with socialized verbal the number of NLC publications are steadily increasing and communications. NLC: natural-language-facilitated human–robot reaching a history-high level in current time, revealing that NLC cooperation; NL: natural language. research is encouraged by other researches such as robotics and NLP. NLC: natural-language-facilitated human–robot coopera- Currently, typical manners in human–robot cooperation tion; NL: natural language. include tactile indications, such as contact location, force 8 9 strength, and visual indications, such as body pose, and papers were retrieved from Google Scholar, then with motion. Compared with these methods, using NL to con- a focus of NL-facilitated human–robot cooperation, duct an intuitive NLC has several advantages. First, NL about 570 papers were related. The publication trend makes human–robot cooperation natural. For traditional is shown in Figure 2, where the increasing significance methods mentioned above, humans need to be trained to use of NLC is reflected by steadily increasing publication certain actions/poses for making themselves understandable numbers. 11,12 by a robot. While in NLC, even nonexpert users without Compared with existing review papers about human– prior training can use verbal conversations to instruct robot cooperation using communication manners such as 28,29 30 31 robots. Second, NL transfers human commands efficiently. gesture and pose, action and motion, and tactile, The traditional communication methods using visual/motion a review paper about human–robot cooperation using indications require the design of informative patterns, “‘lift NL communication is lacking. Therefore, given the hand’ means ‘stop’, ‘horizontal hand movements’ means huge potentials of facilitating human–robot cooperation ‘follow,’” for delivering human commands. Existing lan- and increasingly received attention in NLC, in this guages, such as “English, Chinese, and German,” already review paper, we aim to summarize the state-of-the-art have standard linguistic structures, which contain abun- NLC methodologies in wide-range domains, revealing dant informative expressions to serve as patterns. current research progress and signposting future NLC NL-based methods do not need to design specific informa- research. Our novelty is that we summarized the NLC tive patterns for various NL commands, making human– research as three aspects: NL instruction understanding, robot cooperation efficient. Lastly, since NL instructions NL-based execution plan generation, and knowledge- are delivered orally instead of being physically involved, world mapping. Each aspect was comprehensively human hands are set free to perform more important execu- analyzed with research progress, method advantages, and limitations. The organization of this article is shown tions. Typical areas using NLC are shown in Figure 1. Advancements of NLP support an accurate understand- in Figure 3. ing of the task in NLC. Advancement of a robot’s physical capability support increasingly improved task execution in Framework of NLC realization NLC. With supporting technique from both natural lan- guage processing (NLP) and robot execution, NLC has Realization of NLC is challenging due to the following been developed from low-cognition-level symbol matching aspects. First, human NL is abstract and ambiguous. It is control, such as using “yes/no” to control robotic arms, to hard to understand humans accurately during task assign- high-cognition-level task understanding, such as identify- ments, impeding natural communications between a robot ing a plan from the description “go straight and turn left at and a human. Second, NL-instructed plans are implicit. It is the second cross.” difficult to reason appropriate execution plans from human NLC research is regularly published in international NL instructions for effective human–robot cooperation. 20 21 22 23 journals, such as IJRR, TRO, AI, ]and KBS, and Third, NL-instructed knowledge is information- 24 25 international conferences such as ICRA, IROS, and incomplete and real-world inconsistent. It is difficult to AAAI. By using keywords “‘NLP, human, robot, coop- map enough theoretical knowledge into the real world for eration, speech, dialog, natural language,” about 1400 supporting successful NLC. To solve these problems for Liu and Zhang 3 Figure 4. Typical literal models for NL instruction understanding. (a) House et al. is a grammar model. The robotic arm’s motion was controlled by predefined vowels, such as “aw, ee, ch,” in Figure 3. Organization of this review paper. This review sys- human speech. (b) Dominey et al. is an association model. NL tematically summarized methodologies for using NL to facilitate expressions, such as “OpenLeft,” was interpreted as specific parameter “open left hand for 1 DOF” for robotic arms. NL: human–robot cooperation. Three main researches are intro- duced as NL instruction understanding, NL-based execution plan natural language. DOF: degrees of freedom. generation, and knowledge-world mapping. In each research, typical models, application scenarios, model comparison, and implicitly extracted indicated by humans. The difference open problems are summarized. between them, however, is the information source. The literal models only extract information from human NL effective and natural NLC, mainly three types of research instructions, while the interpreted models will also extract have been done. information from human’s surrounding environment. With literal models, the robot understands tasks merely by fol- NL instruction understanding: To accurately under- lowing human NL instructions, while with interpreted mod- stand assignments during NLC, the research of NL els, robots understand tasks by critically thinking about instruction understanding has been done to build cooperation-related practical environment conditions, semantic models for extracting cooperation-related becoming situation aware. knowledge from human NL instructions. From the model construction perspective, to analyze NL-based execution plan generation: To reason a meanings of human NL instructions in NLC, literal models robot’s execution plans from human NL instruc- mainly use literal linguistic features, such as words, Part- tions, the research of NL-based execution plan gen- of-Speech (PoS) tags, word dependencies, word references, eration has been done to create various reasoning and sentence syntax structures, shown in Figure 4; inter- mechanisms for identifying human requests and for- preted models mainly use interpreted linguistic features, mulize robot execution strategies. such as temporal and spatial relations, object categories, Knowledge-world mapping: To map NL-instructed object physical properties, object functional roles, action theoretical knowledge to real-world situations for usages, and task execution methods, as shown in Figure practical cooperation, the research of knowledge- 5. Literal linguistic features were directly extracted from world mapping research has been done to recom- human NL instructions, while interpreted linguistic fea- mend the missing knowledge and correct the tures were indirectly inferred from common sense based real-world inconsistent knowledge for realizing on NL expressions. NLC in various real-world environment. Literal models NL instruction understanding With regard to involvement manners of literal linguistic NL instruction understanding enables a robot to receive features, literal models are categorized into the following human-assigned tasks, identify human-preferred execution types. (1) Grammar model: Literal linguistic feature pat- procedures, and understand the surrounding environment terns such as “action þ destination” are manually defined. from abstract and ambiguous human NL instructions dur- (2) Association model: Literal linguistic features are ing NLC. By improving the robot’s understanding toward mutually associated with commonsense knowledge. the human, the accuracy and naturalness during NLC are improved. To intuitively understand human NL expres- Grammar models. To initially identify key cooperation- sions with an environment awareness, two types of seman- related information, such as goal, tool usage, and action tic analysis models were developed: literal models and sequences, from human NL instructions, grammar patterns interpreted models. For both literal models and interpreted are defined to build grammar models. Grammar patterns models, cooperation-related information is explicitly or refer to keyword combinations, PoS tag combinations, and 4 International Journal of Advanced Robotic Systems relations ; general terms such as “beverage” are specified to “juice” according to cooperation types and task–object probabilistic relations. With this probabilistic association model, the uncer- tainty in NL expressions was modeled, disambiguating NL instructions and improving a robot’s adaptation toward different human users with various NL expressions. Another typical association model is an empirical associa- tion model. High-level abstract literal linguistic features, such as ambiguous words and uncertain NL phrases, are empirically specified by low-level detailed literal features such as action usage, sensor values, and tool usages. The rationale is that general knowledge could be recommended for disambiguating ambiguous NL instructions in specific situations. Compared with probabilistic association mod- Figure 5. A typical interpreted model for NL instruction els, which use objective probabilistic calculation, empirical understanding. Robot memory, real-world states and human NL association models use subjective empirical association. instructions were integrated to instruct robot executions. NL: Typical usages include the following types. natural language. By defining sensor value ranges as ambiguous NL 35,36 keyword-PoS tag combinations. By using these gram- descriptions, such as “slowly, frequently, heavy,” mar models, robot behaviors will be triggered by the gram- ambiguous execution-related NL expressions were mars mentioned in human NL instructions. Some grammar quantitatively interpreted, making ambiguous NL 41,47 patterns explored execution logics. For example, verbs and expressions sensor-perceivable. nouns were combined to describe a type of actions such as By integrating key aspects, such as execution pre- 37–39 V(go) þ NN(Hallway) and V(grasp) þ NN(cup). conditions, action sequences, human preferences, Some grammar patterns explored temporal relations, such tool usages, and location information, into abstract as the if–then relation “if door open, then turn right” and the NL expressions—such as “drill a hole”—human 40,41 step 1 to step 2 relation “go—grasp.” Some grammar instructed high-level plans were specified into patterns explored spatial relations, such as the IN relation detailed robot-executable plans—such as “clean the 4,36,39,42,48 “cup IN room” and the CloseTo relation “cup CloseTo surface,” or “install a screw”. 42,43 plate.” The rationale of the grammar model is that By using discrete fuzzy statuses—such as “close, sentences with a similar meaning have similar syntax struc- far, cold, warm”—to divide continuous sensor data tures. Similarity of NL meanings was calculated by evalu- ranges, unlimited objective sensor values were ating the syntax structure similarity. “translated” into limited subjective human feelings, such as “close to the robot, day is hot,” supporting a 49,50 Association models. To understand abstract and implicit NL human-centered task understanding. execution commands during cooperation, association models By combining human factors, such as “human’s were developed by associating different literal linguistic fea- visual scope,” with linguistic features, such as a key- tures together to extract new semantic meanings. Essentially, word “wrench” in human NL instructions, empirical the association model exploited existed knowledge by creat- association model became environmental-context- ing high-level abstract knowledge from low-level detailed sensitive, making a robot to understand a human knowledge. One typical association model is a probabilistic NL instructions such as “deliver him a wrench” from association model. Informative literal linguistic features in the human perspective “human desired wrench is 51–53 NL instructions were correlated with other informative key- actually the human-visible wrench.” The words by using probability likelihoods computed from human advantage of using association models in NLC is communications. Typical works are as follows. that the robot cognition level is improved by means of mutual knowledge compensation. With this asso- Learning from previous human execution experi- ciation model, a robot can explore unfamiliar envir- ences: Cooperation-needed actions are inferred onments by exploiting its existing knowledge. based on mentioned tasks, locations, and their prob- abilistic associations. Learning from daily common sense: Quantitative Interpreted models dynamic spatial relations such as “away from, between, ... ” have been associated with its corre- Human requests are usually situated, which means human sponding NL expressions based on their probabilistic NL expressions are with default environmental Liu and Zhang 5 preconditions, such as “cup is dirty, a driller is missing,  By exploring multimodality-information sources, robot is far from a human.” Human NL instructions are rich information can be extracted for an accurate closely correlated with situation-related information, such NL instruction understanding. Information in one modality can be compensated by as human tactile indication (tactile modality), human hand/ information learned from other modalities for better body pose (vision modality and motion dynamics modal- NL disambiguation. ity), and environmental conditions (environment sensor Consistency of multimodality information enables modality). mutual confirmations among knowledge from mul- To accurately understand human NL instructions, tiple modalities. A reliable NL command under- interpreted models are developed to integrate informa- standing could be conducted. tion from multimodalities, instead of merely from NL modality. The rationale behind interpreted models is that Supported by these advantages, multimodality models a human is dependent with their surrounding environ- have the potential to understand complex plans and various ment and better understanding of human needs to be users and to perform practical NL instruction understand- environmentally context aware. With multimodality ing in real-world NLC situations. models, information from different modalities related to human, robot, and their surrounding environment was 54–56 aligned to establish semantic corrections. Using NL Model comparison instructions and human-related features to understand Literal models, which use basic linguistic features directly human NL instructions, typical features beyond linguis- from human NL instructions, are shallow literature-level tic features considered in single-modality models also understanding. Interpreted models, which use multimodal- include the following: individual identity detected by ity features interpreted from human NL instructions, are radio-frequency identification (RFID) sensor, touch comprehensive connotation-level understanding. Each of events detected by tactile sensors, facial expressions them has unique advantages, therefore suitable for different (joy, sad), hand poses detected by computer vision application scenarios. For literal models, they are good at systems, and human head orientations detected by scenarios with simple procedures and clear work assign- motion tracking systems. Supported by rich informa- ments, such as robot arm control and robot pose control. tion from multimodality information, typical problems For interpreted models, they are good at scenarios with tackled for NLC include complex-instruction under- involvements of daily common sense, human cognitive 62 61 standing, human-like cooperation, human social logics, rich domain information, such as object physical behavior understanding, and mimicking. For multi- property-assisted object searching, intuitive machine- modality models using environment and robot-related executable plan generation, as well as vision–verbal– features to understand human NL instructions in NLC, motion-supported object delivery. From literal models to typical features also include the following: spatial interpreted models, robots have been more closely inte- object-robot relations indicated by human hand direc- grated with humans both physically and mentally. This tions, temporal robot-speech-and-head-orientation integration enables a robot to accurately understand both dependencies measured by computer vision systems, human requests and practical environments, improving the 63,64 object visual cues detected by cameras, robot sen- effectiveness and naturalness of NLC. sorimotorbehaviors monitoredby bothmotionsystems and computer vision systems. Supportedbyrichinfor- Open problems mation from these features, typical problems tackled in NLC include real-time communication, context-sensitive Although robots using grammar models have an initial cooperation (sensor-speech alignment), machine- capability of understanding human NL instructions during executable task plan generation, and implicit human cooperation, the drawback is that feature correlations request interpretation. Typical algorithms used for con- needed for understanding have been exhaustively listed. structing multimodality models include hidden Markov It is difficult to summarize all the likely encountered gram- model (HMM) for modeling hidden probabilistic rela- mar rules. Compared with grammar models, association 63,65 tions among interpreted linguistic features, Bayesian models give more cooperation-related knowledge to a robot network (BN) for modeling probabilistic transitions by exploiting associations among literal features. Even 66–68 among task-execution steps, and first-order logic for though the association model could interpret abstract lin- modeling semantic constraints among interpreted lin- guistic features into detailed execution plan, it still suffers 69,70 guistic features. These algorithms integrate different from incorrect association problems. These open problems modalities with appropriate contribution distributions are decreasing NL instruction understanding accuracy and and extract contributive feature patterns among modal- further decreasing robot adaptability. ities. Multimodality models have three potential advan- Although interpreted models are capable of comprehen- tages in understanding human NL instructions. sively understanding human NL instructions by 6 International Journal of Advanced Robotic Systems Table 1. Summary of NL instruction understanding methods. Literal models Grammar Association Interpreted models Knowledge Linguistic structures Meaningful concepts Semantic correlations format Algorithms First-order logic Ontology tree Typical classification algorithms (NB, SVM), first-order logic User Low Low High adaptability Tackled Initially understand logic relations, Specify abstract executions Complex task instruction understanding, problems temporal and spatial relations in into machine-executable human-like human–robot cooperation, execution processes executions context-sensitive cooperation Advantages Performance is good and steady in Model human cognitive Rich cooperation-related information is trained situations process, scaling up robot involved. information is more reliable. knowledge Disadvantages Exhaustive listing of NL instructions, Lacking standards for concept Difficult to combine different-modality time-consuming and labor- interpretation and features, difficult to extract important NL intensive interpretation evaluation features 35,36,40,41,43 44,45,47,48,50 63,65,66,67,70 Typical references NL: natural language. considering practical environment conditions, it is difficult logic relations are defined. Regarding reasoning mechan- to combine different types of modalities such as motion, isms, generation models have three main types. speech, and visual cues with an appropriate manner to Probabilistic models: To enable robots with cooper- reveal practical contribution distributions for different ation associative capability, in which a likely plan is modalities. Second, it is difficult to extract contributive inferred, and appropriate tools and actions are rec- features for describing both distinctive and common ommended, probabilistic models were developed aspects of one modality in understanding NL instructions. based on probabilistic dependencies, shown in Fig- Third, the overfitting problem still exists when using multi- ure 6. modality information to understand NL instructions. NL Logic models: To enable robots with logical reason- instruction understanding based on different modalities ing capability, in which internal logics among exe- could be mutually conflicting, thereby preventing the prac- cution procedures are followed, logic models were tical implementation of multimodality models. Model developed based on ontology and first-order logics, details are presented in Table 1. shown in Figure 7. Cognitive models: To enable robots with cognitive thinking capability, in which plans are intuitively NL-based execution plan generation made and adjusted, cognitive models were devel- With task knowledge extracted in NL instruction under- oped based on weighted logics, shown in Figure 8. standing, it is critical to use the task knowledge to plan robot executions in NLC. Models for NL-based execution plan generation (“generation model” for short) are devel- Probabilistic model oped for formulizing robot execution plans, theoretically supporting robots to cooperate with humans in appropriate Joint-probabilistic BN methods. To enable a robot with manners. In these models, previously learned piecemeal cooperation planning based on various observations, knowledge is organized with different algorithm structures. joint-probabilistic BN methods are developed. By using Different algorithms enable the models with different coop- single-joint probability p(x, y), a robot could use the prob- eration manners under various human–robot cooperation abilistic association p between a human NL instruction y, scenarios. For example, dynamic models supported by such as “move,” and one execution parameter x, such as HMM enable real-time NL understanding and execution, object “ball,” to plan simple cooperation such as object while static models supported by naive Bayesian (NB) placement “move ball.” Typical joint-probability asso- enable spatial human–robot relation exploration. During a ciations in NLC include activity-object associations, such plan generation, correlations among NLC-related knowl- as “drink-cup,” activity-environment associations such as 74 75 edge, such as execution steps, step transitions, and actions, “drink—hot day,” and action-sensor associations. Dur- tools, or locations—as well as their temporal, spatial, and ing the generation of cooperation strategies, a single joint- Liu and Zhang 7 Figure 8. In the cognitive model, human’s cognitive process in decision-making was simulated by execution logics with different influence weights, based on which important logics with larger weights could be emphasized and trivial logics with smaller weights could be ignored. With this soft logic manner, the flexible cooperation between a human and a robot could be conducted. probabilistic BN association is used as independent evi- dence to describe one semantic aspect of a task. For using Figure 6. Typical probabilistic models. (a) Takano’s is an HMM multiple joint-probabilistic associations (,), interpreted model, in which NLC task’s potential execution sequences are linguistic features of NLC task are collected from various modeled by hidden Markov statuses. (b) Salvi et al.’s is a naive NL descriptions and sensor data, describing relatively com- Bayesian model, in which observations “object size, object shape” and their conditional correlations such as “size-big, shape- plex plans. Typical methods using multiple joint- ball, ... ” are combined to form joint-probability correlations such probability associations include Viterbi algorithm, NB 74 76 as “object-size-shape, ... .” NLC: natural-language-facilitated algorithm, and Markov random field. With these algo- human–robot cooperation; NL: natural language. rithms, the most complete plan described in human NL instructions are selected as a human-desired plan. With multi-joint-probabilistic BN models, tackled problems are as follows. Modeling plans by extracting linguistic features, 76,77 such as NL instruction patterns ; Enriching cooperation details by aligning multiple types of sensor data, such as speech meaning, task execution statuses, and robot or human motion status ; Making flexible plans by specifying verbally- described tasks with appropriate execution details, 78,79 such as execution actions and effects ; Intuitively cooperating with a human by integrating current NL descriptions with previous execution experiences ; Accurate tool searching by associating theoretical knowledge, such as tool identities with practical real-world evidences, such as tools’ colors and pla- cement locations. One common characteristic of probabilistic models, Figure 7. In the study by Dantam and Stilman, hard logic rela- such as NB, is that dependencies among task features are tions, such as “move ¼ (grasp, place), ... ,” were defined to simplified to be fully or partially independent. In control robot motion in playing chess with a human. 8 International Journal of Advanced Robotic Systems practical situations, when a set of observations are made, decomposed into sequential logic formulas by satisfying evidence, such as speech, object, context, and action which specific NLC task could be accomplished. In a logic involved in cooperation, is usually not mutually indepen- model, logics are equally important without contribution dent. As for task plan representation, this simplification differences toward execution success. Logic relations, brings both negative effects, such as undermining the plan including tool usages, action sequences, and locations, are representation accuracy, and positive effects, such as pre- defined in the structure. Typical tackled problems include venting overfitting problems in plan-representation pro- the following: cess. The common problem of multi-joint-probabilistic Autonomous robot navigation by using logic navi- BN models is that temporal associations are ignored, lim- gation sequences, such as going to a location iting the implementations of real-time NLC. “hallway” then going to a new location “rest 66,70 DBN methods. To enable a temporal knowledge association room.” for real-time cooperation planning, dynamic Bayesian net-  Environment uncertainty modeling by summarizing work (DBN) was developed. With DBN, temporal depen- potential executions, such as “ground atoms (Boo- dencies p(| ) were propagated among NLC-related lean random variables) eats (Dominik, Cereals), uses requests and object usages . Given that the final format (Dominik, Bowl), eats (Michael, Cereals) and uses of DBN is the joint probabilistic form p(y,, , ), DBN is (Michael, Bowl).” still a joint-probabilistic model. A widely used DBN algo-  Robot action control by defining action-usage logics rithm in NLC is HMM algorithm, which uses a Markov such as “move (grasp piece(location, grip), place 66,90–92 chain assumption to explore the hidden influence of previ- piece(location, ungrip)).” ous task-related features on the current NLC status. The  Autonomous failure analysis by looking up first- rationale of HMM in NLC is that human-desired execu- order logic representations to detect the missing 4,92 tions, such as going to a position, grasping a tool, and knowledge, such as “tool brush, action: sweep.” lifting a robot hand, are decided by the previous coopera-  NL-based robot programming by using the grammar tion ( ), such as action sequence, and current cooperation. language, such as point(object, arm-side), lookA- These statuses include environmental conditions, task exe- t(object), and rotate(rot-dir, arm-side). cution progress, and human NL instructions, as well as The drawback of logic models in modeling NLC tasks is working statuses for the human and robot. HMM uses both that logic relations defined in the model are hard con- observation probabilities (absolute probability p(x)) and straints. If one logic formula was violated in practical exe- transitions abilities (conditional probability p(Y/X)) for cution processes, the whole logic structure would be modeling associations P(x, y) among NLC-related knowl- 71,84 inapplicable, and the task execution would fail. This draw- edge. With HMMs, tackled problems mainly include back limits models’ implementation scopes and reduces a real-time task assignments, dynamic human-centered 71,86 robot’s environment adaptability. Moreover, hard con- cooperation adjustment, accurate tool delivering by straints were defined indifferently, ignoring the relative simultaneously fusing multi-view data such as NL instruc- importance of executions. The execution flexibility is tion, shoulder coordinates, shoulders-elbows’ 3-D angle 84,87 undermined due to critical executions not being focused data, and hand poses. Limited by Markov assumptions, and trivial executions not being ignored when the NLC plan HMM is only capable of modeling shallow-level hidden modifications are necessary. correlations among NLC-related knowledge. Moreover, given that hidden statuses need to be explored for HMM modeling, a large amount of training data is needed, limit- Cognitive model ing HMM implementations in unstructured scenarios with 93 94 limited training data availability. Neural science research and psychology research proved that a cognitive human planning is not a sensorimo- tor transforming, instead a goal-based cognitive thinking. Logic model This reasoning is reflected on that cognitive thinking of To support a robot with rational logical reasoning of coop- cooperation is not relying on specific objects and specific eration strategies, rather than merely conducting exhaust- executions, instead it is merely relying on goal realization. ing probabilistic inferences from various NL-indicated Based on this theory, another generation model category is evidences, logic models were developed. Logic models summarized as a cognitive model. Human-like robot cog- teach robots using unviolated logic formulas to describe nitive planning in NLC is reflected in flexibly changing complex execution procedures which include multiple execution plans (different procedures), adjusting execution actions and statuses. Unviolated logics usually are first- orders (same procedures, different orders), removing some order logic formulas, such as “in possible worlds a kitchen less-important execution steps (same procedures, less is a region (8w8x(kitchen(w, x) ! region(w, x))).” The steps), and adding more critical executions procedures rationale behind logic models in NLC is that an NLC task is (similar procedures, similar orders). Liu and Zhang 9 Cognitive models. To develop human-like robot cognitive selection. For the logic model, it is good for scenarios with planning for robust NLC, cognitive models are developed either poor evidence or multiple objective goals, such as by using soft logic, which is defined by both logic formulas assembly planning and cup grasping planning. For the cog- and their weighted importance. A typical cognitive model nitive model, it is good for rich or poor evidence and mul- is Markov logic network (MLN) model. MLN represents tiple subjective goals, such as human emotion-guided NLC task in a way such as “0.3Drill(1) ^ 0.3TransitionFea- social interaction, and human preference-based object sible(1, 2) ^ 0.3Clean(2) ) 0.9Task(1, 2),” imitating the assembly. human cognition process in task planning. In this model, single execution steps and step transitions Open problems were defined by logic formulas, which could be grounded Probabilistic models lack explorations of indirect human into different logic formulas by substituting real-world con- cognitive processes in NLC, limiting naturalness of ditions. With this cognitive model, a flexible execution robotic executions. Logic models are inflexible and incap- plan can be generated by omitting non-contributing and able of simulating a human’s intuitive planning in real- weak-contributing logic formulas and involving strong- world environments. The cognitive model is close to a contributing logic formulas. Different from hard con- human’s cognitive process in simulating flexible straints in logic models, constraints (logic formula) in MLN decision-making processes. However, cognitive models are soft. These soft constraints mean when human NL are still suffering from two types of shortcomings. One instructions are partially obeyed by a robot, the task could shortcoming is that cognitive process simulation is still still be successfully executed. Typical tackled problems not a cognitive process because the fundamental theory include using MLN to generate a flexible machine- of cognitive process modeling is lacking insufficient sup- executable plan from human NL instructions for autono- port for a human-like task execution. The second prob- mous industrial task execution, NL-based cooperation in lem is the difficulties of cognitive model learning. uncertain environments by using MLN to meet constraints Different individuals have different cognitive processes, from both robots’ knowledge availability (human-NL- thus making it difficult to learn a general reasoning model. instructed knowledge) and real-world’s knowledge require- 89,95 Model details are presented in Table 2. ments (practical situation conditions). The advantage of using cognitive models in NLC is that soft logic is rel- atively like a human’s cognitive process reflected in human Knowledge-world mapping NL instructions during cooperation. It helps a robot with intuitive cooperation in unfamiliar situations by modifying, With understanding of NL language and execution plans, it or replacing, and executing plan details, such as tool or is critical for a robot to use this knowledge in practical action usages, improving robots’ cognition levels and cooperation scenarios. Knowledge-world mapping meth- enhancing its environment adaptability. The major draw- ods are developed to enable intuitive human–robot cooper- back is that MLN is still different from human cognitive ation in real-world situations. The general process of processes to consider logic conditions at a deep level to knowledge-world mapping is shown in Figure 9. Consid- enable plan modification, new plan making, and failure ering the different implementation problems, knowledge- analysis. Logic parameters for analyzing real-world condi- world mapping methods include two main types: tions are still insufficient to imitate logic relations in the theoretical knowledge grounding and knowledge gap fill- human mind, thereby limiting robots’ performances in ing. Theoretical knowledge grounding methods accurately adapting to users and environments. mapped learned knowledge items, such as objects, spatial/ temporal logic relations, into corresponding objects and rela- tions in real-world scenarios. Gap filling methods detect and Model comparison recommend both the missing knowledge, which is needed in Usually the probabilistic model is conducted in an end-to- real-world situations, but has not been covered by theoretical end manner, which directly reasons cooperation strategies execution plans, as well as real-world inconsistent knowl- from observations, ignoring internal correlations among edge, which is provided by a human, but could not find execution procedures. A logic model uses a step-by-step corresponding things in practical real-world scenarios. manner, with which ontology correlations and temporal or spatial correlations among execution procedures are Theoretical knowledge grounding explored, enabling process reasoning for intuitive planning. The cognitive model also uses a step-by-step model. To accurately map theoretical knowledge to practical Including logic correlations, the cognitive model also things, knowledge grounding methods are developed. In explores relative influences of execution procedures, these methods, a knowledge item is defined by properties, enabling a flexibly plan adjustment. For the probabilistic such as visual properties “object color and shape” captured model, it is good for scenarios with rich evidence and single by RGB cameras, motion properties “action speed” cap- objective goal, such as tool delivery and navigation path tured by motion tracking systems and execution properties 10 International Journal of Advanced Robotic Systems Table 2. Method summary of NL-based execution plan generation. Probabilistic model Joint-probabilistic BN Dynamic Bayesian network Logic model Cognitive model Knowledge Joint probabilistic Conditional probabilistic Logic formulas Logic formulas, their format correlations correlations weighted influences Algorithms Joint BN, NB, MRF Conditional BN, Viterbi First-order Logic, Ontology MLN, fuzzy logic Algorithm, HMM Tree User Moderate Moderate Low High adaptability Tackled Modeling meaning Meaning disambiguation, Autonomous robot Support a flexible problems distributions on NL entity-sensor data mapping, navigation, environment machine-executable instructions, aligning human-attended object uncertainty modeling, plan implementation, multi-view sensor data, identification, real-time autonomous execution task execution in action and tool uncertainty assessment failure diagnosis, NL-based unknown recommendation robot programming environments Advantages Good at representing a Good at distinguishing plans Strong logic correlations Flexible task plan, human complete plan among execution steps cognitive process imitating, strong environment/user adaptability Disadvantages Weak capability in modeling Weak capability in Inflexible task execution, Parameters are difficult the mutual distinctiveness representing a complete weak environment to learn, the current among tasks. task; rely on large amount adaptation soft logic is still far of training data from a human cognitive process 72,75,77,78,79 83–87 66,70,88,90,91 73,89,95 Typical references NL: natural language; NB: naı¨ve Bayesian; BN: Bayesian network; HMM: hidden Markov model; MRF: Markov random field; MLN: Markov logic network. “tool usage and location” captured by RFID. Different from  Executing NL-instructed motion plans, such as “pick the direct symbol mapping method, which has an element- up the tire pallet” by focusing on realizing actions “drive, insert, raise, drive, set.” mapping manner, the general property mapping method has Identifying human-desired cooperation places, such a structural mapping manner. The rationale behind these as “lounge, lab, conference room,” by checking methods is that a knowledge item can be successfully spatial-semantic distributions of landmarks, such grounded into the real world by mapping its properties. The as “hallway, gym, ... ” properties were collected by using methods such as “semantic similarity measurement,” which can establish With mapping methods, knowledge could be mapped correlations between an object and their corresponding into real world in a flexible manner, in which only parts properties. One typical mapping method by using general of properties need to be mapped for grounding a theoretical 96,98 property mapping is semantic map. Theoretical indoor item into a real-world thing. This manner could improve a entities such as rooms and objects are identified by mean- robot’s adaptability toward users and environments. The ingful real-world properties, such as location, color, point limitation is that these mapping methods still use predefini- cloud, spatial relation “parallel,” neighbor entities, con- tions to give a robot knowledge, reducing the intuitiveness structing a semantic map with both objective locations and of human–robot cooperation. semantic interpretations “wall, ceiling, wall, floor.” For detecting visual properties in the real world, RGBD cameras are usually used. For spatial relations, it is detected by laser Knowledge gap filling sensors and motion tracking systems. By identifying these properties in the real world, an indoor entity is identified, A theoretical execution plan defines an ideal real-world enabling an accurate robotic navigation in real-world NLC. situation. Given unpredicted aspects in a practical situation, Other typical mapping method also include the following. even if all defined knowledge has been accurately mapped into the real world, it is still challenging to ensure the Object searching by using NL instructions (detected success of NLC by providing all knowledge needed in a by microphones) as well as visual properties such as practical situation. Especially in real-world situations, object color, size, and shape (detected by motion human users and environment conditions vary, causing the tracking systems and cameras). occurrences of knowledge gaps, which are knowledge Liu and Zhang 11 Hierarchical knowledge structure checking, which detects knowledge gaps by checking real-world- available knowledge from top-level goals to low- level NLC execution parameters defined in a hier- 41,103 archical knowledge structure. Knowledge-applicability assessment, which detects knowledge gaps by checking the similarities between theoretical scenarios and real-world 34,103 scenarios. Performance-triggered knowledge gap estimation, which detects knowledge gaps by considering the 105,106 final execution performances. Hierarchical knowledge structure checking has the ratio- nale that if desired knowledge defined in a knowledge structure is missing in real-world situations, then knowl- edge gaps exist. Knowledge applicability assessment has a rationale that if the NLC situation is not similar with the previously trained situations, then knowledge gaps exist. Performance-triggered knowledge gap estimation has a rationale that if the final NLC performances of a robot is not acceptable, then knowledge gaps exist. In this detection stage, execution plan provides reasoning mechanisms. While real world provides practical things such as objects, locations, human identities, and relations such as spatial relations and temporal relations, which are detected by perceiving systems. The second step of gap filling is gap filling. Gap filling methods mainly include the following. Figure 9. Typical methods for theoretical knowledge grounding. In (a), Takano and Nakamura predefined motions, such as Using existing alternative knowledge such as “avoiding, bowing, carring, ... ,” were directly associated with their corresponding symbolic words. In (b), Hemachandra “brush” in the robot knowledge base to replace inap- et al.’s special features, such as “kitchen location, lab propriate knowledge such as “vacuum cleaner” in 4,106 locations, .. . ,” were considered to identify human-desired paths. NLC tasks such as “clean a surface” ; Using general commonsense knowledge “drilling action needs driller” in a robot database to satisfy required by real-world situations but are missing from a the need for a specific type of knowledge such as robot’s knowledge database. “tool for drilling a hole in the install a screw To ensure the success of a robot’s execution, knowledge 106,103 task” ; gap filling methods are developed to fill in these knowledge Asking knowledge input from human users by gaps. There are three main types of knowledge gaps: (1) proactively asking questions such as “where is the environment gaps, which are constraints such as tool avail- 105,107,108 table leg” ; ability and space or location limitations imposed by unfa- Autonomously learning from the Internet for recog- miliar environments ; nizing human daily intentions, such as “drink water, (2) robot gaps, which are constraints such as a robot’s 107,108 wash dishware.” physical structure strength, capable actions, and operation precision ; (3) user gaps, which are missing information In gap filling stage, execution plan describes the needed caused by abstract, ambiguous, or incomplete human NL knowledge items. Real world provides practical objects as 104,105 instructions. Filling up these knowledge gaps well as robot performance monitoring. enhances robot capability in adapting dynamic environ- ments and various tasks or users. Knowledge gap filling Model comparison is challenging in that it is difficult to make a robot aware of its knowledge shortage in specific situations, and it is Knowledge grounding model and knowledge gap filling difficult to make a robot understand how missing knowl- model are two critical steps for a successful mapping edge should be compensated for successful task executions. between NL-instructed theoretical knowledge and real- The first step of gap filling is gap detection. Gap detec- world cooperation situations. For knowledge grounding tion methods mainly include the following. models, the objective is strictly mapping NL-instructed 12 International Journal of Advanced Robotic Systems Table 3. Summary of knowledge-world mapping methods. Knowledge gap filling Hierarchical knowledge Performance-triggered knowledge Theoretical knowledge grounding structure checking gap estimation Knowledge Real-world objects, spatial/temporal/logic Real-world objects, spatial/ Real-world objects, spatial/ format correlations temporal/logic temporal/logic correlations correlations Algorithms Typical classification algorithms First-order logic, ontology Typical classification algorithms tree User Low Middle Middle adaptability Tackled Indoor routine identifying, accurate object Detecting gaps among Detecting gaps among robots, problems searching, scene understanding robots, users, and robots users, and robots Advantages Flexible knowledge usage, improving robots’ Improving smoothness of Improving the success rate of task environment adaptability. task executions executions Disadvantages Predefined manner limits robots’ environment Difficult to decide which Difficult to decide which gaps lead adaptability, execution naturalness and knowledge can miss to the failure of task executions intuitiveness 96,99–101 80,103,106 103,107,108 Typical references objects and logic relations into real-world conditions. It is a availability, location, and distances to a robot or human was necessary step for all the NLC application scenarios, such ignored, it is difficult for a robot to infer which object a as human-like action learning, indoor, and outdoor coop- human user needs. erative navigation. For knowledge gap filling models, the For knowledge gap filling, when a robot queries knowl- objective is to detect and repair missing or incorrect knowl- edge from either a human or open knowledge databases edge in human NL instructions. It is only necessary when such as openCYC, the scalability is limited. For a spe- human NL instruction cannot ensure successful NLC under cific user or a specific open knowledge database, available given real-world conditions. Typical scenarios include contents are insufficient to satisfy general knowledge needs daily assistance such as serving drink, where information in various NLC executions. The time and labor cost are such as correct types of “drink,” “vessel,” and default high, further limiting knowledge supports for NLC. Model places for drink delivery is missing; cooperative surface details are presented in Table 3. processing where execution procedures are incorrect and tools are missing. Discussion DL for better command understanding Open problems Nowadays, NLP is undergoing a deep learning (DL) revo- A typical problem of theoretical knowledge grounding is lution to create sophisticated models for semantic analysis. the non-executable-instruction problem. Human NL- Potential benefits of using advanced DL models for NLC instructed knowledge is usually ambiguous that NL- include the following. mentioned objects are too ambiguous to be identified in real world; abstract that high-level cooperation strategies  The word embedding methods, add semantic corre- are difficult to be interpreted into low-level execution lations, such as “cats and dogs are animals,” into details; information-incomplete that important cooperation irrelevant words, such as “cat, dog.” In future information such as tool usages, action selections, and NLC research, embedding methods could be used working locations are partially ignored; real-world incon- to introduce extra task-specific meanings, such as sistent that human NL-instructed knowledge is not avail- general common sense “drilling needs the tool able in real world. These non-executable problems limit driller” and safety rules, “stay away from hot surface practical executions of human NL-instructed plans. One and sharp tools,” endowing robots with better com- type of cause of non-executable-instruction problems mand understanding with awareness of environment include intrinsic NL characteristics, such as omitting, refer- limitations and human requirements. ring, and simplifying, as well as human speaking habits,  Sequence-to-sequence language model, such as such as different sentence organizations and phrase usages. long- and short-term memory, sequentially out- Another type of cause is the lack of environment under- putted meaning based on the continuously inputted 109,111 standing. For example, if object-related information such as text. In future NLC research, sequence-to- Liu and Zhang 13 sequence models will enable robots to follow real- safety and psychological comfortableness. Given the time instructions for timely executions and modifi- unique aspects of NLC systems in communication and cations, by aligning temporal verbal instructions interaction, the safety considerations are typically three with task-related knowledge, such as “action types, summarized as follows. First, verbally instructed sequence and location assignments.” actions that endanger human being’s safety should be Attention-based NL understanding models, such as rejected or be carefully assessed by the robotic systems. “Recurrent neural networks (RNN) encoder– This requires NLC systems to have safety-related common decoder,” emphasize relatively important words by sense for safety issue understanding and prediction. Sec- increasing the weights of the important expressions. ond, the verbal abuse from either a human to a robot or For example, in translating sentence “that does not from a robot to a human should be avoided, because verbal mean that we want to bring an end to subsidization,” abuse will make a human feel uncomfortable psychologi- keyword “subsidization” which with key information cally and finally influence the performance of NLC sys- 113 121,122 is emphasized. In future NLC research, attention tems. The psychology comfortableness of NLC models could help a robot to focus on human-desired systems requires the research of human behaviors and executions by analyzing verbal attention. human–machine trust. Third, it is necessary to enforce rigid risk assessment and controlled safety verification before releasing the NLC systems into the market, to make sure NLC systems are safe and helpful without causing safety Cost reduction for knowledge learning issues for humans. Cost reduction in knowledge collection is critical for intui- tive NLC. On the one hand, to understand human NL instructions, represent tasks, or fill in knowledge gaps, a Conclusion large scale of reliable knowledge is needed. On the other This article reviewed state-of-the-art methodologies for hand, time, economic cost, and labor investments need to realizing NLC. With in-depth analysis of application sce- be reduced. To solve this problem, two trends in developing narios, method rationales, method formulizations, and cur- knowledge-scaling-up methods appear recently: existing- rent challenges, and research of using NL to push forward knowledge exploitation and new-knowledge exploration. the limits of human–robot cooperation were summarized In existing-knowledge exploitation, existing knowledge is from a high-level perspective. This review article mainly interpreted and extracted into general knowledge, thereby categorized a typical NLC process into three steps: NL increasing knowledge interchangeability. Knowledge for instruction understanding, NL-based execution plan gener- specific situations, such as “use cup and spoon for prepar- ation, and knowledge-world mapping. With these three ing coffee,” could be used for general situations, such as steps, a robot can communicate with a human, reason about “preparing drink.” In new-knowledge exploration, new human NL instruction, and practically provide human- knowledge is collected by proactively asking human and desired cooperation according to human NL instructions. autonomously retrieving from the Word Wide Web, 115 116 117 books, operation logs, and videos. Declaration of conflicting interests The author(s) declared no potential conflicts of interest with NLC system personalization respect to the research, authorship, and/or publication of this article. When a robot cooperates with a specific human for a long time, the personalization of a robot becomes critical. For Funding personalization, it does not only mean defining individua- The author(s) received no financial support for the research, lized knowledge for a robot to adapt to a specific user, but it authorship, and/or publication of this article. also means designing a knowledge-individualization method, for a robot to autonomously adapt to variable ORCID iD users. Therefore, a future research in NLC would be Rui Liu https://orcid.org/0000-0002-8932-1147 developing knowledge-personalization methods to con- Xiaoli Zhang https://orcid.org/0000-0002-2949-4644 sider both execution preferences and social norms, support- ing a long-term NLC personalization. References 1. Winograd T. Understanding natural language. Cogn Psychol Safety consideration in NLC system design 1972; 3: 1–91. When robots are deployed in human-involved environ- 2. Baraglia J, Cakmak M, Nagai Y, et al. Initiative in robot 118,119 ments, it is necessary to follow the safety standards assistance during collaborative task execution. In: ACM/ to minimize and detail safety hazards before actual imple- IEEE HRI, New Zealand, March 2016, pp. 67–74. mentations. The objective of enforcing the safety standards 3. Tellex S A, Kollar T, Dickerson S, et al. Understanding nat- in NLC system design is for protecting humans’ individual ural language commands for robotic navigation and mobile 14 International Journal of Advanced Robotic Systems manipulation. In: AAAI, San Francisco, California, 7–11 20. Hollerbach J. The international journal of robotics research. August 2011, pp. 1507–1514. SAGE. [Online]. http://journals.sagepub.com/home/ijr. 4. Liu R, Webb J, and Zhang X. Natural-language-instructed (accessed 27 January 2017). industrial task execution. In: ASME IDETC/CIE,North 21. Park F. IEEE Xplore: IEEE transactions on robotics [Online]. Carolina, USA, 21–24 August 2016, pp. v01bt02a043– http://ieeexplore.ieee.org/xpl/RecentIssue.jsp? v01bt02a04. punumber¼8860. (accessed 27 January 2017). 5. Tenorth M, Nyga D, and Beetz M. Understanding and execut- 22. Dechter R. Artificial intelligence. [Online]. https://www.jour ing instructions for everyday manipulation tasks from the nals.elsevier.com/artificial-intelligence/. (accessed 27 Janu- world wide web. In: IEEE ICRA, 2010, pp. 1486–1491. ary 2017). 6. Hemachandra S, Duvallet F, and Howard TM. Learning mod- 23. Fujita H and Lu J. Knowledge-based systems. [Online]. els for following natural language directions in unknown https://www.journals.elsevier.com/knowledge-based-sys environments. In: IEEE ICRA, 2015, pp. 5608–5615. tems. (accessed 27 January 2017). 7. Iwata H and Sugano S. Human-robot-contact-state identifica- 24. ICRA2015. in ICRA2015. [Online]. http://ieeexplore.ieee. tion based on tactile recognition. IEEE Trans Ind Electron org/xpl/mostRecentIssue.jsp? punumber¼71 28761. 2005; 52(6): 1468–1477. (accessed 27 January 2017). 8. Kruger J and Surdilovic D. Hand force adjustment: robust 25. IROS2015.in IROS2015. [Online]. https://www.ieee.org/con control of force-coupled human–robot-interaction in assem- ferences_events/conferences/conferenc edetails/index.html? bly processes. CIRP Ann Manuf Technol 2008; 57(1): 41–44. Conf_ID¼33365. (accessed 27 January 2017). 9. Kim S, Jung J, Kavuri S, et al. Intention estimation and rec- 26. AAAI15: Twenty-Ninth conference on artificial intelligence ommendation system based on attention sharing. In: Interna- (AAAI15), 1995. [Online]. http://www.aaai.org/Conferences/ tional conference on neural information processing, 2013, AAAI/aaai15.php. (accessed 27 January 2017). pp. 395–402. 27. GoogleScholar. [Online]. https://scholar.google.com. 10. Liu R and Zhang X. Understanding human behaviors with an 28. Argall BD, Chernova S, and Veloso M. A survey of robot object functional role perspective for robotics. IEEE Trans learning from demonstration. Robot Auton Syst 2009; 57: Cogn Develop Syst 2016; 8(2): 115–127. 469–483. 11. Takano W and Nakamura Y. Action database for categorizing 29. Rautaray SS and Agrawal A. Vision based hand gesture rec- and inferring human poses from video sequences. Robot ognition for human computer interaction: a survey. Artif Auton Syst 2015; 70: 116–125. Intell Rev 2015; 43: 1–54. 12. Raman V, Lignos C, Finucane C, et al. Sorry Dave, I’m afraid 30. Bethel C, Salomon K, and Murphy RR. Survey of psycho- I can’t do that: explaining unachievable robot tasks using physiology measurements applied to human-robot interac- natural language. In: Robotics: science and systems, Berlin, tion. In: IEEE RO-MAN, 2007, pp. 732–737. Germany, 24–28 June 2013. 31. Argall BD and Billard AG. A survey of tactile human–robot 13. Matuszek C, Herbst E, Zettlemoyer L, et al.. Learning to interactions. Robot Auton Syst 2010; 58: 1159–1176. parse natural language commands to a robot control system. 32. House B, Malkin, and Bilmes J. The VoiceBot: a voice con- In: Desai J, Dudek G, Khatib O, and Kumar V (eds) Experi- trolled robot arm. In: ACM SIGCHI, 2009, pp. 183–192. mental robotics. Springer tracts in advanced robotics, Vol. 33. Dominey PF, Mallet A, and Yoshida E. Progress in program- 88. Heidelberg: Springer, 2016, pp. 403–415. ming the hrp-2 humanoid using spoken language. In: IEEE 14. Waldherr S, Romero R, and Thrun S. A gesture based inter- ICRA, 2007, pp. 2169–2174. face for human-robot interaction. Auton Robot 2000; 9(2): 34. Ovchinnikova E and Wachter M. Multi-purpose natural lan- 151–173. guage understanding linked to sensorimotor experience in 15. Hunston S and Francis G. Pattern grammar: a corpus-driven humanoid robots. In: IEEE-RAS Humanoids, 2015, pp. approach to the lexical grammar of English. Comput Linguist 365–372. 2000; 27(2): 318–320. 35. Lee KW, Kim HR, Yoon WC, et al. Designing a human-robot 16. Hsiao K, Vosoughi S, Tellex S, et al. Object schemas for interaction framework for home service robot. In: IEEE RO- responsive robotic language use. In: Proceedings of the third MAN, 2005, pp. 286–293. ACM/IEEE international conference on Human robot inter- 36. Lee S, Kim C, Lee J, et al. Affective effects of speech- action, Amsterdam, The Netherlands, 12–15 March 2008, pp. enabled robots for language learning. In: IEEE SLT, 2010, 233–240. ACM: New York, USA. pp. 145–150. 17. Guerin K, Lea C, and Paxton C. A framework for end-user 37. Motallebipour H and Bering A. A spoken dialogue system to instruction of a robot assistant for manufacturing. In: IEEE control robots. 2002. ICRA, Seattle, May 2015, pp. 6167–6174. 38. Dantam N and Stilman M. The motion grammar: analysis of a 18. Steels L and Kaplan F. AIBO’s first words: the social learning linguistic method for robot control. IEEE Trans Robot 2013; of language and meaning. Evol Commun 2000; 4: 3–32. 29: 704–718. 19. Huang A, Tellex S, and Bachrach A. Natural language com- 39. Bicho E, Louro L, and Erlhagen W. Integrating verbal and mand of an autonomous micro-air vehicle. In: IEEE/RSJ nonverbal communication in a dynamic neural field IROS, 2010, pp. 2663–2669. Liu and Zhang 15 architecture for human–robot interaction. Front Neurorobot 56. Jesse T, Shiqi Z, Raymond M, et al. Learning to interpret natural language commands through human-robot dialog. 2010; 4: 1–13. In: IJCAI, 2015. 40. Mcguire P, Fritsch J, Steil J, et al. Multi-modal human- 57. Foster ME, By T, and Richert M, et al. Human-robot dialogue machine communication for instructing robot grasping tasks. for joint construction tasks. In: ICMI Proceedings of the In: IEEE/RSJ IROS, Vol. 2, 2005, pp. 1082–1088. eighth international conference on multimodal interface, 41. Zender H, Jensfelt P, Mozos O, et al. An Integrated robotic Banff, Alberta, Canada, 2–04 November 2006, pp. 68–71. system for spatial understanding and situated interaction in ACM: USA, New York. indoor environments. In: AAAI, 2007, pp. 1584–1589. 58. Bischoff R and Graefe V. Dependable multimodal commu- 42. Guadarrama S, Riano L, and Golland D. Grounding spatial nication and interaction with robotic assistants. In: IEEE RO- relations for human-robot interaction. In: IEEE/RSJ IROS, MAN, 2002, pp. 300–305. 2013, pp. 1640–1647. 59. Breazeal C and Aryananda L. Recognition of affective com- 43. Scheutz M, Cantrell R, and Schermerhorn P. Toward human- municative intent in robot-directed speech. Auton Robot like task-based dialogue processing for human robot interac- 2002; 12: 83–104. tion. AI Mag 2011; 32: 77–84. 60. Connell J, Marcheret, Pankanti S, et al. An extensible language 44. Kollar T, Perera T, and Nardi D. Learning environmental interface for robot manipulation. In: ICAGI, 2012, pp. 21–30. knowledge from task-based human-robot dialog. In: IEEE 61. Stiefelhagen R, Fugen C, and Gieselmann R. Natural human- ICRA, 2013, pp. 4304–4309. robot interaction using speech. head pose and gestures. In: 45. Fasola J and Mataric M. Using semantic fields to model IEEE/RSJ IROS, 2004, pp. 2422–2427. dynamic spatial relations in a robot architecture for natural 62. Ghidary S, Nakata Y, and Saito H. Multi-modal human robot language instruction of service robots. In: IEEE/RSJ IROS, interaction for map generation. In: IEEE/RSJ IROS, 2001, pp. 2013, pp. 143–150. 2246–2251. 46. Lu D, Wu F, and Chen X. Understanding user instructions by 63. Levinson S, Zhu W, Li D, et al. Automatic language acqui- utilizing open knowledge for service robots. arXiv: 1606. sition by an autonomous robot. In: IJCNN, Vol. 4, 2003, pp. 02877v1, 2016. 2716–2721. 47. Brenner M, Hawes N, Kelleher J, et al. Mediating between 64. Kruijff GM, Zender H, and Jensfelt P. Situated dialogue and qualitative and quantitative representations for task- spatial organization: what, where and why. IJARS 2007; 4: orientated human-robot interaction. In: Proceedings of the 20th international joint conference on Artificial intelligence, 65. Oliveira JL, Ince G, Nakamura K, et al. An active audition IJCAI 2007, Hyderabad, India, 6–12 January 2007 pp. framework for auditory-driven HRI: application to interactive 2072–2077. San Francisco, CA, USA: Morgan Kaufmann robot dancing. In: IEEE RO-MAN, 2012, pp. 1078–1085. Publishers Inc. 66. Shimizu N and Haas AR. Learning to follow navigational 48. Cantrell R, Schermerhorn P, and Scheutz M. Learning actions route instructions. In: IJCAI 2009, pp. 1488–1493. from human-robot dialogues. In: IEEE RO-MAN, 2011, pp. 67. Gemignani G, Veloso M, and Nardi D. Language-based sen- 125–130. sing descriptors for robot object grounding. In: Robot Soccer 49. Jayawardena C, Watanabe K, and Lzumi K. Posture control World Cup, 2015, pp. 3–15. of robot manipulators with fuzzy voice commands using a 68. Dongcai L, Shiqi Z, Peter S, et al. Leveraging commonsense fuzzy coach–player system. Adv Robot 2007; 21: 293–328. reasoning and multimodal perception for robot spoken dialog 50. Zhang J and Knoll A. A two-arm situated artificial commu- systems. In: IEEE IROS, 2017, pp. 3855–3863. nicator for human–robot cooperative assembly. IEEE Trans 69. Allen J, Duong Q, and Thompson C. Natural language service Ind Electron 2003; 50: 651–658. for controlling robots and other agents. In: IEEE KIMAS 51. Misra DK, Sung J, and Lee K. Tell me dave: context-sensitive 2005, pp. 592–595. grounding of natural language to manipulation instructions. 70. Bos J. Applying automated deduction to natural language Int J Robot Res 2016; 35: 281–300. understanding. J Appl Logic 2009; 7: 100–112. 52. Fong T, Nourbakhsh I, Kunz C, et al. The peer-to-peer 71. Takano W. Learning motion primitives and annotative texts human-robot interaction project. In: AAAI Space Forum, from crowd-sourcing. Robomech J 2015; 2: 1–9. Long Beach, California, August 2005, pp. 6750. 72. Salvi G, Montesano L, and Bernardino A. Language boot- 53. Rybski PE, Stolarz J, Yoon K, et al. Using dialog and human strapping: learning word meanings from perception–action observations to dictate tasks to a learning robot assistant. Intel association. IEEE Trans Syst Man Cybern B 2011; 42: Serv Robot 2008; 1: 159–167. 660–671. 54. Walter M, Hemachandra S, Homberg B, et al. Learning 73. Liu R and Zhang X. Generating machine-executable plans semantic maps from natural language descriptions. In: RSS, from end-user’s natural-language instructions. Knowl Based Syst 2017; 140: 15–26. 55. Bastianelli E, Croce D, Vanzo A, et al. A discriminative 74. Liu R, Zhang X, and Webb J. Context-specific intention approach to grounded spoken language understanding in awareness through web query in robotic caregiving. In: IEEE interactive robotics. In: IJCAI, 2016, pp. 9–15. ICRA, 2015, pp. 1962–1967. 16 International Journal of Advanced Robotic Systems 75. Sattar J and Dudek G. Towards quantitative modeling of task 93. Johnson SHF. What’s so special about human tool use? Neu- confirmations in human-robot dialog. In: IEEE ICRA, 2011, ron 2003; 39: 201–204. pp. 1957–1963. 94. Gray J and Breazeal C. Manipulating mental states through 76. Liu R and Zhang X. Context-specific grounding of web nat- physical action. Int J Soc Robot 2014; 6: 315–327. ural descriptions to human-centered situations. Knowl Based 95. Tenorth M. Knowledge processing for autonomous robots. Dissertation. Universita¨t Mu¨nchen, 2011. Syst 2016; 111: 1–16. 77. Dindo H and Zambuto D. A probabilistic approach to learn- 96. Hemachandra S, Walter M, and Tellex S. Learning spatial- semantic representations from natural language descriptions ing a visually grounded language model through human-robot interaction. In: IEEE/RSJ IROS, 2010, pp. 790–796. and scene classifications. In: IEEE ICRA, 2015, pp. 78. Oates JT. Grounding knowledge in sensors: unsupervised 2623–2630. learning for language and planning. 2001. 97. Bastianelli E, Croce D, Basili R, et al. Using semantic models 79. Krunic V, Salvi G, and Bernardino A. Affordance based for robust natural language human robot interaction. In: AIIA, word-to-meaning association. In: IEEE ICRA, 2009, pp. 2015 pp. 343–356. 4138–4143. 98. Nuchter A and Hertzberg J. Towards semantic maps for 80. Deits R, Tellex S, and Thaker P. Clarifying commands with mobile robots. Robot Auton Syst 2008; 56: 915–926. information-theoretic human-robot dialog. JHRI 2012; 1: 99. Steels L, and Baillie J. Shared grounding of event descrip- 78–95. tions by autonomous robots. Robot Auton Syst 2003; 43: 81. Matuszek C, Fitzgerald N, Zettlemoyer L, et al. A joint model 163–173. 100. Tellex S, Thaker P, Joseph J, et al. Learning perceptually of language and perception for grounded attribute learning. arXiv preprint arXiv:1206.6423. 2012. grounded word meanings from unaligned parallel data. 82. Bustamante C, Garrido L, and Soto R. Fuzzy naive Bayesian Mach Learn 2014; 94: 151–167. classification in RoboSoccer 3D: a hybrid approach to deci- 101. Spexard T, Li S, Wrede B, et al. BIRON. Where are you? sion making. In: Robot Soccer World Cup, 2006; pp. Enabling a robot to learn new places in a real home envi- ronment by integrating spoken dialog and visual localiza- 507–515. 83. Tahboub K. Intelligent human-machine interaction based on tion. In: IEEE/RSJ IROS, 2006, pp. 934–940. dynamic Bayesian networks probabilistic intention recogni- 102. Mason M and Lopes M. Robot self-initiative and persona- tion. J Intell Robot Syst 2006; 45: 31–52. lization by learning through repeated interactions. In: ACM/ 84. Burger B, Ferrane I, Lerasle F, et al. Two-handed gesture IEEE HRI, 2011, pp. 433–440. recognition and fusion with speech to command a robot. 103. Chen X, Xie J, Ji J, et al. Toward open knowledge enabling for human-robot interaction. JHRI 2012; 1: 100–117. Auton Robot 2012; 32: 129–147. 104. Thomas BJ, and Jenkins OC. RoboFrameNet: verb-centric 85. Doshi F and Roy N. Spoken language interaction with model uncertainty: an adaptive human–robot interaction system. semantics for actions in robot middleware. In: IEEE ICRA, Connect Sci 2008; 20: 299–318. 2012, pp. 4750–4755. 86. Takano W and Nakamura Y. Bigram-based natural language 105. Knepper RA, Tellex S, Li A, et al. Recovering from failure by asking for help. Auton Robot 2012; 39: 347–362. model and statistical motion symbol model for scalable lan- 106. Scioni E, Borghesan G, Bruyninckx H, et al. Bridging the guage of humanoid robots. In: IEEE ICRA, 2012, pp. gap between discrete symbolic planning and optimization- 1232–1237. 87. Rossi S, Leone E, Fiore M, et al. An extensible architecture based robot control. In: 2015 IEEE international conference for robust multimodal human-robot communication. In: on robotics and automation (ICRA), 26–30 May 2015, pp. 5075–5081. USA: IEEE. IEEE/RSJ IROS, 2013, pp. 2208–2213. 107. Toris R, Kent D, and Chernova S. Unsupervised learning of 88. Bos J and Oka T. A spoken language interface with a mobile robot. Artif Life Robot 2017; 11: 42–47. multi-hypothesized pick-and-place task templates via 89. Jain D, Mosenlechner L, and Beetz M. Equipping robot con- crowdsourcing. In: IEEE ICRA, 2015, pp. 4504–4510. trol programs with first-order probabilistic reasoning capabil- 108. Tenorth M, Klank U, and Pangercic D. Web-enabled robots. ities. In: IEEE ICRA, 2009, pp. 3626–3631. IEEE Robot Autom Mag 2011; 18: 58–68. 90. Dzifcak J, Scheutz M, and Baral C. What to do and how to do 109. Akshat A, Swaminathan G, Vasu S, et al. Community reg- ularization of visually-grounded dialog. 2018. arXiv:1808. it: translating natural language directives into temporal and dynamic logic representation for goal management and action 04359. execution. In: IEEE ICRA, 2009, pp. 4163–4168. 110. Weston J, Ratle F, and Collobert R. Deep learning via semi- 91. Finucane C and Jing G. LTLMoP: experimenting with lan- supervised embedding. In: Proceedings of the 25th interna- guage. Temporal logic and robot control. In: IEEE/RSJ IROS, tional conference on Machine learning (ICML), pp. 639–655, 2008. 2010, pp. 1988–1993. 111. Sutskever I, Vinyals O, and Le QV. Sequence to sequence 92. Cheng Y, Jia Y, Fang R, et al. Modelling and analysis of natural language controlled robotic systems. IFAC Proc Vol learning with neural networks. Adv Neural Inf Process Syst 2014; 47: 11767–11772. 2014; 4: 3104–3112. Liu and Zhang 17 112. Bahdanau D, Cho K, and Bengio Y. Neural machine 119. Jacobs T and Virk GS. ISO 13482—The new safety standard translation by jointly learning to align and translate. In: for personal care robots, ISR/Robotik 2014. In: 41st inter- International conference on learning representation national symposium on robotics, Munich, Germany, 2014, (ICLR),2015. pp. 1–6. 113. Ling W, Trancoso I, Dyer C, et al. Character-based neural 120. Singh S, Payne SR, and Jennings PA. Toward a methodol- machine translation. In: ICLR, 2016. ogy for assessing electric vehicle exterior sounds. IEEE 114. Samadi M, Kollar T, and Veloso M. Using the web to inter- Trans Intell Transp Syst 2014; 15(4): 1790–1800. actively learn to find objects. In: Proceedings of the 26th 121. Han M, Lin C, and Song K. Robotic emotional expression AAAI conference on artificial intelligence, 2012. generation based on mood transition and personality model. 115. Gordon G and Breazeal C. Bayesian active learning-based IEEE Trans Cybern 2013; 43(4) 1290–1303. robot tutor for children’s word-reading skills. In: Proceedings 122. Li S, Wrede B, and Sagerer G. A dialog system for com- of the 29th AAAI conference on artificial intelligence,Austin, parative user studies on robot verbal behavior. In: IEEE Texas, 25–30 January 2015. pp. 1343–1349. international symposium on robot and human interactive 116. Zeng Q, Sun S, Duan H, et al. Cross-organizational colla- communication, 6–8 September 2006, pp. 129–134. Hat- borative workflow mining from a multi-source log. Decis field, UK: IEEE. Support Syst 2013; 54: 1280–1301. 123. Guiochet J, Martin-Guillerez D, and Powell D. Experience 117. Liu R, Zhang X, and Zhang H. Web-video-mining- with model-based user-centered risk assessment for service supported workflow modeling for laparoscopic surgeries. robots. In: IEEE international symposium on high- Artif Intell Med 2016; 74: 9–20. assurance systems engineering (HASE), 3–4 November 118. https://www.iso.org/standard/53820.html. 2010, pp. 104–113, CA, USA: IEEE.

Journal

International Journal of Advanced Robotic SystemsSAGE

Published: Jun 16, 2019

Keywords: Natural language; human–robot cooperation; NL instruction understanding; NL-based execution plan generation; knowledge-world mapping

There are no references for this article.