martes, 12 de agosto de 2025

Talk the Walk

In an innovative synthesis of urban visual perception, discrete spatial actions, and interactive linguistic exchange, this work configures a cooperative scenario where a guide, armed with a map yet blind to the tourist’s exact coordinates, must deduce their position from landmark descriptions and reported moves, while the tourist, immersed in panoramic street views but ignorant of both the map and the goal’s location, navigates under verbal instruction. The environment, structured as 2D grids extracted from real New York neighborhoods, demands the integration of perceptual cues with spatial reasoning to achieve the shared objective. The central methodological contribution is the Masked Attention for Spatial Convolutions (MASC) architecture, capable of dynamically translating action sequences into spatial shifts on the guide’s map, thus grounding language into movement. Empirical evaluation reveals that emergent communication protocols, whether continuous or discrete, attain high localization precision, while natural language remains more challenging yet benefits from landmark-focused utterances. The dataset, exceeding 10,000 dialogues and implemented within the ParlAI framework, provides a robust benchmark for advancing grounded dialogue systems, embodied AI, and situated navigation research in realistic, perception-rich settings.



de Vries, H., Shuster, K., Batra, D., Parikh, D., Weston, J. and Kiela, D., 2018. Talk the Walk: Navigating Grids in New York City through Grounded Dialogue. arXiv preprint arXiv:1807.03367.