SCALING LAWS AND ARCHITECTURAL ADVANCES OF HIERARCHICAL JEPA (H-JEPA) MODEL FOR PLANNING, CONTROL AND ROBOTICS IN PHYSICAL SYSTEMS
Hierarchical Joint-Embedding Predictive Architec- ture (H-JEPA) is increasingly viewed as a promising family of world models for embodied intelligence because it learns to pre-dict abstract future representations rather than reconstructing raw sensory inputs. This distinction is especially important in robotics, where successful control depends less on recovering exact pixels and more on learning compact state abstractions that are stable, semantically meaningful, and useful for planning. In this paper, we present an extended student-level study of H- JEPA from three complementary angles: architectural principles, scaling behaviour, and practical deployment for robotic planning and control. We first review the conceptual line from predictive coding and world models to JEPA, I-JEPA, V-JEPA, and recent action-conditioned variants. We then formalize a two-level H- JEPA suitable for physical systems, in which low-level predictors model short-horizon action-conditioned transitions and higher- level predictors produce temporally coarse sub-goals for long- horizon planning. Next, we analyze scaling trends with respect to encoder width, predictor depth, temporal hierarchy, and dataset size, arguing that downstream planning performance follows a weak power-law regime but saturates earlier than language- model loss scaling because control success is bottlenecked by rep- resentation utility, action coverage, and model-planner mismatch. We also describe a practical pipeline that maps raw observations to latent state estimation, hierarchical rollout, cross-entropy method planning, task-conditioned evaluation, and iterative re- finement. To ground the discussion, we compare a hand-tuned model-predictive controller against an H-JEPA-driven planner on simulated reaching and pushing tasks. The results suggest that hierarchy provides larger gains for long-horizon contact- rich behaviour than simply increasing parameter count, while the main engineering difficulties remain representation collapse, prompt or context sensitivity, latent oversmoothing, and the absence of a universally trustworthy proxy loss. In addition to quantitative comparisons, we include ablations, failure analysis, and workflow observations that highlight when hierarchical latent prediction genuinely helps and when human intervention remains indispensable. The goal of this work is not to claim state-of-the- art performance, but to provide a more detailed and structured foundation for future student projects on JEPA-style world models for robotics.
Lal, M. (2026). Scaling Laws and Architectural Advances of Hierarchical JEPA (H-JEPA) Model for Planning, Control and Robotics in Physical Systems. International Journal of Science, Strategic Management and Technology, 02(05). https://doi.org/10.55041/ijsmt.v2i5.169
Lal, Mayank. "Scaling Laws and Architectural Advances of Hierarchical JEPA (H-JEPA) Model for Planning, Control and Robotics in Physical Systems." International Journal of Science, Strategic Management and Technology, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i5.169.
Lal, Mayank. "Scaling Laws and Architectural Advances of Hierarchical JEPA (H-JEPA) Model for Planning, Control and Robotics in Physical Systems." International Journal of Science, Strategic Management and Technology 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i5.169.
2.Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” arXiv preprint arXiv:2001.08361, 2020.
3.Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and Sifre, “Training Compute-Optimal Large Language Models,” in Proc. NeurIPS, 2022, pp. 30016–30030.
4.Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat,LeCun, and N. Ballas, “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,” in Proc. IEEE/CVF CVPR, 2023, pp. 15619–15629.
5.Schmidhuber, “Formal Theory of Creativity, Fun, and Intrinsic Mo- tivation,” IEEE Trans. Autonomous Mental Development, vol. 2, no. 3,230–247, 2010.
6.Ha and J. Schmidhuber, “World Models,” arXiv preprint arXiv:1803.10122, 2018.
7.Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun,Assran, and N. Ballas, “Revisiting Feature Prediction for Learning Visual Representations from Video,” arXiv preprint arXiv:2404.08471, 2024.
8.Garrido, R. Balestriero, L. Najman, and Y. LeCun, “On the Duality Between Contrastive and Non-Contrastive Self-Supervised Learning,” in Proc. ICLR, 2023.
9.Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3M: A Universal Visual Representation for Robot Manipulation,” in Proc. CoRL, 2023.
10.Radosavovic, T. Xiao, S. James, P. Darrell, J. Malik, and T. Pinto, “Real-World Robot Learning with Masked Visual Pre-Training,” in Proc. CoRL, 2023.