Voyager – An interactive video generation model with realtime 3D reconstruction

Here's a summary of the themes from the Hacker News discussion, presented in markdown format with direct quotes:

Human Perception vs. Computational Models of Reality

A significant portion of the discussion revolves around whether human perception dictates the necessary complexity of world models, or if a purely data-driven approach from 2D inputs is sufficient. The initial premise of human perception being strictly 2D is challenged and largely dismissed.

Human Perception is Multi-Dimensional: Users argue that human perception is not limited to 2D, citing senses beyond vision.
- "Human perception is not 2D, touch and proprioception[1] are three-dimensional senses." - AIPedant
- "And of course it really makes more sense to say human perception is 3+1-dimensional since we perceive the passage of time." - AIPedant
- "The brain does sensor fusion to build a 3d model that we perceive. We don't perceive in 2d" - 2OEH8eoCRo0
- "The inner ear is a great example! I mentioned in another comment that if you want to be reductive the sensors in the inner ear - the hairs themselves - are one dimensional, but the overall sense is directly three dimensional." - AIPedant
- "The point is that knowing where your hand is in space relative to the rest of your body is a distinct sense which is directly three-dimensional. This information is not inferred, it is measured with receptors in your joints and ligaments." - AIPedant
- "It is simply wrong to describe touch and proprioception receptors as 2D." - AIPedant
The "Bitter Lesson" and Data Richness: The discussion touches on the "Bitter Lesson" and whether it implies a preference for simpler algorithms over richer data. Some argue that more data, even if it appears to be implicitly 3D (like stereo images), aligns with the Bitter Lesson if the model can learn the underlying structure.
- "Increasing the fidelity and richness of training data does not go against the bitter lesson." - soulofmischief
- "You're needlessly making things harder by forcing the model to also learn to estimate depth from monocular images, and robbing it of a channel for error-correction in the case of faulty real-world data." - soulofmischief
- "Stereo images have no explicit 3D information and are just 2D sensor data. But even if you wanted to use stereo data, you would restrict yourself to stereo datasets and wouldn't be able to use 99.9% of video data out there to train on which wasn't captured in stereo, that's the part that's against the Bitter Lesson." - WithinReason
Depth Cues in Monocular Vision: The capability of 2D (monocular) vision to infer depth is also discussed, suggesting that even single-image models can capture 3D information, albeit through learned cues rather than direct measurement.
- "We can estimate distance purely from your eyes focal point." - __alexs
- "with one eye you have temporal parallax, depth cues (ordering of objects in your vision), lighting cues, relative size of objects (things further away are smaller) together with your learned comparison size etc." - yeoyeo42
- "There are a number of monocular depth cues:" - supermatt (linking to Wikipedia)

The Nature and Purpose of "World Models"

The utility and definition of "world models" in AI are debated, with some questioning their ability to represent persistent, consistent environments.

Persistence and Object Identity: A key concern is whether 2D models can maintain consistent representations of the world over time and across different views.
- "2D models don't have object persistence, because they store information in the viewport. Back when OpenAI released their Sora teasers, they had some scenes where they did a 360° rotation and it produced a completely different backdrop." - imtringued
The Illusion of 3D: The discussion considers whether current 2D-based "world models" are merely creating an illusion of 3D rather than truly understanding or reconstructing the underlying 3D structure.
- "So a lot of text to "world" engines have been basically 2d, in that they create a static background and add sprites in to create the illusion of 3D." - KaiserPro

Licensing, Regulation, and Geopolitical Factors of AI Development

A significant portion of the conversation shifts to the licensing of the Tencent HunyuanWorld-Voyager model, its implications for open source, and the role of regulations (particularly the EU AI Act) in shaping AI development and deployment.

Open Source vs. Weights-Available: A debate arises over whether a model with available weights but restrictive licensing qualifies as truly open source.
- "This is not open source. It is weights-available." - ambitiousslab
- "This is not open source because the license is not open source." - NitpickLawyer
- "I think at this point, open source is practically shorthand for weights available" - htrp
The EU AI Act and Jurisdictional Exclusions: The exclusion of the EU, UK, and South Korea from the model's license is a major point of discussion, with many attributing it to the complexity and risk associated with the EU AI Act.
- "The exclusion of EU, UK and South Korea suggests to me they've trained on data those countries would be mad they trained on/would demand money for training on." - vintermann
- "Or, those countries are trying to regulate AI." - heod749
- "The license used for this is quite a read. Available to the world except the European Union, the UK, and South Korea" - Ragnarork
- "It's the EU AI act. I've tried their cute little app a week ago, designed to let you know if you comply, what you need to report and so on... It was a mess when they proposed it, it was said to be better while they were working on it, turns out to be as unclear and as bureaucratic now that it's out." - NullCascade
- "They don't want to invest labor of complying with them. So, they reduced their liability by prohibiting usage of the model to show those jurisdictions' decision makers they were complying." - nickpsecurity
Data Privacy and Regulation Philosophies: The discussion broadens to the fundamental differences in approaches to data privacy and AI regulation between different blocs, with comparisons drawn between the EU's cautious stance and the more permissive approaches elsewhere.
- "Peak American thinking: megacorps and dictatorships stealing data with no respect whatsoever for privacy and not giving anything back is good. Any attempt to defend oneself from that is foolish and should be mocked. I wish you people could realize you're getting fucked over as much as the rest of us." - thrance
- "We didn't regulate adtech and now we're stuck with pervasive tracking that's hurting society and consumer privacy. Better to be more cautious with AI too so we can prevent negative societal effects rather than trying to roll them back when billions of euros are already at play, and thus the corporate lobby and interests in keeping things as they are." - wkat4242
- "I'd rather be free and my data safe than be an economic world leader." - Cthulhu_
- "This is a false dichotomy, you can have privacy and still be militarily and economically relevant. But say that you were right, and you have to choose between privacy and relevance, if you choose privacy, then once you are entirely economically dependent on Russia (Europe is still paying more in energy money to Russia than in aid to Ukraine) and China — when Europe is a vassal — it won't be able to make its own laws anymore." - flanked-evergl
Geopolitical Tensions and AI: Some users inject broader geopolitical concerns, linking AI development and regulation to the ongoing conflict in Ukraine and potential influences from Russia and China.
- "If I was Russia and/or China and I wanted to eliminate EU as a potential rival economically and militarily, then I don't think I could have come up with a better way to do it than EU regulations." - flanked-evergl
- "My concern is that it's all going to be sameish slop. Read ten AI generated stories and you've read them all." - Cthulhu_

Technical Aspects and Capabilities of AI World Models

The discussion also delves into the technical merits and limitations of the demonstrated AI model, including its memory requirements and potential applications.

Hardware Requirements: The high GPU memory requirements are noted, leading to discussions about accessibility and cost.
- "The minimum GPU memory required is 60GB for 540p. Cool, I guess… If you have tens of thousands of $ to drop on a GPU for output that’s definitely not usable in any 3D project out-of-the-box." - SirHackalot
Future Applications (VR/Gaming): Users express excitement about the potential for these models in areas like VR and game development, envisioning dynamically generated worlds.
- "I can see that being more of an engineering problem than a research one at this point." - jimmySixDOF (referring to latency)
- "I think its a matter of time when we will have photorealistic playable computer games generated by these engines." - krystofee
Limitations and "Cheating" in Demos: Critiques are raised about the short duration of example videos and the perceived "cheating" by not demonstrating full 360-degree rotations, questioning if the models truly capture a consistent world.
- "These clips are very short and don’t rotate the camera more than like 45 degrees. Genie3 also cheats and only rotate the camera 90 degrees. It’s always important to pay attention to what models don’t do. And in this case it’s turn the bloody camera around. I refuse to accept any model to be a “world model” if it can’t pass a simple “spin in place” test." - forrestthewoods
Comparison with Other Models and LiDAR: The model is compared to previous image-to-3D efforts and the role of LiDAR technology is discussed, with questions raised about whether AI can replace direct measurement.
- "Isn't it picture to 3D model? You'd generate the environment/model ahead of time and then "dive in" to the photo" - geokon
- "Can I use this to replace a LiDAR?" - amelius
- "Lidar is direct measurement" - ENGNR