In a blog post today, OpenAI says they've "trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data." The model can reportedly learn to craft diamond tools, "a task that usually takes proficient humans over 20 minutes (24,000 actions)," they note. From the post: In order to utilize the wealth of unlabeled video data available on the internet, we introduce a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We start by gathering a small dataset from contractors where we record not only their video, but also the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. Importantly, the IDM can use past and future information to guess the action at each step. This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning. We chose to validate our method in Minecraft because it (1) is one of the most actively played video games in the world and thus has a wealth of freely available video data and (2) is open-ended with a wide variety of things to do, similar to real-world applications such as computer usage. Unlike prior works in Minecraft that use simplified action spaces aimed at easing exploration, our AI uses the much more generally applicable, though also much more difficult, native human interface: 20Hz framerate with the mouse and keyboard. Trained on 70,000 hours of IDM-labeled online video, our behavioral cloning model (the âoeVPT foundation modelâ) accomplishes tasks in Minecraft that are nearly impossible to achieve with reinforcement learning from scratch. It learns to chop down trees to collect logs, craft those logs into planks, and then craft those planks into a crafting table; this sequence takes a human proficient in Minecraft approximately 50 seconds or 1,000 consecutive game actions. Additionally, the model performs other complex skills humans often do in the game, such as swimming, hunting animals for food, and eating that food. It also learned the skill of "pillar jumping," a common behavior in Minecraft of elevating yourself by repeatedly jumping and placing a block underneath yourself. For more information, OpenAI has a paper (PDF) about the project.
Read more of this story at Slashdot.