A person goes from a standing position to a bended knee while gesturing.
His left elbow and his left knee are at right angle. From this stance, the left knee extends, and not long after, it bends. Just moments before, both hands are further down than his hips, near to the left ankle. With that pose, his right hand spreads significantly apart from the left ankle and right after, gets closer to the right foot. In the second right before, he moves upwards and in the meantime, moves towards the front. A second later, he shifts far to the left. Simultaneously, both hands spread away from the left foot. Meanwhile, his left hand spreads significantly apart from his left ankle. Shortly after, he shifts downwards and a moment later, he shifts downwards briskly. The right knee is unbent and from that pose, the right knee bends significantly.
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement—including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data.
MotionScript framework is the process of generating textual representations of human motion, directly derived from 3D skeleton sequences. First, posecodes, a quantifiable representation of static pose attributes, are extracted. Next, temporal changes in the posecodes are analyzed using Algorithm 1 to segment dynamic motion over joints, which are represented as motioncodes, a novel representation of movement patterns. Finally, a selection process is used to filter out redundant motioncodes, aggregating them to transform the motioncodes into concise and coherent natural language sentences.
We tested our method on out-of-distribution (OOD) captions, created from miming and improv exercises, to evaluate our models' generalization ability.
Three models were tested:
The animations are presented in both short human descriptions and more detailed versions generated by prompted LLMs. The columns represent the model variations, while the rows show animations generated from short and detailed captions. For the detailed captions, T2M(MS) included MotionScript examples in the prompt, while T2M(LLM) used simple prompts to describe motion (which we do not show here).
Here, we present examples of dancing and exercise motions from the HumanML3D dataset, along both the original human-annotated captions and those generated by MotionScript. These examples demonstrate MotionScript's ability to convert raw 3D motion sequences into meaningful and structured natural language descriptions.
The body is in a dancing action while doing a performance.
The left elbow advances from behind the right one to in front of the right one. At the same time, she moves a great distance to the right. Meanwhile, she moves a bit downwards. At the same time, the right hand moves from behind the left one to a position in front of the left one and comes significantly closer to the left knee. She moves slightly downwards speedily. Simultaneously, she shifts forward.
The person is performing while in a dancing action.
She shifts a great distance backward rapidly. The left hand lifts from below the neck ascending to above the neck. She shifts slightly downwards and a second later, she shifts to the right just a little bit. In the second right before, both knees are bent slightly and from this position, her right knee bends and speedily. Meanwhile, her left elbow is almost completely bent and from this stance, her left elbow extends.
This person is going backwards and is dancing.
The right hand moves nearer to the right foot. Immediately after, he moves a great distance backward quickly. Right after, the left hand raises from below his neck to above his neck.
The subject is performing while making a dance pose.
He moves forward and not long after, moves backward.
Someone is in a dancing action while doing a performance.
He shifts backward just a little bit.
A person is doing hand exercises while making a dance pose.
The right hand moves from the left side of the right shoulder to the right side of the right shoulder, and not long after, it moves from the right side of the right shoulder to the left side of the right shoulder. In the second right before, the right elbow is ahead of his left elbow and is partly bent, and from this stance, his right elbow extends.
The subject is lowering a body part and is making hand exercises.
The elbows are a bit bent, and from that pose, both elbows bend.
© This webpage was in part inspired from this template.