World Models & Video Reasoning
- Pandora: Towards General World Model with Natural Language Actions and Video States, 2024 | Code
- Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
- Insightful and comprehensive. The conecept of spatial-temporal scene graph (STSG) is something new to me.
- Robotic Control via Embodied Chain-of-Thought Reasoning
Video Position Encoding & Temporal/Spacial Attention
- Rotary Position Embedding for Vision Transformer, 2024
- The application of RoPE in video transformers. Compared to the S/T Attention, it's something different.
- Video Transformers: A Survey, 2022
- Space or time for video classification transformers, 2023
- The concept of Space Attention and Temporal Attention is interesting.
Datasets
- (hdvila) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | Code, 2021
- How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs | Code, 2024
- ShareGPT4Video:
Improving Video Understanding and Generation with Better Captions, 2024
- The differencial data annotation method is interesting, we want to have fine-grained annotations for the video data.
Video Generation
Benchmark
- VBench: Comprehensive Benchmark Suite for Video Generative Models | Code, 2023 (CVPR 2024)
- Comprehensive benchmark for video generation models.
Long Context LLM
RoFormer: Enhanced Transformer with Rotary Position Embedding, 2020
Base of RoPE Bounds Context Length, 2024
- Explore the influence of the base of RoPE on the context length of the model.
3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding, 2024
- Creative idea to introduce a new dimension to the RoPE, which is called a chunk. Attention of attention.
LongEmbed: Extending Embedding Models for Long Context Retrieval, 2024 | Code
- A comprehensive experimental study on different methods to extend context window (e.g. Parallel Context Window, NTK, self-extend, Grouped Position & Reccurent Position, etc.). Also introduce a benchmark called LongEmbed.
CAPE: Context-Adaptive Positional Encoding for Length Extrapolation, 2024
- Exploring on using NNs to further enhance the additive PE methods.
Diffusion Models
MLSys
- vLLM (PagedAttention):