This script is a major dataset helper for loras, checkpoints, etc.!
Provide any youtube (many other sites also work!) URL OR local video file, and it will automatically extract the highest-available-quality screenshots, auto filter for aesthetics, and more!
Full details: https://github.com/EnragedAntelope/youtube-screenshot-extractor
Features:
Simple startup .bat/scripts for Windows, Mac, and Linux!
Easy to use GUI front-end!
Download YouTube and other videos using yt-dlp
Process local video files
Multiple frame extraction methods:
Interval-based extraction
All frames extraction
Keyframe extraction
Scene detection
Quality assessment of frames
Blur detection
Automatic removal of black bars
Basic watermark detection
Parallel processing for faster execution
GPU acceleration (if available)
Resume interrupted extractions
Generate thumbnail montages
Customizable output options (JPG or PNG)
Detailed logging and dry-run option
Load settings from a configuration file
Post-processing filters:
Gradfun (reduce color banding, less aggressive)
Deblock (reduce compression artifacts)
Deband (reduce color banding, more aggressive)
Description
Changed PyCuda to be an optional requirement and to not be called by the script unless it is being used.
This is to prevent errors on systems that cannot support PyCuda. Thank you for the feedback!
FAQ
Comments (12)
Cool, lots of options! But really consider adding post process filters like gradfun, deblock, deband, and some others to lessen the impact of Youtube's compression. Grabbing screenshots from anything below 2k videos on youtube will get you pretty big compression artifacts (due to bitrate + resolution). It won't be blurry, nor compositionally bad, but it will still get you bad images (due to the artifacts) and thus teaching the LoRA/Checkpoint how to replicate the artifacts.
Thank you for the suggestions! I will look into those for a potential future version. Do you think those sorts of filters are safe enough to set as default ON, or should they be left off and then only activated via command argument?
I'd keep them off as default (or possibly check video resolution, and when lower than 1080p60 put it on as default), they do have an impact and may blur video a bit more than someone wants. But blur is generally not as bad (since Flux is really, sometimes overly, sharp by itself) as artifacts. I'd even give people an option which ones to choose. Some benefit more from only deblock, others may benefit more from only gradfun.
@Zavy Thank you again! Will be looking into this.
In the meantime, you can take the screenshots and just batch process them in a graphic utility to minimize the artifacts.
Hi, I added your ideas for optional post-processing. Check out new version. Enjoy!
You can also pass this to a vision model, so once the grabs are done, processed you can point to the output folder for tagging. I use CLIP/ llava in 4-bit, very fast. Good way to print datasets.
Thank you. It's a great idea for sure. I use something similar (using the load images from directory node in Comfy) for upscaling and also do for tagging.
@EnragedAntelope nice one, If your using it in comfy you should be able to add a fine tune detection model, like segformer for additional guidance, especially on clothing. For people/clothing will work very well. Didnt look too much into the best workflow for it though, got distracted with other things, but worth looking into. Was basicaly testing it to automate cropping objects out of images.
I spent a few hours using this recently, and I love it. Do not miss the fact that it works with multiple online sites and not just YouTube.

