UPDATE v1.2: I have replaced custom nodes with default Comfy nodes wherever possible. The example animation now has 100 frames to verify that it can handle videos in that range.

UPDATE v1.1: Has the same workflow but includes an example with inputs and outputs. Hopefully this will help people get started.

Sharing this workflow to get feedback and ideas from the community. I've uploaded the workflow JSON, the custom nodes I use, and some samples.

I find that the key to good rotoscoping is to:

Choose a source video with reasonably high quality where the figure is easy to distinguish from the background and in a pose that you can clearly describe in a prompt.
Use models and loras that have clean, simple art styles.
Isolate the figure with a mask whenever possible. It is easier for the AI if the source video has a simple or empty background.
Use prompt travelling to change the prompt as the video changes. Especially key for capturing details that make the animation lifelike: expressions, blinking, etc.

Details below.

Custom Nodes

I built two small utility nodes to help with prompt travelling. These are designed to be used to edit booru-style, comma-delimited tags generated by automated tagging nodes like WD14Tagger. Suggestions, improvements, bugfixes are welcome.

Install these nodes by placing the e7ai-animate.py file from the ZIP into your ComfyUI's custom_nodes directory.

TagSwap

TagSwap is designed to edit an autogenerated prompt by doing find/replace on booru-tags. It takes a list of tag sets as input and find/replace commands in YAML format. The node has two modes.

To install, put the python file in the ComfyUI custom_nodes folder.

Select Mode

Select is probably the most useful mode for animation. In select mode, the node looks for tags in the input and the tag is present it can swap the tag with another tag. If it is absent then it can inject a tag.

An example of select mode YAML:

select:
  closed_eyes:
    p: closed_eyes, eyeshadow
    a: (pink_eyes:1.1)
  open_mouth:
    p: open_mouth

Example inputs and outputs:

Input : 1girl, solo, brown_eyes, open_mouth
Output: (pink_eyes:1.1), open_mouth

Input : 1girl, solo, closed_eyes
Output: closed_eyes, eyeshadow

Replace mode operates the same way but it will pass through tags that do not match. Replace is useful when you want your travel prompt to closely follow the input. When using replace, it's often useful to exclude tags using the WD14Tagger node.

An example of replace mode:

replace:
  closed_eyes:
    p: closed_eyes, eyeshadow
    a: (pink_eyes:1.1)
  open_mouth:
    p: open_mouth

Example inputs and outputs (you would typically exclude brown_eyes here):

Input : 1girl, solo, brown_eyes, open_mouth
Output: 1girl, solo, brown_eyes, (pink_eyes:1.1), open_mouth

PromptMerge

The prompt merge node is designed to work in conjunction with an automatic tagger. Given a list of input prompts, it will coalesce identical, subsequent prompts into the format expected by Fizz BatchPromptSchedule.

Example inputs and outputs:

Input (as a list):
 0: 1girl, solo
 1: 1girl, solo
 2: 1girl, solo, brown_eyes
 3: 1girl, solo, brown_eyes, black_hair
 4: 1girl, solo
Output (as Text):
 "0": "1girl, solo",
 "2": "1girl, solo, brown_eyes",
 "3": "1girl, solo, brown_eyes, black_hair",
 "4": "1girl, solo"

Workflow

This workflow is a starting point and almost always requires adjustment for the specific video you are trying to rotoscope. Here are some notes on how I generally use it.

Models and Loras

Models with clean simple lines that generate flat art, like those from misstoon, seem to work better than models like SeekYou that add lots of detail or flair. Loras help a lot, especially if they impose a clean art style with thick lines. Pretty much every lora from Envy works really well. Character Loras can help with consistency, but don't seem to be necessary like they are under pure img2img animation.

Detail Loras are also very useful for removing detail. Too much detail can mess with the AnimateDiff consistency (they tend to appear and disappear). I often run the adetailer lora at 0.2 or 0.3.

Prompting and Prompt Travel

Use WD14Tagger, TagSwap, and PromptMerge nodes early on to build your core prompt and travels. WD14Tagger doesn't cache very well, it often reruns without changes to its input, so I typically disable all these nodes (using Mode -> Never) once I have the prompt travel ironed out. If you print the merged prompt to the console, you can replace all these nodes with a text constant.

You will also typically wind up manually editing the prompt to add things the automatic tagging misses. The auto tagging is just to get started.

Use TagSwap to select the tags that are key to the animation and that change. If a tag is used for the entire animation, put it in the prefix or suffix input on the batch prompt scheduler. Remember that if you use the prefix input, you need to end your prefix with at "," to separate it from the other tags.

AnimateDiff Settings

I am still learning the best models and settings to use.

ip2p ControlNet

For anime video, ip2p is the 'secret sauce'. It does a fantastic job of converting the input to an anime style. I always use this CN for my videos. Keys to success with ip2p:

ALWAYS start your prompt with the phrase "Make [subject] a [descriptive] anime [subject]." Examples: "Make her a cute anime girl,", "Make it a fantasy anime dragon," "Make him a stylish anime boy,", etc.
It seems to work best to run ip2p 0-100% start / end percent.
The strength is a key variable. As far as I can tell, lower strength deviates farther from the input image while higher strength will apply less style transfer. Values around 0.2 or 0.3 seem to work well.

Other ControlNets

Other control nets will depend heavily on the input video, how much motion it has, etc. Getting a good video will require fiddling with CNs and their settings. Key CNs are:

Depth: The depth CN is fantastic at capturing motion but can be overbearing if you want to change the silhouette of the subject. I typically use depth with high strength (0.8-1) but low end percent (0.20-0.50). This seems to "stamp" the motion on the output during the early steps and then "get out of the way" for your prompt to take over the details.
Tile: Tile is good at transferring details (like hands) without destroying the ip2p style transfer. I use this CN in the same way as depth: High strength but low endpoint (sometimes as low as 0.1). Tile is also useful for suppressing background noise if you've used a mask.
OpenPose: I find OpenPose to be pretty poor at capturing motion, especially if the motion is subtle. Depth works much better for capturing the actual motion. However, OpenPose can compliment depth to provide the orientation and Z-placement of limbs. For example, if depth causes an arm to render behind your figure instead of in front, then add OpenPose. It is typically safe to run OpenPose at high strength (around 0.8) from 0-100 endpoints.
LineArt: Line art is rarely used, but can be helpful for fine details like fingers/hands. Settings are extremely dependent on the video.

Sampling

I typically use euler_ancestral with 23-27 steps and CFG from 4-7. It will depend on your input. Keep in mind that since AnimateDiff works by rendering all the nodes at once, your render time will heavily depend on the number of steps. If you go too high, then your video will take forever to render. If you go too low, then any control nets that have lower endpoints will probably have too much influence.

Misc

It is often useful to use nodes that remove the background of images to isolate your subject. Backgrounds can confuse the AI and depth CN. It is typically better to isolate your subject and overlay it on a different background in post processing. If you have trouble masking your subject, try the green screen tool at RunwayML. You can either use it to generate an isolated video or you can use it to generate a mask by adding the Fill Effect and choosing White as your fill color.

You can use FaceDetailer as part of the main workflow, but I recommend moving FaceDetailer to a post-processing workflow. It can be very slow and it works only on the input image. I cannot find a way to reuse the AnimateDiff modified model on it, which makes sense.

A decent post-processing workflow is to take the output from AnimateDiff and img2img+FaceDetail at lowish denoises (less than 0.6). This can help to clean up and add detail. Then run those images through img2img AnimateDiff again with very low denoise (less than 0.3).

I typically sample the input video down to 5-16fps (depending on frame counts) and then use FlowFrames to up the framerate. Keep in mind that anime is typically around 16 fps, so having too many frames can actually be detrimental.

Tools like ffmpeg and ChaiNNer are also extremely useful for quickly splitting, upscaling, etc. your video or video frames.