Objective: To assess prompt adherence in image generation models, specifically the SDXL and SD15, by examining the impact of various token counts on the rendering of complex and descriptive prompts.
Key Observations:
- Token Limits: Significant changes in the image are bound by token limits:
- SDXL: Effective token range for large changes is between 27 to 33 tokens.
- SD15: Token limit ranges from 10 to 16 for substantial alterations.
- Descriptive Elements: Simple descriptors (e.g., “pretty eyes,” “sharp nose”) are rendered consistently, while complex elements like “sneakers” and “holding a baby” strain the token limits due to occupying different latent spaces.
- Model Comparisons:
- CKPT Models: Changes in these models don’t significantly affect token adherence.
- Newer/Optimized Models: Show slightly better prompt adherence, suggesting potential future improvements in token efficiency.
- Artistic Models: Slight improvement in prompt adherence with a possible increase of 1-2 tokens.
- Midjourney and DALL-E: Comparable performance to SDXL, with DALL-E achieving around 40 tokens for stylized images but reverting to 33 for photorealistic renderings.
Model-Specific Findings:
- SDXL (27 Tokens)
- Breaks after the description “sneakers” when more details are added.
- Resolution: 1024×1024, CFG: 7
- SDXL (33 Tokens)
- Maintains general prompt adherence with a slight increase in token count.
- Resolution: 768×1152, CFG: 7
- SD15 (10 Tokens)
- Breaks at native resolution and CFG, with larger changes being ignored.
- Resolution: 512×512, CFG: 7
- SD15 (16 Tokens)
- Prompt adherence breaks for significant changes.
- Resolution: 512×768, CFG: 5
Additional Insights:
- Negative Prompts: Including elements such as “black and white,” “2D,” and “anime” does not significantly impact adherence.
- Steps: Increasing the number of steps to 20 does not meaningfully affect prompt adherence.
Conclusion:
The SDXL and SD15 models show a capacity for handling a certain range of descriptive tokens with varying degrees of prompt adherence. While SDXL can accommodate more complex prompts up to a certain threshold, the SD15 has a more modest capacity before prompt adherence begins to degrade. Both models demonstrate a potential for incremental improvements with advancements in optimization and token efficiency. The findings also underscore the importance of token management and the challenges posed by complex image elements that require deeper latent space explorations.
Model Version
|
PROMPT
|
TOKENS
|
notes
|
resolution
|
CFG
|
SDXL
|
blond hair,
sunglasses,
necklace,
hat,
short hair,
jacket,
red pants,
leather belt,
purse,
sneakers
inside a dungeon,
|
27 tokens
|
after sneakers, breaks. if you add more descriptive words to the character, it can’t really do…. but it can still do environment variables, such as “inside a dungeon”.
stuff like “yellow car” breaks the adhesion.
|
1024×1024
|
7
|
SDXL
|
blond hair,
sunglasses,
necklace,
hat,
short hair,
jacket,
red pants,
leather belt,
purse,
sneakers,
yellow car,
smiling,
holding a baby,
|
33 tokens
|
this allows you to keep using more tokens a bit. it starts to ignore some tokens, but it’s still holding all the general words. at this point it’s now starting to break coherence, but still listening to the majority of prompt tokens.
|
768×1152
|
7
|
SDXL
|
blond hair,
sunglasses,
necklace,
hat,
short hair,
jacket,
red pants,
leather belt,
purse,
sneakers,
yellow car,
smiling,
holding a baby,
|
33 tokens
|
reducing the CFG helps a bit, not enough to add more tokens, but in batch renders, i’m finding a SLIGHTLY higher adherence to prompts (going to higher cfg Seems to have a worse effect)
|
768×1152
|
5
|
SD15
|
blond hair,
sunglasses,
necklace,
holding a baby
|
10 tokens
|
This is where it breaks at native resolution and cfg. large changes beyond this are ignored.
|
512×512
|
7
|
SD15
|
blond hair,
sunglasses,
necklace,
hat,
short hair,
holding a baby,
|
16 tokens
|
This is roughly where prompt adherence breaks for drastic items.
doesn’t matter seed, cfg, resolution, checkpoiint model… they all break roughly around here.
|
512X768
|
5
|