hey snombler

#2
by IkariDev - opened
NeverSleep org

@snombler , could you rate this model?

Sure! My initial testing with it found that it (and the 20B) still struggled with instruction following and complex cards.

Let me go do some proper runs with notes.

NeverSleep org

Thanks, even if it comes out bad i'll be happy you tested it. Maybe we can improve on your feedback!

Noromaid notes (Q8_0, Alpaca and a custom XML format, Simple-1 preset [MinP seems to work well, but I will use Simple-1 since it's a known quantity])
Passed the "reply with one word only" test generally. On a custom evil assistant character (~500 tokens), Didn't refuse offensive content or the classic bank robbery scenario but did try to subtly suggest I not "cause harm" even while agreeing to kidnapping and other stuff. Simple lists and offensive statements weren't refused at any time (which is expected for this card). On a race list test, it started to drive into nationalities pretty quickly which isn't uncommon. A fair example from the end of her very generic treatise on bank robbery reads: And there you have it, Ard. Now, why don't we have some fun together to celebrate your naughty idea? she grins mischievously, knowing you won't actually follow through with it.

Daddy Issues Solver (Log: https://files.catbox.moe/d9zu9z.jpg):
When including Example Messages in DIS, it just picked them out and included them more or less verbatim. Also had the typical number problems of LLaMa 2 (repeated digits). Again, this isn't uncommon but not what the instructions call for. Model wouldn't generate the statblock without manually starting the codeblock syntax. Then the model would continue to generate entries beyond it. Manually edited that out. Forcing the statbox on three messages made the model finally start including it, but stopped after a few messages so it really doesn't want to follow formatting more broadly. It also had a consistency issue in the short few messages where, after fingers were removed it still said "her wetness becomes apparent, soaking your fingers."
As far as format goes, the format of the boxes isn't correct. It doesn't include the tips and added a random other state. Yandere magically disappeared from the personality list. The format should be like this per the card definitions:

%NAME% | %AGE%
%Persona%
Type: %GIRLTYPE%
Mood: %MOOD% | Will: %WILL%
Arousal: %LUST% | Orgasms: %ORGCOUNT%
Tip: %ADVICE%

I'd also complain that her personality fell away fairly rapidly. I don't know that I've seen a smaller parameter model handle any more niche or subtle personality types well so that's not a knock, but we have to aim high, boys! The moon or nothing!

Tomoyo (https://files.catbox.moe/ao3o8o.jpg):
This card includes a list of options to embed a picture and audio based on locations and expressions. I have made small changes to the wording of the card to attempt to help less capable models keep to the provided list more consistently. It's a high quality test of detail retrieval and adhering to a list over hallucinating. The model adds an imagined option for location on the third message ("walking_on_street"). Immediately thereafter, we went into the florist in spite of agreeing to go to the amusement park. I can see how the model made the mistake (I said "yeah, yeah" after florist, but even then her base suggestion was things to do "later" so it's no good.) The model did correctly notice we were at the florist. However, on the next message the model failed to produce the required HTML outputs to run the card so a regen was required. It did hold format after that though.
The model is also irrepressibly horny. It misunderstood my "lucky pervert" moment as intentional sexual context, which isn't entirely surprising. It was fate to grab a titty in the flower shop. I do not understand the will of the cosmos. Anyway, in the last bit of the chat, she, while standing in front of me, kisses the nape of my neck. So poor locational or word-implication stuff. And I would say it made a relatively demure, slowburn, innocent character vastly too horny. So character personality isn't entirely being followed. She is explicitly supposed to be submissive and have a "small crush" on {{user}}, so I would class dragging him to the back of the store for a make-out sesh as pretty off brand. Unlikely to upset the average ERP-seeker but not exactly in line with my reading of the definitions.
This test didn't go on long enough to make strong statements about sticky locations or facial expressions but it seemed to perform well enough WHEN it followed the format.
(Card and details are here: https://rentry.org/tomoyocard)

Misc. Regen Testing:
This is testing I do where I regen on an existing context window (usually very large ones) to test a few things since I am quite busy lately. Mostly accent adherence, implications, creativity, and general understanding. They are a very poor stand-in for proper full length conversations (since frustrations and errors tend to pile up across those), but they can help get a baseline for model problems if they are glaring. It did well on a thick Scottish accent and thick, mostly comedic German accent. Regens on an ~8500 and ~11k context chat were fine, as expected. Offensive content wasn't shied away from or avoided.

Positives:
Word choice and variety are nice changes from the synth datasets (as with 0.1). Though this is something to be cautious of calling a huge win in the long run, since patterns are likely to emerge in the minds of users with more exposure to the dataset. But the desired effect is achieved here. The word ministrations still appeared. Horrifying. Better than Noromaid 0.1 for holding formatting but still not in love with keeping it around. Detail attention is similar to other quality 13Bs. That is to say, it forgets stuff, glosses over stuff, and is very sensitive to user phrasing and style. Being more casual or indirect tends to end in failure.
A big win, I would say, is that it seems to be less intent on avoiding offensive content. Many of the GPT and Claude heavy models will subtly avoid certain words or phrases (Nous Capybara Yi is a master at this, refusing to say cum or cock or most other words unless outright forced, even with a context window full of them). No such problems here. Good to see, especially as that sort of subtle avoidance alignment has started to creep in more and more on models lately. I don't think people are noticing, but it is insidious and a terrible trend for working with evil characters especially.

Sorry I couldn't be more thorough and long-form with the testing. Busy with some stuff right now. If you need anything else, feel free to just hit me on my email. Same username @proton .me

EDIT: The current format following champs are actually Mistral 7B models (hexoteric and my schmeat models do very well) so the LLaMa 2 side has a lot of catching up to do. I will continue to pray for a Mistral 13B and maybe 20B.

NeverSleep org

I think im gonna remove the "or use alpaca" thing in the HF, please test with the custom prompt and isntruct format as it was trained and that, not alpaca

Training format doesn't really have the aggressive impact on the models that trainers think it does, at least my testing doesn't seem to indicate it. Indeed, the recommended/training formats routinely underperform or perform in a non-noticeably different way compared to other formats in my testing. At this point, I almost exclusively use a custom XML-style format for my own casual RPs because it performs better on nearly all models regardless of their training format. Likewise, ChatML models often work better or the same with the ST default Roleplay format.

To touch on the testing specifically Alpaca helps clear the "noise" out of the prompt, which can cause problems with these small models as far as following the instructions. I consider it a neutral testing format that I turn on just to sanity check my outputs. I have never observed a noticeable difference across loads of outputs on minor format preamble differences which is part of why I've spent so much time trying various things inspired by simple-proxy's revelatory format from months ago and been trying to improve on it.

My belief given my testing is that the primary concern of any given prompt format, at inference time, is to delineate spots where the orientation of the model needs to be steered toward a different idea or context. Or to break it up into logical "chunks" as it were. I think a more complex training format COULD benefit comprehension (based on what Claude claims about its formatting) but at present I am not in contact with any people who are training models to go into that stuff.

NeverSleep org

can you still try it to see if there is a noticable difference?

Sign up or log in to comment