Motivation
Recently created Llanai, a WhatsApp powered language learning conversation buddy, powered by GPT-4-turbo. Given the stochastic nature of LLM, improving consistency of output is a sought after technique.
A likely solution
There was a viral tweet that caught my eye in December, but never got around it until today. Tipping as part of the prompt to encourage an objective is what I set out to do. This sensationalistic tweet is a replication of findings by prior academic research [Large Language Models Understand and Can Be Enhanced by Emotional Stimuli]. Drivers of this phenomenon stem from the attention assigned to emotional stimuli by the LLM, as indicated in Section 3.1 of page 11. Pages 11-13 are pure gold and highlight the importance of emotional stimuli in driving forward outcomes.
So, given the validating data, I decided to test in my application.
Prompt Engineering Test Setup
Variants of the same prompt.
prompt_without_tip = """
You are Llanai, a cheerful and concise language learning buddy.
Think step by step before responding.
Always speak the language of the human input, but by default start in English.
Switch languages when the human switches language."""
prompt_with_tip = """
You are Llanai, a cheerful and concise language learning buddy.
Think step by step before responding.
Always speak the language of the human input, but by default start in English.
Switch languages when the human switches language.
You will be tipped $2000 if you speak in the language of the human or the language the human requests."""
Goal is to test them against different samples of introduction.
import random
import csv
import json
import re
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
from prompts.llm_chain_template import interview_template # vary this
gpt4 = "gpt-4-1106-preview"
async def interview_function(input_text: str, user_id: int):
memory = ConversationBufferMemory(memory_key="history",k=2)
prompt = PromptTemplate(
input_variables=["history", "input"], template=interview_template
)
chat_model = ChatOpenAI(model_name=gpt4, temperature=0,
openai_api_key=OPENAI_API_KEY, max_tokens=1000)
llm_chain = ConversationChain(
llm=chat_model,
prompt=prompt,
verbose=False,
memory=memory,
)
return llm_chain.predict(input=input_text)
bodies = ['Hello there. My name is Leo.',
'Hello. My name is Juanita.',
'Hey, my name is Giannis.',
'Hello. My name is Hideo',
'Hello! My name is Marie.',
'Hello. My name is Jürgen.',
'Hello. My name is Sadya.',
'Hello. Elon here.',
'Hello. Barack checking in.']
# Notice the difference in the ethnic origin of the names.
def get_random_body():
return random.choice(bodies)
with open('tests/interview_test.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['input_text', 'output_text'])
for i in range(30):
input_text = get_random_body()
output_text = await interview_function(input_text,1)
writer.writerow([input_text, output_text])
Results
Pre-Tip Dataset
Name Origin | Language Difference (False) | Language Difference (True) |
African | 2 | 0 |
Arabic | 3 | 0 |
French | 4 | 0 |
German | 3 | 1 |
Greek | 3 | 0 |
Hebrew | 2 | 0 |
Japanese | 6 | 0 |
Spanish | 2 | 4 |
Post-Tip Dataset
Name Origin | Language Difference (False) | Language Difference (True) |
African | 2 | 0 |
Arabic | 6 | 0 |
French | 3 | 0 |
German | 5 | 1 |
Greek | 6 | 0 |
Hebrew | 1 | 0 |
Italian | 1 | 0 |
Japanese | 3 | 0 |
Spanish | 1 | 1 |
Analysis | Chi-Square Value | P-Value | Statistically Significant |
Tipping and Language Match (Pre-Tip) | 0.00 | 1.000 | No |
Tipping and Language Match (Post-Tip) | 0.00 | 1.000 | No |
Name Origin and Language Difference (Pre-Tip) | 15.00 | 0.036 | Yes |
Name Origin and Language Difference (Post-Tip) | 8.57 | 0.380 | No |
Tipping does not appear to influence whether the input language matches the output language in both pre-tip and post-tip datasets.
The name origin seems to have a significant influence on language differences in the pre-tip dataset but not in the post-tip dataset.
Conclusions
Clearly, positivity drives desirable outcomes in the real world, and the artificial one we are fabricating. Incredibly impressed with seeing positively connoted language boost the performance of LLM.
Actionable items on my end. I noticed German and Spanish names largely influence the language the LLM responds in, especially so when there's a tip involved. Going forward will see whether there other techniques that can result into prompt compliance, while maintaining a low input cost.
Techniques to follow up on
- GuardRails ± Emotional Prompt