Open-Orca/OpenOrca
Viewer • Updated • 2.94M • 49.5k • 1.54k
How to use Writer/palmyra-20b-chat with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="Writer/palmyra-20b-chat") # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Writer/palmyra-20b-chat")
model = AutoModelForCausalLM.from_pretrained("Writer/palmyra-20b-chat")How to use Writer/palmyra-20b-chat with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Writer/palmyra-20b-chat"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Writer/palmyra-20b-chat",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/Writer/palmyra-20b-chat
How to use Writer/palmyra-20b-chat with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "Writer/palmyra-20b-chat" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Writer/palmyra-20b-chat",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Writer/palmyra-20b-chat" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Writer/palmyra-20b-chat",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use Writer/palmyra-20b-chat with Docker Model Runner:
docker model run hf.co/Writer/palmyra-20b-chat
Please note that this model is no longer maintained or supported by our team. We strongly advise against using it in production or for any critical applications.
Instead, we recommend using our latest and greatest models, which can be found at:
https://huggingface.co/collections/Writer/palmyra-writer-license-66476fa8156169f8720a2c89
==========================
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
model_name = "Writer/palmyra-20b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
prompt = "What is the meaning of life?"
input_text = (
"A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions. "
"USER: {prompt} "
"ASSISTANT:"
)
model_inputs = tokenizer(input_text.format(prompt=prompt), return_tensors="pt").to(
"cuda"
)
gen_conf = {
"top_k": 20,
"max_new_tokens": 2048,
"temperature": 0.6,
"do_sample": True,
"eos_token_id": tokenizer.eos_token_id,
}
streamer = TextStreamer(tokenizer)
if "token_type_ids" in model_inputs:
del model_inputs["token_type_ids"]
all_inputs = {**model_inputs, **gen_conf}
output = model.generate(**all_inputs, streamer=streamer)
print("-"*20)
print(output)
Detailed results can be found here
| Metric | Value |
|---|---|
| Avg. | 38.97 |
| ARC (25-shot) | 43.52 |
| HellaSwag (10-shot) | 72.83 |
| MMLU (5-shot) | 35.18 |
| TruthfulQA (0-shot) | 43.17 |
| Winogrande (5-shot) | 66.46 |
| GSM8K (5-shot) | 3.94 |
| DROP (3-shot) | 7.7 |