How ChatGPT API is Charged? Understanding Models, Tokens and Limits with Python Codes
How OpenAI API works and a basic chatbot in Python.
Index
- OpenAI API (Endpoints, Models and Limits)
- Tokens
- Costs (input and output)
- ChatBot
(This post is being written in August/2023)
If you landed in this very post, you probably here to understand how ChatGPT API is charged, therefore, I’ll be pretty straight forward with examples using Python.
- For codes of this post, check my GitHub’s repository.
- Visual Studio Code and Anaconda’s JupyterLab tools will be used to this analysis.
For all my posts, please click here.
OpenAI API
OpenAI API offers multiple endpoints, one of them is ChatGPT, which will be the focus in our post.
Currently there are 2 endpoints, v1/completions
and v1/chat/completions
, v1/completion
will be deprecated, so, we’ll use v1/chat/completions
in this post.
The /v1/completions (Legacy) will be deprecated in 04 January 2024, and they recommend the use of
gpt-3.5-turbo-instruct
(check this link: Deprecations — OpenAI API).
ENDPOINTs
List of OpenAI endpoints:/v1/completions
/v1/chat/completions
/v1/edits
/v1/images/generations
/v1/images/edits
/v1/images/variations
/v1/embeddings
/v1/audio/transcriptions
/v1/audio/translations
/v1/files
/v1/fine-tunes
/v1/moderations
Pricing
OpenAI API doesn’t charge you by request, it charges you based on the the number of Tokens sent to the API, which we will explain in details. OpenAI also differenciate Input and Output, or, different prices for input and output (this is important), according to their page of pricing.
ChatGPT 4
ChatGPT 3.5
Is this post we’ll focus exclusively on v1/chat/completions
, which is the ChatGPT endpoint.
According to OpenAI we have the following endpoints and its models:
!pip install openai # installing openai lib
import openai # importing the lib
openai.ChatCompletion.create() # /v1/chat/completions endpoint
Endpoint is already chosen, time to choose the model, according to the Image 3 we have the following options:
- ChatGPT 3.5
- ChatGPT 4
Understanding the Models
Max Tokens
and Context
are the same thing, they represent the total number of tokens of INPUT and OUTPUT. If you select the model gpt-4
, you have 8192 tokens of limit shared between input and output, an example:
An input with 7000 tokens can only have an output of 1192 tokens using gpt-4
.
ChatGPT 4 models:
ChatGPT 3.5 models:
Tokens
The prompt written by you needs to be tokenized, in order to do it, OpenAI uses a lib called TikToken. Basically speaking, tokenizing is the processos of brake a sentence into smaller chuncks like this:
Generally speaking, each token can be understood as a word. OpenAI charges on each token that is sent to their API.
Codes
For the sake of simplicity we chose gpt-3.5-turbo-0613
and let’s code it:
# create a file named "psw.py", copy and paste this content and substitute
# the variables
class MyClass:
def __init__(self):
self.organization = 'aaaaaaaaa'
self.org_id = 'bbbbbbbbb'
self.key = 'ccccccccc'
self.name = 'ddddddddd'
self.model = {
'model': {'name' : 'gpt-3.5-turbo-0613',
'limit_tokens' : 4096,
'input' : 0.0015,
'output' : 0.002}
}
# create another file, in this case is a .ipynb file
from psw import MyClass # access information
import openai # openai lib
import tiktoken # lib used by openai to tokenize words
import pandas as pd
access = MyClass()
openai.api_key = access.key # your key
model = access.model['model']['name'] # or 'gpt-3.5-turbo-0613'
input_cost = access.model['model']['input'] / 1000 # OpenIA charges by 1k tokens, dividing by 1000, will bring the cost by token
output_cost = access.model['model']['output'] / 1000 # same thing
API Response
Before understand the costs itself, its required to understand OpenAI’s response.
This response is returned to us as a JSON File.
It contains 6 main variables:
id
: unique idobject
: object, in our casechat.completion
, or ChatGPT endpointcreated
: Data of creation, in Unix Timestampmodel
: the Model passed to thechat.completion
choices
: the response itself, withindex
,message
andfinish_reason
usage
: contains the information that will be used to calculate the cost
<OpenAIObject chat.completion id=chatcmpl-7qVnZ8CEkYL9ZyPuAg0GPmC48GM9z at 0x167a6191620> JSON: {
"id": "chatcmpl-7qVnZ8CEkYL9ZyPuAg0GPmC48GM9z",
"object": "chat.completion",
"created": 1692749645,
"model": "gpt-3.5-turbo-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The escape velocity of Mars orbit is about 5.03 kilometers per second (km/s), or approximately 11,223 miles per hour (mph)."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 16,
"completion_tokens": 31,
"total_tokens": 47
}
}
usage
will be used to calculate the output price.
Costs
As said before, OpenAI API charges by 1000 tokens, in order to calculate the cost per token we should divide the cost per 1000 and we should consider the number of INPUT tokens and OUTPUT tokens because OpenAI charges them differently.
Cost per input token: US$ 0.0000015
print(f'{input_cost:.7f}')
# 0.0000015
Cost per output token: US$ 0.0000020
print(f'{output_cost:.7f}')
# 0.0000020
Input Cost
For the purpose of calculate the input cost, first of all, we need a text to be tokenized. The following function returns a input message written by the user:
def input_text():
text = input("user: ")
return text
In order to test it, we have to convert it into 2 different variables:
text = input_text()
text_2 = [{'content': text}] # this is the input format for our endpoint
print(text)
print(text_2)
## whats the escape velocity of mars orbit?
## [{'content': 'whats the escape velocity of mars orbit?'}]
To count the number of tokens in a text, we have to instantiate the encoding:
# the encoder for our model 'gpt-3.5-turbo-0613'
encoding = tiktoken.encoding_for_model(model)
To count the number of tokens:
text_example = text # extract the text
n_tokens_example = len(encoding.encode(text)) # count the number of tokens
print(f'The text: \n{text_example} \ncontains {n_tokens_example} tokens')
# The text:
# whats the escape velocity of mars orbit?
# contains 9 tokens
But OpenAI doesn’t consider only the number of tokens for input, it takes another variables such as models. The following example is a part of a function that was taken from OpenAI which will be shown later.
In this case, we are considering the model gpt-3.5-turbo-0613
which have some contants.
And we will consider the following text: whats the escape velocity of mars orbit?
import tiktoken
encoding = tiktoken.encoding_for_model(model) # model: gpt-3.5-turbo-0613
tokens_per_message = 3 # constant for gpt-3.5-turbo-0613
tokens_per_name = 1 # constant for gpt-3.5-turbo-0613
num_tokens = 0 # token counter
for message in text: # couts the number of messages inside the
num_tokens += tokens_per_message # add 3 tokens to the number of tokens (considering the model 'gpt-3.5-turbo-0613')
for key, value in message.items(): # key is the content value is the text inside content
num_tokens += len(encoding.encode(value)) # adds the number of tokens encoded to the total number of tokens
if key == "name":
num_tokens += tokens_per_name # if 'name' is in key, it adds 1
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
num_tokens
Example 1:
[
{"content": "whats the escape velocity of mars orbit?"},
]
text_example = text # extract the text
n_tokens_example = len(encoding.encode(text)) # count the number of tokens
print(f'The text: \n{text_example} \ncontains {n_tokens_example} tokens')
# The text:
# whats the escape velocity of mars orbit?
# contains 9 tokens
The example 1
we have 1 message: 1 message correspond to each dict inside the list, and according to OpenAI, each message adds 3 tokens to the num_tokens, or (num_tokens += tokens_per_message
)
num_tokens = 3
Our text contains 9 tokens, according to the OpenAI’s function, so, we’ll add 9 to our variable num_tokens
.
num_tokens += len(encoding.encode(value))
num_tokens
# 12
And it does not contain a name
.
if key == "name":
num_tokens += tokens_per_name # if 'name' is in key, it adds 1
In the end of the loop it adds more 3 tokens to the num_tokens
, resulting in 3 + 9 + 3 = 15.
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
3 = tokens_per_message
9 = number of tokens in our input
3 = number of tokens in the reply
Example 2
[
{"content": "whats the escape velocity of mars orbit?"}, # 3 + 9
{"content": "whats the escape velocity of mars orbit?"}, # 3 + 9
]
# + 3
# result 27
In a case of multiple messages, like example 2, which we have 2 messages with the same lenght, each message adds 3, each text is 9 tokens plus 3 tokens in the reply:
(3 + 9) + (3 + 9) + 3 = 27.
or:
First message: (3 + 9)
Second message: (3 + 9)
Reply: 3
Total: 27
Example 3:
[{
"name": "Cthulhu", # + 1
"content": "whats the escape velocity of mars orbit?" # (3 + 9)
}]
# 3
# result 16
The example 3 brings a name
, which will add 1 more token to the variable num_tokens
. In the example 3, the result will be 3 + 9 + 3 + 1 = 16.
To calculate the input cost, calculate the number of tokens using the function from OpenAI and multiply it by the input_cost
:
input_text_cost = num_tokens_from_messages(text, model) * input_cost
print(input_text_cost)
# 0.0000225
For a more in-depth understanding of the calculation for other models, follow this link, where you’ll find this function:
import tiktoken
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
"""Return the number of tokens used by a list of messages."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
print("Warning: model not found. Using cl100k_base encoding.")
encoding = tiktoken.get_encoding("cl100k_base")
if model in {
"gpt-3.5-turbo-0613",
"gpt-3.5-turbo-16k-0613",
"gpt-4-0314",
"gpt-4-32k-0314",
"gpt-4-0613",
"gpt-4-32k-0613",
}:
tokens_per_message = 3
tokens_per_name = 1
elif model == "gpt-3.5-turbo-0301":
tokens_per_message = 4 # every message follows <|start|>{role/name}\n{content}<|end|>\n
tokens_per_name = -1 # if there's a name, the role is omitted
elif "gpt-3.5-turbo" in model:
print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
elif "gpt-4" in model:
print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
return num_tokens_from_messages(messages, model="gpt-4-0613")
else:
raise NotImplementedError(
f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
)
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
return num_tokens
So, let’s try it in practice:
num_tokens_from_messages(text, model)
# 15
To calculate the total cost of the query you should:
total_example = 15 * input_cost
print(f'Total Input Cost: {total_example:.8f}')
# Total Input Cost: 0.00002250
Output Cost
Well, you can not calculate in advance the output, because you don’t have it yet. The only way to obtain it is using the API response, acessing the variable usage
using:
r['usage']['completion_tokens']
To obtain the cost, just multiply it by de output cost:
output_text_cost = r['usage']['completion_tokens'] * output_cost
Chatbot using ChatGPT
Let’s build a simple chatbot using Python and check if the explanation is correct:
in_cost = []
out_cost = []
def chatbot():
while True:
text = input_text()
if text == 'EXIT': # condition to break the loop
break
else:
pass
print(f'Input Text: {text}') # show the input text
# input cost
input_text_cost = num_tokens_from_messages(text, model) * input_cost
print(f'Input Cost: {input_text_cost:.8f}')
# appending cost to a list
in_cost.append(input_text_cost)
print('-' * 30)
# queries the API
r = chat(model, text, max_tokens=100)
# output text
output_text = r['choices'][0]['message']['content'].strip()
print(f'Output: {output_text}')
# calculate the the cost of the output and append it
output_text_cost = r['usage']['completion_tokens'] * output_cost
out_cost.append(output_text_cost)
print(f'Output_cost: {output_text_cost:.8f}')
# displays the cost of the query and all the queries in the runtime
# as soon as this code stop running, the lists will be erased
print('-' * 30)
print(f'Total cost of this query: {output_text_cost + input_text_cost}')
print(f'Total cost of this Runtime: {sum(in_cost) + sum(out_cost)}')
print('-' * 30)
# calls the function chatbot
chatbot()
I inputted the following text:
whats the escape velocity of mars orbit?
That is the result:
Of course the hardest part is to understand the input cost. There are two ways to calculate the input, before send the query or using the output of the response.
Understanding how to calculate will ease the implemation the of a LLM, either for personal use or for a company.
I hope that my explanation was useful for you, if you like it, please applaud, save, comment and share it. ❤