How ChatGPT API is Charged? Understanding Models, Tokens and Limits with Python Codes

9 min readAug 27, 2023

Source: ChatGPT Artificial Intelligence | The ChatGPT artificial int… | Flickr

How OpenAI API works and a basic chatbot in Python.

Index

OpenAI API (Endpoints, Models and Limits)
Tokens
Costs (input and output)
ChatBot

(This post is being written in August/2023)

If you landed in this very post, you probably here to understand how ChatGPT API is charged, therefore, I’ll be pretty straight forward with examples using Python.

For codes of this post, check my GitHub’s repository.
Visual Studio Code and Anaconda’s JupyterLab tools will be used to this analysis.

For all my posts, please click here.

OpenAI API

OpenAI API offers multiple endpoints, one of them is ChatGPT, which will be the focus in our post.

Currently there are 2 endpoints, v1/completions and v1/chat/completions, v1/completion will be deprecated, so, we’ll use v1/chat/completions in this post.

The /v1/completions (Legacy) will be deprecated in 04 January 2024, and they recommend the use of gpt-3.5-turbo-instruct (check this link: Deprecations — OpenAI API).

ENDPOINTs

List of OpenAI endpoints:
/v1/completions /v1/chat/completions /v1/edits /v1/images/generations /v1/images/edits /v1/images/variations /v1/embeddings /v1/audio/transcriptions /v1/audio/translations /v1/files /v1/fine-tunes /v1/moderations

Pricing

OpenAI API doesn’t charge you by request, it charges you based on the the number of Tokens sent to the API, which we will explain in details. OpenAI also differenciate Input and Output, or, different prices for input and output (this is important), according to their page of pricing.

ChatGPT 4

ChatGPT 3.5

Is this post we’ll focus exclusively on v1/chat/completions, which is the ChatGPT endpoint.

According to OpenAI we have the following endpoints and its models:

Image 3 — Endpoints and Models Compatibility— Source: Models — OpenAI API

!pip install openai # installing openai lib

import openai # importing the lib
openai.ChatCompletion.create() # /v1/chat/completions endpoint

Endpoint is already chosen, time to choose the model, according to the Image 3 we have the following options:

ChatGPT 3.5
ChatGPT 4

Understanding the Models

Max Tokensand Context are the same thing, they represent the total number of tokens of INPUT and OUTPUT. If you select the model gpt-4, you have 8192 tokens of limit shared between input and output, an example:

An input with 7000 tokens can only have an output of 1192 tokens using gpt-4.

ChatGPT 4 models:

ChatGPT 3.5 models:

Tokens

The prompt written by you needs to be tokenized, in order to do it, OpenAI uses a lib called TikToken. Basically speaking, tokenizing is the processos of brake a sentence into smaller chuncks like this:

Image 6 — Source: Top 5 Tokenization Techniques in Natural Language Processing in Python | by Ajay Khanna | Medium

Generally speaking, each token can be understood as a word. OpenAI charges on each token that is sent to their API.

Codes

For the sake of simplicity we chose gpt-3.5-turbo-0613 and let’s code it:

# create a file named "psw.py", copy and paste this content and substitute 
# the variables

class MyClass:
    def __init__(self):
        self.organization = 'aaaaaaaaa'
        self.org_id       = 'bbbbbbbbb'
        self.key          = 'ccccccccc'
        self.name         = 'ddddddddd'
        self.model        = {
                            'model': {'name' : 'gpt-3.5-turbo-0613', 
                                      'limit_tokens' : 4096,
                                      'input' : 0.0015,
                                      'output' : 0.002}
                            }

# create another file, in this case is a .ipynb file

from psw import MyClass # access information
import openai # openai lib
import tiktoken # lib used by openai to tokenize words
import pandas as pd

access = MyClass()
openai.api_key = access.key # your key
model = access.model['model']['name'] # or 'gpt-3.5-turbo-0613'
input_cost = access.model['model']['input'] / 1000 # OpenIA charges by 1k tokens, dividing by 1000, will bring the cost by token
output_cost = access.model['model']['output'] / 1000 # same thing

API Response

Before understand the costs itself, its required to understand OpenAI’s response.

This response is returned to us as a JSON File.

It contains 6 main variables:

id: unique id
object: object, in our case chat.completion, or ChatGPT endpoint
created: Data of creation, in Unix Timestamp
model: the Model passed to the chat.completion
choices: the response itself, with index, messageand finish_reason
usage: contains the information that will be used to calculate the cost

<OpenAIObject chat.completion id=chatcmpl-7qVnZ8CEkYL9ZyPuAg0GPmC48GM9z at 0x167a6191620> JSON: {
  "id": "chatcmpl-7qVnZ8CEkYL9ZyPuAg0GPmC48GM9z",
  "object": "chat.completion",
  "created": 1692749645,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The escape velocity of Mars orbit is about 5.03 kilometers per second (km/s), or approximately 11,223 miles per hour (mph)."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 31,
    "total_tokens": 47
  }
}

usage will be used to calculate the output price.

Costs

As said before, OpenAI API charges by 1000 tokens, in order to calculate the cost per token we should divide the cost per 1000 and we should consider the number of INPUT tokens and OUTPUT tokens because OpenAI charges them differently.

Cost per input token: US$ 0.0000015

print(f'{input_cost:.7f}')
# 0.0000015

Cost per output token: US$ 0.0000020

print(f'{output_cost:.7f}')
# 0.0000020

Input Cost

For the purpose of calculate the input cost, first of all, we need a text to be tokenized. The following function returns a input message written by the user:

def input_text():
    text = input("user: ")
    return text

In order to test it, we have to convert it into 2 different variables:

text = input_text()
text_2 = [{'content': text}] # this is the input format for our endpoint
print(text)
print(text_2)

## whats the escape velocity of mars orbit?
## [{'content': 'whats the escape velocity of mars orbit?'}]

To count the number of tokens in a text, we have to instantiate the encoding:

# the encoder for our model 'gpt-3.5-turbo-0613'
encoding = tiktoken.encoding_for_model(model)

To count the number of tokens:

text_example = text # extract the text
n_tokens_example = len(encoding.encode(text)) # count the number of tokens

print(f'The text: \n{text_example} \ncontains {n_tokens_example} tokens')

# The text: 
# whats the escape velocity of mars orbit? 
# contains 9 tokens

But OpenAI doesn’t consider only the number of tokens for input, it takes another variables such as models. The following example is a part of a function that was taken from OpenAI which will be shown later.

In this case, we are considering the model gpt-3.5-turbo-0613 which have some contants.

And we will consider the following text: whats the escape velocity of mars orbit?

import tiktoken
encoding = tiktoken.encoding_for_model(model) # model: gpt-3.5-turbo-0613

tokens_per_message = 3 # constant for gpt-3.5-turbo-0613
tokens_per_name = 1 # constant for gpt-3.5-turbo-0613
num_tokens = 0 # token counter

for message in text: # couts the number of messages inside the 
    num_tokens += tokens_per_message # add 3 tokens to the number of tokens (considering the model 'gpt-3.5-turbo-0613')
    for key, value in message.items(): # key is the content value is the text inside content
        num_tokens += len(encoding.encode(value)) # adds the number of tokens encoded to the total number of tokens
        if key == "name":
            num_tokens += tokens_per_name # if 'name' is in key, it adds 1
num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
num_tokens

Example 1:

[
  {"content": "whats the escape velocity of mars orbit?"},
]

text_example = text # extract the text
n_tokens_example = len(encoding.encode(text)) # count the number of tokens
print(f'The text: \n{text_example} \ncontains {n_tokens_example} tokens')

# The text: 
# whats the escape velocity of mars orbit? 
# contains 9 tokens

The example 1 we have 1 message: 1 message correspond to each dict inside the list, and according to OpenAI, each message adds 3 tokens to the num_tokens, or (num_tokens += tokens_per_message)

num_tokens = 3

Our text contains 9 tokens, according to the OpenAI’s function, so, we’ll add 9 to our variable num_tokens.

num_tokens += len(encoding.encode(value))
num_tokens
# 12

And it does not contain a name.

if key == "name":
  num_tokens += tokens_per_name # if 'name' is in key, it adds 1

In the end of the loop it adds more 3 tokens to the num_tokens, resulting in 3 + 9 + 3 = 15.

num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>

3 = tokens_per_message

9 = number of tokens in our input

3 = number of tokens in the reply

Example 2

[
  {"content": "whats the escape velocity of mars orbit?"}, # 3 + 9
  {"content": "whats the escape velocity of mars orbit?"}, # 3 + 9
] 
# + 3
# result 27

In a case of multiple messages, like example 2, which we have 2 messages with the same lenght, each message adds 3, each text is 9 tokens plus 3 tokens in the reply:

(3 + 9) + (3 + 9) + 3 = 27.

or:

First message: (3 + 9)

Second message: (3 + 9)

Reply: 3

Total: 27

Example 3:

[{
  "name": "Cthulhu", # + 1
  "content": "whats the escape velocity of mars orbit?" # (3 + 9)
}]
# 3
# result 16

The example 3 brings a name, which will add 1 more token to the variable num_tokens. In the example 3, the result will be 3 + 9 + 3 + 1 = 16.

To calculate the input cost, calculate the number of tokens using the function from OpenAI and multiply it by the input_cost:

input_text_cost = num_tokens_from_messages(text, model) * input_cost
print(input_text_cost)
# 0.0000225

For a more in-depth understanding of the calculation for other models, follow this link, where you’ll find this function:

import tiktoken

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model in {
        "gpt-3.5-turbo-0613",
        "gpt-3.5-turbo-16k-0613",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

So, let’s try it in practice:

num_tokens_from_messages(text, model)
# 15

To calculate the total cost of the query you should:

total_example = 15 * input_cost
print(f'Total Input Cost: {total_example:.8f}')

# Total Input Cost: 0.00002250

Output Cost

Well, you can not calculate in advance the output, because you don’t have it yet. The only way to obtain it is using the API response, acessing the variable usage using:

r['usage']['completion_tokens']

To obtain the cost, just multiply it by de output cost:

output_text_cost = r['usage']['completion_tokens'] * output_cost

Chatbot using ChatGPT

Let’s build a simple chatbot using Python and check if the explanation is correct:

in_cost = []
out_cost = []

def chatbot():
    while True:
        text = input_text()
        if text == 'EXIT': # condition to break the loop
            break
        else:
            pass
        
        print(f'Input Text: {text}') # show the input text
        # input cost
        input_text_cost = num_tokens_from_messages(text, model) * input_cost
        print(f'Input Cost: {input_text_cost:.8f}')
        # appending cost to a list
        in_cost.append(input_text_cost)
        print('-' * 30)
        # queries the API
        r = chat(model, text, max_tokens=100)
        # output text
        output_text = r['choices'][0]['message']['content'].strip()
        print(f'Output: {output_text}')
        # calculate the the cost of the output and append it
        output_text_cost = r['usage']['completion_tokens'] * output_cost
        out_cost.append(output_text_cost)
        print(f'Output_cost: {output_text_cost:.8f}')
        # displays the cost of the query and all the queries in the runtime
        # as soon as this code stop running, the lists will be erased
        print('-' * 30)
        print(f'Total cost of this query: {output_text_cost + input_text_cost}')
        print(f'Total cost of this Runtime: {sum(in_cost) + sum(out_cost)}')
        print('-' * 30)

# calls the function chatbot
chatbot()

I inputted the following text:

whats the escape velocity of mars orbit?

That is the result:

Of course the hardest part is to understand the input cost. There are two ways to calculate the input, before send the query or using the output of the response.

Understanding how to calculate will ease the implemation the of a LLM, either for personal use or for a company.

I hope that my explanation was useful for you, if you like it, please applaud, save, comment and share it. ❤

Igor Comune | LinkedIn