topicgpt package

Submodules

topicgpt.Clustering module

class topicgpt.Clustering.Clustering_and_DimRed(n_dims_umap: int = 5, n_neighbors_umap: int = 15, min_dist_umap: float = 0, metric_umap: str = 'cosine', min_cluster_size_hdbscan: int = 30, metric_hdbscan: str = 'euclidean', cluster_selection_method_hdbscan: str = 'eom', number_clusters_hdbscan: int = None, random_state: int = 42, verbose: bool = True, UMAP_hyperparams: dict = {}, HDBSCAN_hyperparams: dict = {})

Bases: object

Class to perform dimensionality reduction with UMAP followed by clustering with HDBSCAN.

cluster_and_reduce(embeddings: ndarray) Tuple[ndarray, ndarray, UMAP]

Cluster embeddings using HDBSCAN and reduce dimensions with UMAP.

Parameters:

embeddings (np.ndarray) – Embeddings to cluster and reduce.

Returns:

A tuple containing three items:
  • reduced_embeddings (np.ndarray): Reduced embeddings.

  • cluster_labels (np.ndarray): Cluster labels.

  • umap_mapper (umap.UMAP): UMAP mapper for transforming new embeddings, especially embeddings of the vocabulary. (MAKE SURE TO NORMALIZE EMBEDDINGS AFTER USING THE MAPPER)

Return type:

tuple

cluster_hdbscan(embeddings: ndarray) ndarray

Cluster embeddings using HDBSCAN.

If self.number_clusters_hdbscan is not None, further clusters the data with AgglomerativeClustering to achieve a fixed number of clusters.

Parameters:

embeddings (np.ndarray) – Embeddings to cluster.

Returns:

Cluster labels.

Return type:

np.ndarray

reduce_dimensions_umap(embeddings: ndarray) Tuple[ndarray, UMAP]

Reduces dimensions of embeddings using UMAP.

Parameters:

embeddings (np.ndarray) – Embeddings to reduce.

Returns:

A tuple containing two items:
  • reduced_embeddings (np.ndarray): Reduced embeddings.

  • umap_mapper (umap.UMAP): UMAP mapper for transforming new embeddings, especially embeddings of the vocabulary. (MAKE SURE TO NORMALIZE EMBEDDINGS AFTER USING THE MAPPER)

Return type:

tuple

umap_diagnostics(embeddings, hammer_edges=False)

Fit UMAP on the provided embeddings and generate diagnostic plots.

Params:

embeddingsarray-like

The high-dimensional data for UMAP to reduce and visualize.

hammer_edges : bool, default False. Is computationally expensive.

visualize_clusters_dynamic(embeddings: ndarray, labels: ndarray, texts: list[str], class_names: list[str] = None)

Visualize clusters using Plotly and enable hovering over clusters to see the beginning of the texts of the documents.

Parameters:
  • embeddings (np.ndarray) – Embeddings for which to visualize clustering.

  • labels (np.ndarray) – Cluster labels.

  • texts (list[str]) – Texts of the documents.

  • class_names (list[str], optional) – Names of the classes.

visualize_clusters_static(embeddings: ndarray, labels: ndarray)

Reduce dimensionality with UMAP to two dimensions and plot the clusters.

Parameters:
  • embeddings (np.ndarray) – Embeddings for which to plot clustering.

  • labels (np.ndarray) – Cluster labels.

topicgpt.ExtractTopWords module

class topicgpt.ExtractTopWords.ExtractTopWords

Bases: object

compute_bow_representation(document: str, vocab: list[str], vocab_set: set[str]) ndarray

Compute the bag-of-words representation of a document.

Parameters:
  • document (str) – Document to compute the bag-of-words representation of.

  • vocab (list[str]) – List of words in the corpus sorted alphabetically.

  • vocab_set (set[str]) – Set of words in the corpus sorted alphabetically.

Returns:

Bag-of-words representation of the document.

Return type:

np.ndarray

compute_centroid_similarity(embeddings: ndarray, centroid_dict: dict, cluster_label: int) ndarray

Compute the similarity of the document embeddings to the centroid of the cluster via cosine similarity.

Parameters:
  • embeddings (np.ndarray) – Embeddings to cluster and reduce.

  • centroid_dict (dict) – Dictionary of cluster labels and their centroids.

  • cluster_label (int) – Cluster label for which to compute the similarity.

Returns:

Cosine similarity of the document embeddings to the centroid of the cluster.

Return type:

np.ndarray

compute_corpus_vocab(corpus: list[str], remove_stopwords: bool = True, remove_punction: bool = True, min_word_length: int = 3, max_word_length: int = 20, remove_short_words: bool = True, remove_numbers: bool = True, verbose: bool = True, min_doc_frequency: int = 3, min_freq: float = 0.1, max_freq: float = 0.9) list[str]

Compute the vocabulary of the corpus and perform preprocessing of the corpus.

Parameters:
  • corpus (list[str]) – List of documents.

  • remove_stopwords (bool, optional) – Whether to remove stopwords.

  • remove_punction (bool, optional) – Whether to remove punctuation.

  • min_word_length (int, optional) – Minimum word length to retain.

  • max_word_length (int, optional) – Maximum word length to retain.

  • remove_short_words (bool, optional) – Whether to remove short words.

  • remove_numbers (bool, optional) – Whether to remove numbers.

  • verbose (bool, optional) – Whether to print progress and describe what is happening.

  • min_doc_frequency (int, optional) – Minimum number of documents a word should appear in to be considered in the vocabulary.

  • min_freq (float, optional) – Minimum frequency percentile of words to be considered in the vocabulary.

  • max_freq (float, optional) – Maximum frequency percentile of words to be considered in the vocabulary.

Returns:

List of words in the corpus sorted alphabetically.

Return type:

list[str]

compute_embedding_similarity_centroids(vocab: list[str], vocab_embedding_dict: dict, umap_mapper: UMAP, centroid_dict: dict, reduce_vocab_embeddings: bool = False, reduce_centroid_embeddings: bool = False) ndarray

Compute the cosine similarity of each word in the vocabulary to each centroid.

Parameters:
  • vocab (list[str]) – List of words in the corpus sorted alphabetically.

  • vocab_embedding_dict (dict) – Dictionary of words and their embeddings.

  • umap_mapper (umap.UMAP) – UMAP mapper to transform new embeddings in the same way as the document embeddings.

  • centroid_dict (dict) – Dictionary of cluster labels and their centroids. -1 means outlier.

  • reduce_vocab_embeddings (bool, optional) – Whether to reduce the vocab embeddings with the UMAP mapper.

  • reduce_centroid_embeddings (bool, optional) – Whether to reduce the centroid embeddings with the UMAP mapper.

Returns:

Cosine similarity of each word in the vocab to each centroid. Has shape (len(vocab), len(centroid_dict) - 1).

Return type:

np.ndarray

compute_word_topic_mat(corpus: list[str], vocab: list[str], labels: ndarray, consider_outliers=False) ndarray

Compute the word-topic matrix efficiently.

Parameters:
  • corpus (list[str]) – List of documents.

  • vocab (list[str]) – List of words in the corpus, sorted alphabetically.

  • labels (np.ndarray) – Cluster labels. -1 indicates outliers.

  • consider_outliers (bool, optional) – Whether to consider outliers when computing the top words. Defaults to False.

Returns:

Word-topic matrix.

Return type:

np.ndarray

compute_word_topic_mat_old(corpus: list[str], vocab: list[str], labels: ndarray, consider_outliers: bool = False) ndarray

Compute the word-topic matrix.

Parameters:
  • corpus (list[str]) – List of documents.

  • vocab (list[str]) – List of words in the corpus sorted alphabetically.

  • labels (np.ndarray) – Cluster labels. -1 means outlier.

  • consider_outliers (bool, optional) – Whether to consider outliers when computing the top words. I.e. whether the labels contain -1 to indicate outliers.

Returns:

Word-topic matrix.

Return type:

np.ndarray

compute_words_topics(corpus: list[str], vocab: list[str], labels: ndarray) dict

Compute the words per topic.

Parameters:
  • corpus (list[str]) – List of documents.

  • vocab (list[str]) – List of words in the corpus sorted alphabetically.

  • labels (np.ndarray) – Cluster labels. -1 means outlier.

Returns:

Dictionary of topics and their words.

Return type:

dict

embed_vocab_openAI(api_key: str, vocab: list[str], embedder: GetEmbeddingsOpenAI = None) dict[str, ndarray]

Embed the vocabulary using the OpenAI embedding API.

Parameters:
  • api_key (str) – OpenAI API key.

  • vocab (list[str]) – List of words in the corpus sorted alphabetically.

  • embedder (GetEmbeddingsOpenAI, optional) – Embedding object.

Returns:

Dictionary of words and their embeddings.

Return type:

dict[str, np.ndarray]

extract_centroid(embeddings: ndarray) ndarray

Extract the single centroid of a cluster.

Parameters:

embeddings (np.ndarray) – Embeddings to extract the centroid from.

Returns:

The centroid of the cluster.

Return type:

np.ndarray

extract_centroids(embeddings: ndarray, labels: ndarray) dict

Extract centroids of clusters.

Parameters:
  • embeddings (np.ndarray) – Embeddings to cluster and reduce.

  • labels (np.ndarray) – Cluster labels. -1 means outlier.

Returns:

Dictionary of cluster labels and their centroids.

Return type:

dict

extract_topwords_centroid_similarity(word_topic_mat: ~numpy.ndarray, vocab: list[str], vocab_embedding_dict: dict, centroid_dict: dict, umap_mapper: ~umap.umap_.UMAP, top_n_words: int = 10, reduce_vocab_embeddings: bool = True, reduce_centroid_embeddings: bool = False, consider_outliers: bool = False) -> (<class 'dict'>, <class 'numpy.ndarray'>)

Extract the top words for each cluster by computing the cosine similarity of the words that occur in the corpus to the centroid of the cluster.

Parameters:
  • word_topic_mat (np.ndarray) – Word-topic matrix.

  • vocab (list[str]) – List of words in the corpus sorted alphabetically.

  • vocab_embedding_dict (dict) – Dictionary of words and their embeddings.

  • centroid_dict (dict) – Dictionary of cluster labels and their centroids. -1 means outlier.

  • umap_mapper (umap.UMAP) – UMAP mapper to transform new embeddings in the same way as the document embeddings.

  • top_n_words (int, optional) – Number of top words to extract per topic.

  • reduce_vocab_embeddings (bool, optional) – Whether to reduce the vocab embeddings with the UMAP mapper.

  • reduce_centroid_embeddings (bool, optional) – Whether to reduce the centroid embeddings with the UMAP mapper.

  • consider_outliers (bool, optional) – Whether to consider outliers when computing the top words. I.e., whether the labels contain -1 to indicate outliers.

Returns:

Dictionary of topics and their top words. np.ndarray: Cosine similarity of each word in the vocab to each centroid. Has shape (len(vocab), len(centroid_dict) - 1).

Return type:

dict

extract_topwords_tfidf(word_topic_mat: ndarray, vocab: list[str], labels: ndarray, top_n_words: int = 10) dict

Extract the top words for each topic using a class-based tf-idf score.

Parameters:
  • word_topic_mat (np.ndarray) – Word-topic matrix.

  • vocab (list[str]) – List of words in the corpus sorted alphabetically.

  • labels (np.ndarray) – Cluster labels. -1 means outlier.

  • top_n_words (int, optional) – Number of top words to extract per topic.

Returns:

Dictionary of topics and their top words.

Return type:

dict

get_most_similar_docs(corpus: list[str], embeddings: ndarray, labels: ndarray, centroid_dict: dict, cluster_label: int, top_n: int = 10) List[str]

Get the most similar documents to the centroid of a cluster.

Parameters:
  • corpus (list[str]) – List of documents.

  • embeddings (np.ndarray) – Embeddings to cluster and reduce.

  • labels (np.ndarray) – Cluster labels. -1 means outlier.

  • centroid_dict (dict) – Dictionary of cluster labels and their centroids.

  • cluster_label (int) – Cluster label for which to compute the similarity.

  • top_n (int, optional) – Number of top documents to extract.

Returns:

List of the most similar documents to the centroid of a cluster.

Return type:

List[str]

topicgpt.GetEmbeddingsOpenAI module

class topicgpt.GetEmbeddingsOpenAI.GetEmbeddingsOpenAI(api_key: str, embedding_model: str = 'text-embedding-ada-002', tokenizer: str = None, max_tokens: int = 8191)

Bases: object

This class allows to compute embeddings of text using the OpenAI API.

compute_number_of_tokens(corpus: list[str]) int

Computes the total number of tokens needed to embed the corpus.

Parameters:

corpus (list[str]) – List of strings to embed, where each element in the list is a document.

Returns:

Total number of tokens needed to embed the corpus.

Return type:

int

convert_api_res_list(api_res_list: list[dict]) dict

Converts the api_res list into a dictionary containing the embeddings as a matrix and the corpus as a list of strings.

Parameters:
  • self – The instance of the class.

  • api_res_list (list[dict]) – List of dictionaries, where each dictionary contains the embedding of the document, the text of the document, and a list of errors that occurred during the embedding process.

Returns:

A dictionary containing the embeddings as a matrix and the corpus as a list of strings.

Return type:

dict

get_embeddings(corpus: list[str]) dict

Computes the embeddings of a corpus.

Parameters:
  • self – The instance of the class.

  • corpus (list[str]) – List of strings to embed, where each element in the list is a document.

Returns:

A dictionary containing the embeddings as a matrix and the corpus as a list of strings.

Return type:

dict

get_embeddings_doc_split(corpus: list[list[str]], n_tries=3) list[dict]

Computes the embeddings of a corpus for split documents.

Parameters:
  • self – The instance of the class.

  • corpus (list[list[str]]) – List of strings to embed, where each element is a document represented by a list of its chunks.

  • n_tries (int, optional) – Number of tries to make an API call (default is 3).

Returns:

A list of dictionaries, where each dictionary contains the embedding of the document, the text of the document, and a list of errors that occurred during the embedding process.

Return type:

List[dict]

make_api_call(text: str)

Makes an API call to the OpenAI API to embed a text string.

Parameters:
  • self – The instance of the class.

  • text (str) – The string to embed.

Returns:

The response from the API.

Return type:

API response

static num_tokens_from_string(string: str, encoding) int

Returns the number of tokens in a text string.

Parameters:
  • string (str) – Text string to compute the number of tokens.

  • encoding – A function to encode the string into tokens.

Returns:

Number of tokens in the text string.

Return type:

int

split_doc(text)

Splits a single document that is longer than the maximum number of tokens into a list of smaller documents.

Parameters:
  • self – The instance of the class.

  • text (str) – The string to be split.

Returns:

A list of strings to embed, where each element in the list is a list of chunks comprising the document.

Return type:

List[str]

split_long_docs(text: list[str]) list[list[str]]

Splits all documents that are longer than the maximum number of tokens into a list of smaller documents.

Parameters:
  • self – The instance of the class.

  • text (list[str]) – List of strings to embed, where each element in the list is a document.

Returns:

A list of lists of strings to embed, where each element in the outer list is a list of chunks comprising the document.

Return type:

List[list[str]]

topicgpt.TopicGPT module

class topicgpt.TopicGPT.TopicGPT(openai_api_key: str, n_topics: int = None, openai_prompting_model: str = 'gpt-3.5-turbo-16k', max_number_of_tokens: int = 16384, corpus_instruction: str = '', document_embeddings: ndarray = None, vocab_embeddings: dict[str, ndarray] = None, embedding_model: str = 'text-embedding-ada-002', max_number_of_tokens_embedding: int = 8191, use_saved_embeddings: bool = True, clusterer: Clustering_and_DimRed = None, n_topwords: int = 2000, n_topwords_description: int = 500, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], compute_vocab_hyperparams: dict = {}, enhancer: TopwordEnhancement = None, topic_prompting: TopicPrompting = None, verbose: bool = True)

Bases: object

This is the main class for doing topic modelling with TopicGPT.

compute_embeddings(corpus: list[str]) -> (<class 'numpy.ndarray'>, dict[str, numpy.ndarray])

Computes document and vocabulary embeddings for the given corpus.

Parameters:

corpus (list[str]) – List of strings to embed, where each element is a document.

Returns:

A tuple containing two items:
  • document_embeddings (np.ndarray): Document embeddings for the corpus, with shape (len(corpus), n_embedding_dimensions).

  • vocab_embeddings (dict[str, np.ndarray]): Vocabulary embeddings for the corpus, provided as a dictionary where keys are words and values are embeddings.

Return type:

tuple

describe_topics(topics: list[Topic]) list[Topic]

Names and describes the provided topics using the OpenAI API.

Parameters:

topics (list[Topic]) – List of Topic objects to be named and described.

Returns:

A list of Topic objects with names and descriptions.

Return type:

list[Topic]

extract_topics(corpus: list[str]) list[Topic]

Extracts topics from the given corpus.

Parameters:

corpus (list[str]) – List of strings to process, where each element represents a document.

Returns:

A list of Topic objects representing the extracted topics.

Return type:

list[Topic]

fit(corpus: list[str], verbose: bool = True)

Compute embeddings if necessary, extract topics, and describe them.

Parameters:
  • corpus (list[str]) – List of strings to embed, where each element represents a document.

  • verbose (bool, optional) – Whether to print the progress and details of the process.

pprompt(query: str, return_function_result: bool = True) object

Prompts the model with the given query and prints the answer.

Parameters:
  • query (str) – The query to prompt the model with.

  • return_function_result (bool, optional) – Whether to return the result of the function call by the Language Model (LLM).

Returns:

The result of the function call if return_function_result is True, otherwise None.

Return type:

object

print_topics()

Prints a string explanation of the topics.

prompt(query: str) -> (<class 'str'>, <class 'object'>)

Prompts the model with the given query.

Parameters:

query (str) – The query to prompt the model with.

Returns:

A tuple containing two items:
  • answer (str): The answer from the model.

  • function_result (object): The result of the function call.

Return type:

tuple

Note

Please refer to the TopicPrompting class for more details on available functions for prompting the model.

repr_topics() str

Returns a string explanation of the topics.

save_embeddings(path: str = 'SavedEmbeddings/embeddings.pkl') None

Saves the document and vocabulary embeddings to a pickle file for later re-use.

Parameters:

path (str, optional) – The path to save the embeddings to. Defaults to embeddings_path.

visualize_clusters()

Visualizes the identified clusters representing the topics in a scatterplot.

topicgpt.TopicPrompting module

class topicgpt.TopicPrompting.TopicPrompting(topic_lis: list[Topic], openai_key: str, openai_prompting_model: str = 'gpt-3.5-turbo-16k', max_context_length_promting: int = 16000, openai_model_temperature_prompting: float = 0.5, openai_embedding_model: str = 'text-embedding-ada-002', max_context_length_embedding: int = 8191, basic_model_instruction: str = "You are a helpful assistant. \nYou are excellent at inferring information about topics discovered via topic modelling using information retrieval. \nYou summarize information intelligently. \nYou use the functions you are provided with if applicable.\nYou make sure that everything you output is strictly based on the provided text. If you cite documents, give their indices. \nYou always explicitly say if you don't find any useful information!\nYou only say that something is contained in the corpus if you are very sure about it!", corpus_instruction: str = '', enhancer: TopwordEnhancement = None, vocab: list = None, vocab_embeddings: dict = None, random_state: int = 42)

Bases: object

This class allows to formulate prompts and queries against the identified topics to get more information about them

add_new_topic_keyword(keyword: str, inplace: bool = False, rename_new_topic: bool = False) list[Topic]

Create a new topic based on a keyword and recompute topic topwords.

This method removes all documents belonging to other topics from them and adds them to the new topic. It computes new topwords using both the tf-idf and the cosine-similarity method.

Parameters:
  • keyword (str) – Keyword to create the new topic from.

  • inplace (bool, optional) – If True, the topic is updated in place. Otherwise, a new list of topics is created and returned (default is False).

  • rename_new_topic (bool, optional) – If True, the new topic is renamed to the keyword (default is False).

Returns:

A list of new topics, including the newly created topic and the modified old ones.

Return type:

list of Topic

combine_topics(topic_idx_lis: list[int], inplace: bool = False) list[Topic]

Combines several topics into one topic.

This method combines the specified topics into a single topic. Note that no new topwords are computed in this step, and the topwords of the old topics are just combined. Additionally, only the cosine-similarity method for topwords extraction is used.

Parameters:
  • topic_idx_list (list[int]) – List of topic indices to combine.

  • inplace (bool, optional) – If True, the topics are combined in place. Otherwise, a new list of topics is created and returned (default is False).

Returns:

A list of new topics resulting from the combination.

Return type:

list of Topic

delete_topic(topic_idx: int, inplace: bool = False) list[Topic]

Deletes a topic with the given index from the list of topics and recomputes topwords and representations of the remaining topics.

This method assigns the documents of the deleted topic to the remaining topics.

Parameters:
  • topic_idx (int) – Index of the topic to delete.

  • inplace (bool, optional) – If True, the topic is deleted in place. Otherwise, a new list of topics is created and returned (default is False).

Returns:

A list of new topics resulting from the deletion.

Return type:

list of Topic

general_prompt(prompt: str, n_tries: int = 2) -> (list[str], <class 'object'>)

Prompt the Language Model (LLM) with a general prompt and return the response. Allow the LLM to call any function defined in the class.

Use n_tries in case the LLM does not provide a valid response.

Parameters:
  • prompt (str) – Prompt string.

  • n_tries (int, optional) – Number of tries to get a valid response from the LLM (default is 2).

Returns:

Response messages from the LLM. object: Response of the invoked function.

Return type:

list of str

get_topic_information(topic_idx_lis: list[int], max_number_topwords: int = 500) dict

Get detailed information on topics by their indices.

This function returns a dictionary where the keys are the topic indices, and the values are strings describing the topics. The description includes a maximum of max_number_topwords topwords.

Parameters:
  • topic_idx_list (list[int]) – List of topic indices to compare.

  • max_number_topwords (int, optional) – Maximum number of topwords to include in the description of the topics (default is 500).

Returns:

A dictionary with topic indices as keys and their descriptions as values.

Return type:

dict

get_topic_lis() list[Topic]

Returns the list of topics stored in the instance.

This method retrieves and returns the list of topics associated with the instance.

Returns:

The list of Topic objects.

Return type:

list[Topic]

identify_topic_idx(query: str, n_tries: int = 3) int

Identifies the index of the topic that the query is most likely about.

This method uses a Language Model (LLM) to determine which topic best fits the query description. If the LLM does not find any topic that fits the query, None is returned.

Parameters:
  • query (str) – Query string.

  • n_tries (int, optional) – Number of tries to get a valid response from the LLM (default is 3).

Returns:

The index of the topic that the query is most likely about. If no suitable topic is found, None is returned.

Return type:

int

Finds the k nearest neighbors of the query in the given topic based on cosine similarity in the original embedding space.

Parameters:
  • topic_index (int) – Index of the topic to search within.

  • query (str) – Query string.

  • k (int, optional) – Number of neighbors to return (default is 20).

  • doc_cutoff_threshold (int, optional) – Maximum number of tokens per document. Afterwards, the document is cut off (default is 1000).

Returns:

A tuple containing two lists -
  • A list of top k documents (as strings).

  • A list of indices corresponding to the top k documents in the topic.

Return type:

tuple

Uses the Language Model (LLM) to answer the llm_query based on the documents belonging to the topic.

Parameters:
  • llm_query (str) – Query string for the Language Model (LLM).

  • topic_index (int, optional) – Index of the topic object. If None, the topic is inferred from the query.

  • n_tries (int, optional) – Number of tries to get a valid response from the LLM (default is 3).

Returns:

A tuple containing two elements -
  • A string representing the answer from the LLM.

  • A tuple containing two lists -
    • A list of top k documents (as strings).

    • A list of indices corresponding to the top k documents in the topic.

Return type:

tuple

reindex_topic_lis(topic_list: list[Topic]) list[Topic]

Reindexes the topics in the provided topic list to assign correct new indices.

This method updates the indices of topics within the given topic list to ensure they are correctly ordered.

Parameters:

topic_list (list[Topic]) – The list of Topic objects to reindex.

Returns:

The reindexed list of Topic objects.

Return type:

list[Topic]

reindex_topics() None

Reindexes the topics in self.topic_list to assign correct new indices.

This method updates the indices of topics within the instance’s topic list to ensure they are correctly ordered.

Returns:

None

set_topic_lis(topic_list: list[Topic]) None

Sets the list of topics for the instance.

This method updates the list of topics associated with the instance to the provided list.

Parameters:

topic_list (list[Topic]) – The list of Topic objects to set.

Returns:

None

show_topic_lis() str

Returns a string representation of the list of topics.

This method generates a human-readable string representation of the topics in the instance’s topic list.

Returns:

A string containing the representation of the list of topics.

Return type:

str

split_topic_hdbscan(topic_idx: int, min_cluster_size: int = 100, inplace: bool = False) list[Topic]

Splits an existing topic into several subtopics using HDBSCAN clustering on the document embeddings of the topic.

This method does not require specifying the number of clusters to split. Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.

Parameters:
  • topic_idx (int) – Index of the topic to split.

  • min_cluster_size (int, optional) – Minimum cluster size to split the topic into (default is 100).

  • inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).

Returns:

A list of new topics resulting from the split.

Return type:

list of Topic

split_topic_keywords(topic_idx: int, keywords: str, inplace: bool = False) list[Topic]

Splits the topic into subtopics according to the provided keywords.

This is achieved by computing the cosine similarity between the keywords and the documents in the topic. Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.

Parameters:
  • topic_idx (int) – Index of the topic to split.

  • keywords (str) – Keywords to split the topic into. Needs to be a list of at least two keywords.

  • inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).

Returns:

A list of new topics resulting from the split.

Return type:

list of Topic

split_topic_kmeans(topic_idx: int, n_clusters: int = 2, inplace: bool = False) list[Topic]

Splits an existing topic into several subtopics using k-means clustering on the document embeddings of the topic.

Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.

Parameters:
  • topic_idx (int) – Index of the topic to split.

  • n_clusters (int, optional) – Number of clusters to split the topic into (default is 2).

  • inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).

Returns:

A list of new topics resulting from the split.

Return type:

list of Topic

split_topic_new_assignments(topic_idx: int, new_topic_assignments: ndarray, inplace: bool = False) list[Topic]

Splits a topic into new topics based on new topic assignments.

Note that this method only computes topwords based on the cosine-similarity method because tf-idf topwords need expensive computation on the entire corpus. The topwords of the old topic are also just split among the new ones. No new topwords are computed in this step.

Parameters:
  • topic_idx (int) – Index of the topic to split.

  • new_topic_assignments (np.ndarray) – New topic assignments for the documents in the topic.

  • inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).

Returns:

A list of new topics resulting from the split.

Return type:

list of Topic

split_topic_single_keyword(topic_idx: int, keyword: str, inplace: bool = False) list[Topic]

Splits the topic with a single keyword.

This method splits the topic such that all documents closer to the original topic name stay in the old topic, while all documents closer to the keyword are moved to the new topic. Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.

Parameters:
  • topic_idx (int) – Index of the topic to split.

  • keyword (str) – Keyword to split the topic into.

  • inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).

Returns:

A list of new topics resulting from the split.

Return type:

list of Topic

topicgpt.TopicRepresentation module

class topicgpt.TopicRepresentation.Topic(topic_idx: str, documents: list[str], words: dict[str, int], centroid_hd: ndarray = None, centroid_ld: ndarray = None, document_embeddings_hd: ndarray = None, document_embeddings_ld: ndarray = None, document_embedding_similarity: ndarray = None, umap_mapper: UMAP = None, top_words: dict[str, list[str]] = None, top_word_scores: dict[str, list[float]] = None)

Bases: object

class to represent a topic and all its attributes

set_topic_description(text: str)

add a text description to the topic params:

text: text description of the topic

set_topic_name(name: str)

add a name to the topic params:

name: name of the topic

to_dict() dict

return a dict representation of the topic

to_json() str

return a json representation of the topic

topicgpt.TopicRepresentation.describe_and_name_topics(topics: list[Topic], enhancer: TopwordEnhancement, topword_method='tfidf', n_words=500) list[Topic]

Describe and name the topics using the OpenAI API with the given enhancer object.

Parameters:
  • topics (list[Topic]) – List of Topic objects.

  • enhancer (TopwordEnhancement) – Enhancer object to enhance the top-words and generate the description.

  • topword_method (str, optional) – Method to use for top-word extraction. Can be “tfidf” or “cosine_similarity” (default is “tfidf”).

  • n_words (int, optional) – Number of topwords to extract for the description and the name (default is 500).

Returns:

List of Topic objects with the description and name added.

Return type:

list[Topic]

topicgpt.TopicRepresentation.extract_and_describe_topic_cos_sim(documents_topic: list[str], document_embeddings_topic: ndarray, words_topic: list[str], vocab_embeddings: dict, umap_mapper: UMAP, enhancer: TopwordEnhancement, n_topwords: int = 2000, n_topwords_description=500) Topic

Create a Topic object from the given documents and embeddings by computing the centroid and the top-words. Only use cosine-similarity for top-word extraction. Describe and name the topic with the given enhancer object.

Parameters:
  • documents_topic (list[str]) – List of documents in the topic.

  • document_embeddings_topic (np.ndarray) – High-dimensional embeddings of the documents in the topic.

  • words_topic (list[str]) – List of words in the topic.

  • vocab_embeddings (dict) – Embeddings of the vocabulary.

  • umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.

  • enhancer (TopwordEnhancement) – Enhancer object to enhance the top-words and generate the description.

  • n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).

  • n_topwords_description (int, optional) – Number of top-words to use from the extracted topics for the description and the name (default is 500).

Returns:

Topic object representing the extracted and described topic.

Return type:

Topic

topicgpt.TopicRepresentation.extract_and_describe_topics(corpus: list[str], document_embeddings: ndarray, clusterer: Clustering_and_DimRed, vocab_embeddings: ndarray, enhancer: TopwordEnhancement, n_topwords: int = 2000, n_topwords_description: int = 500, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], compute_vocab_hyperparams: dict = {}, topword_description_method: str = 'cosine_similarity') list[Topic]

Extracts topics from the given corpus using the provided clusterer object on the document embeddings and describes/names them using the given enhancer object.

Parameters:
  • corpus (list[str]) – List of documents.

  • document_embeddings (np.ndarray) – Embeddings of the documents.

  • clusterer (Clustering_and_DimRed) – Clustering and dimensionality reduction object to cluster the documents.

  • vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.

  • enhancer (TopwordEnhancement) – Enhancer object for enhancing top-words and generating descriptions/names for topics.

  • n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).

  • n_topwords_description (int, optional) – Number of top-words to use from the extracted topics for description and naming (default is 500).

  • topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).

  • compute_vocab_hyperparams (dict, optional) – Hyperparameters for the top-word extraction methods.

  • topword_description_method (str, optional) – Method to use for top-word extraction for description/naming. Can be “tfidf” or “cosine_similarity” (default is “cosine_similarity”).

Returns:

List of Topic objects representing the extracted and described topics.

Return type:

list[Topic]

topicgpt.TopicRepresentation.extract_describe_topics_labels_vocab(corpus: list[str], document_embeddings_hd: ndarray, document_embeddings_ld: ndarray, labels: ndarray, umap_mapper: UMAP, vocab_embeddings: ndarray, enhancer: TopwordEnhancement, vocab: list[str] = None, n_topwords: int = 2000, n_topwords_description: int = 500, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], topword_description_method: str = 'cosine_similarity') list[Topic]

Extracts topics from the given corpus using the provided labels that indicate the topics (no -1 for outliers). Vocabulary is already computed. Describe and name the topics with the given enhancer object.

Parameters:
  • corpus (list[str]) – List of documents.

  • document_embeddings_hd (np.ndarray) – Embeddings of the documents in high-dimensional space.

  • document_embeddings_ld (np.ndarray) – Embeddings of the documents in low-dimensional space.

  • labels (np.ndarray) – Labels indicating the topics.

  • umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.

  • vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.

  • enhancer (TopwordEnhancement) – Enhancer object to enhance the top-words and generate the description.

  • vocab (list[str], optional) – Vocabulary of the corpus (default is None).

  • n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).

  • n_topwords_description (int, optional) – Number of top-words to use from the extracted topics for the description and the name (default is 500).

  • topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).

  • topword_description_method (str, optional) – Method to use for top-word extraction. Can be “tfidf” or “cosine_similarity” (default is “cosine_similarity”).

Returns:

List of Topic objects representing the extracted topics.

Return type:

list[Topic]

topicgpt.TopicRepresentation.extract_topic_cos_sim(documents_topic: list[str], document_embeddings_topic: ndarray, words_topic: list[str], vocab_embeddings: dict, umap_mapper: UMAP, n_topwords: int = 2000) Topic

Create a Topic object from the given documents and embeddings by computing the centroid and the top-words. Only uses cosine-similarity for top-word extraction.

Parameters:
  • documents_topic (list[str]) – List of documents in the topic.

  • document_embeddings_topic (np.ndarray) – High-dimensional embeddings of the documents in the topic.

  • words_topic (list[str]) – List of words in the topic.

  • vocab_embeddings (dict) – Embeddings of the vocabulary.

  • umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.

  • n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).

Returns:

Topic object representing the extracted topic.

Return type:

Topic

topicgpt.TopicRepresentation.extract_topics(corpus: list[str], document_embeddings: ndarray, clusterer: Clustering_and_DimRed, vocab_embeddings: ndarray, n_topwords: int = 2000, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], compute_vocab_hyperparams: dict = {}) list[Topic]

Extracts topics from the given corpus using the provided clusterer object on the document embeddings.

Parameters:
  • corpus (list[str]) – List of documents.

  • document_embeddings (np.ndarray) – Embeddings of the documents.

  • clusterer (Clustering_and_DimRed) – Clustering and dimensionality reduction object to cluster the documents.

  • vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.

  • n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).

  • topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).

  • compute_vocab_hyperparams (dict, optional) – Hyperparameters for the top-word extraction methods.

Returns:

List of Topic objects representing the extracted topics.

Return type:

list[Topic]

topicgpt.TopicRepresentation.extract_topics_labels_vocab(corpus: list[str], document_embeddings_hd: ndarray, document_embeddings_ld: ndarray, labels: ndarray, umap_mapper: UMAP, vocab_embeddings: ndarray, vocab: list[str] = None, n_topwords: int = 2000, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity']) list[Topic]

Extracts topics from the given corpus using the provided labels that indicate the topics (no -1 for outliers). Vocabulary is already computed.

Parameters:
  • corpus (list[str]) – List of documents.

  • document_embeddings_hd (np.ndarray) – Embeddings of the documents in high-dimensional space.

  • document_embeddings_ld (np.ndarray) – Embeddings of the documents in low-dimensional space.

  • labels (np.ndarray) – Labels indicating the topics.

  • umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.

  • vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.

  • vocab (list[str], optional) – Vocabulary of the corpus (default is None).

  • n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).

  • topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).

Returns:

List of Topic objects representing the extracted topics.

Return type:

list[Topic]

topicgpt.TopicRepresentation.extract_topics_no_new_vocab_computation(corpus: list[str], vocab: list[str], document_embeddings: ndarray, clusterer: Clustering_and_DimRed, vocab_embeddings: ndarray, n_topwords: int = 2000, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], consider_outliers: bool = False) list[Topic]

Extracts topics from the given corpus using the provided clusterer object on the document embeddings. This version does not compute the vocabulary of the corpus and instead uses the provided vocabulary.

Parameters:
  • corpus (list[str]) – List of documents.

  • vocab (list[str]) – Vocabulary of the corpus.

  • document_embeddings (np.ndarray) – Embeddings of the documents.

  • clusterer (Clustering_and_DimRed) – Clustering and dimensionality reduction object to cluster the documents.

  • vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.

  • n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).

  • topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).

  • consider_outliers (bool, optional) – Whether to consider outliers during topic extraction (default is False).

Returns:

List of Topic objects representing the extracted topics.

Return type:

list[Topic]

topicgpt.TopicRepresentation.topic_lis_to_json(topics: list[Topic]) str

Return a JSON representation of a list of topics.

Parameters:

topics (list[Topic]) – The list of topic objects to convert to JSON.

Returns:

A JSON string representing the list of topics.

Return type:

str

topicgpt.TopicRepresentation.topic_to_json(topic: Topic) str

Return a JSON representation of the topic.

Parameters:

topic (Topic) – The topic object to convert to JSON.

Returns:

A JSON string representing the topic.

Return type:

str

topicgpt.TopwordEnhancement module

class topicgpt.TopwordEnhancement.TopwordEnhancement(openai_key: str, openai_model: str = 'gpt-3.5-turbo', max_context_length: int = 4000, openai_model_temperature: float = 0.5, basic_model_instruction: str = 'You are a helpful assistant. You are excellent at inferring topics from top-words extracted via topic-modelling. You make sure that everything you output is strictly based on the provided text.', corpus_instruction: str = '')

Bases: object

count_tokens_api_message(messages: list[dict[str]]) int

Count the number of tokens in the API messages.

Parameters:

messages (list[dict[str]]) – List of messages from the API.

Returns:

Number of tokens in the messages.

Return type:

int

describe_topic_document_sampling_str(documents: list[str], truncate_doc_thresh=100, n_documents: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>, sampling_strategy: str = None) str

Describe a topic based on a sample of its documents by using the openai model.

Parameters:
  • documents (list[str]) – List of documents ordered by similarity to the topic’s centroid.

  • truncate_doc_thresh (int, optional) – Threshold for the number of words in a document. If a document exceeds this threshold, it is truncated. Defaults to 100.

  • n_documents (int, optional) – Number of documents to use for the query. If None, all documents are used. Defaults to None.

  • query_function (Callable, optional) – Function to query the model. Defaults to a lambda function generating a query based on the provided documents.

  • sampling_strategy (Union[Callable, str], optional) – Strategy to sample the documents. If None, the first provided documents are used. If it’s a string, it’s interpreted as a method of the class (e.g., “sample_uniform” is interpreted as self.sample_uniform). It can also be a custom sampling function. Defaults to None.

Returns:

A description of the topic by the model in the form of a string.

Return type:

str

describe_topic_documents_completion_object(documents: list[str], truncate_doc_thresh=100, n_documents: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) ChatCompletion

Describe the given topic based on its documents using the OpenAI model.

Parameters:
  • documents (list[str]) – List of documents.

  • truncate_doc_thresh (int, optional) – Threshold for the number of words in a document. If a document has more words than this threshold, it is pruned to this threshold.

  • n_documents (int, optional) – Number of documents to use for the query. If None, all documents are used.

  • query_function (Callable, optional) – Function to query the model. The function should take a list of documents and return a string.

Returns:

A description of the topics by the model in the form of an openai.ChatCompletion object.

Return type:

openai.ChatCompletion

describe_topic_documents_sampling_completion_object(documents: list[str], truncate_doc_thresh=100, n_documents: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>, sampling_strategy: str = None) ChatCompletion

Describe a topic based on a sample of its documents by using the openai model.

Parameters:
  • documents (list[str]) – List of documents ordered by similarity to the topic’s centroid.

  • truncate_doc_thresh (int, optional) – Threshold for the number of words in a document. If a document exceeds this threshold, it is truncated. Defaults to 100.

  • n_documents (int, optional) – Number of documents to use for the query. If None, all documents are used. Defaults to None.

  • query_function (Callable, optional) – Function to query the model. Defaults to a lambda function generating a query based on the provided documents.

  • sampling_strategy (Union[Callable, str], optional) – Strategy to sample the documents. If None, the first provided documents are used. If it’s a string, it’s interpreted as a method of the class (e.g., “sample_uniform” is interpreted as self.sample_uniform). It can also be a custom sampling function. Defaults to None.

Returns:

A description of the topic by the model in the form of an openai.ChatCompletion object.

Return type:

openai.ChatCompletion

describe_topic_topwords_completion_object(topwords: list[str], n_words: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) ChatCompletion

Describe the given topic based on its topwords using the OpenAI model.

Parameters:
  • topwords (list[str]) – List of topwords.

  • n_words (int, optional) – Number of words to use for the query. If None, all words are used.

  • query_function (Callable, optional) – Function to query the model. The function should take a list of topwords and return a string.

Returns:

A description of the topics by the model in the form of an OpenAI ChatCompletion object.

Return type:

openai.ChatCompletion

describe_topic_topwords_str(topwords: list[str], n_words: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) str

Describe the given topic based on its topwords using the OpenAI model.

Parameters:
  • topwords (list[str]) – List of topwords.

  • n_words (int, optional) – Number of words to use for the query. If None, all words are used.

  • query_function (Callable, optional) – Function to query the model. The function should take a list of topwords and return a string.

Returns:

A description of the topics by the model in the form of a string.

Return type:

str

generate_topic_name_str(topwords: list[str], n_words: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) str

Generate a topic name based on the given topwords using the OpenAI model.

Parameters:
  • topwords (list[str]) – List of topwords.

  • n_words (int, optional) – Number of words to use for the query. If None, all words are used.

  • query_function (Callable, optional) – Function to query the model. The function should take a list of topwords and return a string.

Returns:

A topic name generated by the model in the form of a string.

Return type:

str

static sample_identity(n_docs: int) ndarray

Generate an identity array of document indices without changing their order.

Parameters:

n_docs (int) – Number of documents.

Returns:

An array containing document indices from 0 to (n_docs - 1).

Return type:

np.ndarray

static sample_poisson(n_docs: int) ndarray

Randomly sample document indices according to a Poisson distribution, favoring documents from the beginning of the list.

Parameters:

n_docs (int) – Number of documents.

Returns:

An array containing randomly permuted document indices, with more documents drawn from the beginning of the list.

Return type:

np.ndarray

static sample_uniform(n_docs: int) ndarray

Randomly sample document indices without replacement.

Parameters:

n_docs (int) – Number of documents.

Returns:

An array containing randomly permuted document indices from 0 to (n_docs - 1).

Return type:

np.ndarray

Module contents