topicgpt package
Submodules
topicgpt.Clustering module
- class topicgpt.Clustering.Clustering_and_DimRed(n_dims_umap: int = 5, n_neighbors_umap: int = 15, min_dist_umap: float = 0, metric_umap: str = 'cosine', min_cluster_size_hdbscan: int = 30, metric_hdbscan: str = 'euclidean', cluster_selection_method_hdbscan: str = 'eom', number_clusters_hdbscan: int = None, random_state: int = 42, verbose: bool = True, UMAP_hyperparams: dict = {}, HDBSCAN_hyperparams: dict = {})
Bases:
object
Class to perform dimensionality reduction with UMAP followed by clustering with HDBSCAN.
- cluster_and_reduce(embeddings: ndarray) Tuple[ndarray, ndarray, UMAP]
Cluster embeddings using HDBSCAN and reduce dimensions with UMAP.
- Parameters:
embeddings (np.ndarray) – Embeddings to cluster and reduce.
- Returns:
- A tuple containing three items:
reduced_embeddings (np.ndarray): Reduced embeddings.
cluster_labels (np.ndarray): Cluster labels.
umap_mapper (umap.UMAP): UMAP mapper for transforming new embeddings, especially embeddings of the vocabulary. (MAKE SURE TO NORMALIZE EMBEDDINGS AFTER USING THE MAPPER)
- Return type:
tuple
- cluster_hdbscan(embeddings: ndarray) ndarray
Cluster embeddings using HDBSCAN.
If self.number_clusters_hdbscan is not None, further clusters the data with AgglomerativeClustering to achieve a fixed number of clusters.
- Parameters:
embeddings (np.ndarray) – Embeddings to cluster.
- Returns:
Cluster labels.
- Return type:
np.ndarray
- reduce_dimensions_umap(embeddings: ndarray) Tuple[ndarray, UMAP]
Reduces dimensions of embeddings using UMAP.
- Parameters:
embeddings (np.ndarray) – Embeddings to reduce.
- Returns:
- A tuple containing two items:
reduced_embeddings (np.ndarray): Reduced embeddings.
umap_mapper (umap.UMAP): UMAP mapper for transforming new embeddings, especially embeddings of the vocabulary. (MAKE SURE TO NORMALIZE EMBEDDINGS AFTER USING THE MAPPER)
- Return type:
tuple
- umap_diagnostics(embeddings, hammer_edges=False)
Fit UMAP on the provided embeddings and generate diagnostic plots.
Params:
- embeddingsarray-like
The high-dimensional data for UMAP to reduce and visualize.
hammer_edges : bool, default False. Is computationally expensive.
- visualize_clusters_dynamic(embeddings: ndarray, labels: ndarray, texts: list[str], class_names: list[str] = None)
Visualize clusters using Plotly and enable hovering over clusters to see the beginning of the texts of the documents.
- Parameters:
embeddings (np.ndarray) – Embeddings for which to visualize clustering.
labels (np.ndarray) – Cluster labels.
texts (list[str]) – Texts of the documents.
class_names (list[str], optional) – Names of the classes.
- visualize_clusters_static(embeddings: ndarray, labels: ndarray)
Reduce dimensionality with UMAP to two dimensions and plot the clusters.
- Parameters:
embeddings (np.ndarray) – Embeddings for which to plot clustering.
labels (np.ndarray) – Cluster labels.
topicgpt.ExtractTopWords module
- class topicgpt.ExtractTopWords.ExtractTopWords
Bases:
object
- compute_bow_representation(document: str, vocab: list[str], vocab_set: set[str]) ndarray
Compute the bag-of-words representation of a document.
- Parameters:
document (str) – Document to compute the bag-of-words representation of.
vocab (list[str]) – List of words in the corpus sorted alphabetically.
vocab_set (set[str]) – Set of words in the corpus sorted alphabetically.
- Returns:
Bag-of-words representation of the document.
- Return type:
np.ndarray
- compute_centroid_similarity(embeddings: ndarray, centroid_dict: dict, cluster_label: int) ndarray
Compute the similarity of the document embeddings to the centroid of the cluster via cosine similarity.
- Parameters:
embeddings (np.ndarray) – Embeddings to cluster and reduce.
centroid_dict (dict) – Dictionary of cluster labels and their centroids.
cluster_label (int) – Cluster label for which to compute the similarity.
- Returns:
Cosine similarity of the document embeddings to the centroid of the cluster.
- Return type:
np.ndarray
- compute_corpus_vocab(corpus: list[str], remove_stopwords: bool = True, remove_punction: bool = True, min_word_length: int = 3, max_word_length: int = 20, remove_short_words: bool = True, remove_numbers: bool = True, verbose: bool = True, min_doc_frequency: int = 3, min_freq: float = 0.1, max_freq: float = 0.9) list[str]
Compute the vocabulary of the corpus and perform preprocessing of the corpus.
- Parameters:
corpus (list[str]) – List of documents.
remove_stopwords (bool, optional) – Whether to remove stopwords.
remove_punction (bool, optional) – Whether to remove punctuation.
min_word_length (int, optional) – Minimum word length to retain.
max_word_length (int, optional) – Maximum word length to retain.
remove_short_words (bool, optional) – Whether to remove short words.
remove_numbers (bool, optional) – Whether to remove numbers.
verbose (bool, optional) – Whether to print progress and describe what is happening.
min_doc_frequency (int, optional) – Minimum number of documents a word should appear in to be considered in the vocabulary.
min_freq (float, optional) – Minimum frequency percentile of words to be considered in the vocabulary.
max_freq (float, optional) – Maximum frequency percentile of words to be considered in the vocabulary.
- Returns:
List of words in the corpus sorted alphabetically.
- Return type:
list[str]
- compute_embedding_similarity_centroids(vocab: list[str], vocab_embedding_dict: dict, umap_mapper: UMAP, centroid_dict: dict, reduce_vocab_embeddings: bool = False, reduce_centroid_embeddings: bool = False) ndarray
Compute the cosine similarity of each word in the vocabulary to each centroid.
- Parameters:
vocab (list[str]) – List of words in the corpus sorted alphabetically.
vocab_embedding_dict (dict) – Dictionary of words and their embeddings.
umap_mapper (umap.UMAP) – UMAP mapper to transform new embeddings in the same way as the document embeddings.
centroid_dict (dict) – Dictionary of cluster labels and their centroids. -1 means outlier.
reduce_vocab_embeddings (bool, optional) – Whether to reduce the vocab embeddings with the UMAP mapper.
reduce_centroid_embeddings (bool, optional) – Whether to reduce the centroid embeddings with the UMAP mapper.
- Returns:
Cosine similarity of each word in the vocab to each centroid. Has shape (len(vocab), len(centroid_dict) - 1).
- Return type:
np.ndarray
- compute_word_topic_mat(corpus: list[str], vocab: list[str], labels: ndarray, consider_outliers=False) ndarray
Compute the word-topic matrix efficiently.
- Parameters:
corpus (list[str]) – List of documents.
vocab (list[str]) – List of words in the corpus, sorted alphabetically.
labels (np.ndarray) – Cluster labels. -1 indicates outliers.
consider_outliers (bool, optional) – Whether to consider outliers when computing the top words. Defaults to False.
- Returns:
Word-topic matrix.
- Return type:
np.ndarray
- compute_word_topic_mat_old(corpus: list[str], vocab: list[str], labels: ndarray, consider_outliers: bool = False) ndarray
Compute the word-topic matrix.
- Parameters:
corpus (list[str]) – List of documents.
vocab (list[str]) – List of words in the corpus sorted alphabetically.
labels (np.ndarray) – Cluster labels. -1 means outlier.
consider_outliers (bool, optional) – Whether to consider outliers when computing the top words. I.e. whether the labels contain -1 to indicate outliers.
- Returns:
Word-topic matrix.
- Return type:
np.ndarray
- compute_words_topics(corpus: list[str], vocab: list[str], labels: ndarray) dict
Compute the words per topic.
- Parameters:
corpus (list[str]) – List of documents.
vocab (list[str]) – List of words in the corpus sorted alphabetically.
labels (np.ndarray) – Cluster labels. -1 means outlier.
- Returns:
Dictionary of topics and their words.
- Return type:
dict
- embed_vocab_openAI(api_key: str, vocab: list[str], embedder: GetEmbeddingsOpenAI = None) dict[str, ndarray]
Embed the vocabulary using the OpenAI embedding API.
- Parameters:
api_key (str) – OpenAI API key.
vocab (list[str]) – List of words in the corpus sorted alphabetically.
embedder (GetEmbeddingsOpenAI, optional) – Embedding object.
- Returns:
Dictionary of words and their embeddings.
- Return type:
dict[str, np.ndarray]
- extract_centroid(embeddings: ndarray) ndarray
Extract the single centroid of a cluster.
- Parameters:
embeddings (np.ndarray) – Embeddings to extract the centroid from.
- Returns:
The centroid of the cluster.
- Return type:
np.ndarray
- extract_centroids(embeddings: ndarray, labels: ndarray) dict
Extract centroids of clusters.
- Parameters:
embeddings (np.ndarray) – Embeddings to cluster and reduce.
labels (np.ndarray) – Cluster labels. -1 means outlier.
- Returns:
Dictionary of cluster labels and their centroids.
- Return type:
dict
- extract_topwords_centroid_similarity(word_topic_mat: ~numpy.ndarray, vocab: list[str], vocab_embedding_dict: dict, centroid_dict: dict, umap_mapper: ~umap.umap_.UMAP, top_n_words: int = 10, reduce_vocab_embeddings: bool = True, reduce_centroid_embeddings: bool = False, consider_outliers: bool = False) -> (<class 'dict'>, <class 'numpy.ndarray'>)
Extract the top words for each cluster by computing the cosine similarity of the words that occur in the corpus to the centroid of the cluster.
- Parameters:
word_topic_mat (np.ndarray) – Word-topic matrix.
vocab (list[str]) – List of words in the corpus sorted alphabetically.
vocab_embedding_dict (dict) – Dictionary of words and their embeddings.
centroid_dict (dict) – Dictionary of cluster labels and their centroids. -1 means outlier.
umap_mapper (umap.UMAP) – UMAP mapper to transform new embeddings in the same way as the document embeddings.
top_n_words (int, optional) – Number of top words to extract per topic.
reduce_vocab_embeddings (bool, optional) – Whether to reduce the vocab embeddings with the UMAP mapper.
reduce_centroid_embeddings (bool, optional) – Whether to reduce the centroid embeddings with the UMAP mapper.
consider_outliers (bool, optional) – Whether to consider outliers when computing the top words. I.e., whether the labels contain -1 to indicate outliers.
- Returns:
Dictionary of topics and their top words. np.ndarray: Cosine similarity of each word in the vocab to each centroid. Has shape (len(vocab), len(centroid_dict) - 1).
- Return type:
dict
- extract_topwords_tfidf(word_topic_mat: ndarray, vocab: list[str], labels: ndarray, top_n_words: int = 10) dict
Extract the top words for each topic using a class-based tf-idf score.
- Parameters:
word_topic_mat (np.ndarray) – Word-topic matrix.
vocab (list[str]) – List of words in the corpus sorted alphabetically.
labels (np.ndarray) – Cluster labels. -1 means outlier.
top_n_words (int, optional) – Number of top words to extract per topic.
- Returns:
Dictionary of topics and their top words.
- Return type:
dict
- get_most_similar_docs(corpus: list[str], embeddings: ndarray, labels: ndarray, centroid_dict: dict, cluster_label: int, top_n: int = 10) List[str]
Get the most similar documents to the centroid of a cluster.
- Parameters:
corpus (list[str]) – List of documents.
embeddings (np.ndarray) – Embeddings to cluster and reduce.
labels (np.ndarray) – Cluster labels. -1 means outlier.
centroid_dict (dict) – Dictionary of cluster labels and their centroids.
cluster_label (int) – Cluster label for which to compute the similarity.
top_n (int, optional) – Number of top documents to extract.
- Returns:
List of the most similar documents to the centroid of a cluster.
- Return type:
List[str]
topicgpt.GetEmbeddingsOpenAI module
- class topicgpt.GetEmbeddingsOpenAI.GetEmbeddingsOpenAI(api_key: str, embedding_model: str = 'text-embedding-ada-002', tokenizer: str = None, max_tokens: int = 8191)
Bases:
object
This class allows to compute embeddings of text using the OpenAI API.
- compute_number_of_tokens(corpus: list[str]) int
Computes the total number of tokens needed to embed the corpus.
- Parameters:
corpus (list[str]) – List of strings to embed, where each element in the list is a document.
- Returns:
Total number of tokens needed to embed the corpus.
- Return type:
int
- convert_api_res_list(api_res_list: list[dict]) dict
Converts the api_res list into a dictionary containing the embeddings as a matrix and the corpus as a list of strings.
- Parameters:
self – The instance of the class.
api_res_list (list[dict]) – List of dictionaries, where each dictionary contains the embedding of the document, the text of the document, and a list of errors that occurred during the embedding process.
- Returns:
A dictionary containing the embeddings as a matrix and the corpus as a list of strings.
- Return type:
dict
- get_embeddings(corpus: list[str]) dict
Computes the embeddings of a corpus.
- Parameters:
self – The instance of the class.
corpus (list[str]) – List of strings to embed, where each element in the list is a document.
- Returns:
A dictionary containing the embeddings as a matrix and the corpus as a list of strings.
- Return type:
dict
- get_embeddings_doc_split(corpus: list[list[str]], n_tries=3) list[dict]
Computes the embeddings of a corpus for split documents.
- Parameters:
self – The instance of the class.
corpus (list[list[str]]) – List of strings to embed, where each element is a document represented by a list of its chunks.
n_tries (int, optional) – Number of tries to make an API call (default is 3).
- Returns:
A list of dictionaries, where each dictionary contains the embedding of the document, the text of the document, and a list of errors that occurred during the embedding process.
- Return type:
List[dict]
- make_api_call(text: str)
Makes an API call to the OpenAI API to embed a text string.
- Parameters:
self – The instance of the class.
text (str) – The string to embed.
- Returns:
The response from the API.
- Return type:
API response
- static num_tokens_from_string(string: str, encoding) int
Returns the number of tokens in a text string.
- Parameters:
string (str) – Text string to compute the number of tokens.
encoding – A function to encode the string into tokens.
- Returns:
Number of tokens in the text string.
- Return type:
int
- split_doc(text)
Splits a single document that is longer than the maximum number of tokens into a list of smaller documents.
- Parameters:
self – The instance of the class.
text (str) – The string to be split.
- Returns:
A list of strings to embed, where each element in the list is a list of chunks comprising the document.
- Return type:
List[str]
- split_long_docs(text: list[str]) list[list[str]]
Splits all documents that are longer than the maximum number of tokens into a list of smaller documents.
- Parameters:
self – The instance of the class.
text (list[str]) – List of strings to embed, where each element in the list is a document.
- Returns:
A list of lists of strings to embed, where each element in the outer list is a list of chunks comprising the document.
- Return type:
List[list[str]]
topicgpt.TopicGPT module
- class topicgpt.TopicGPT.TopicGPT(openai_api_key: str, n_topics: int = None, openai_prompting_model: str = 'gpt-3.5-turbo-16k', max_number_of_tokens: int = 16384, corpus_instruction: str = '', document_embeddings: ndarray = None, vocab_embeddings: dict[str, ndarray] = None, embedding_model: str = 'text-embedding-ada-002', max_number_of_tokens_embedding: int = 8191, use_saved_embeddings: bool = True, clusterer: Clustering_and_DimRed = None, n_topwords: int = 2000, n_topwords_description: int = 500, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], compute_vocab_hyperparams: dict = {}, enhancer: TopwordEnhancement = None, topic_prompting: TopicPrompting = None, verbose: bool = True)
Bases:
object
This is the main class for doing topic modelling with TopicGPT.
- compute_embeddings(corpus: list[str]) -> (<class 'numpy.ndarray'>, dict[str, numpy.ndarray])
Computes document and vocabulary embeddings for the given corpus.
- Parameters:
corpus (list[str]) – List of strings to embed, where each element is a document.
- Returns:
- A tuple containing two items:
document_embeddings (np.ndarray): Document embeddings for the corpus, with shape (len(corpus), n_embedding_dimensions).
vocab_embeddings (dict[str, np.ndarray]): Vocabulary embeddings for the corpus, provided as a dictionary where keys are words and values are embeddings.
- Return type:
tuple
- describe_topics(topics: list[Topic]) list[Topic]
Names and describes the provided topics using the OpenAI API.
- extract_topics(corpus: list[str]) list[Topic]
Extracts topics from the given corpus.
- Parameters:
corpus (list[str]) – List of strings to process, where each element represents a document.
- Returns:
A list of Topic objects representing the extracted topics.
- Return type:
list[Topic]
- fit(corpus: list[str], verbose: bool = True)
Compute embeddings if necessary, extract topics, and describe them.
- Parameters:
corpus (list[str]) – List of strings to embed, where each element represents a document.
verbose (bool, optional) – Whether to print the progress and details of the process.
- pprompt(query: str, return_function_result: bool = True) object
Prompts the model with the given query and prints the answer.
- Parameters:
query (str) – The query to prompt the model with.
return_function_result (bool, optional) – Whether to return the result of the function call by the Language Model (LLM).
- Returns:
The result of the function call if return_function_result is True, otherwise None.
- Return type:
object
- print_topics()
Prints a string explanation of the topics.
- prompt(query: str) -> (<class 'str'>, <class 'object'>)
Prompts the model with the given query.
- Parameters:
query (str) – The query to prompt the model with.
- Returns:
- A tuple containing two items:
answer (str): The answer from the model.
function_result (object): The result of the function call.
- Return type:
tuple
Note
Please refer to the TopicPrompting class for more details on available functions for prompting the model.
- repr_topics() str
Returns a string explanation of the topics.
- save_embeddings(path: str = 'SavedEmbeddings/embeddings.pkl') None
Saves the document and vocabulary embeddings to a pickle file for later re-use.
- Parameters:
path (str, optional) – The path to save the embeddings to. Defaults to embeddings_path.
- visualize_clusters()
Visualizes the identified clusters representing the topics in a scatterplot.
topicgpt.TopicPrompting module
- class topicgpt.TopicPrompting.TopicPrompting(topic_lis: list[Topic], openai_key: str, openai_prompting_model: str = 'gpt-3.5-turbo-16k', max_context_length_promting: int = 16000, openai_model_temperature_prompting: float = 0.5, openai_embedding_model: str = 'text-embedding-ada-002', max_context_length_embedding: int = 8191, basic_model_instruction: str = "You are a helpful assistant. \nYou are excellent at inferring information about topics discovered via topic modelling using information retrieval. \nYou summarize information intelligently. \nYou use the functions you are provided with if applicable.\nYou make sure that everything you output is strictly based on the provided text. If you cite documents, give their indices. \nYou always explicitly say if you don't find any useful information!\nYou only say that something is contained in the corpus if you are very sure about it!", corpus_instruction: str = '', enhancer: TopwordEnhancement = None, vocab: list = None, vocab_embeddings: dict = None, random_state: int = 42)
Bases:
object
This class allows to formulate prompts and queries against the identified topics to get more information about them
- add_new_topic_keyword(keyword: str, inplace: bool = False, rename_new_topic: bool = False) list[Topic]
Create a new topic based on a keyword and recompute topic topwords.
This method removes all documents belonging to other topics from them and adds them to the new topic. It computes new topwords using both the tf-idf and the cosine-similarity method.
- Parameters:
keyword (str) – Keyword to create the new topic from.
inplace (bool, optional) – If True, the topic is updated in place. Otherwise, a new list of topics is created and returned (default is False).
rename_new_topic (bool, optional) – If True, the new topic is renamed to the keyword (default is False).
- Returns:
A list of new topics, including the newly created topic and the modified old ones.
- Return type:
list of Topic
- combine_topics(topic_idx_lis: list[int], inplace: bool = False) list[Topic]
Combines several topics into one topic.
This method combines the specified topics into a single topic. Note that no new topwords are computed in this step, and the topwords of the old topics are just combined. Additionally, only the cosine-similarity method for topwords extraction is used.
- Parameters:
topic_idx_list (list[int]) – List of topic indices to combine.
inplace (bool, optional) – If True, the topics are combined in place. Otherwise, a new list of topics is created and returned (default is False).
- Returns:
A list of new topics resulting from the combination.
- Return type:
list of Topic
- delete_topic(topic_idx: int, inplace: bool = False) list[Topic]
Deletes a topic with the given index from the list of topics and recomputes topwords and representations of the remaining topics.
This method assigns the documents of the deleted topic to the remaining topics.
- Parameters:
topic_idx (int) – Index of the topic to delete.
inplace (bool, optional) – If True, the topic is deleted in place. Otherwise, a new list of topics is created and returned (default is False).
- Returns:
A list of new topics resulting from the deletion.
- Return type:
list of Topic
- general_prompt(prompt: str, n_tries: int = 2) -> (list[str], <class 'object'>)
Prompt the Language Model (LLM) with a general prompt and return the response. Allow the LLM to call any function defined in the class.
Use n_tries in case the LLM does not provide a valid response.
- Parameters:
prompt (str) – Prompt string.
n_tries (int, optional) – Number of tries to get a valid response from the LLM (default is 2).
- Returns:
Response messages from the LLM. object: Response of the invoked function.
- Return type:
list of str
- get_topic_information(topic_idx_lis: list[int], max_number_topwords: int = 500) dict
Get detailed information on topics by their indices.
This function returns a dictionary where the keys are the topic indices, and the values are strings describing the topics. The description includes a maximum of max_number_topwords topwords.
- Parameters:
topic_idx_list (list[int]) – List of topic indices to compare.
max_number_topwords (int, optional) – Maximum number of topwords to include in the description of the topics (default is 500).
- Returns:
A dictionary with topic indices as keys and their descriptions as values.
- Return type:
dict
- get_topic_lis() list[Topic]
Returns the list of topics stored in the instance.
This method retrieves and returns the list of topics associated with the instance.
- Returns:
The list of Topic objects.
- Return type:
list[Topic]
- identify_topic_idx(query: str, n_tries: int = 3) int
Identifies the index of the topic that the query is most likely about.
This method uses a Language Model (LLM) to determine which topic best fits the query description. If the LLM does not find any topic that fits the query, None is returned.
- Parameters:
query (str) – Query string.
n_tries (int, optional) – Number of tries to get a valid response from the LLM (default is 3).
- Returns:
The index of the topic that the query is most likely about. If no suitable topic is found, None is returned.
- Return type:
int
- knn_search(topic_index: int, query: str, k: int = 20, doc_cutoff_threshold: int = 1000)
Finds the k nearest neighbors of the query in the given topic based on cosine similarity in the original embedding space.
- Parameters:
topic_index (int) – Index of the topic to search within.
query (str) – Query string.
k (int, optional) – Number of neighbors to return (default is 20).
doc_cutoff_threshold (int, optional) – Maximum number of tokens per document. Afterwards, the document is cut off (default is 1000).
- Returns:
- A tuple containing two lists -
A list of top k documents (as strings).
A list of indices corresponding to the top k documents in the topic.
- Return type:
tuple
- prompt_knn_search(llm_query: str, topic_index: int = None, n_tries: int = 3) -> (<class 'str'>, tuple[list[str], list[int]])
Uses the Language Model (LLM) to answer the llm_query based on the documents belonging to the topic.
- Parameters:
llm_query (str) – Query string for the Language Model (LLM).
topic_index (int, optional) – Index of the topic object. If None, the topic is inferred from the query.
n_tries (int, optional) – Number of tries to get a valid response from the LLM (default is 3).
- Returns:
- A tuple containing two elements -
A string representing the answer from the LLM.
- A tuple containing two lists -
A list of top k documents (as strings).
A list of indices corresponding to the top k documents in the topic.
- Return type:
tuple
- reindex_topic_lis(topic_list: list[Topic]) list[Topic]
Reindexes the topics in the provided topic list to assign correct new indices.
This method updates the indices of topics within the given topic list to ensure they are correctly ordered.
- reindex_topics() None
Reindexes the topics in self.topic_list to assign correct new indices.
This method updates the indices of topics within the instance’s topic list to ensure they are correctly ordered.
- Returns:
None
- set_topic_lis(topic_list: list[Topic]) None
Sets the list of topics for the instance.
This method updates the list of topics associated with the instance to the provided list.
- Parameters:
topic_list (list[Topic]) – The list of Topic objects to set.
- Returns:
None
- show_topic_lis() str
Returns a string representation of the list of topics.
This method generates a human-readable string representation of the topics in the instance’s topic list.
- Returns:
A string containing the representation of the list of topics.
- Return type:
str
- split_topic_hdbscan(topic_idx: int, min_cluster_size: int = 100, inplace: bool = False) list[Topic]
Splits an existing topic into several subtopics using HDBSCAN clustering on the document embeddings of the topic.
This method does not require specifying the number of clusters to split. Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.
- Parameters:
topic_idx (int) – Index of the topic to split.
min_cluster_size (int, optional) – Minimum cluster size to split the topic into (default is 100).
inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).
- Returns:
A list of new topics resulting from the split.
- Return type:
list of Topic
- split_topic_keywords(topic_idx: int, keywords: str, inplace: bool = False) list[Topic]
Splits the topic into subtopics according to the provided keywords.
This is achieved by computing the cosine similarity between the keywords and the documents in the topic. Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.
- Parameters:
topic_idx (int) – Index of the topic to split.
keywords (str) – Keywords to split the topic into. Needs to be a list of at least two keywords.
inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).
- Returns:
A list of new topics resulting from the split.
- Return type:
list of Topic
- split_topic_kmeans(topic_idx: int, n_clusters: int = 2, inplace: bool = False) list[Topic]
Splits an existing topic into several subtopics using k-means clustering on the document embeddings of the topic.
Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.
- Parameters:
topic_idx (int) – Index of the topic to split.
n_clusters (int, optional) – Number of clusters to split the topic into (default is 2).
inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).
- Returns:
A list of new topics resulting from the split.
- Return type:
list of Topic
- split_topic_new_assignments(topic_idx: int, new_topic_assignments: ndarray, inplace: bool = False) list[Topic]
Splits a topic into new topics based on new topic assignments.
Note that this method only computes topwords based on the cosine-similarity method because tf-idf topwords need expensive computation on the entire corpus. The topwords of the old topic are also just split among the new ones. No new topwords are computed in this step.
- Parameters:
topic_idx (int) – Index of the topic to split.
new_topic_assignments (np.ndarray) – New topic assignments for the documents in the topic.
inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).
- Returns:
A list of new topics resulting from the split.
- Return type:
list of Topic
- split_topic_single_keyword(topic_idx: int, keyword: str, inplace: bool = False) list[Topic]
Splits the topic with a single keyword.
This method splits the topic such that all documents closer to the original topic name stay in the old topic, while all documents closer to the keyword are moved to the new topic. Note that no new topwords are computed in this step, and the topwords of the old topic are just split among the new ones. Additionally, only the cosine-similarity method for topwords extraction is used.
- Parameters:
topic_idx (int) – Index of the topic to split.
keyword (str) – Keyword to split the topic into.
inplace (bool, optional) – If True, the topic is split in place. Otherwise, a new list of topics is created and returned (default is False).
- Returns:
A list of new topics resulting from the split.
- Return type:
list of Topic
topicgpt.TopicRepresentation module
- class topicgpt.TopicRepresentation.Topic(topic_idx: str, documents: list[str], words: dict[str, int], centroid_hd: ndarray = None, centroid_ld: ndarray = None, document_embeddings_hd: ndarray = None, document_embeddings_ld: ndarray = None, document_embedding_similarity: ndarray = None, umap_mapper: UMAP = None, top_words: dict[str, list[str]] = None, top_word_scores: dict[str, list[float]] = None)
Bases:
object
class to represent a topic and all its attributes
- set_topic_description(text: str)
add a text description to the topic params:
text: text description of the topic
- set_topic_name(name: str)
add a name to the topic params:
name: name of the topic
- to_dict() dict
return a dict representation of the topic
- to_json() str
return a json representation of the topic
- topicgpt.TopicRepresentation.describe_and_name_topics(topics: list[Topic], enhancer: TopwordEnhancement, topword_method='tfidf', n_words=500) list[Topic]
Describe and name the topics using the OpenAI API with the given enhancer object.
- Parameters:
topics (list[Topic]) – List of Topic objects.
enhancer (TopwordEnhancement) – Enhancer object to enhance the top-words and generate the description.
topword_method (str, optional) – Method to use for top-word extraction. Can be “tfidf” or “cosine_similarity” (default is “tfidf”).
n_words (int, optional) – Number of topwords to extract for the description and the name (default is 500).
- Returns:
List of Topic objects with the description and name added.
- Return type:
list[Topic]
- topicgpt.TopicRepresentation.extract_and_describe_topic_cos_sim(documents_topic: list[str], document_embeddings_topic: ndarray, words_topic: list[str], vocab_embeddings: dict, umap_mapper: UMAP, enhancer: TopwordEnhancement, n_topwords: int = 2000, n_topwords_description=500) Topic
Create a Topic object from the given documents and embeddings by computing the centroid and the top-words. Only use cosine-similarity for top-word extraction. Describe and name the topic with the given enhancer object.
- Parameters:
documents_topic (list[str]) – List of documents in the topic.
document_embeddings_topic (np.ndarray) – High-dimensional embeddings of the documents in the topic.
words_topic (list[str]) – List of words in the topic.
vocab_embeddings (dict) – Embeddings of the vocabulary.
umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.
enhancer (TopwordEnhancement) – Enhancer object to enhance the top-words and generate the description.
n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).
n_topwords_description (int, optional) – Number of top-words to use from the extracted topics for the description and the name (default is 500).
- Returns:
Topic object representing the extracted and described topic.
- Return type:
- topicgpt.TopicRepresentation.extract_and_describe_topics(corpus: list[str], document_embeddings: ndarray, clusterer: Clustering_and_DimRed, vocab_embeddings: ndarray, enhancer: TopwordEnhancement, n_topwords: int = 2000, n_topwords_description: int = 500, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], compute_vocab_hyperparams: dict = {}, topword_description_method: str = 'cosine_similarity') list[Topic]
Extracts topics from the given corpus using the provided clusterer object on the document embeddings and describes/names them using the given enhancer object.
- Parameters:
corpus (list[str]) – List of documents.
document_embeddings (np.ndarray) – Embeddings of the documents.
clusterer (Clustering_and_DimRed) – Clustering and dimensionality reduction object to cluster the documents.
vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.
enhancer (TopwordEnhancement) – Enhancer object for enhancing top-words and generating descriptions/names for topics.
n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).
n_topwords_description (int, optional) – Number of top-words to use from the extracted topics for description and naming (default is 500).
topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).
compute_vocab_hyperparams (dict, optional) – Hyperparameters for the top-word extraction methods.
topword_description_method (str, optional) – Method to use for top-word extraction for description/naming. Can be “tfidf” or “cosine_similarity” (default is “cosine_similarity”).
- Returns:
List of Topic objects representing the extracted and described topics.
- Return type:
list[Topic]
- topicgpt.TopicRepresentation.extract_describe_topics_labels_vocab(corpus: list[str], document_embeddings_hd: ndarray, document_embeddings_ld: ndarray, labels: ndarray, umap_mapper: UMAP, vocab_embeddings: ndarray, enhancer: TopwordEnhancement, vocab: list[str] = None, n_topwords: int = 2000, n_topwords_description: int = 500, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], topword_description_method: str = 'cosine_similarity') list[Topic]
Extracts topics from the given corpus using the provided labels that indicate the topics (no -1 for outliers). Vocabulary is already computed. Describe and name the topics with the given enhancer object.
- Parameters:
corpus (list[str]) – List of documents.
document_embeddings_hd (np.ndarray) – Embeddings of the documents in high-dimensional space.
document_embeddings_ld (np.ndarray) – Embeddings of the documents in low-dimensional space.
labels (np.ndarray) – Labels indicating the topics.
umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.
vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.
enhancer (TopwordEnhancement) – Enhancer object to enhance the top-words and generate the description.
vocab (list[str], optional) – Vocabulary of the corpus (default is None).
n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).
n_topwords_description (int, optional) – Number of top-words to use from the extracted topics for the description and the name (default is 500).
topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).
topword_description_method (str, optional) – Method to use for top-word extraction. Can be “tfidf” or “cosine_similarity” (default is “cosine_similarity”).
- Returns:
List of Topic objects representing the extracted topics.
- Return type:
list[Topic]
- topicgpt.TopicRepresentation.extract_topic_cos_sim(documents_topic: list[str], document_embeddings_topic: ndarray, words_topic: list[str], vocab_embeddings: dict, umap_mapper: UMAP, n_topwords: int = 2000) Topic
Create a Topic object from the given documents and embeddings by computing the centroid and the top-words. Only uses cosine-similarity for top-word extraction.
- Parameters:
documents_topic (list[str]) – List of documents in the topic.
document_embeddings_topic (np.ndarray) – High-dimensional embeddings of the documents in the topic.
words_topic (list[str]) – List of words in the topic.
vocab_embeddings (dict) – Embeddings of the vocabulary.
umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.
n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).
- Returns:
Topic object representing the extracted topic.
- Return type:
- topicgpt.TopicRepresentation.extract_topics(corpus: list[str], document_embeddings: ndarray, clusterer: Clustering_and_DimRed, vocab_embeddings: ndarray, n_topwords: int = 2000, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], compute_vocab_hyperparams: dict = {}) list[Topic]
Extracts topics from the given corpus using the provided clusterer object on the document embeddings.
- Parameters:
corpus (list[str]) – List of documents.
document_embeddings (np.ndarray) – Embeddings of the documents.
clusterer (Clustering_and_DimRed) – Clustering and dimensionality reduction object to cluster the documents.
vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.
n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).
topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).
compute_vocab_hyperparams (dict, optional) – Hyperparameters for the top-word extraction methods.
- Returns:
List of Topic objects representing the extracted topics.
- Return type:
list[Topic]
- topicgpt.TopicRepresentation.extract_topics_labels_vocab(corpus: list[str], document_embeddings_hd: ndarray, document_embeddings_ld: ndarray, labels: ndarray, umap_mapper: UMAP, vocab_embeddings: ndarray, vocab: list[str] = None, n_topwords: int = 2000, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity']) list[Topic]
Extracts topics from the given corpus using the provided labels that indicate the topics (no -1 for outliers). Vocabulary is already computed.
- Parameters:
corpus (list[str]) – List of documents.
document_embeddings_hd (np.ndarray) – Embeddings of the documents in high-dimensional space.
document_embeddings_ld (np.ndarray) – Embeddings of the documents in low-dimensional space.
labels (np.ndarray) – Labels indicating the topics.
umap_mapper (umap.UMAP) – UMAP mapper object to map from high-dimensional space to low-dimensional space.
vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.
vocab (list[str], optional) – Vocabulary of the corpus (default is None).
n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).
topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).
- Returns:
List of Topic objects representing the extracted topics.
- Return type:
list[Topic]
- topicgpt.TopicRepresentation.extract_topics_no_new_vocab_computation(corpus: list[str], vocab: list[str], document_embeddings: ndarray, clusterer: Clustering_and_DimRed, vocab_embeddings: ndarray, n_topwords: int = 2000, topword_extraction_methods: list[str] = ['tfidf', 'cosine_similarity'], consider_outliers: bool = False) list[Topic]
Extracts topics from the given corpus using the provided clusterer object on the document embeddings. This version does not compute the vocabulary of the corpus and instead uses the provided vocabulary.
- Parameters:
corpus (list[str]) – List of documents.
vocab (list[str]) – Vocabulary of the corpus.
document_embeddings (np.ndarray) – Embeddings of the documents.
clusterer (Clustering_and_DimRed) – Clustering and dimensionality reduction object to cluster the documents.
vocab_embeddings (np.ndarray) – Embeddings of the vocabulary.
n_topwords (int, optional) – Number of top-words to extract from the topics (default is 2000).
topword_extraction_methods (list[str], optional) – List of methods to extract top-words from the topics. Can contain “tfidf” and “cosine_similarity” (default is [“tfidf”, “cosine_similarity”]).
consider_outliers (bool, optional) – Whether to consider outliers during topic extraction (default is False).
- Returns:
List of Topic objects representing the extracted topics.
- Return type:
list[Topic]
topicgpt.TopwordEnhancement module
- class topicgpt.TopwordEnhancement.TopwordEnhancement(openai_key: str, openai_model: str = 'gpt-3.5-turbo', max_context_length: int = 4000, openai_model_temperature: float = 0.5, basic_model_instruction: str = 'You are a helpful assistant. You are excellent at inferring topics from top-words extracted via topic-modelling. You make sure that everything you output is strictly based on the provided text.', corpus_instruction: str = '')
Bases:
object
- count_tokens_api_message(messages: list[dict[str]]) int
Count the number of tokens in the API messages.
- Parameters:
messages (list[dict[str]]) – List of messages from the API.
- Returns:
Number of tokens in the messages.
- Return type:
int
- describe_topic_document_sampling_str(documents: list[str], truncate_doc_thresh=100, n_documents: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>, sampling_strategy: str = None) str
Describe a topic based on a sample of its documents by using the openai model.
- Parameters:
documents (list[str]) – List of documents ordered by similarity to the topic’s centroid.
truncate_doc_thresh (int, optional) – Threshold for the number of words in a document. If a document exceeds this threshold, it is truncated. Defaults to 100.
n_documents (int, optional) – Number of documents to use for the query. If None, all documents are used. Defaults to None.
query_function (Callable, optional) – Function to query the model. Defaults to a lambda function generating a query based on the provided documents.
sampling_strategy (Union[Callable, str], optional) – Strategy to sample the documents. If None, the first provided documents are used. If it’s a string, it’s interpreted as a method of the class (e.g., “sample_uniform” is interpreted as self.sample_uniform). It can also be a custom sampling function. Defaults to None.
- Returns:
A description of the topic by the model in the form of a string.
- Return type:
str
- describe_topic_documents_completion_object(documents: list[str], truncate_doc_thresh=100, n_documents: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) ChatCompletion
Describe the given topic based on its documents using the OpenAI model.
- Parameters:
documents (list[str]) – List of documents.
truncate_doc_thresh (int, optional) – Threshold for the number of words in a document. If a document has more words than this threshold, it is pruned to this threshold.
n_documents (int, optional) – Number of documents to use for the query. If None, all documents are used.
query_function (Callable, optional) – Function to query the model. The function should take a list of documents and return a string.
- Returns:
A description of the topics by the model in the form of an openai.ChatCompletion object.
- Return type:
openai.ChatCompletion
- describe_topic_documents_sampling_completion_object(documents: list[str], truncate_doc_thresh=100, n_documents: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>, sampling_strategy: str = None) ChatCompletion
Describe a topic based on a sample of its documents by using the openai model.
- Parameters:
documents (list[str]) – List of documents ordered by similarity to the topic’s centroid.
truncate_doc_thresh (int, optional) – Threshold for the number of words in a document. If a document exceeds this threshold, it is truncated. Defaults to 100.
n_documents (int, optional) – Number of documents to use for the query. If None, all documents are used. Defaults to None.
query_function (Callable, optional) – Function to query the model. Defaults to a lambda function generating a query based on the provided documents.
sampling_strategy (Union[Callable, str], optional) – Strategy to sample the documents. If None, the first provided documents are used. If it’s a string, it’s interpreted as a method of the class (e.g., “sample_uniform” is interpreted as self.sample_uniform). It can also be a custom sampling function. Defaults to None.
- Returns:
A description of the topic by the model in the form of an openai.ChatCompletion object.
- Return type:
openai.ChatCompletion
- describe_topic_topwords_completion_object(topwords: list[str], n_words: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) ChatCompletion
Describe the given topic based on its topwords using the OpenAI model.
- Parameters:
topwords (list[str]) – List of topwords.
n_words (int, optional) – Number of words to use for the query. If None, all words are used.
query_function (Callable, optional) – Function to query the model. The function should take a list of topwords and return a string.
- Returns:
A description of the topics by the model in the form of an OpenAI ChatCompletion object.
- Return type:
openai.ChatCompletion
- describe_topic_topwords_str(topwords: list[str], n_words: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) str
Describe the given topic based on its topwords using the OpenAI model.
- Parameters:
topwords (list[str]) – List of topwords.
n_words (int, optional) – Number of words to use for the query. If None, all words are used.
query_function (Callable, optional) – Function to query the model. The function should take a list of topwords and return a string.
- Returns:
A description of the topics by the model in the form of a string.
- Return type:
str
- generate_topic_name_str(topwords: list[str], n_words: int = None, query_function: ~typing.Callable = <function TopwordEnhancement.<lambda>>) str
Generate a topic name based on the given topwords using the OpenAI model.
- Parameters:
topwords (list[str]) – List of topwords.
n_words (int, optional) – Number of words to use for the query. If None, all words are used.
query_function (Callable, optional) – Function to query the model. The function should take a list of topwords and return a string.
- Returns:
A topic name generated by the model in the form of a string.
- Return type:
str
- static sample_identity(n_docs: int) ndarray
Generate an identity array of document indices without changing their order.
- Parameters:
n_docs (int) – Number of documents.
- Returns:
An array containing document indices from 0 to (n_docs - 1).
- Return type:
np.ndarray
- static sample_poisson(n_docs: int) ndarray
Randomly sample document indices according to a Poisson distribution, favoring documents from the beginning of the list.
- Parameters:
n_docs (int) – Number of documents.
- Returns:
An array containing randomly permuted document indices, with more documents drawn from the beginning of the list.
- Return type:
np.ndarray
- static sample_uniform(n_docs: int) ndarray
Randomly sample document indices without replacement.
- Parameters:
n_docs (int) – Number of documents.
- Returns:
An array containing randomly permuted document indices from 0 to (n_docs - 1).
- Return type:
np.ndarray