Katzbot - An AI assistant for Yeshiva University

Introduction

KatzBot is a cutting-edge solution designed to enhance communication within university communities. Leveraging the Katz generative pre-trained transformer (KatzGPT), a sophisticated custom large language model (LLM), KatzBot addresses the precision gaps often observed in existing academic chatbot systems. By utilizing two meticulously curated datasets comprising sentence-completion and question-answer pairs, KatzGPT is trained to expand its knowledge base and improve its accuracy. This innovative system offers a concise and effective interface, facilitating seamless communication between users and the KatzGPT model, thereby revolutionizing the academic chatbot experience.

Data collection and processing

Text Data Collection

Gathered text data relevant to the University from primary sources: University database, official website, articles, and social media feeds.
Aimed to collect comprehensive information on university courses, faculty, degrees, and enrollment procedures.
Initially used automated web scraping techniques with Python libraries BeautifulSoup and Requests.
Faced challenges like irrelevant or nonsensical information and inconsistencies in data format.
Shifted towards manual data collection for thoroughness and accuracy.

Preprocessing Data

Meticulously curated dataset by hand to address inaccuracies and irrelevant content.
Preprocessed data using regular expressions and customized parsing techniques.
Standardized sentences by removing extraneous symbols, punctuation, and special characters.
Applied tokenization to break down sentences into individual words or phrases.
Identified and eliminated duplicates and unnecessary noise to enhance dataset's quality.
Organized dataset into sentence pairs and question-answer pairs.
Stored processed data in universally compatible formats like CSV and JSON.

Sentence Pairs Creation

Developed an automated script to segment text into related sentence pairs.
Conducted manual review of dataset for accuracy and quality.
Compiled dataset comprising over 6,400 pairs of related sentences.

Question-Answer Pairs Creation

Utilized an automated script to process text file, generating question-and-answer pairs.
Manually crafted additional pairs to enhance quality, resulting in 7,334 question-and-answer pairs.

Test Dataset Creation

Developed a test dataset covering a wide array of unencountered questions and prompts.
Included 2,081 question-answer pairs meticulously selected to encompass a comprehensive range of subjects pertinent to the university community.

Data Summary

Data Type	Description	Count
Sentence Completion	Training for knowledge integration	6,280
Train QA Pairs	Enhancing detailed understanding	7,334
Test QA Pairs	Assessing model's consistency	2,081

Settings

Model Parameters

PyTorch: 2.0.1
torchvision: 0.15.2
CUDA: 12.1
Learning Rate: $5e-5$
Optimizer: AdamW
Weight Decay: 5e-4

Results

In our comparative analysis detailed in the table below, we assess the performance of several Large Language Models (LLMs), highlighting our in-house developed KatzGPT model. This evaluation benchmarks the Rouge Scores, focusing on particularly Rouge-L, which measures the long-form coherence of generated texts.

Comparison of Rouge F-Scores for LLMs (sorted by Rouge L)
Model	Rouge-1	Rouge-2	Rouge-L
Llama2 3B	0.23	0.07	0.20
Microsoft Phi 1.5	0.26	0.10	0.24
Llama2 7B	0.28	0.12	0.25
KatzGPT	0.29	0.16	0.25
Microsoft Phi2	0.34	0.15	0.31
Mistral 7B Instruct	0.43	0.20	0.33
GPT-2	0.45	0.32	0.43

KatzGPT, leveraging the foundational architecture of GPT, showcases notable performance, especially in terms of Rouge-L, where it matches the scores of advanced models such as Llama2 7B, indicating its robust capability in understanding and reproducing the context and structure of the source texts. Despite being a newcomer, KatzGPT outstrips many predecessors including Microsoft's Phi 1.5 and Llama2 models, underscoring its exceptional grasp of nuanced textual relationships. Notably, GPT-2 stands out as the leading model, affirming its influence and effectiveness as the base for KatzGPT development. Its top-tier performance across all Rouge metrics solidifies its position as a pivotal benchmark in the landscape of LLMs. While Mistral 7B Instruct leads in extractive and instructive capabilities, it still trails behind GPT-2, highlighting the competitive nature of these models in different aspects of language comprehension and generation. The under-performance of Llama2 3B across all metrics suggests areas requiring further refinement, particularly in enhancing its understanding and response generation capabilities. Our evaluation not only highlights KatzGPT's ability to preserve coherence and context in its outputs but also establishes its status as a formidable rival to more established models. The comparative study illuminates the varying capabilities and potential areas for enhancement among top Large Language Models, positioning KatzGPT as an impactful newcomer in text generation and understanding. With additional training and some adjustments to its architectural design, KatzGPT has the potential to achieve even greater performance.

LLMs Predictions Comparison
User Input	GPT-2	Mistral 7B	Phi 2	Llama2 7B
What career services does the university provide to its students?	The university offers services to its student body.	The university provides career services that include personalized counseling, career planning, and access to professional networks.	The university provides career services to help students find employment after graduation. These services include resume building, interview preparation, job search strategies, and networking opportunities.	Access to professional mentors and career coaches. Opportunities for internships, STEM OPT, and career opportunities through the Career Center.
Can you share information on internship opportunities for students?	Internships are primarily during the summer, but you can also consider internships in New York, the financial capital of the world, or on an individual basis.	Yes, there are numerous internship opportunities available for students, both in the U.S. and in Israel, through the Yeshiva University Israel Center.	Yes, KatzBot can provide information on internship opportunities for students. What department are you interested in?	The program offers internship opportunities that give students industry experience and valuable career skills.
What are the basic admission requirements for this university for Data Analytics?	The basic admission rules for this college program include a minimum GPA of 3.2, an SAT score of 1460 or an ACT score of 32, and approval from their high school administration.	The basic admission requirements for this university for Data Analytics include a bachelor's degree with a strong mathematical and programming background.	The basic admission requirements include a bachelor's degree from an accredited college or university, a minimum GPA of 3.0, and a minimum grade of B+ in the prerequisite courses.	Bachelor's degree, one year of related work experience, minimum grade point average of 3.0 on a 4.0 scale, or minimum grade of B in prerequisite courses.

KatzBot - An AI assistant for Yeshiva University