README.md

VidTutorAssistant Replication Package

Welcome to the replication package for VidTutorAssistant, an approach that leverages GPT-4 to automate responses to questions posted by viewers of programming video tutorials on YouTube.

Web Tool

You can access the web tool https://vidtutorassistant.com.

Replication Package Link

You can download the replication package from here.

Replication Package Contents

  • config
    • api_keys.ini: Store GoogleDeveloperKeys and OpenAIKeys in this file.
  • requirements.txt: Contains the required Python packages to run the code.
  • utils: Contains utility functions used across different scripts in src.
  • src
    • Main
      • main.py: Implements end-to-end interaction with VidTutorAssistant for generating responses to user comments on YouTube videos. Prompts the user to enter a YouTube video URL, their name, and a comment or question about the video, then generates and displays a response using our approach.
    • Calculate_Inter_Reliability
      • Classification_Evaluation.py: Calculates Kappa between two coders for classifying comments into question or non-question.
      • Evaluators_Evaluation.py: Calculates Kappa between two evaluators for the correctness and completeness of answers.
      • Open_Coding_Evaluation.py: Calculates Krippendorff's Alpha between the annotators in the open coding process.
    • Classify_Comments
      • 1_Classify_Comments.py: Automates the process of classifying comments into predefined categories (question or non-question) using GPT models.
      • 2_Evaluate_Accuracy.py: Processes comment classification results to evaluate the performance of a model that classified comments as question or non-question. Computes standard classification metrics including accuracy, precision, recall, and F1-score.
    • Dataset_Collection
      • 1_Fetch_Video_IDs.py: Fetches and stores YouTube video data based on specific topics.
      • 2_Remove_Duplicates_Video.py: Reads JSON files containing YouTube video data, processes the data, and outputs it into a CSV file.
      • 3_Fetch_Video_Info.py: Fetches detailed information about YouTube videos based on a list of video IDs.
      • 4_Filter_Non_English_Videos.py: Filters a given CSV file to retain only rows where either the 'defaultLanguage' or 'defaultAudioLanguage' columns contain 'en' or any variation starting with 'en-' (representing English). Discards rows where both columns are empty or contain non-English languages.
      • 5_Fetch_Video_Transcript.py: Fetches YouTube video transcripts using the YouTube Transcript API. Reads a CSV file containing video data, fetches transcripts, and processes the transcript data.
      • 6_Process_Video_Transcript_Into_Text_Files.py: Processes JSON files containing video transcripts.
      • 7_Filter_Videos_with_No_Transcript.py: Filters a given Videos CSV file to retain only rows where the 'transcript_words_count' column has more than zero words, indicating the presence of a transcript.
      • 8_Fetch_Comments.py: Fetches and stores comments from YouTube videos using the YouTube Data API. Manages API keys to handle pagination and rate limits.
      • 9_Create_Comments_CSV_File.py: Processes YouTube comments from JSON files, extracting relevant information and organizing it into a structured format. Saves the data into a CSV file for further processing.
      • 10_Create_Comment_Reply_Pairs.py: Processes video comments and author replies from CSV files. Generates a CSV file of comment-reply pairs.
      • 11_Group_Videos.py: Processes video data from CSV files for a specified category.
    • Detect_Programming_Language
      • 1_Prepare_Videos_Dataset.py: Processes video datasets and filters them based on selected comments. Combines the filtered video data into a single CSV file.
      • 2_Detect_Programming_Language.py: Detects the programming language from video metadata and transcripts using an AI model (e.g., GPT-4).
      • 3_Evaluate_Accuracy.py: Evaluates the performance of the model that identified video programming language using standard classification metrics.
    • Generate_Answers
      • 1_Prepare_Subset_Data.py: Processes comment datasets to prepare a subset of data for a specified programming language.
      • 2_Embed_Video_Transcripts.py: Processes video transcripts by embedding text and encoding tokens.
      • 3_Generate_Responses.py: Generates automated replies to video comments. Reformulates comments, searches for relevant video transcript segments, prepares a context, and generates a reply using a specified AI model.
  • data
    • Calculate_Inter_Reliability: Includes input files for the Calculate_Inter_Reliability code.
    • Classify_Comments: Includes output files of the Classify_Comments code.
    • Dataset_Collection: Includes output files of the Dataset_Collection code.
    • Detect_Programming_Language: Includes output files of the Detect_Programming_Language code.
    • Generate_Answers: Includes output files of the Generate_Answers code (generated responses to comments in the subset).
    • Sample_Dataset: Includes the randomly selected subset.
  • User_Study
    • Open_Coding_Process.json: The open coding process between the two annotators.
    • Open_Coding_Results.docx: The results (list of reasons for preferring different answer sources) of the open coding analysis as a Word document.
    • Open_Coding_Results.pdf: The results (list of reasons for preferring different answer sources) of the open coding analysis as a PDF file.
    • Participant_Demographics.xlsx: Participant demographics data.
    • Participant_Preferences.xlsx: Raw data of participant responses.
  • Figures
    • comments_classification_tool.png: A screenshot of the comments classification tool.
    • comments_evaluation_tool.png: A screenshot of the comments evaluation tool.
    • user_study_page0.png: A screenshot of the user study page 0 (instructions).
    • user_study_page1.png: A screenshot of the user study page 1 (demographics).
    • user_study_page2-11_example1.png: A screenshot of the user study pages 2-11 (example 1).
    • user_study_page2-11_example2.png: A screenshot of the user study pages 2-11 (example 2).
    • user_study_page2-11_example3.png: A screenshot of the user study pages 2-11 (example 3).
    • VidTutorAssistant_tool_example1.png: A screenshot of the VidTutorAssistant tool (home page).
    • VidTutorAssistant_tool_example2.png: A screenshot of the VidTutorAssistant tool (example 1 of generating responses).
    • VidTutorAssistant_tool_example3.png: A screenshot of the VidTutorAssistant tool (example 2 of generating responses).

How to Use

  1. Set Up API Keys:
    • Store your Google Developer and OpenAI API keys in config/api_keys.ini.
  2. Install Required Packages:
    • Run pip install -r requirements.txt to install the required Python packages.
  3. Run Main Scripts:
    • Use src/Main/main.py to interact with VidTutorAssistant and generate responses to user comments on YouTube videos.
    • Run python src/Main/main.py to start the interaction.

Replicate Experiments

If you want to replicate the experiments and data, run the scripts in the src directory to collect data, classify comments, detect programming languages, and generate answers.

Acknowledgments

This package includes a replication package that contains datasets and code to facilitate further research and the reproducibility of our findings. For more details, visit VidTutorAssistant.