AUTOMATED GENERATION OF DETAILED PROGRAMMING ASSIGNMENT FEEDBACK

When teaching students computer programming, instructors often teach specific techniques that students should follow. Students are told to program in these ways, but instructors never really know if the techniques are used; and if they are used, then how effective they are. This project produced a Programming Analysis Plug-In (PAPI) to analyze student academic computer programming course work to measure when and how students are working on programming assignments.These measurements include examining the final assignment submitted by a student as well as the steps a student used to get to the final product. To make sure that this data capture is being performed in the most user-friendly way, potential users, both instructors and students, were interviewed for their opinions on how the software should work. It was determined both students and instructors prefer auto-grading software, but it currently lacks formative feedback. It was also postulated that if a teacher can access and easily understand how a student gets to a final result, they can help support struggling students, find class pain points, and discover bad practices on projects. Having an instructor sit down with every student to ask how they programmed an assignment is not feasible in the large classes found in computer science. By automating this process, students can get the feedback that they need to excel. PAPI delivers this feedback by analyzing assignment creation date, last edit date, number of saves, number of character insertions and deletions, and number of comments. This thesis describes the PAPI software, its testing in a computer science course, and the results that indicated starting an assignment early, commenting code, and having lower numbers of text insertions and deletions trend to higher assignment grades. PAPI will have a broad impact because of its compatibility with current technologies and its intuitive ease of use.

I would like to thank Jake Fonseca for his guidance in the early stages of this research project. His mentorship gave me a thought-provoking forum to discuss thesis topics and find one I am truly interested in. Beyond this, I was able to explore a topic with real-world implications that can directly aid others upon deployment. Over the course of my schooling, his direction has allowed me to own  This is a screenshot of the gradescope cloud auto-grading software. Gradescope is a common LMS used for digital grading and automatic grading. When teaching students new programming skills, an instructor must have various ways to evaluate the student. A student's progress throughout a class must be recorded and evaluated in order for the student to know where they need to improve. In typical computer science courses, students hand assignments in without the instructor knowing how they got to the end product. If student progress on an assignment can be recorded and analyzed, then formative feedback can be provided to the students and instructors.
With the capability to access and understand how a student progressed to their final result, the teacher can then help support struggling students, find class-wide patterns, and discover discouraged programming practices on projects. Traditionally, this data can only be obtained by having an instructor sit down with every student to ask how they coded a program. This is unrealistic for large class sizes with limited staff instruction time. Class sizes are growing, and at some universities, they are expected to reach a capacity where the instructor cannot grade every assignment manually [1]. Coding is becoming an important and necessary skill for employability and the current numbers of computer science students will only increase [2]. Ways to grade assignments to provide summative feedback via autograding software exist. However, with these current auto-grading tools, students do not receive the traditional formative feedback that helps improve their grades [3]. Previous research has been done in the areas of proper programming techniques and the efficiency of auto-grading assignments. If this formative feedback can be provided in an automated fashion, like summative grading is done now, this support can be included while not increasing the workload of the teaching staff.
In this project, I followed the standard techniques used in software development -I researched the needs and wants of potential users of a tool that records student approaches to programming and built to those requirements. This process included having one-on-one interviews with the target users.

Motivation
In my computer science undergraduate career, I noticed an increase in autograding assignments. While I saw many of the benefits to this technology, I felt inhibited by its shortcomings. In particular, I wanted more feedback on how I work as a programmer and if there were specific patterns in my style that could be improved. From this clear need in the computer science education process that I experienced, I decided that I would base my project on the analysis of students' programming process to identify patterns that could be useful for instructors.

Research Goals
In this project, research and development was performed to establish the implementation of an integrated development environment (IDE) plugin that is able to generate auto-graded formative feedback. The research focused on four main points: • Find what data can be and should be collected from a student assignment; • Development of software to generate/parse the data; • Find patterns in the student data; • Provide the ability to allow others to analyze their own data captured from the tool.
The goals of this work are explained in more detail in the following subsections.

Formative Feedback
My project included developing software to analyze student work to automatically provide formative feedback to both the student and to instructors, along with a plagiarism indicator. The purpose of PAPI is to identify patterns in a student's programming technique. This feedback can be delivered to the professor, who can choose to do things such as help students struggling on assignments, use it at the end of the semester for the cumulative patterns of students, or see how different teaching styles affect student work. The feedback software also generates a PDF report that can be sent to the student to reflect on their own patterns. When analyzing a student's assignment, PAPI can help identify plagiarized work by checking the amount of entered text in a specific amount of time. The software will not only detect plagiarized work, but all copy and pasted text. PAPI checks for this by only looking at the keystroke information provided by the student. Any text, including the students own written code, will be flagged if inserted at an accelerated speed.
The analysis of this tool, presented in Chapter 4, shows that copied (assumed to be cheated) code is identified with at least 80% accuracy with no more than 10% false positives and 10% false negatives.

Workload
Another important criteria is the amount of extra effort that students and instructors have to perform to use a feedback tool. After prototyping the tool, I found that it requires minimal effort by the teaching staff. PAPI can generate a student's report in less than seven clicks by a user. The feedback time for generating a report is also immediate, at a rate of less than one second a student. having the same emphasis abilities as traditional grading. However, the use of automated grading is still growing. This has been shown by constant increase in the use of automatic grading tools, as well as one-on-one interviews performed in the study. Speed and efficiency are rated the most important by the surveyed instructors.

Online Only Courses
Massive Open Online Courses, commonly shortened to the acronym MOOC, are gaining in popularity in higher education [2]. These courses are overcoming boundaries such as language and cultures through the use of the online setting.
The scaling of grading infrastructure is also increasing. When students take an online course, they need to receive their feedback in a faster manner so they can progress in their curriculum. To keep up with this demand, new ways of grading will need to take place. Figure 1. This is a screenshot of the gradescope cloud auto-grading software. Gradescope is a common LMS used for digital grading and automatic grading.

Auto-graders
Automatically graded assignments in the computer science classroom are becoming increasingly common at all stages of the learning process. Currently, feedback given by auto-grading software provides a grade but does not give specific feedback on the nuances of programming. This feedback includes, but is not limited to, analyzing when the student starts working on an assignment, how much and when the student comments their code, and how the student divides the work of a large assignment [3,4,5]. To better learn the skill of programming through personalized feedback, more detail needs to be supplied to the teacher for proper observation. For a student to become a master of their skill faster and to provide a more enjoyable programming experience, a teacher needs more information from the IDE to properly assess and assist the student [6,7]. The way an auto-grader currently grades student work is limited to the strategies explained below.

Algorithmic Strategy Teacher Provided Responses
Assignments can be written in many ways. When working on a programming assignment, especially with introductory-level material, there are only a few options to implement an idea [8]. Current implementations for project grading with higher levels of feedback involve a process of creating an algorithm that software can check across the students' work. MISTAKEBROWSER [9] is a current deployment of this style of software. While this level of grading is very helpful for the students, it does not alleviate work from the instructor. The process of writing every possible option to solve an assignment is arduous and repetitive. There is also the issue of not thinking of all possibilities and still having to go back and manually grade those missed students. The purpose of our software is to provide feedback on an assignment without prior knowledge of its internals.

Previous Submission Analysis
With the growth of artificial intelligence (AI) and neural networks, the work can shift from the teacher providing answers to the student. When looking at student submissions of previous semesters, variations exist, but only to a point [10]. The research by Huang et al. looked at 32,876 submissions and found 423 correct ways to solve the problem with close to 3,000 incorrect ways. This supports the idea that the manual teacher generation of these approaches would not reach a level of significance to support the work. It also shows that with a large enough data set, patterns do arise in student programming techniques. Grading future work based on the past works well, but if an assignment is changed at all from semester to the next, the current technology cannot be re-branded for different assignments.

Grammars
A similar approach to an algorithmic strategy is to check a student's assignment at the level of grammars [11]. Instead of providing all options and having set responses, this program looks through the student's assignment with the use of a syntax tree, catching both programming style and structure. This works to the teacher's advantage when teaching very specific skills for programming constructs such as logic, loops, and functions. While compiling software can arguably also provide debugging information, teacher-provided comments can exceed this level of detail and explain to the student using beginner-level language and terminology. Figure 2. This is a screenshot of possible feedback options for applying previous responses to a current assignment.

Reuse of Feedback
The possibility of re-using feedback from previous iterations of a class is another way teachers are increasing the speed of their grading [9]. Similar to having only a limited number of correct answers to a question, there are also only so many ways a student may do an assignment incorrectly [2]. Writing a detailed response once for a student on a common mistake can increase the level of feedback for all other students who make it in the future. This can be used in combination with the work in AI, as explained previously. By grouping similar students' work before providing feedback, an instructor can find all students who made the same mistake at once. Glassman et al. [12] handle this process by producing a stack of variables; comparing students' similarities of actions to those variables. This normalizes the solutions to the important base elements to both group together similar implementations and simplify the readability of student code. The teacher can then apply feedback to multiple students simultaneously who all attempted a part of an assignment in a related way.

Auto-grading Accuracy
Feedback provided by automatic systems can cover many students even in simple deployments. Singh et al. [2] have determined that because of such similarity between student work, software grading of summative feedback can affect the average student even in simple implementations. Their current technique can provide accurate feedback to 64% of over 1000 submitted assignments in under 10 seconds. This shows that this technology when deployed, even in a poor way, can affect a majority of students. The level of detail either needs to be decreased, or the accuracy of the program increased to reach the desired efficiency level. Our study will find that point of efficiency so teachers and students can get the feedback they need with a higher level of reliability.

Plagiarism 2.3.1 Plagiarism and Learning
When a student copies and pastes code, it constitutes plagiarism. While cheating of all kinds is discouraged and against most university policies, copy and paste has been shown to hinder students' learning capabilities [13] more than other cheating methods. The study determined a student does not spend time thinking about a concept when copy and pasting, leading to a higher rate of forgetting a concept in the future. Students who pasted fewer words were found to have a deeper understanding of material compared to those who pasted more [14]. The processing of material appears to happen less so for students who were not limited in how much text could be pasted. In terms of coding, copy and pasting can be done from websites and other student files. The text can also be copied and pasted from their own files and even in the same document. By not typing out even their own code, they can be losing this memorization [15,16]. This can lead to what can be referred to as copy-paste driven development or cargo cult programming [17,18], where a student thinks they understand how a program works, when not really understanding the underlying concepts.

Plagiarism Software Statistical Analysis
One way to evaluate plagiarism is to look at the programming style using markers and statistical analysis [19] by finding themes in the indentation, variable naming, and spacing. This technique works well, but it assumes the offender is not trying to hide the fact that there is plagiarism occurring. Taking copied code and styling it to the individual's traditional style will remove the opportunity for detecting the difference.

Hashing
Fingerprinting assignments with URL hashing is another way to check for plagiarism incidents [20]. Gao created an algorithm for identifying copies of code between HTML pages. This was created to crawl the web for identical pages to find duplicate websites. While current hashing algorithms, like MD5, work well with detecting exact copies, the fingerprinting method was created to look for similarities. The algorithm was a success and was able to identify many copied pages over the internet. When applying this to web news, duplicated pages ranged from 33.4% to 63.7%. In terms of applying this to the classroom, this would have a direct correlation to classes that teach web-programming. PAPI is also assumed to be modifiable to look over other programming languages.

Measure Of Software Similarity (MOSS) is the most widely used tool at the
University of Rhode Island for detecting plagiarism in computer science classes.
This software is cloud-based, running most of the process on Stanford owned servers [21]. The service has about 300,000 accounts with 1,000 to 10,000 submissions per day. In each of these submissions, there is a range of twenty to 2,000 assignments [22]. The algorithm of choice for this software is called winnowing.
MOSS creates small windows of varying sizes to analyze each assignment and assign a hash to locations throughout the provided documents. Once these smaller fingerprints have been made, MOSS looks up each piece of text over the documents. If a fingerprint comes up in two documents, a case of plagiarism is noted and is accumulated to present to the user [23].
Kaya andÖzel have taken the MOSS source code plagiarism detection tool [24] and integrated it with the learning management system (LMS) Moodle. By integrating the two together it makes it easier for instructors to check for students who plagiarize code with the goal of increasing the number of classes that use the technology. In turn, this decreases cheating instances, as the students know they will be caught [25]. Other plagiarism tools are built into the LMS, but none that lend well to the programming classroom. The study found that professors can get plagiarism feedback with little user integration, showing the success of their product. It was also determined that students who have been caught plagiarising tend to have to lower grades in the course. With a success in implementing the auto-grader in the majority of the computer science classrooms it would be assumed there would also be an increase in catching cheating cases. This was not the result.
The study determined that due to plagiarism detection software, computer science students became less likely to cheat in fear of getting caught.

JPlag
JPlag is a coding plagiarism detection tool created by Prechelt et al. to analyze Java, Scheme, C, or C++ programs if they contain code similar in multiple assignment submissions [26]. This software is created to build off the previous technology YAP3 [27], but uses new optimizations to improve JPlag 's speed.
JPlag hosts a website with separate user accounts to handle queries. A set of assignment submissions are uploaded and are then compared in a pairwise manner.
JPlag splits the program into token strings and uses the "Greedy String Tiling" algorithm to find similar code.
Our implementation differs from other plagiarism software as it does not compare the similarity between program structure.

Programming Techniques
When it is time to start programming there is consensus in the computer science community that there are good and bad ways to write code [28,29,30,31].
The goal of adopting a method of programming is to stay away from the idea of being a "spaghetti code programmer" [30]. This type of programmer throws everything at a program and just sees what sticks. The code is difficult to follow, lacking similarities between sections of the same program. Palomba et. al have created a "smell" detector to sense malpractices like this that can cause problems later on. These technologies already existed, but they wanted to create a better system while analyzing what already exists. By examining change history in combination with the end resulting file, more smells were able to be identified [29] than previous technologies. Change history was the key new addition used to enhance this software's performance. This smell test is a similar idea to our software which will identify students who need help with coursework.

Programming Order of Operations
When teaching and learning how to program, there are many techniques to then determine what to show [32]. The "what to do" step can be figured out using pseudocode to plan out a program's function. This preparatory step would work off of what an assignment instructs, and starts the process of converting pseudocode to real code. The "how to do" step focuses on the back-end of software to perform operations. Finally, the output of the code is managed and displayed to the user in the "what to show" step.
The importance of commenting code during the programming process helps the software engineer understand their own code and aids in figuring out what the code is doing when others need comments for reference [33,34]. Currently, this commenting step is frequently skipped over and forgotten. Commenting code is something that students need to start implementing in assignments today as it is desirable in the software engineering industry [35]. To maintain software with many contributors, this step is even more crucial [36].
Moving onto the "how to do" step, it is important to add logic of the program in a tactful, limited way. While over-commenting can be a problem, overusing logic can affect the readability of code in a more impactful way. If a flow is disrupted frequently, then the code becomes less understandable. This practice can lead to writing a program "to clever" [31]. It takes too much time for someone to understand what a program is doing, and there is little improvement in efficiency.
Programming is a team effort and others need to be able to understand what you have written, especially in the classroom setting [37].
Oman and Cook look to find ways to use a taxonomy to talk about different programming styles [38]. In our research, we are looking for which programming methods lead to different grade results. This study acknowledges the importance of programming style but there needs to be improvements in the vocabulary in order to properly have a conversation about the topic.

Programming Method
When trying to apply an "extreme apprenticeship method," a study found that using specific practices provided a better experience for the students partic- ipating. An extreme apprenticeship method of teaching, also referenced as work based learning, is when education is paired directly with the workplace. A student will learn a concept and then quickly use that skill in a company/real-world setting. Practices such as starting early, having small goals, and having assignments with real-world examples, were elements stressed when holding exercise sessions [28]. The study came to the conclusion that when the practices learned from the mentors are taught to the students along with continuous feedback and great scaffolding lead to higher pass rates for both their introduction to programming and CHAPTER 3 Design and Methodology

Student Metrics
Instructors providing formative feedback to students on programming techniques traditionally have required one-on-one sessions for a student to explain how they wrote their code. This takes a lot of time to critique the student's efforts and programming style, and typically does not occur in a vast majority of computer science courses because its time burden is prohibitive. This is even more true for the larger computer science courses that are now common. One of the primary contributions of the PAPI project is to reduce assignment review time drastically.
This time reduction was analyzed and compared in this project.

Plagiarism
Plagiarism detection is another feature of PAPI. When a student finds and uses code from the internet or from another student, we assume that the rate of the student keystrokes will be significantly larger compared to other sections of original coding from the student. PAPI's accuracy of identifying copy and pasted code is measured in percent of time correct, false positive, and false negative. PAPI can flag students by identifying the pasted text to help in cheating cases.

Approach Metrics
PAPI works by recording student keystrokes while they programm and helps the instructor and the student identify patterns in the student's programming style. PAPI collects many different forms of metadata and metrics.
One form of metadata to be collected is the start time of the assignment in relation to the due date. This data allows an instructor to see when the student started the assignment. For example, if the student started the assignment when it was first assigned or if they started the day it was due. This could identify and stress the importance of starting an assignment early.
Another form of metadata is the length of programming sessions. The amount of time a student puts into a piece of work is assumed to have a correlation to how they succeed. A student who puts less time into an assignment should be flagged to check if effort is being put forth. Programming speed can also be extracted from these measurements for generating a student's total programming time.
The number of programming sessions metadata can help indicate a student's programming method. A student can use this information to identify which work strategy leads to a higher grade on an assignment. They may program many short sessions or a few long sessions. If a large gap of time is found between text insertions, PAPI will count these as two separate work sessions.
Commenting code is traditionally a graded item in early computer science courses, and is metadata that PAPI can capture. Teaching staff view student files to check and grade if the student commented on a program. Automatically searching if a student is commenting and how much they are commenting can relieve an already checked item. PAPI uses regular expressions to scan for comment frequency in the last version of a student file.
The number of saves is also matadata recorded by PAPI to provide more insight into the programming process. A teacher may recommend a student to run and compile their code often. While PAPI cannot check when the code is compiled, the number of save points can be counted and can help indicate which students may not be compiling their code often.
The number of deletions and insertions in terms of characters is metadata that can help indicate when a student is working. If a lot of deletions are identified this could be a signal that a student struggled on a section. With the University of Rhode Island endorsing the idea of the growth mindset, this struggle and failure was measured to see a possible relationship.
The order in which code elements are written metadata can help indicate when a student is working on different stages of programming. A student may start by writing pseudo-code for an assignment. A student might also not be showing output data until the last step. Both of these can reduce the efficiency of working on a programming assignment. PAPI looks at common strings used at each of these steps to identify where different types of work are being performed.

Implementation
This section discusses the process of collecting example student work to test the PAPI software.

How to Measure Students Work
My data collection was done in URI's CSC211 course in the Spring of 2020, instructed by Michael Conti. Students do most of their programming in the CS50 IDE in this course. The CS50IDE keeps the history of a student's document progress so the student can go back into save history to undo edits. This is similar to version control found in Microsoft Word, Apple Pages, and Google Docs.

Getting Students History
To make the history accessible the CS50IDE docker image was downloaded and installed to look for this history file, which I found in an SQLite database.
To access the students' files remotely I used a built-in operation referred to as "sharing your work-space". With this action, a user can share their CS50 instance using the cloud version of CS50, authenticated by GitHub. Figure 3. This is a screenshot of the database file provided by the CS50IDE software.

Testing the Scale
While accessing the history of one account was straightforward, a larger trial group was needed to continuously test the practicality of implementing "sharing your work-space" in an entire classroom. I gathered a group of five computer science students and instructed them to share their accounts with my research account. The main account successfully received every request, emailing the corresponding email address with information on how to connect to the new student environment. Opening every account to test access was proven successful for each student. Each environment shared student files as well as the back-end CS50IDE hidden files. None of the test group members participated in the course pilot.

Deploying
After completing these tests, the project's focus transitioned to the target audience. CSC 211, a beginner object-oriented programming course, was the course used to implement this logging. This group was chosen because they use the desired IDE and they are an introductory 200 level course. By selecting a less experienced class, we hoped to find a good range of techniques in student programming. This class received a form to provide consent and voluntarily sign-up to participate in the research. Students were not provided any incentive to participate in the project and their data was to be made anonymous after attaching their data to assignment grades. Course instructors and TA's could also not have access to the data until after course grades were finalized. CSC 212, a data structures and abstractions course, was also invited to participate in the study. Students were invited to participate, although the course did not support the use of CS50IDE.
29 students signed up to participate in the pilot study.

Analyzing the Data
The data from the student assignments was analyzed using scripts I have created in Python. Using the built-in SQLite package in Python, the raw database was converted into readable data. I converted the database into a simpler dictionary data structure for easier manipulation in Python. This was also done to prevent changes to the original file provided. Personal identifiable information was also taken out at this time. This included things like private hidden files, other shared accounts, and chat messages.

Assignment Downloads
After each assignment was due, I performed another download of the students' history file. This would allow for more specific analysis so that calculations could be made to determine work done over a project's assigned time. These files were named with the students GitHub names for identification. This information was removed once the grades have been received for that specific assignment. These downloads occurred three days after an assignment was due to capture possible late submissions.

Building PAPI 3.3.1 Study Design
I conducted interviews with 15 students/instructors with experience in programming and knowledge of computer science concepts.

Interview Style
Semi-structured interviews were held so that a conversation style survey could be had with participants. Removing the formal style related to other surveying methods allowed for a discussion of PAPI's implementation so that the user could express their ideas freely. Questions could be answered in a way best understood by the participant and opinions could be accurately provided.

Interview Members
To create the PAPI software, I first had one-on-one interviews with members of the University of Rhode Island community. This included current, past, and future instructors. This group were instructors; consisting of current, future, and past lectures and Teacher's Assistant. This community is the potential target user of PAPI when deployed. Each survey took place in about twenty minutes, with the participants answering questions related to who they are and what they would assume PAPI would be capable of. The group was not told specifics until additional ideas could be brought forward of what type of student data they might be interested in receiving. PAPI's frontend and operations were then built to match their specifications.

Participant Recruitment
Students and teaching assistants were recruited via university channels. From this original group, more individuals were added by snowball sampling. To be eligible to partake, participants must have been at least 19 years old. They were also required to have some programming experience in any programming language.
Teacher's Assistant (TA) experience was preferred but was not required. There was no compensation for participating. To sign up for the interview the individual would email the student investigator to schedule a time free in their weekly schedule.

Interview Design
Interviews were held using the video conferencing software, Zoom. This software allows for participants to make traditional or internet calls to the interviewer.
To keep all interviews in the same format, the video portion of the software was disabled for the duration of the interview. These interviews originally were to take place in-person but were changed to an online setting due to "social distancing" laws. The interview was divided into four parts and took about 30 minutes (21-52 minutes [1]. The participant was allowed to ask questions throughout the process. The interview was concluded after any participant follow up questions were addressed by the interviewer.
When interviewing, the participant questioning started vaguely and then became progressively more specific in each section. To start, the questions were to collect participants' ideas. The goal was to utilize the very minimal instruction so the participants may generate ideas that we have not thought of previously. This allowed us to broaden the scope of the PAPI software for a diverse audience. When brainstorming I am aware of the limits in the CS50 software, possibly leading to the absence of user-desired features. Due to the participants' limited understanding of CO50IDE and PAPI, they were able to think of more abstract thoughts.
The interview process continued by educating the participants on most concepts of PAPI, and things we wanted to implement. This step allowed for them to think of other related notions and rate our opinions.

Participant Demographics
The survey totaled 15 individuals (8 male and 7 female). Their ages ranged from 20 to 58 years old, with an average age of 25.4. The majors were 60% Computer Science, 33% Engineering, and 3% Writing and Rhetoric. The majority of participants (6/15) were TA's with an average of four years of experience. There were also four research assistants, three students, and two instructors. Figure 4. This is a screenshot of PAPI's homepage. From this page the user can chose to enter either student or instructor mode.

Creating the Software
After receiving the information the user base was interested in, we began writing the PAPI software. The software can be found online at https://github.com/DanielGauthier8/PAPI and was written by Benjamin Dahrooge and I. The PAPI back-end was written using Python3. We use HTML as a front-end and Flask as the web server gateway interface (WSGI). The project uses GitHub for version control and for managing collaboration. To start, the interface was written to allow for the upload of a single database file. This process continued to extract specific measurements and metadata to provide to the instructor. A mode where multiple files can be processed was added after single student analysis was implemented. More screenshots of the software can be found in Appendices A and B.

Data Parsing
The students' assignment database files were then uploaded into PAPI on a per student per assignment basis. I recorded the data from the PAPI software and changed student usernames to grades received on the assignment. Names were stripped from the recorded data after grade information was added according to the participant consent form. To remove cumulative numbers from each database collection, the data were corrected by subtracting prior data-points from each cumulative recording. This allowed each data point to be associated with a single assignment. After removing empty student submissions, 56 assignments were left and were used going forward. Comparisons were made to compare differences in grades and programming patterns. Averaging of student data was done in grade letter ranges. Thorough analysis lead to the generation of results to follow. The data to be measured was manually recorded and then compared to the PAPI software. First a test of the date metric was checked for basic data retrieval.
The software does not have to do any data parsing and only performs a simple lookup in the files database. All start and end dates were retrieved successfully for  (Table 2).
Next, data that has to be generated via other data were checked. The number of comments, time worked, number of sessions, and number of saves were manually recorded by a researcher in Table 3. This was done by journaling the number of sessions, the timing, and number of saves while programming. The results from PAPI matched most of these recordings with minor error; this data can be found in Table 4. The number of sessions and number of saves matched exactly to the manually recorded data. Total time worked was slightly lower for the average test on the PAPI software, at about 4.5% less than the time manually recorded. I assume this pattern is because someone working on an assignment will spend time at the beginning and end of each work session, not making any file edits. This

Plagiarism Indicator
Plagiarism indication has been built into the student mode as well as the class export function to catch large copy and pasting instances. The indicator provides the file name, finds the pasted text, and provides the text identified as pasted. PAPI starts searching for pasted text after the first text insertion until the most recent edit. It would be assumed this first text insertion is the assignment instructions or starter code. Files written outside the IDE and uploaded are also automatically considered an approved outside resource. PAPI will start looking for pasted code in all edits after this upload. If a student pastes in another student's work and submits it as their own, this would be caught by any other cheating identification software. Therefore the PAPI software ignores this form of plagiarism. To identify work as cheated, PAPI calculates the average text inserted per database entry. As the CS50IDE saves at a consistent rate, we can identify outliers. If a student has a piece of text entered that is over 8 times their traditional cadence, then they are flagged for possible cheating and the text is shown on PAPI's dashboard. PAPI also flags entries that have text insertions above 400 characters. This was added to identify a file with too few text insertions, or if all text is pasted at a high speed.
PAPI was tested with multiple example student files with assorted sizes of copy and paste lengths. This can be viewed in Table 5. In our tests, all pastes were detected from character length of 50 to 3260. PAPI was also run on assignments without any copy and pasting. It can be seen in Table 6 none of the assignments written out by hand had large text insertion detected. The first pasted comments from one test run were also properly ignored, as the PAPI software skips the first text insertion. False positives and false negatives were not observed in either of these tests. A larger test group will be needed to find edge cases. According to to plagiarism to get a full picture.

No Workload Increase
The generation of student feedback is immediate for the majority of situations using PAPI. When in student mode, the generation is always under one second.
This operation is typically faster than one second, at around 400 milliseconds.
Please reference Table 7 for all time measurements. To represent the largest a database might get, these values were recorded with the 10 largest databases of the semester. The number of clicks to use both student and class mode can be done in under seven clicks. Student mode can be completed in six clicks total and class mode can be completed in five. PAPI's checks will take place in its all in one software without the need for looking over any specific code. This can act as a tool for new TA's who are unfamiliar with helping other students as well as a center of knowledge for those with more experience. To sit down with a student in office hours, a session with a single student can range from about five minutes to the entire session, at around three hours. Getting insight into the student code can be done in these sections to provide specific comments. To ask a student their total  an assignment would be estimated to take at least five minutes for a student and there would be a large loss in accuracy. The PAPI software provides a lot of added benefits without the added time that would be traditionally required.

Have a Wide Impact
The possible impact of PAPI is large on both the university scale and the national level. The number of potential students for impact was determined by looking at the number of students and classrooms that use the CS50IDE. To have a wide impact, PAPI was written to work with cloud9 CS50IDE. The CS50IDE software is free and only requires a GitHub account to sign-up. CS50IDE is currently being used in the University of Rhode Island's sophomore curriculum for the object oriented programming and data structures class. These classes traditionally have at least 100 students registered. By teaching the students this IDE earlier in their schooling careers, the students can continue using this software for future classes like Computer Organization, Operating Systems and Networks, and Design and Analysis of Algorithms. At a minimum, assuming students do not continue using the same IDE after the required two courses, PAPI can reach the goal of 500 students in only a year and a half. The CS50IDE has similar popularity country-wide as it does at the University of Rhode Island. The CS50IDE software, according to their systems administrator, has 150,000+ users in the most up to date version of the software [2]. This tool can be provided to these users with no additional changes to their workflow. With this number of current users I will also be able to meet my goal with 10,000 nationally in about a month. The last version came out Jul 5, 2019, a year before this reading was taken, reducing to 15,000 new users a month [3].

Software Results
Results were first obtained from the Google Form filled out by the interviewer, as found in Appendix C. Upon completion of all interviews, the results were compiled in a Google Sheet in order to code each participant response. The coding process worked by looking through user responses from each question, taking out all the main ideas. Each response was not limited in the number of codes they could receive, having 1-5 codes per response. If many participants mentioned a subtopic of a code, the response would receive both the overarching and subtopic code. Any unrelated information provided by the user for each question was put into its own category to act as a list of notes to keep in mind while developing PAPI. For the purpose of this paper, the participants will be referred to as P1 through P15.

Previous Experience
To understand each participant's background with software grading, questions were asked on previous interactions with auto-graders and grade-book programs.
The two groups that participated were students and instructors. 86.7% of the individuals interviewed had experience with software grading. Of all participants, 40 % had experience as a grader in these systems. This is desirable, as the target audience should have previous experience in general. In addition, having both grading and student experience helps diversify answers. Opinions related to autograding that four or more people mentioned can be viewed in Table 8.

Current opinions
As shown in Table 9, the most liked thing about auto-graders was the fast speed of receiving feedback. This was mentioned by 60% of the total participants.
The next largest positive to software grading is that it can provide better feedback for the students than human graders can. Students mentioned that while the software grading feedback is traditionally vague, it is typically more itemized than some professors' grading schemes. Students are able to see exactly what was wrong, item by item. The TA's mentioned a similar improvement, saying that they can grade more assignments in less time when a large portion is auto-graded. This allows them to spend more time critiquing those who scored poorly.
Issues with software grading consist mostly of having assignments marked wrong for poor reasons. 53% of participants mentioned this problem in different ways. With auto-grading, there is a pattern for requiring very specific output of a program and cannot provide partial credit if this is not met. Sometimes the instructors write auto-graders incorrectly and the student is then notified that they did something wrong, even if what they did might have been correct. These issues are highlighted further as vague feedback or no feedback is provided as to why an answer is incorrect. The key weakness is when an issue occurs and the student is not advised on how to fix it. While the points awarded are specific and considered a positive, when things do go wrong and a TA is not there to provide further information, the students simply get mad with the current implementation.
It should also be considered that the student does not get overall feedback on how they worked on the assignment as a whole with the current system.
It is beneficial to understand where the PAPI software will fit into these current gaps in auto-graders. The main purpose of our software is to provide formative feedback and this is considered a largely lacking element by those who use autograders.
A common theme between these conversations was the relationship between the grader and the student. The approval and disapproval of software grading depended a lot on the participants' opinions on the importance of having this relationship. This was not a separate question but instead came up in many of the discussions. The dislike of software grading seemed to stem from the idea that when an instructor is grading an assignment a necessary relationship is built between the teacher and the student. It is through this relationship that the teacher learns where students are struggling and succeeding. Participants who liked software grading attributed to not having this relationship, allowing for a more even and fair grading system. It was mentioned by P9 that "I've had TA's make mistakes in the past", continuing to explain how TA's can add a bias to the grading without even realizing it. Software grading will grade every student with the same reasoning every time, giving each student a completely equal opportunity to succeed.
In regard to current software interfaces, a pattern in feedback came up again.
40% said that the feedback, when supplied, is displayed in a poor manner. The current systems make feedback, when provided, difficult to find when looking through grades. With current changes to the University of Rhode Island's LMS, it will be interesting if this opinion is kept going forward. The second complaint was poor inter-compatibility between grading software, coming up in 27% of the interviews.
While the first iteration of PAPI is stand-alone, its future work involves integrating directly inside of an IDE. I kept this in mind as PAPI has an export function.
While the front-end may change over time, the back-end is being written in Python for easy re-implementation in the future.

Programming Skills Participant Ideas
When analyzing the students' key-logging information, I want to make sure I was capturing statistics that graders and students will find useful. I started by asking participants what students should do to ensure positive results on an assignment. This questioning was done to see if I could automate checks for each of these ideas. All responses are summarized in Table 10. Of these responses I will mention some that could be integrated into PAPI and how.
The idea of planning out an assignment before starting was brought up by 80% of the participants as an item I need to make a top priority. PAPI can implement a check like this by measuring when the student is writing out their pseudocode/comments. Comments written at the beginning of the process can be weighted more than at the end of the process. I also put a focus on how much pseudocode they are writing out, assuming that more pseudocode is better for retaining concepts. Additionally, it was important to analyze "incrementally building" and "starting early and often". These two measurements can be detected in similar ways. By noting the number of times a student works on an assignment and how much they add to the program during that time will dictate if they are dividing their work well. A student who starts as early as possible and has many separate sessions with equally divided work is considered a perfect score for this measurement. Another overarching and measurable concept is the idea that a 'good' programmer does not cheat. Cheating can be identified by looking for the copying and pasting of large amounts of code. This can be detected if the keylogger inserts a row with very large amounts of text. The logging software records input at a rate of close to once every three seconds, while the fastest English typist can type 216 words per minute. If more than 10 words are typed in one data collection the student would be flagged for cheating. This number will be adjusted as necessary with future research, but acts as an example of a rate of text input not achievable by a human using a keyboard. One interesting topic brought up by some participants was how students handle problems that they encounter when programming. This was an idea I had not previously considered.
With ideas of growth mindset stressed at the University of Rhode Island, it is important to obtain an understanding of how students handle failure as well as This would show the student is taking action to fix a mistake.

Rating Ideas
After getting participants' ideas on ways a student work can be measured, I had them rate hypothetical programming practices. Of the concepts aforementioned, the participants most liked the idea of detecting cheating, having 60% of the participants view it as the most important identifier. While this is not the intention of PAPI, many saw it as its most desirable feature. P15 mentioned, " [it's] easy to defeat current cheating technology, even the good ones". Previously, this was not a highly prioritized item. After hearing this feedback, the feature was given more attention. As I have acknowledged, assignments can only be written in so many ways. Detecting copying and pasting can help differentiate false positives with other deployed systems. Red-flagging students suspected of cheating and then having a conversation can stop the bad habit before it becomes an issue.
Participants rated plain metrics on the Likert scale to be collected and presented to the grader. Options that received an average rating of 3.5 or higher were the first to be implemented in PAPI. These ratings can be found in Table 11.
These data points make sense as elements that are useful on their own. Metrics such as time spent on an assignment and start date can indicate to a professor the workload they are assigning to their students. As mentioned previously, it was not predicted how high copy and pasting code would rate in the Likert scale question.
The "number of saves" for the assignment was not deemed important, receiving an average of 2.2. This is contrary to seeing this data in the context of checking for other measurements like "what the student does when they encounter a difficulty." It was rationalized that when rating simple metrics, such as the number of saves, it would be difficult to understand abstract uses. If an item was proved to have worth in the "participant ideas" portion, the data point will still be implemented in PAPI. The number of saves will, therefore, be added to the list of implemented ideas. Items like the number of file creations, the number of grammar mistakes, and the naming of each file were assumed to score poorly. They do not provide a lot of useful information on their own or in more abstract ideas from the "participant ideas" portion. These items do not relate to the students' process in a way that we deem important for performing well on an assignment. A student can be a bad speller, name an assignment anything they want, and create many files, but this will not greatly impact the running of a program. At this time, they do not relate to participant ideas or our ideas and will not be developed for PAPI.
In general, taking these data measurements can be deemed as intrusive to the students. When asking what information should not be shared with an instructor, even if it helps the student, the opinion was this data collection is something that the student should consent to and not automatically implement across the class.
For the current study taking place this is not an issue. The student already had to consent to participate in the study. When implementing in the classroom, this same style may be recommended and should be looked into further. Similar issues are occurring with turnitin.com [4]. For the time being, students have to share their files to allow a teacher to see the data. It is assumed the consent has already been provided. Professors should consider this as an opt-in and not a class requirement.
Another topic mentioned by the participants (47%) was limiting recording to only the assignment and not allow the viewing of other windows. This will not be an issue as only text entered in the IDE is logged.

Feedback Action Item
One participant articulated our goals well, saying, "teach, do not monitor" (P14). This aligns closely with what PAPI should be used for, acting as a teaching aid and not as a 'tattler' system. If students do not perform well on an assignment and missed an element mentioned, the teacher can recommend the student to use a different method. If a student performs well on an assignment and misses an element, it will be assumed the instructor will have that student continue their current routine.

Software Design
Software design was subdivided into three sections, this was the application format, the presentation of files, and the presentation of file analysis.

Application Format
One quick but important section was the applications deployed format. Flexibility was something I saw important as one of our goals was to reach the most number of people possible. The grading/TA position is also unique because it is not typically a long term job. While students graduate from college, the same professor can teach a class for many years. There is a lot of turnover of TA's and this needs to be kept in mind when creating the PAPI software. I want to make sure PAPI is in a familiar format and can be quickly taught to a new grader to fulfill this requirement. In reference to Table 12, 74% thought a website is the best technique to achieve this. While there are many options for deployment, these are key deciding factors that need to be considered. With our premonition of using this deployment mechanism and the majority of participants marking this as the top option, I will make a website. A website is not bound by an operating system and can be accessed anywhere that has an internet connection. By running a web service on the grader's computer, similar to Jupyter Notebooks, PAPI could also run locally. By not requiring specific software, the setup process can be removed for the average computer user.

Presentation of File(s)
The interview became abstract when talking about how to display student files to the instructor. The database has a document ID for each file and records this with each user action. To be able to look through this database in a visual style is similar to looking through someone else's computer folders. This interface was created from scratch, so any recommendations would help with implementation.
Although mandating specific formatting of the IDE folders is one way to help with this, I still wanted to see if there were additional implementations. The actuality will have to be a hybrid of having a good file viewer as well as requiring some structure for naming files. In terms of file sorting, the participants were told to imagine they were using another person's computer to find a specific file. They were then asked which steps they would use to find it and why. This was seen as a parallel to traversing the students' database without having to relate directly to our specific software. The most popular option was to sort by last edit date, followed by having a visual directory tree. Having such a specific consensus, the goal of PAPI is to show files in a directory tree with the option to limit by creation date. This will allow for a swift process when looking at most recent assignments. Table 13 shows which data source was most desired and at what granularity.

File Analysis
The data source is the number of students analyzed at one time. The duration is the number of selected assignments, and the average rating is the popularity of participants who voted for the corresponding data source and duration. The participants were given options across axes; all students to one student and all assignments to one assignment. Choices were distributed in groups at either side of the matrix. To start, participants thought both the class as a whole and individual students should be allowed to be selected. This means there was no preference for deploying only class data or only student data. When analyzing the class as a whole, the participants deemed looking over every assignment as more important than one assignment at a time. When selecting one student, they had the opinion of both all assignments or one. The options for choosing a small group of students were not rated as highly.
In terms of implementation, these results are the best case. Having to select each student and then their specific assignment would have been challenging to implement in an easy-to-understand way. The grader would have to select each student's assignment one at a time. By analyzing the entire class as a whole, the teacher can look at the overall trendline of the class for all assignments. This means they could see when students start working on new assignments and when they are putting in the most effort. This can help emphasize to students to start sooner or have the teacher change assignment deadlines. This same pattern would be useful for one student over a semester. If a student performed poorly on all assignments, there might be a clear pattern between projects. If a student performed poorly on one assignment, they might have used an unproductive technique. With this new data, I know PAPI should be able to support specific student patterns and general class patterns.

Implementation
The next step is to implement the ideas I learned from the target audience.
With PAPI, there will be a limited amount of time before the end product can be created. These discoveries will help create a tiered system of implementation.
Instead of assuming what action items should be deployed in the first launch, I now know what people want to be implemented and why. The research will also continue the process of analyzing student work. This included documenting previous work in computer science, as well as making discoveries of our own in the current parallel student study. Our process will have to repeat for the lifetime of PAPI if I want to continue to have a pleased user-base. Keeping up to date with current opinions and patterns is clearly a requirement for good software.  can be noted this pattern is not found in those students who have received a B on the assignment. I assume this occurred because of the low sample size. There were only two students who received a B, making the results inconclusive for this grade range. A pattern was not identified when looking at the difference between deletions and insertions.  Comments and grades were examined to find the impact of commenting code on the project grade. It should be noted that commenting was not an individually graded item on assignments. On average, it was found that students who received an A on an assignment commented more than any other letter grade. When comparing students who passed and failed an assignment (≥ 65 considered passing), there were, on average, more than double the number of comments made by students who passed than those who failed (Passing average: 13.55 comments, Failing average: 6.21 comments). This shows that more comments typically lead to a higher assignment grade. The time at which a student finished working on an assignment, in relation to its deadline, influenced the grade the student received on the assignment. The mean of students who received a grade of over 100 created all necessary files for a project early in their timeline, while those who scored lower created files closer to the submission deadline. This trend continues for the last edit time. Working on an assignment closer to the deadline led to a lower grade. On average, the last edit date for the students who received over 81 points had made their last edit earlier than lower-scoring students. The lower-scoring students had not created all of the files they needed for the assignment at this same time. One interesting find was that students who scored over 100 tend to have all files for an assignment made at least 9 days before the assignment was due. This does not mean that the assignment was completed this early, as the edit date continues to about 3.5 days before the deadline. This matches our assumption that students who start an assignment early and plan out their layout most often receive a higher grade than those who do not.  The way a student divides their time on an assignment is another topic of interest. The time spent on an assignment did not vary drastically besides that of students who scored over 101. On average the students worked on the assignment for at least an hour and a half less than any other grade range, while still getting extra credit points. This leads us to believe that students who score very high on assignments either use the IDE not recommended for the class or they already understand the material so they are able to do assignments and their extra credit in a very fast time. The pattern at 100 and lower does not have a statistically significant pattern. This observation stands true to the number of work sessions and the number of days worked. Overall, there is little to no pattern found between grade and time spent working on assignments.

CHAPTER 5 Conclusion
This project has developed the PAPI software to provide data on how students approach programming and to support analysis of that data to help with formative feedback to both students and instructors. PAPI achieves this by providing measurement of time spent on an assignment, a copy and paste indicator, a comment counter, and a graphic of what students are working on during each programming session. The results from the course pilot found patterns showing the data that we collected has probable correlation between grades revived on an assignment and several assignment details. The PAPI software indicated in the pilot classroom that a low number of text insertions and deletions, starting an assignment early, and commenting code pattern to higher assignment grade. The data also showed students who receive grades over 100 have much lower usage of the course assigned IDE. These students had noticeably lower work time, number of edits, number of days, and number of saves. With both a desire from the community of this product and promising patterns in data, the PAPI software has a clear opportunity for deployment.

CHAPTER 6
Future Work

Software
Like all software, PAPI will need continuous updates to stay operational and relevant to current standards. This would be required to remain to its current security standards and progress with its dependencies. In addition to this upkeep there are more elements that could be added to PAPI. When writing PAPI, elements were introduced in a priority matching what our interview indicated. Some of these lower priority items on this list are not yet implemented. One example is to include the option of entering student mode from class mode. Another addition could be adding instructor settings that change how the website operates on a peruser basis. In addition, the PAPI software could save class patterns for comparing with students in future courses. While PAPI is written as an individual program to run alongside an IDE, the long term goal is to have this built directly into an IDE. An environment would be expected to handle the generation of the database, the retrieval of the history for the instructor, and the parsing of the data. This would remove the requirements of having to load the history file into a separate software.

Research
PAPI is a proof of concept to show different programming techniques and compare those with assignment grades. More time and attention is needed to perform an education and psychology study on which programming methods lead to higher course grades. Our proof of concept needs to be taken to a larger participant group. This group would be suggested to test large classes at different levels to see the difference in programming trends throughout the learning process.