Code Overview:
The system starts by loading three key datasets: training logs (containing keystroke data from essays with known scores), training scores (the actual scores for those essays), and test logs (keystroke data from new essays that need scoring).
For each essay, the system takes various behavioral features that shows the writing process. These features include the total number of keystrokes, how long the student spent writing, their average time between actions, and different types of writing behaviors, like typing new text, deleting, pasting, or replacing content.
It also tracks more complicated metrics like the ratio of revision activities, such as deletions, pastes, and replacements, to total actions, the number of long pauses, the final word count, and the typing speed in characters per second. Once these features are extracted, they're used to train a Random Forest machine learning model.
This model learns patterns between these writing behaviors and essay scores from the training data. After training, the model can then predict scores for new essays based solely on how they were written, without ever needing to read the actual content. The system uses multiple decision trees to make its predictions, combining their outputs to generate a final score prediction. This approach suggests that the way a student writes might be just as indicative of essay quality as the content itself.
|