Middle Data Format Protocol

In EduStudio, we adopt a flexible CSV (Comma-Separated Values) file format following Recbole. The flexible CSV format is defined in middata stage of dataset (see Dataset Stage Protocol for details).

The Middle Data Format Protocol including two parts: Columns name Format and Filename Format.

Columns Name Format

feat_type Explanations Examples
token single discrete feature exer_id, stu_id
token_seq discrete features sequence knowledge concept seq of exercise
float single continuous feature label, start_timestamp
float_seq continuous feature sequence word2vec embedding of exercise

Filename format

So far, there are five atomic files in edustudio.

Note: Users could also load other types of data except the three atomic files below. {dt} is the dataset name.

filename format description
{dt}.inter.csv Student-Exercise Interaction data
{dt}.train.inter.csv Student-Exercise Interaction data for training set
{dt}.valid.inter.csv Student-Exercise Interaction data for validation set
{dt}.test.inter.csv Student-Exercise Interaction data for test set
{dt}.stu.csv Features of students
{dt}.exer.csv Features of exercises

Example

example_dt.inter.csv

stu_id:token exer_id:token label:float
0 1 0.0
1 0 1.0

example_dt.stu.csv

stu_id:token gender:token occupation:token
0 1 11
1 0 7

example_dt.exer.csv

exer_id:token cpt_seq:token_seq w2v_emb:float_seq
0 [0, 1] [0.121, 0.123, 0.761]
1 [1, 2, 3] [0.229, -0.113, 0.138]