EduStudio

In EduStudio, we view the dataset as three statuses: rawdata, middata, cachedata.

inconsistent rawdata: the original data format provided by the dataset publisher.
standardized middata: the standardized middle data format defined by EduStudio.
model-friendly cachedata: the data format that is convenient for model usage.

Dataset Folder Format Example

All datasets are required to store in a unified folder. The example below illustrated the FrcSub dataset folder format.

data/
├── FrcSub
│   ├── cachedata
│   │   ├── FrcSub_five_fold
│   │   │   ├── datatpl_cfg.json
│   │   │   ├── df_exer.pkl
│   │   │   ├── df_stu.pkl
│   │   │   ├── dict_test_folds.pkl
│   │   │   ├── dict_train_folds.pkl
│   │   │   ├── dict_valid_folds.pkl
│   │   │   └── final_kwargs.pkl
│   │   └── FrcSub_one_fold
│   │       ├── datatpl_cfg.json
│   │       ├── df_exer.pkl
│   │       ├── df_stu.pkl
│   │       ├── dict_test_folds.pkl
│   │       ├── dict_train_folds.pkl
│   │       ├── dict_valid_folds.pkl
│   │       └── final_kwargs.pkl
│   ├── middata
│   │   ├── FrcSub.exer.csv
│   │   └── FrcSub.inter.csv
│   └── rawdata
│       ├── data.txt
│       ├── problemdesc.txt
│       ├── qnames.txt
│       └── q.txt

Dataset Stage protocol

The rawdata folder stores raw data files. For different datasets, they have different raw data file format.

The middata folder stores middle data files. For existing datasets, we provide the atomic operation inheriting the protocol class BaseRaw2Mid, which process raw data to middle data. The middle is required in atomic file protocol.

With the middata of following atomic file protocol, we can implement some other atomic operations inheriting the protocol class BaseMid2Cache to build cachedata from middata.

Example：Atomic Operation Sequence of Data Processing

R2M_FrcSub: process the Frcsub dataset from rawdata to midata
M2C_FilterRecords4CD：Filter students or exercises whose number of interaction records is less than a threshold
M2C_ReMapId: ReMap feature Id
M2C_RandomDataSplit4CD: Split Datasets
M2C_GenQMat: Generate Q-matrix