Input format¶
Input to attelo consists of three files two of which are aligned:
- an EDU input file with one line per discourse unit
- a pairings file with one line per EDU pair
- a features file also with one line per EDU pair
EDU inputs¶
- global id: used by your application, arbitrary string? (NB: ROOT is a special name: no EDU should be named that, but all EDUs can have ROOT as a potential parent)
- text: essentially for debugging purposes, used by attelo graph to provide a visualisation of parses
- grouping (eg. file name, dialogue id): edus are only ever connected with edus in the same group. Also, folds are built on the basis of EDU groupings
- subgrouping (eg. sentence id): any common subunit that can hold multiple EDUs (use the EDU id itself if there is no useful notion of subgrouping). Some decoders may try to treat links between EDUs in the same subgrouping differently from the general case
- span start: (int): used by decoders to order EDUs and determine their adjacency
- span end: (int): see span start
d1_492 sheep for wood? dialogue_1 sent1 0 15
d1_493 nope, not me dialogue_1 sent2 16 28
d1_494 not me either dialogue_1 sent2 29 42
Pairings¶
The pairings file is a tab-delimited list of (parent, child) pairs, with each element being either an EDU global id (from the EDU inputs), or the distinguished label ROOT. Each row in this file is corresponds with a row in the feature files
ROOT d1_492
d1_493 d1_492
d1_494 d1_492
ROOT d1_493
d1_492 d1_493
d1_494 d1_493
ROOT d1_494
d1_492 d1_494
d1_493 d1_494
Note that attelo can also accept pairings files with a third column (which it ignores)
Features¶
Features and labels are supplied as in (multiclass) libsvm/svmlight format.
Relation labels¶
You should supply a single comment at the very beginning of the file, which attelo can use to associate relation labels with string values
# labels: <space delimited list of labels>
The labels ‘UNRELATED’ must exist and be used for any edu pairs which are not related/attached. For example, in the below, the second and fourth EDU pairs are not considered to be related
# labels: elaboration narration continuation UNRELATED ROOT
1 1:1 2:1
4 1:2
2 1:3 3:1
4 1:1
3 1:2
Also, if intersentential learning/decoding is used, the label ‘ROOT’ must also be exist and be used for links from the ROOT edu.
Note that labels are assumed to start from 1.
Categorical features¶
Attelo no longer provides direct support for categorical features, that is, features whose possible values are members of a set (eg. POS tag). You should perform one hot encoding on any categorical features you have. Luckily, with the svmlight sparse format, this can be done with no additional cost in space and also opens the door for more straightforward filtering on your part.
Other notes on features¶
Don’t forget that the order that features appear in must correspond to the order that pairings appear in the EDU file