attelo.parser package

Attelo is essentially a toolkit for producing parsers: parsers are black boxes that take EDUS as inputs and produce graphs as output.

Parsers follow the scikit fit/transform idiom. They are learned from some training data via the fit() function (this usually results in some model that the parser remembers; but a hypothetical purely rule-based parser might have a no-op fit function). Once fitted to the training data, they can be set loose on anything you might want to parse: the transform function will produce graphs from the EDUs.

Submodules

attelo.parser.attach module

A parser that only decides on the attachment task (whether this is directed or not depends on the underlying datapack and decoder). You could also combine this with the label parser

class attelo.parser.attach.AttachClassifierWrapper(learner_attach)

Bases: attelo.parser.interface.Parser

Parser that extracts attachments weights from an attachment classifier.

This parser is really meant to be used in conjunction with other parsers downstream that make use of these weights.

If you use it in standalone mode, it will just provide the standard unknown prediction everywhere

Notes

Cache keys

  • attach: attachment model path
fit(dpacks, targets, nonfixed_pairs=None, cache=None)

Extract whatever models or other information from the multipack that is necessary to make the parser operational

Parameters:mpack (MultiPack) –
transform(dpack, nonfixed_pairs=None)
class attelo.parser.attach.AttachPipeline(learner, decoder)

Bases: attelo.parser.pipeline.Pipeline

Parser that performs the attachment task.

Attachments may be directed or undirected depending on the datapack and models.

For the moment, this assumes AD models, but perhaps over time could be generalised to A.D models too.

This can work as a standalone parser: if the datapack is unweighted it will initalise it from the classifier. Also, if there are pre-existing weights, they will be multiplied with the new weights.

Notes

fit() and transform() have a ‘cache’ parameter that is a dict with expected keys: * attach: attachment model path

attelo.parser.full module

A ‘full’ parser does the attach, direction, and labelling tasks

class attelo.parser.full.AttachTimesBestLabel

Bases: attelo.parser.interface.Parser

Intermediary parser that adjusts the attachment weight by multiplying the best label weight with it.

This is most useful in the middle of a parsing pipeline: we need something upstream to assign initial attachment and label weights (otherwise we get the default 1.0 everywhere), and something downstream to make predictions (otherwise it’s UNKNOWN everywhere)

fit(dpacks, targets, nonfixed_pairs=None, cache=None)
transform(dpack, nonfixed_pairs=None)
class attelo.parser.full.JointPipeline(learner_attach, learner_label, decoder)

Bases: attelo.parser.pipeline.Pipeline

Parser that performs attach, direction, and labelling tasks.

For the moment, this assumes AD.L models, but we hope to explore possible generalisations of this idea over time.

In our working shorthand, this would be an AD.L:adl parser, ie. one that has separate attach-direct model and label model (AD.L); but which treats decoding as a joint-prediction task.

Notes

fit() and transform() have a cache parameter, it should be a dict with keys: * ‘attach’: attach model path * ‘label’: label model path

class attelo.parser.full.PostlabelPipeline(learner_attach, learner_label, decoder)

Bases: attelo.parser.pipeline.Pipeline

Parser that perform the attachment task (may be directed or undirected depending on datapack and models), and then the labelling task in a second step

For the moment, this assumes AD models, but perhaps over time could be generalised to A.D models too

This can work as a standalone parser: if the datapack is unweighted it will initalise it from the classifier. Also, if there are pre-existing weights, they will be multiplied with the new weights

Notes

fit() and transform() have a ‘cache’ parameter that is a dict with expected keys: * ‘attach’: attach model path * ‘label’: label model path

attelo.parser.interface module

Basic interface that all parsers should respect

class attelo.parser.interface.Parser

Bases: object

Parsers follow the scikit fit/transform idiom. They are learned from some training data via the fit() function. Once fitted to the training data, they can be set loose on anything you might want to parse: the transform function will produce graphs from the EDUs.

If the learning process is expensive, it would make sense to offer the ability to initialise a parser from a cached model

static deselect(dpack, idxes)

Common parsing pattern: mark all edges at the given indices as unrelated with attachment score of 0. This should normally exclude them from attachment by a decoder.

Warning: assumes a weighted datapack

This is often a better bet than using something like DataPack.selected because it keeps the unwanted edges in the datapack

static dzip(fun, dpacks, targets)

Apply a function on each datapack and the corresponding target block

Parameters:
  • ((a, b) -> (a, b)) (fun) –
  • [a] (dpacks) –
  • [b] (targets) –
Returns:

Return type:

[a], [b]

fit(dpacks, targets, cache=None)

Extract whatever models or other information from the multipack that is necessary to make the parser operational

Parameters:
  • dpacks ([DataPack]) –
  • targets ([array(int)]) – A block of labels for each datapack. Each block should have the same length as its corresponding datapack
  • cache (dict(string, object), optional) –

    Paths to submodels. If set, this dictionary associates submodel names with filenames. The submodel names are arbitrary strings like “attach” or “label” (check the documentation for the parser itself to see what submodels it recognises) with some sort of cache.

    This usage is necessarily loose. The parser should be prepared to ignore a key if it does not exist in the cache. The typical cache value is a filepath containing a pickle to load or dump; but other objects may sometimes be used depending on the parser (eg. other caches if it’s a parser that somehow combines other parsers together)

static multiply(dpack, attach=None, label=None)

If the datapack is weighted, multiply its existing probabilities by the given ones, otherwise set them

Parameters:
  • (array(float), optional) (attach) – If unset will default to ones
  • (2D array(float), optional) (label) – If unset will default to ones
Returns:

Return type:

The modified datapack

static select(dpack, idxes)

Mark any pairs except the ones indicated as unrelated

See also

Parser.deselect

transform(dpack)

Refine the parse for a single document: given a document and a graph (for the same document), add or remove edges from the graph (mostly remove).

A standalone parser should be able to start from an unweighted datapack (a fully connected graph with all labels equally liekly) and pare it down with to a much more useful graph with one best label per edge.

Standalone parsers ought to also do something sensible with weighted datapacks (partially instantiated graphs), but in practice they may ignore them.

Not all parsers may necessarily standalone. Some may only be designed to refine already existing parses. Or may require further processing.

Parameters:dpack (DataPack) – the graph to refine (can be unweighted for standalone parsers, MUST be weighted for other parsers)
Returns:predictions – the best graph/prediction for this document

(TODO: support n-best)

Return type:DataPack

attelo.parser.intra module

Document-level parsers that first do sentence-level parsing.

An IntraInterParser applies separate parsers on edges within a sentence and then on edges across sentences.

class attelo.parser.intra.FrontierToHeadParser(parsers, sel_inter='inter', verbose=False)

Bases: attelo.parser.intra.IntraInterParser

Intra/inter parser in which sentence recombination consists of parsing with edges from the frontier of sentential subtree to sentence head.

[ ] write and integrate an oracle that replaces lost gold edges (from non-head to head) with the closest alternative ; here this probably happens on leaky sentences and I still have to figure out what an oracle should look like.

class attelo.parser.intra.HeadToHeadParser(parsers, sel_inter='inter', verbose=False)

Bases: attelo.parser.intra.IntraInterParser

Intra/inter parser in which sentence recombination consists of parsing with only sentence heads.

[ ] write and integrate an oracle that replaces lost gold edges (from non-head to head) with the closest alternative, here moving edges up the intra subtrees so they link the (recursive) heads of their original nodes.

class attelo.parser.intra.IntraInterPair

Bases: attelo.parser.intra.IntraInterPair

Any pair of the same sort of thing, but with one meant for intra-sentential decoding, and the other meant for intersentential

fmap(fun)

Return the result of applying a function on both intra/inter

Parameters:fun (a -> b) –
Returns:
Return type:IntraInterPair(b)
class attelo.parser.intra.IntraInterParser(parsers, sel_inter='inter', verbose=False)

Bases: attelo.parser.interface.Parser

Parser that performs attach, direction, and labelling tasks; but in two phases:

  1. by separately parsing edges within the same sentence
  2. and then combining the results to form a document

This is an abstract class

Notes

/Cache keys/: Same as whatever included parsers would use. This parser will divide the dictionary into keys that have an ‘intra:’ prefix or not. The intra prefixed keys will be passed onto the intrasentential parser (with the prefix stripped). The other keys will be passed onto the intersentential parser

fit(dpacks, targets, cache=None)
transform(dpack)
class attelo.parser.intra.SentOnlyParser(parsers, sel_inter='inter', verbose=False)

Bases: attelo.parser.intra.IntraInterParser

Intra/inter parser with no sentence recombination. We also chop off any fakeroot connections

class attelo.parser.intra.SoftParser(parsers, sel_inter='inter', verbose=False)

Bases: attelo.parser.intra.IntraInterParser

Intra/inter parser in which sentence recombination consists of

  1. passing intra-sentential edges through but
  2. marking 1.0 attachment probabilities if they are attached and 1.0 label probabilities on the resulting edge

Notes

In its current implementation, this parser needs a global model, i.e. one fit on the whole dataset, so that it can correctly score intra-sentential edges. Different, alternative implementations could probably solve or work around this.

attelo.parser.intra.edu_id2num(edu_id)

Get the number of an EDU

attelo.parser.intra.for_intra(dpack, target)

Adapt a datapack to intrasentential decoding.

An intrasentential datapack is almost identical to its original, except that we set the label for each (‘ROOT’, edu) pairing to ‘ROOT’ if that edu is a subgrouping head (if it has no parents other than ‘ROOT’ within its subgrouping).

This should be done before either for_labelling or for_attachment

Returns:
  • dpack (DataPack)
  • target (array(int))
attelo.parser.intra.partition_subgroupings(dpack)

Partition the pairings of a datapack along (grouping, subgrouping).

Parameters:dpack (DataPack) – Datapack to partition
Returns:groups – Map each (grouping, subgrouping) to the list of indices of pairings within the same subgrouping.
Return type:dict from (string, string) to list of integers

Notes

  • (FAKE_ROOT, x) pairings are included in the group defined by

    (grouping(x), subgrouping(x)).

  • This function is a tiny wrapper around

    attelo.table.grouped_intra_pairings.

attelo.parser.label module

Labelling

class attelo.parser.label.LabelClassifierWrapper(learner)

Bases: attelo.parser.interface.Parser

Parser that extracts label weights from a label classifier.

This parser is really meant to be used in conjunction with other parsers downstream that make use of these weights.

If you use it in standalone mode, it will just provide the standard unknown prediction everywhere.

Notes

fit() and transform() have a ‘cache’ argument that is a dict with expected keys: * ‘label’: label model path

fit(dpacks, targets, nonfixed_pairs=None, cache=None)

Extract whatever models or other information from the multipack that is necessary to make the labeller operational.

Returns:self
Return type:object
transform(dpack, nonfixed_pairs=None)
class attelo.parser.label.SimpleLabeller(learner)

Bases: attelo.parser.label.LabelClassifierWrapper

A simple parser that assigns the best label to any edges with unknown labels.

This can be used as a standalone parser if the underlying classifier predicts UNRELATED.

Notes

fit() and transform() have a ‘cache’ parameter that is a dict with expected keys: * ‘label’: label model path

transform(dpack, nonfixed_pairs=None)

attelo.parser.pipeline module

Parser made by sequencing other parsers.

Ideally, we’d like to use sklearn.pipeline.Pipeline but our previous attempts have failed. The current trend is to try and slowly converge.

class attelo.parser.pipeline.Pipeline(steps)

Bases: attelo.parser.interface.Parser

Apply a sequence of parsers.

NB. For now we assume that these parsers can be fitted independently of each other.

Steps should be a tuple of names and parsers, just like in sklearn.

Parameters:steps (list) – List of (name, parser) tuples that are chained.
named_steps

dict

Read-only attribute to access any step parameter by user given name. Keys are step names and values are step parameters.

fit(dpacks, targets, nonfixed_pairs=None, cache=None)

Fit.

named_steps
transform(dpack, nonfixed_pairs=None)

Transform.