attelo package

Attelo is a statistical discourse parser. The API provides

  • decoders which you should be able to call in a standalone way
  • machine learning infrastructure wrapping around a library like sci-kit learn
  • support for building experimental harnesses around the parser

Submodules

attelo.args module

Managing command line arguments

attelo.args.add_common_args(psr)

add usual attelo args to subcommand parser

attelo.args.add_fold_choice_args(psr)

ability to select a subset of the data according to a fold

attelo.args.add_model_read_args(psr, help_)

models files we can read in

Parameters:help (string) – python format string for help {} will have a word (eg. ‘attachment’) plugged in
attelo.args.add_report_args(psr)

add args to scoring/evaluation

attelo.args.validate_fold_choice_args(wrapped)

Given a function that accepts an argparsed object, check the fold arguments before carrying on.

The idea here is that –fold and –fold-file are meant to be used together (xnor)

This is meant to be used as a decorator, eg.:

@validate_fold_choice_args
def main(args):
    blah

attelo.edu module

Uniquely identifying information for an EDU

class attelo.edu.EDU

Bases: attelo.edu.EDU

a class representing the EDU (id, span start and end, grouping, subgrouping)

span()

Starting and ending position of the EDU as an integer pair

attelo.edu.FAKE_ROOT = EDU(id='ROOT', text='', start=0, end=0, grouping=None, subgrouping=None)

a distinguished fake root EDU which simultaneously appears in all groupings

attelo.fold module

Group-aware n-fold evaluation.

Attelo uses a variant of n-fold evaluation, where we (still) andomly partition the dataset into a set of folds of roughly even size, but respecting the additional constraint that any two data entries belonging in the same “group” (determined a single distiguished feature, eg. the document id, the dialogue id, etc) are always in the same fold. Note that this makes it a bit harder to have perfectly evenly sized folds

Created on Jun 20, 2012

@author: stergos

contribs: phil

attelo.fold.fold_groupings(fold_dict, fold)

Return the set of groupings that belong in a fold. Raise an exception if the fold is not in the fold dictionary

:rtype frozenset(int)

attelo.fold.make_n_fold(groupings, folds, rng)

Given a set of groupings and a desired number of folds, return a fold selection dictionary assigning a fold number to each each grouping (see attelo.edu.EDU).

Parameters:rng (:py:class:random.Random:) – random number generator (hint: the random module will be just fine if you don’t mind shared state)

:rtype dict(string, int)

attelo.fold.select_testing(mpack, fold_dict, fold)

Given a division into folds and a fold number, return only the test items for that fold

Return type:Multipack
attelo.fold.select_training(mpack, fold_dict, fold)

Given a division into folds and a fold number, return only the training items for that fold

Return type:Multipack

attelo.graph module

graph visualisation

exception attelo.graph.Alarm

Bases: exceptions.Exception

Exception to raise on signal timeout

class attelo.graph.GraphSettings

Bases: attelo.graph.GraphSettings

Parameters:
  • hide (string or None) – ‘intra’ to hide links between EDUs in the same subgrouping; ‘inter’ to hide links across subgroupings; None to show all links
  • select ([string] or None) – EDU groupings to graph (if None, all groupings will be graphed unless)
  • unrelated (bool) – show unrelated links
  • timeout (int) – number of seconds to allow graphviz to run before it times out
  • quiet (bool) – suppress informational messages
attelo.graph.alarm_handler(_, frame)

Raise Alarm on signal

attelo.graph.diff_all(edus, src_predictions, tgt_predictions, settings, output_dir)

Generate graphs for all the given predictions. Each grouping will have its own graph, saved in the output directory

attelo.graph.graph_all(edus, predictions, settings, output_dir)

Generate graphs for all the given predictions. Each grouping will have its own graph, saved in the output directory

attelo.graph.mk_diff_graph(title, edus, src_links, tgt_links, settings)

Convert attelo predictions to a graphviz graph diplaying differences between two predictions

Predictions here consist of an EDU followed by a list of (parent name, relation label) tuples

Parameters:tgt_links – if present, we generate a graph that represents a difference between the links and tgt_links (by highlighting links that only occur in one or the other)
attelo.graph.mk_single_graph(title, edus, links, settings)

Convert single set of attelo predictions to a graphviz graph

Given a set of edus and of edu id pairs, return only the pairs whose ids appear in the edu list

Parameters:
  • intra – if True, in addition to the constraints above, only return links that are in the same subgrouping
  • inter – if True, only return links between subgroupings
attelo.graph.write_dot_graph(filename, dot_graph, run_graphviz=True, quiet=False, timeout=30)

Write a dot graph and possibly run graphviz on it

attelo.io module

attelo.report module

attelo.score module

attelo.table module

Manipulating data tables (taking slices, etc)

class attelo.table.DataPack

Bases: attelo.table.DataPack

A set of data that can be said to belong together.

A typical use of the datapack would be to group together data for a single document/grouping. But in cases where this distinction does not matter, it can also be convenient to combine data from multiple documents into a single pack.

Notes

A datapack is said to be

  • single document (the usual case) it corresponds to a single document or “stacked” if it is made by joining multiple datapacks together. Some functions may only behave correctly on single-document datapacks
  • weighted if the graphs tuple is set. You should never see weighted datapacks outside of a learner or decoder
Parameters:
  • (EDU) (edus) – effectively a set of edus
  • ([(EDU, EDU)]) (pairings) – edu pairs
  • 2D array(float) (data) – sparse matrix of features, each row corresponding to a pairing
  • 1D array (should be int, really) (target) – array of predictions for each pairing
  • ctarget (dict from string to objects) – Mapping from grouping name to structured target
  • ([string]) (vocab) – list of relation labels (NB: by convention label zero is always the unknown label)
  • ([string]) – feature names (corresponds to the feature indices) in data
  • (None or Graph) (graph) – if set, arrays representing the probabilities (or confidence scores) of attachment and labelling
get_label(i)

Return the class label for the given target value.

Parameters:(int, less than len(self.labels)) (i) – a target value

See also

label_number

label_number(label)

Return the numerical label that corresponnds to the given string label

Useful idiom: unrelated = dpack.label_number(UNRELATED)

Parameters:(string in self.labels) (label) – a label string

See also

get_label

classmethod load(edus, pairings, data, target, ctarget, labels, vocab)

Build a data pack and run some sanity checks (see :py:method:sanity_check’) (recommended if reading from disk)

Return type:DataPack
sanity_check()

Raising DataPackException if anything about this datapack seems wrong, for example if the number of rows in one table is not the same as in another

selected(indices)

Return only the items in the specified rows

set_graph(graph)

Return a copy of the datapack with weights set

classmethod vstack(dpacks)

Combine several datapacks into one.

The labels and vocabulary for all packs must be the same

exception attelo.table.DataPackException(msg)

Bases: exceptions.Exception

An exception which arises when worknig with an attelo data pack

class attelo.table.Graph

Bases: attelo.table.Graph

A graph can only be interpreted in light of a datapack.

It has predictions and attach/label weights. Predictions work like DataPack.target. The weights are useful within parsing pipelines, where it is sometimes useful for an intermediary parser to manipulate the weight vectors that a parser may calculate downstream.

See the parser interface for more details.

Parameters:
  • prediction (array(int)) – label for each edge (each cell corresponds to edge)
  • attach (array(float)) – attachment weights (each cell corresponds to an edge)
  • label (2D array(float)) – label attachment weights (edge by label)

Notes

Predictions are always labels; however, datapack targets may also be -1/0/1 when adapted to binary attachment task

selected(indices)

Return a subset of the links indicated by the list/array of indices

tweak(prediction=None, attach=None, label=None)

Return a variant of the current graph with some values changed.

Parameters:
  • prediction (1D array of int16) – Predicted label for each pair of EDUs
  • attach (1D array of float) – Attachment scores for each pair of EDUs
  • label (2D array of float) – Score of each label for each pair of EDUs
Returns:

g_copy – Copy of self with prediction, attach or label overridden with the values passed as arguments.

Return type:

Graph

Notes

This returns a copy of self with graph changed, because “[EYK] superstitiously believes that datapacks and graphs should be immutable as much as possible, and that mutability in the parsing pipeline would lead to confusion; hence this and namedtuples instead of simple getting and setting”.

classmethod vstack(graphs)

Combine several graphs into one.

class attelo.table.Multipack

Bases: dict

A multipack is a mapping from groupings to datapacks

This class exists purely for documentation purposes; in practice, a dictionary of string to Datapack will do just fine

attelo.table.UNKNOWN = '__UNK__'

distinguished internal value for post-labelling mode

attelo.table.UNRELATED = 'UNRELATED'

distinguished value for unrelated relation labels

attelo.table.attached_only(dpack, target)

Return only the instances which are labelled as attached (ie. this would presumably return an empty pack on completely unseen data)

Parameters:
  • dpack (DataPack) – Original datapack
  • target (array(int)) – Original targets
Returns:

  • dpack (DataPack) – Transformed datapack, with binary labels
  • target (array(int)) – Transformed targets, with binary labels

attelo.table.for_attachment(dpack, target)

Adapt a datapack to the attachment task.

This could involve: * selecting some of the features (all for now, but may change in the future) * modifying the features/labels in some way: we currently binarise labels to {-1 ; 1} for UNRELATED and not-UNRELATED respectively.

Parameters:
  • dpack (DataPack) – Original datapack
  • target (array(int)) – Original targets
Returns:

  • dpack (DataPack) – Transformed datapack, with binary labels
  • target (array(int)) – Transformed targets, with binary labels

attelo.table.for_labelling(dpack, target)

Adapt a datapack to the relation labelling task (currently a no-op).

This could involve * selecting some of the features (all for now, but may change in the future) * modifying the features/labels in some way (in practice no change)

Parameters:
  • dpack (DataPack) – Original datapack
  • target (array(int)) – Original targets
Returns:

  • dpack (DataPack) – Transformed datapack, with binary labels
  • target (array(int)) – Transformed targets, with binary labels

attelo.table.get_label_string(labels, i)

Return the class label for the given target value.

attelo.table.grouped_intra_pairings(dpack, include_fake_root=False)

Retrieve intra pairings from a datapack, grouped by subgrouping.

Parameters:
  • dpack (DataPack) – The datapack under scrutiny.
  • include_fake_root (boolean, optional) – If True, (FAKE_ROOT_ID, x) pairings are included in the group defined by (grouping(x), subgrouping(x)).
Returns:

groups – Map each (grouping, subgrouping) to the list of pairing indices within the same subgrouping.

Return type:

dict from (string, string) to list of integers

Notes

The result roughly corresponds to a hypothetical dpack.pairings[‘intra’].groupby([‘grouping’, ‘subgrouping’]).groups.

attelo.table.groupings(pairings)

Given a list of EDU pairings, return a dictionary mapping grouping names to list of rows within the pairings.

Return type:dict(string, [int])
attelo.table.idxes_attached(dpack, target)

Indices of attached pairings from dpack, according to target.

Parameters:
  • dpack (DataPack) – Datapack
  • target (list of integers) – Label for each pairings of dpack
Returns:

  • indices (array of integers) – Indices of attached pairings.
  • TODO
  • —-
  • Try and apply widely, especially for parser.intra ;
  • search for e.g. “target != unrelated” and “target[i] != unrelated”.

attelo.table.idxes_fakeroot(dpack)

Return datapack indices only the pairings which involve the fakeroot node

attelo.table.idxes_inter(dpack, include_fake_root=False)

Return indices of pairings from different subgroupings.

Parameters:
  • dpack (DataPack) – Datapack under scrutiny
  • include_fake_root (boolean, optional) – If True, pairings of the form (FAKE_ROOT_ID, x) are included.
Returns:

idxes – Indices of the inter pairings.

Return type:

list of int

attelo.table.idxes_intra(dpack, include_fake_root=False)

Return indices of pairings from same subgrouping, inside a datapack.

Parameters:
  • dpack (DataPack) – Datapack under scrutiny
  • include_fake_root (boolean, optional) – If True, pairings of the form (FAKE_ROOT_ID, x) are included.
Returns:

idxes – Indices of the intra pairings.

Return type:

list of int

attelo.table.locate_in_subpacks(dpack, subpacks)

Given a datapack and some of its subpacks, return a list of tuples identifying for each pair, its subpack and index in that subpack.

If a pair is not found in the list of subpacks, we return None instead of tuple

Returns:
Return type:[None or (DataPack, float)]
attelo.table.mpack_pairing_distances(mpack)

Return for each target value (label) in the multipack. See pairing_distances() for details

:rtype dict(int, (int, int))

attelo.table.pairing_distances(dpack)

Return for each target value (label) in the datapack, the left and right maximum distances of edu pairings (in number of EDUs, so adjacent EDUs have distance of 0)

Note that we assume a single-document datapack. If you give this a stacked datapack, you may get very large distances to the fake root

:rtype dict(int, (int, int))

attelo.table.select_window(dpack, window)

Select only EDU pairs that are at most window EDUs apart from each other (adjacent EDUs would be considered 0 apart)

Note that if the window is None, we simply return the original datapack

Note that will only work correctly on single-document datapacks

attelo.util module

General-purpose classes and functions

class attelo.util.ArgparserEnum

Bases: enum.Enum

An enumeration whose values we spit out as choices to argparser

classmethod choices_str()

available choices in this enumeration

classmethod from_string(string)

from command line arg

classmethod help_suffix(default)

help text suffix showing choices and default

class attelo.util.Team

Bases: attelo.util.Team

Any collection where we have the same thing but duplicated for each attelo subtask (eg. models, learners,)

fmap(func)

Apply a function to each member of the collection

attelo.util.concat_i(iters)

Merge an iterable of iterables into a single iterable

attelo.util.concat_l(iters)

Merge an iterable of iterables into a list

attelo.util.mk_rng(shuffle=False, default_seed=None)

Return a random number generator instance, hard-seeded unless we ask for shuffling to be enabled

(note: if shuffle mode is enable, the rng in question will just be the system generator)

attelo.util.truncate(text, width)

Truncate a string and append an ellipsis if truncated