attelo package¶
Attelo is a statistical discourse parser. The API provides
- decoders which you should be able to call in a standalone way
- machine learning infrastructure wrapping around a library like sci-kit learn
- support for building experimental harnesses around the parser
Subpackages¶
Submodules¶
attelo.args module¶
Managing command line arguments
-
attelo.args.
add_common_args
(psr)¶ add usual attelo args to subcommand parser
-
attelo.args.
add_fold_choice_args
(psr)¶ ability to select a subset of the data according to a fold
-
attelo.args.
add_model_read_args
(psr, help_)¶ models files we can read in
Parameters: help (string) – python format string for help {} will have a word (eg. ‘attachment’) plugged in
-
attelo.args.
add_report_args
(psr)¶ add args to scoring/evaluation
-
attelo.args.
validate_fold_choice_args
(wrapped)¶ Given a function that accepts an argparsed object, check the fold arguments before carrying on.
The idea here is that –fold and –fold-file are meant to be used together (xnor)
This is meant to be used as a decorator, eg.:
@validate_fold_choice_args def main(args): blah
attelo.edu module¶
Uniquely identifying information for an EDU
-
class
attelo.edu.
EDU
¶ Bases:
attelo.edu.EDU
a class representing the EDU (id, span start and end, grouping, subgrouping)
-
span
()¶ Starting and ending position of the EDU as an integer pair
-
-
attelo.edu.
FAKE_ROOT
= EDU(id='ROOT', text='', start=0, end=0, grouping=None, subgrouping=None)¶ a distinguished fake root EDU which simultaneously appears in all groupings
attelo.fold module¶
Group-aware n-fold evaluation.
Attelo uses a variant of n-fold evaluation, where we (still) andomly partition the dataset into a set of folds of roughly even size, but respecting the additional constraint that any two data entries belonging in the same “group” (determined a single distiguished feature, eg. the document id, the dialogue id, etc) are always in the same fold. Note that this makes it a bit harder to have perfectly evenly sized folds
Created on Jun 20, 2012
@author: stergos
contribs: phil
-
attelo.fold.
fold_groupings
(fold_dict, fold)¶ Return the set of groupings that belong in a fold. Raise an exception if the fold is not in the fold dictionary
:rtype frozenset(int)
-
attelo.fold.
make_n_fold
(groupings, folds, rng)¶ Given a set of groupings and a desired number of folds, return a fold selection dictionary assigning a fold number to each each grouping (see
attelo.edu.EDU
).Parameters: rng (:py:class:random.Random:) – random number generator (hint: the random module will be just fine if you don’t mind shared state) :rtype dict(string, int)
-
attelo.fold.
select_testing
(mpack, fold_dict, fold)¶ Given a division into folds and a fold number, return only the test items for that fold
Return type: Multipack
-
attelo.fold.
select_training
(mpack, fold_dict, fold)¶ Given a division into folds and a fold number, return only the training items for that fold
Return type: Multipack
attelo.graph module¶
graph visualisation
-
exception
attelo.graph.
Alarm
¶ Bases:
exceptions.Exception
Exception to raise on signal timeout
-
class
attelo.graph.
GraphSettings
¶ Bases:
attelo.graph.GraphSettings
Parameters: - hide (string or None) – ‘intra’ to hide links between EDUs in the same subgrouping; ‘inter’ to hide links across subgroupings; None to show all links
- select ([string] or None) – EDU groupings to graph (if None, all groupings will be graphed unless)
- unrelated (bool) – show unrelated links
- timeout (int) – number of seconds to allow graphviz to run before it times out
- quiet (bool) – suppress informational messages
-
attelo.graph.
alarm_handler
(_, frame)¶ Raise Alarm on signal
-
attelo.graph.
diff_all
(edus, src_predictions, tgt_predictions, settings, output_dir)¶ Generate graphs for all the given predictions. Each grouping will have its own graph, saved in the output directory
-
attelo.graph.
graph_all
(edus, predictions, settings, output_dir)¶ Generate graphs for all the given predictions. Each grouping will have its own graph, saved in the output directory
-
attelo.graph.
mk_diff_graph
(title, edus, src_links, tgt_links, settings)¶ Convert attelo predictions to a graphviz graph diplaying differences between two predictions
Predictions here consist of an EDU followed by a list of (parent name, relation label) tuples
Parameters: tgt_links – if present, we generate a graph that represents a difference between the links and tgt_links (by highlighting links that only occur in one or the other)
-
attelo.graph.
mk_single_graph
(title, edus, links, settings)¶ Convert single set of attelo predictions to a graphviz graph
-
attelo.graph.
select_links
(edus, links, settings)¶ Given a set of edus and of edu id pairs, return only the pairs whose ids appear in the edu list
Parameters: - intra – if True, in addition to the constraints above, only return links that are in the same subgrouping
- inter – if True, only return links between subgroupings
-
attelo.graph.
write_dot_graph
(filename, dot_graph, run_graphviz=True, quiet=False, timeout=30)¶ Write a dot graph and possibly run graphviz on it
attelo.io module¶
attelo.report module¶
attelo.score module¶
attelo.table module¶
Manipulating data tables (taking slices, etc)
-
class
attelo.table.
DataPack
¶ Bases:
attelo.table.DataPack
A set of data that can be said to belong together.
A typical use of the datapack would be to group together data for a single document/grouping. But in cases where this distinction does not matter, it can also be convenient to combine data from multiple documents into a single pack.
Notes
A datapack is said to be
- single document (the usual case) it corresponds to a single document or “stacked” if it is made by joining multiple datapacks together. Some functions may only behave correctly on single-document datapacks
- weighted if the graphs tuple is set. You should never see weighted datapacks outside of a learner or decoder
Parameters: - (EDU) (edus) – effectively a set of edus
- ([(EDU, EDU)]) (pairings) – edu pairs
- 2D array(float) (data) – sparse matrix of features, each row corresponding to a pairing
- 1D array (should be int, really) (target) – array of predictions for each pairing
- ctarget (dict from string to objects) – Mapping from grouping name to structured target
- ([string]) (vocab) – list of relation labels (NB: by convention label zero is always the unknown label)
- ([string]) – feature names (corresponds to the feature indices) in data
- (None or Graph) (graph) – if set, arrays representing the probabilities (or confidence scores) of attachment and labelling
-
get_label
(i)¶ Return the class label for the given target value.
Parameters: (int, less than len(self.labels)) (i) – a target value See also
label_number
-
label_number
(label)¶ Return the numerical label that corresponnds to the given string label
Useful idiom: unrelated = dpack.label_number(UNRELATED)
Parameters: (string in self.labels) (label) – a label string See also
get_label
-
classmethod
load
(edus, pairings, data, target, ctarget, labels, vocab)¶ Build a data pack and run some sanity checks (see :py:method:sanity_check’) (recommended if reading from disk)
Return type: DataPack
-
sanity_check
()¶ Raising
DataPackException
if anything about this datapack seems wrong, for example if the number of rows in one table is not the same as in another
-
selected
(indices)¶ Return only the items in the specified rows
-
set_graph
(graph)¶ Return a copy of the datapack with weights set
-
classmethod
vstack
(dpacks)¶ Combine several datapacks into one.
The labels and vocabulary for all packs must be the same
-
exception
attelo.table.
DataPackException
(msg)¶ Bases:
exceptions.Exception
An exception which arises when worknig with an attelo data pack
-
class
attelo.table.
Graph
¶ Bases:
attelo.table.Graph
A graph can only be interpreted in light of a datapack.
It has predictions and attach/label weights. Predictions work like DataPack.target. The weights are useful within parsing pipelines, where it is sometimes useful for an intermediary parser to manipulate the weight vectors that a parser may calculate downstream.
See the parser interface for more details.
Parameters: - prediction (array(int)) – label for each edge (each cell corresponds to edge)
- attach (array(float)) – attachment weights (each cell corresponds to an edge)
- label (2D array(float)) – label attachment weights (edge by label)
Notes
Predictions are always labels; however, datapack targets may also be -1/0/1 when adapted to binary attachment task
-
selected
(indices)¶ Return a subset of the links indicated by the list/array of indices
-
tweak
(prediction=None, attach=None, label=None)¶ Return a variant of the current graph with some values changed.
Parameters: - prediction (1D array of int16) – Predicted label for each pair of EDUs
- attach (1D array of float) – Attachment scores for each pair of EDUs
- label (2D array of float) – Score of each label for each pair of EDUs
Returns: g_copy – Copy of self with prediction, attach or label overridden with the values passed as arguments.
Return type: Notes
This returns a copy of self with graph changed, because “[EYK] superstitiously believes that datapacks and graphs should be immutable as much as possible, and that mutability in the parsing pipeline would lead to confusion; hence this and namedtuples instead of simple getting and setting”.
-
classmethod
vstack
(graphs)¶ Combine several graphs into one.
-
class
attelo.table.
Multipack
¶ Bases:
dict
A multipack is a mapping from groupings to datapacks
This class exists purely for documentation purposes; in practice, a dictionary of string to Datapack will do just fine
-
attelo.table.
UNKNOWN
= '__UNK__'¶ distinguished internal value for post-labelling mode
-
attelo.table.
UNRELATED
= 'UNRELATED'¶ distinguished value for unrelated relation labels
-
attelo.table.
attached_only
(dpack, target)¶ Return only the instances which are labelled as attached (ie. this would presumably return an empty pack on completely unseen data)
Parameters: - dpack (DataPack) – Original datapack
- target (array(int)) – Original targets
Returns: - dpack (DataPack) – Transformed datapack, with binary labels
- target (array(int)) – Transformed targets, with binary labels
-
attelo.table.
for_attachment
(dpack, target)¶ Adapt a datapack to the attachment task.
This could involve: * selecting some of the features (all for now, but may change in the future) * modifying the features/labels in some way: we currently binarise labels to {-1 ; 1} for UNRELATED and not-UNRELATED respectively.
Parameters: - dpack (DataPack) – Original datapack
- target (array(int)) – Original targets
Returns: - dpack (DataPack) – Transformed datapack, with binary labels
- target (array(int)) – Transformed targets, with binary labels
-
attelo.table.
for_labelling
(dpack, target)¶ Adapt a datapack to the relation labelling task (currently a no-op).
This could involve * selecting some of the features (all for now, but may change in the future) * modifying the features/labels in some way (in practice no change)
Parameters: - dpack (DataPack) – Original datapack
- target (array(int)) – Original targets
Returns: - dpack (DataPack) – Transformed datapack, with binary labels
- target (array(int)) – Transformed targets, with binary labels
-
attelo.table.
get_label_string
(labels, i)¶ Return the class label for the given target value.
-
attelo.table.
grouped_intra_pairings
(dpack, include_fake_root=False)¶ Retrieve intra pairings from a datapack, grouped by subgrouping.
Parameters: - dpack (DataPack) – The datapack under scrutiny.
- include_fake_root (boolean, optional) – If True, (FAKE_ROOT_ID, x) pairings are included in the group defined by (grouping(x), subgrouping(x)).
Returns: groups – Map each (grouping, subgrouping) to the list of pairing indices within the same subgrouping.
Return type: dict from (string, string) to list of integers
Notes
The result roughly corresponds to a hypothetical dpack.pairings[‘intra’].groupby([‘grouping’, ‘subgrouping’]).groups.
-
attelo.table.
groupings
(pairings)¶ Given a list of EDU pairings, return a dictionary mapping grouping names to list of rows within the pairings.
Return type: dict(string, [int])
-
attelo.table.
idxes_attached
(dpack, target)¶ Indices of attached pairings from dpack, according to target.
Parameters: - dpack (DataPack) – Datapack
- target (list of integers) – Label for each pairings of dpack
Returns: - indices (array of integers) – Indices of attached pairings.
- TODO
- —-
- Try and apply widely, especially for parser.intra ;
- search for e.g. “target != unrelated” and “target[i] != unrelated”.
-
attelo.table.
idxes_fakeroot
(dpack)¶ Return datapack indices only the pairings which involve the fakeroot node
-
attelo.table.
idxes_inter
(dpack, include_fake_root=False)¶ Return indices of pairings from different subgroupings.
Parameters: - dpack (DataPack) – Datapack under scrutiny
- include_fake_root (boolean, optional) – If True, pairings of the form (FAKE_ROOT_ID, x) are included.
Returns: idxes – Indices of the inter pairings.
Return type: list of int
-
attelo.table.
idxes_intra
(dpack, include_fake_root=False)¶ Return indices of pairings from same subgrouping, inside a datapack.
Parameters: - dpack (DataPack) – Datapack under scrutiny
- include_fake_root (boolean, optional) – If True, pairings of the form (FAKE_ROOT_ID, x) are included.
Returns: idxes – Indices of the intra pairings.
Return type: list of int
-
attelo.table.
locate_in_subpacks
(dpack, subpacks)¶ Given a datapack and some of its subpacks, return a list of tuples identifying for each pair, its subpack and index in that subpack.
If a pair is not found in the list of subpacks, we return None instead of tuple
Returns: Return type: [None or (DataPack, float)]
-
attelo.table.
mpack_pairing_distances
(mpack)¶ Return for each target value (label) in the multipack. See
pairing_distances()
for details:rtype dict(int, (int, int))
-
attelo.table.
pairing_distances
(dpack)¶ Return for each target value (label) in the datapack, the left and right maximum distances of edu pairings (in number of EDUs, so adjacent EDUs have distance of 0)
Note that we assume a single-document datapack. If you give this a stacked datapack, you may get very large distances to the fake root
:rtype dict(int, (int, int))
-
attelo.table.
select_window
(dpack, window)¶ Select only EDU pairs that are at most window EDUs apart from each other (adjacent EDUs would be considered 0 apart)
Note that if the window is None, we simply return the original datapack
Note that will only work correctly on single-document datapacks
attelo.util module¶
General-purpose classes and functions
-
class
attelo.util.
ArgparserEnum
¶ Bases:
enum.Enum
An enumeration whose values we spit out as choices to argparser
-
classmethod
choices_str
()¶ available choices in this enumeration
-
classmethod
from_string
(string)¶ from command line arg
-
classmethod
help_suffix
(default)¶ help text suffix showing choices and default
-
classmethod
-
class
attelo.util.
Team
¶ Bases:
attelo.util.Team
Any collection where we have the same thing but duplicated for each attelo subtask (eg. models, learners,)
-
fmap
(func)¶ Apply a function to each member of the collection
-
-
attelo.util.
concat_i
(iters)¶ Merge an iterable of iterables into a single iterable
-
attelo.util.
concat_l
(iters)¶ Merge an iterable of iterables into a list
-
attelo.util.
mk_rng
(shuffle=False, default_seed=None)¶ Return a random number generator instance, hard-seeded unless we ask for shuffling to be enabled
(note: if shuffle mode is enable, the rng in question will just be the system generator)
-
attelo.util.
truncate
(text, width)¶ Truncate a string and append an ellipsis if truncated