File Coverage

blib/lib/Data/Pareto.pm

Criterion	Covered	Total	%
statement	85	86	98.8
branch	21	24	87.5
condition	6	8	75.0
subroutine	18	18	100.0
pod	6	6	100.0
total	136	142	95.7

line	stmt	bran	cond	sub	pod	time	code
1							package Data::Pareto;
2
3	2			2		52792	use warnings;
	2					4
	2					60
4	2			2		11	use strict;
	2					4
	2					68
5
6	2			2		10	use Scalar::Util qw( reftype );
	2					7
	2					242
7	2			2		9	use Carp;
	2					3
	2					2244
8
9							=head1 NAME
10
11							Data::Pareto - Computing Pareto sets in Perl
12
13							=head1 VERSION
14
15							Version 0.05
16
17							=cut
18
19							our $VERSION = '0.05';
20
21							=head1 SYNOPSIS
22
23							use Data::Pareto;
24
25							# only first and third columns are used in comparison
26							# the others are simply descriptive
27							my $set = new Data::Pareto( { columns => [0, 2] } );
28							$set->add(
29							[ 5, "pareto", 10, 11 ],
30							[ 5, "dominated", 11, 9 ],
31							[ 4, "pareto2", 12, 12 ]
32							);
33
34							# this returns [ [ 5, "pareto", 10, 11 ], [ 4, "pareto2", 12, 12 ] ],
35							# the other one is dominated on selected columns
36							$set->get_pareto_ref;
37
38							=head1 DESCRIPTION
39
40							This module makes calculation of Pareto set. Given a set of vectors
41							(i.e. arrays of simple scalars), Pareto set is all the vectors from the given
42							set which are not dominated by any other vector of the set. A vector C is
43							said to be dominated by C, iff C<< X[i] >= Y[i] >> for all C and
44							C<< X[i] > Y[i] >> for at least one C.
45
46							Pareto sets play an important role in multiobjective optimization, where
47							each non-dominated (i.e. Pareto) vector describes objectives value of
48							"optimal" solution to the given problem.
49
50							This module allows occurrence of duplicates in the set - this makes it
51							rather a bag than a set, but is useful in practice (e.g. when we want to
52							preserve two solutions giving the same objectives value, but structurally
53							different). This assumption influences dominance definition given above:
54							two duplicates never dominate each other and hence can be present in the Pareto
55							set. This is controlled by C option passed to L: if set
56							to C value, duplicates are allowed in Pareto set; otherwise, only the
57							first found element of the subset of duplicated vectors is preserved in Pareto
58							set.
59
60							The values are allowed to be invalid. The meaning of 'invalid' is 'the worst
61							possible'. It's different concept than 'unknown'; unknown value make the
62							definition of domination less clear.
63
64							By default, the comparison of column values is numerical and the smaller
65							value dominates the larger one. If you want to override this behaviour, pass
66							your own dominator sub in arguments to L.
67
68							=head1 FUNCTIONS
69
70							By default, a vector is passed around as a ref to array of consecutive column
71							values. This means you shouldn't mess with it after passing to C method.
72
73							=cut
74
75
76							=head2 new
77
78							Creates a new object for calculating Pareto set.
79
80							The first argument passed is a hashref with options; the recognized options are:
81
82							=over
83
84							=item * C
85
86							Arrayref containing column numbers which should be used for determining
87							domination and duplication. Column numbers are C<0>-based array indexes to
88							data vectors.
89
90							Only values at those positions will be ever compared between vectors.
91							Any other data in the vectors may be present and is not used in any way.
92
93							At least one column number should be passed, for obvious reasons.
94
95							=item * C
96
97							If set to C value, duplicated vectors are all put in Pareto set (if they
98							are Pareto, of course). If set to C, duplicates of vectors already
99							in the Pareto set are discarded.
100
101							=item * C
102
103							The value considered invalid in pareto set. Such value is dominated by
104							any value and dominates only invalid value.
105
106							However, computations of domination in presence of invalid values can be
107							considerably slower, as much as 5 times. So it probably will be faster to first
108							parse the data and replace invalid markers with some huge-and-surely-dominated
109							values.
110
111							=item * C
112
113							The sub(s) used to compare specific column values and determining domination
114							between them. Scalar, sub ref or hash ref. If not set, the default is that
115							the numerically smaller value dominates the other one.
116
117							When the scalar is passed, it is assumed to be the name of a predefined
118							dominator. This is a much faster option to specifying the sub of your own.
119							Recognized dominators are:
120
121							=over
122
123							=item * C numerically smaller value dominates
124
125							=item * C numerically greater value dominates
126
127							=item * C earlier in collation order value dominates (lexicographical
128							order)
129
130							=item * C later in collation order value dominates (reversed
131							lexicographical order)
132
133							=item * C standard, i.e. C dominator
134
135							=back
136
137							During creation of Pareto set, the dominator sub is called with three arguments:
138							column number, first vector's value, second vector's value, and should return
139							C, when the second value dominates the first one, assuming they appeared
140							in the specified column.
141
142							Make sure that your sub returns C when two passed values are the same.
143							This is necessary to obey the whole Pareto set domination contract.
144
145							There are two approaches possible when the values in different columns are of
146							different types, in the sense of domination. First, you can use passed column
147							number to decide the domination check function. Alternatively, you can pass a
148							hash ref with mapping from the column number to the sub ref used to compare the
149							given column:
150
151							my $lexi_dominator = sub {
152							my ($col, $dominated, $by) = @_;
153							return ($dominated ge $by);
154							};
155							my $min_dominator = sub {
156							my ($col, $dominated, $by) = @_;
157							return ($dominated >= $by);
158							}
159
160							my $set = new Data::Pareto({
161							columns => [0, 2],
162							column_dominator => {
163							0 => $lexi_dominator,
164							2 => $min_dominator
165							}
166							});
167							$set->add(['a', 'label 1', 12], ['b', 'label 2', 9]);
168
169							=back
170
171							The rest of arguments are assumed to be vectors, and passed to L
172							method.
173
174							=cut
175
176
177							sub new {
178	32			32	1	3769	my ($class, $attrs) = (shift, shift);
179	32					158	my $self = bless {
180							pareto => [ ],
181							vectorStatus => { },
182							%$attrs
183							}, $class;
184	32					72	$self->_construct_subs;
185
186	32	50				82	$self->add(@_) if @_;
187	32					65	return $self;
188							}
189
190							=head2 add
191
192							Tests vectors passed as arguments and adds the non-dominated ones to the
193							Pareto set.
194
195							=cut
196
197							sub add {
198	19			19	1	85	my $self = shift;
199	19					51	$self->_update_pareto($_) for @_;
200							}
201
202							=head2 get_pareto
203
204							Returns the current content of Pareto set as a list of vectors.
205
206							=cut
207
208							sub get_pareto {
209	2			2	1	8	my ($self) = @_;
210	2					3	return (@{$self->{pareto}});
	2					6
211							}
212
213							=head2 get_pareto_ref
214
215							Returns the current content of Pareto set as a ref to array with vectors.
216							The return value references the original array, so treat it as read-only!
217
218							=cut
219
220							sub get_pareto_ref {
221	17			17	1	61	my ($self) = @_;
222	17					114	return $self->{pareto};
223							}
224
225							# update (potentially) the set with a new vector:
226							# check if it is Pareto, if so, remove dominated vectors
227							sub _update_pareto {
228	43			43		52	my ($self, $NV) = @_;
229
230							# check if we already have a duplicate?
231							# if so, handle it gently, so there are no mind-cracking
232							# algorithm variations after that
233
234	43	100				72	if ($self->_has_duplicates($NV)) {
235							# ...then it depends on the policy
236	4	100				8	if ($self->{duplicates}) {
237							# add the duplicated vector to the pareto set
238	3					4	push @{$self->{pareto}}, $NV;
	3					5
239							} else {
240							# simply disgard the new vector
241							}
242	4					12	return;
243							}
244
245	39					45	my @newP = ( );
246	39					62	my $surePareto = 0;
247
248							# check with every vector considered pareto so far
249	39					39	for my $o (@{$self->{pareto}}) {
	39					66
250	27	100				40	if ($surePareto) {
251							# preserve the current vector only if it is not dominated by new (now Pareto) vector
252	1	50				20	if ($self->{_sub_is_dominated}($self, $o, $NV)) {
253	1					3	$self->_ban_vector($o);
254							} else {
255	0					0	push @newP, $o;
256							}
257							} else {
258							# stop processing with unchanged Pareto set if the new vector is dominated by the current one
259	26	100				524	return if $self->{_sub_is_dominated}($self, $NV, $o);
260
261							# mark new vector as "sure Pareto" only if it dominates the current vector
262	21	100				426	if ($self->{_sub_is_dominated}($self, $o, $NV)) {
263	5					7	$surePareto = 1;
264							# ...and hence we don't preserve the dominated current vector
265	5					10	$self->_ban_vector($o);
266	5					9	next;
267							}
268
269							# otherwise, the current vector is for sure Pareto still, so preserve it
270	16					33	push @newP, $o;
271							}
272							}
273
274	34					46	push @newP, $NV;
275	34					65	$self->_mark_vector($NV);
276	34					134	$self->{pareto} = \@newP;
277							}
278
279							=head2 is_dominated
280
281							Checks if the first vector passed is dominated by the second one.
282							The comparison is made based on the values in vectors' columns, which
283							were passed to L.
284
285							The vectors passed are never duplicates of each other when this method is
286							called from inside this module.
287
288							Returns C, when the first vector from arguments list
289							is dominated by the other one, and C otherwise.
290
291							=cut
292
293	22			22	1	745	sub is_dominated { $_[0]->{_sub_is_dominated}(@_); } # pass the whole @_, as the sub thinks it is a method
294
295							# these are is_dominated() parts which will be composed into the function,
296							# depending on the constructor options.
297							my %_is_dominated_parts = (
298							invalid => <<'_EOT_',
299							next if $self->{_sub_is_invalid}($dominated->[$col]); # invalid dominated by anything
300							return 0 if $self->{_sub_is_invalid}($by->[$col]); # invalid can't dominate valid
301							_EOT_
302							dominator_min => '($dominated->[$col] >= $by->[$col])',
303							dominator_max => '($dominated->[$col] <= $by->[$col])',
304							dominator_lexi => '($dominated->[$col] ge $by->[$col])',
305							dominator_lexi_rev => '($dominated->[$col] le $by->[$col])',
306
307							_dominator_custom => <<'_EOT_',
308							$self->{column_dominator}($col, $dominated->[$col], $by->[$col])
309							_EOT_
310							_dominator_custom_hash => <<'_EOT_',
311							$self->{column_dominator}{$col}($col, $dominated->[$col], $by->[$col])
312							_EOT_
313
314							);
315							$_is_dominated_parts{dominator_std} = $_is_dominated_parts{dominator_min};
316
317							sub _construct_subs {
318	32			32		39	my ($self) = @_;
319
320	32					29	my $invalid_part;
321	32	100				66	if (exists $self->{invalid}) {
322	4					5	my $inv = $self->{invalid};
323	4			23		14	$self->{_sub_is_invalid} = sub { $_[0] eq $inv };
	23					378
324	4					9	$invalid_part = $_is_dominated_parts{invalid};
325							} else {
326	28			2		91	$self->{_sub_is_invalid} = sub { 0 };
	2					11
327	28					36	$invalid_part = '';
328							}
329
330	32					32	my $cmp_part;
331	32	100				69	if (exists $self->{column_dominator}) {
332	10		50			26	my $dom = $self->{column_dominator} \|\| '';
333	10					25	my $type = reftype $dom;
334	10	100	66			25	if (!defined $type) {
		100
335							# builtin
336	8					18	$cmp_part = $_is_dominated_parts{"dominator_$dom"};
337	8	50				18	croak "Unrecognized dominator builtin '$dom'" unless $cmp_part;
338							} elsif ($type && $type eq 'HASH') {
339	1					2	$cmp_part = $_is_dominated_parts{_dominator_custom_hash};
340							} else {
341	1					3	$cmp_part = $_is_dominated_parts{_dominator_custom};
342							}
343							} else {
344	22					36	$cmp_part = $_is_dominated_parts{dominator_std};
345							}
346
347	32					77	my $sub_str = <<'_EOT_'
348							sub {
349							my ($self, $dominated, $by) = @_;
350							for my $col (@{$self->{columns}}) {
351							_EOT_
352							. <<_EOT_
353							$invalid_part
354							return 0 unless $cmp_part;
355							}
356							1;
357							}
358							_EOT_
359							;
360	32					3282	$self->{_sub_is_dominated} = eval $sub_str;
361							}
362
363							=head2 is_invalid
364
365							Checks if the given value is considered invalid for the current object.
366							Every value is valid by default.
367
368							=cut
369
370	6			6	1	51	sub is_invalid { return $_[0]->{_sub_is_invalid}($_[1]); }
371
372							# calculate the string repr. of a vector; to be used as a hash key
373							sub _vector_key {
374	83			83		86	my ($self, $v) = @_;
375	83					94	my @cols = ( );
376	83					71	for my $c (@{$self->{columns}}) {
	83					146
377	191					273	push @cols, $v->[$c];
378							}
379
380	83					267	return join ';', @cols;
381							}
382
383							# checks if the given vector has duplicates in Pareto
384							sub _has_duplicates {
385	43			43		44	my ($self, $v) = @_;
386	43					73	my $key = $self->_vector_key($v);
387	43		100			169	return (exists $self->{vectorStatus}{$key} && $self->{vectorStatus}{$key} > 0);
388							}
389
390							# mark the vector as not present in Pareto.
391							# In the future it can be used to ban the vector from trying to return
392							# to the Pareto set.
393							sub _ban_vector {
394	6			6		8	my ($self, $v) = @_;
395	6					10	my $key = $self->_vector_key($v);
396	6					11	$self->{vectorStatus}{$key} = 0;
397							}
398
399							# mark vector as present in the Pareto set.
400							sub _mark_vector {
401	34			34		38	my ($self, $v) = @_;
402	34					56	my $key = $self->_vector_key($v);
403	34					84	$self->{vectorStatus}{$key} = 1;
404							}
405
406							=head1 TODO
407
408							Allow specifying built-in dominators inside dominator hash.
409
410							For large data sets calculations become time-intensive. There are a couple
411							of techniques which might be applied to improve the performance:
412
413							=over
414
415							=item * defer the phase of removing vectors dominated by newly added vectors
416							to L call; this results in smaller number of arrays
417							rewritings.
418
419							=item * split the set of vectors being added into smaller subsets, calculate
420							Pareto sets for such subsets, and then apply insertion of resulting Pareto
421							subsets to the main set; this results in smaller number of useless tries of
422							adding dominated vectors into the set.
423
424							=back
425
426							=head1 AUTHOR
427
428							Przemyslaw Wesolek, C<< >>
429
430							=head1 BUGS
431
432							Please report any bugs or feature requests to C, or through
433							the web interface at L. I will be notified, and then you'll
434							automatically be notified of progress on your bug as I make changes.
435
436							=head1 SUPPORT
437
438							You can find documentation for this module with the perldoc command.
439
440							perldoc Data::Pareto
441
442
443							You can also look for information at:
444
445							=over 4
446
447							=item * RT: CPAN's request tracker
448
449							L
450
451							=item * AnnoCPAN: Annotated CPAN documentation
452
453							L
454
455							=item * CPAN Ratings
456
457							L
458
459							=item * Search CPAN
460
461							L
462
463							=back
464
465							=head1 COPYRIGHT & LICENSE
466
467
468							Copyright 2009 Przemyslaw Wesolek
469
470							This program is free software; you can redistribute it and/or modify it
471							under the terms of the Artistic License 2.0. For details, see the full
472							text of the license in the file LICENSE.
473
474							=cut
475
476							1; # End of Data::Pareto