File Coverage

blib/lib/Data/Pareto.pm
Criterion Covered Total %
statement 85 86 98.8
branch 21 24 87.5
condition 6 8 75.0
subroutine 18 18 100.0
pod 6 6 100.0
total 136 142 95.7


line stmt bran cond sub pod time code
1             package Data::Pareto;
2              
3 2     2   52792 use warnings;
  2         4  
  2         60  
4 2     2   11 use strict;
  2         4  
  2         68  
5              
6 2     2   10 use Scalar::Util qw( reftype );
  2         7  
  2         242  
7 2     2   9 use Carp;
  2         3  
  2         2244  
8              
9             =head1 NAME
10              
11             Data::Pareto - Computing Pareto sets in Perl
12              
13             =head1 VERSION
14              
15             Version 0.05
16              
17             =cut
18              
19             our $VERSION = '0.05';
20              
21             =head1 SYNOPSIS
22              
23             use Data::Pareto;
24            
25             # only first and third columns are used in comparison
26             # the others are simply descriptive
27             my $set = new Data::Pareto( { columns => [0, 2] } );
28             $set->add(
29             [ 5, "pareto", 10, 11 ],
30             [ 5, "dominated", 11, 9 ],
31             [ 4, "pareto2", 12, 12 ]
32             );
33              
34             # this returns [ [ 5, "pareto", 10, 11 ], [ 4, "pareto2", 12, 12 ] ],
35             # the other one is dominated on selected columns
36             $set->get_pareto_ref;
37              
38             =head1 DESCRIPTION
39              
40             This module makes calculation of Pareto set. Given a set of vectors
41             (i.e. arrays of simple scalars), Pareto set is all the vectors from the given
42             set which are not dominated by any other vector of the set. A vector C is
43             said to be dominated by C, iff C<< X[i] >= Y[i] >> for all C and
44             C<< X[i] > Y[i] >> for at least one C.
45              
46             Pareto sets play an important role in multiobjective optimization, where
47             each non-dominated (i.e. Pareto) vector describes objectives value of
48             "optimal" solution to the given problem.
49              
50             This module allows occurrence of duplicates in the set - this makes it
51             rather a bag than a set, but is useful in practice (e.g. when we want to
52             preserve two solutions giving the same objectives value, but structurally
53             different). This assumption influences dominance definition given above:
54             two duplicates never dominate each other and hence can be present in the Pareto
55             set. This is controlled by C option passed to L: if set
56             to C value, duplicates are allowed in Pareto set; otherwise, only the
57             first found element of the subset of duplicated vectors is preserved in Pareto
58             set.
59              
60             The values are allowed to be invalid. The meaning of 'invalid' is 'the worst
61             possible'. It's different concept than 'unknown'; unknown value make the
62             definition of domination less clear.
63              
64             By default, the comparison of column values is numerical and the smaller
65             value dominates the larger one. If you want to override this behaviour, pass
66             your own dominator sub in arguments to L.
67              
68             =head1 FUNCTIONS
69              
70             By default, a vector is passed around as a ref to array of consecutive column
71             values. This means you shouldn't mess with it after passing to C method.
72              
73             =cut
74              
75              
76             =head2 new
77              
78             Creates a new object for calculating Pareto set.
79              
80             The first argument passed is a hashref with options; the recognized options are:
81              
82             =over
83              
84             =item * C
85              
86             Arrayref containing column numbers which should be used for determining
87             domination and duplication. Column numbers are C<0>-based array indexes to
88             data vectors.
89              
90             Only values at those positions will be ever compared between vectors.
91             Any other data in the vectors may be present and is not used in any way.
92              
93             At least one column number should be passed, for obvious reasons.
94              
95             =item * C
96              
97             If set to C value, duplicated vectors are all put in Pareto set (if they
98             are Pareto, of course). If set to C, duplicates of vectors already
99             in the Pareto set are discarded.
100              
101             =item * C
102              
103             The value considered invalid in pareto set. Such value is dominated by
104             any value and dominates only invalid value.
105              
106             However, computations of domination in presence of invalid values can be
107             considerably slower, as much as 5 times. So it probably will be faster to first
108             parse the data and replace invalid markers with some huge-and-surely-dominated
109             values.
110              
111             =item * C
112              
113             The sub(s) used to compare specific column values and determining domination
114             between them. Scalar, sub ref or hash ref. If not set, the default is that
115             the numerically smaller value dominates the other one.
116              
117             When the scalar is passed, it is assumed to be the name of a predefined
118             dominator. This is a much faster option to specifying the sub of your own.
119             Recognized dominators are:
120              
121             =over
122              
123             =item * C numerically smaller value dominates
124              
125             =item * C numerically greater value dominates
126              
127             =item * C earlier in collation order value dominates (lexicographical
128             order)
129              
130             =item * C later in collation order value dominates (reversed
131             lexicographical order)
132              
133             =item * C standard, i.e. C dominator
134              
135             =back
136              
137             During creation of Pareto set, the dominator sub is called with three arguments:
138             column number, first vector's value, second vector's value, and should return
139             C, when the second value dominates the first one, assuming they appeared
140             in the specified column.
141              
142             Make sure that your sub returns C when two passed values are the same.
143             This is necessary to obey the whole Pareto set domination contract.
144            
145             There are two approaches possible when the values in different columns are of
146             different types, in the sense of domination. First, you can use passed column
147             number to decide the domination check function. Alternatively, you can pass a
148             hash ref with mapping from the column number to the sub ref used to compare the
149             given column:
150              
151             my $lexi_dominator = sub {
152             my ($col, $dominated, $by) = @_;
153             return ($dominated ge $by);
154             };
155             my $min_dominator = sub {
156             my ($col, $dominated, $by) = @_;
157             return ($dominated >= $by);
158             }
159            
160             my $set = new Data::Pareto({
161             columns => [0, 2],
162             column_dominator => {
163             0 => $lexi_dominator,
164             2 => $min_dominator
165             }
166             });
167             $set->add(['a', 'label 1', 12], ['b', 'label 2', 9]);
168              
169             =back
170              
171             The rest of arguments are assumed to be vectors, and passed to L
172             method.
173              
174             =cut
175              
176              
177             sub new {
178 32     32 1 3769 my ($class, $attrs) = (shift, shift);
179 32         158 my $self = bless {
180             pareto => [ ],
181             vectorStatus => { },
182             %$attrs
183             }, $class;
184 32         72 $self->_construct_subs;
185            
186 32 50       82 $self->add(@_) if @_;
187 32         65 return $self;
188             }
189              
190             =head2 add
191              
192             Tests vectors passed as arguments and adds the non-dominated ones to the
193             Pareto set.
194              
195             =cut
196              
197             sub add {
198 19     19 1 85 my $self = shift;
199 19         51 $self->_update_pareto($_) for @_;
200             }
201              
202             =head2 get_pareto
203              
204             Returns the current content of Pareto set as a list of vectors.
205              
206             =cut
207              
208             sub get_pareto {
209 2     2 1 8 my ($self) = @_;
210 2         3 return (@{$self->{pareto}});
  2         6  
211             }
212              
213             =head2 get_pareto_ref
214              
215             Returns the current content of Pareto set as a ref to array with vectors.
216             The return value references the original array, so treat it as read-only!
217              
218             =cut
219              
220             sub get_pareto_ref {
221 17     17 1 61 my ($self) = @_;
222 17         114 return $self->{pareto};
223             }
224              
225             # update (potentially) the set with a new vector:
226             # check if it is Pareto, if so, remove dominated vectors
227             sub _update_pareto {
228 43     43   52 my ($self, $NV) = @_;
229            
230             # check if we already have a duplicate?
231             # if so, handle it gently, so there are no mind-cracking
232             # algorithm variations after that
233            
234 43 100       72 if ($self->_has_duplicates($NV)) {
235             # ...then it depends on the policy
236 4 100       8 if ($self->{duplicates}) {
237             # add the duplicated vector to the pareto set
238 3         4 push @{$self->{pareto}}, $NV;
  3         5  
239             } else {
240             # simply disgard the new vector
241             }
242 4         12 return;
243             }
244            
245 39         45 my @newP = ( );
246 39         62 my $surePareto = 0;
247            
248             # check with every vector considered pareto so far
249 39         39 for my $o (@{$self->{pareto}}) {
  39         66  
250 27 100       40 if ($surePareto) {
251             # preserve the current vector only if it is not dominated by new (now Pareto) vector
252 1 50       20 if ($self->{_sub_is_dominated}($self, $o, $NV)) {
253 1         3 $self->_ban_vector($o);
254             } else {
255 0         0 push @newP, $o;
256             }
257             } else {
258             # stop processing with unchanged Pareto set if the new vector is dominated by the current one
259 26 100       524 return if $self->{_sub_is_dominated}($self, $NV, $o);
260            
261             # mark new vector as "sure Pareto" only if it dominates the current vector
262 21 100       426 if ($self->{_sub_is_dominated}($self, $o, $NV)) {
263 5         7 $surePareto = 1;
264             # ...and hence we don't preserve the dominated current vector
265 5         10 $self->_ban_vector($o);
266 5         9 next;
267             }
268            
269             # otherwise, the current vector is for sure Pareto still, so preserve it
270 16         33 push @newP, $o;
271             }
272             }
273              
274 34         46 push @newP, $NV;
275 34         65 $self->_mark_vector($NV);
276 34         134 $self->{pareto} = \@newP;
277             }
278              
279             =head2 is_dominated
280              
281             Checks if the first vector passed is dominated by the second one.
282             The comparison is made based on the values in vectors' columns, which
283             were passed to L.
284              
285             The vectors passed are never duplicates of each other when this method is
286             called from inside this module.
287              
288             Returns C, when the first vector from arguments list
289             is dominated by the other one, and C otherwise.
290              
291             =cut
292              
293 22     22 1 745 sub is_dominated { $_[0]->{_sub_is_dominated}(@_); } # pass the whole @_, as the sub thinks it is a method
294              
295             # these are is_dominated() parts which will be composed into the function,
296             # depending on the constructor options.
297             my %_is_dominated_parts = (
298             invalid => <<'_EOT_',
299             next if $self->{_sub_is_invalid}($dominated->[$col]); # invalid dominated by anything
300             return 0 if $self->{_sub_is_invalid}($by->[$col]); # invalid can't dominate valid
301             _EOT_
302             dominator_min => '($dominated->[$col] >= $by->[$col])',
303             dominator_max => '($dominated->[$col] <= $by->[$col])',
304             dominator_lexi => '($dominated->[$col] ge $by->[$col])',
305             dominator_lexi_rev => '($dominated->[$col] le $by->[$col])',
306            
307             _dominator_custom => <<'_EOT_',
308             $self->{column_dominator}($col, $dominated->[$col], $by->[$col])
309             _EOT_
310             _dominator_custom_hash => <<'_EOT_',
311             $self->{column_dominator}{$col}($col, $dominated->[$col], $by->[$col])
312             _EOT_
313              
314             );
315             $_is_dominated_parts{dominator_std} = $_is_dominated_parts{dominator_min};
316            
317             sub _construct_subs {
318 32     32   39 my ($self) = @_;
319            
320 32         29 my $invalid_part;
321 32 100       66 if (exists $self->{invalid}) {
322 4         5 my $inv = $self->{invalid};
323 4     23   14 $self->{_sub_is_invalid} = sub { $_[0] eq $inv };
  23         378  
324 4         9 $invalid_part = $_is_dominated_parts{invalid};
325             } else {
326 28     2   91 $self->{_sub_is_invalid} = sub { 0 };
  2         11  
327 28         36 $invalid_part = '';
328             }
329            
330 32         32 my $cmp_part;
331 32 100       69 if (exists $self->{column_dominator}) {
332 10   50     26 my $dom = $self->{column_dominator} || '';
333 10         25 my $type = reftype $dom;
334 10 100 66     25 if (!defined $type) {
    100          
335             # builtin
336 8         18 $cmp_part = $_is_dominated_parts{"dominator_$dom"};
337 8 50       18 croak "Unrecognized dominator builtin '$dom'" unless $cmp_part;
338             } elsif ($type && $type eq 'HASH') {
339 1         2 $cmp_part = $_is_dominated_parts{_dominator_custom_hash};
340             } else {
341 1         3 $cmp_part = $_is_dominated_parts{_dominator_custom};
342             }
343             } else {
344 22         36 $cmp_part = $_is_dominated_parts{dominator_std};
345             }
346            
347 32         77 my $sub_str = <<'_EOT_'
348             sub {
349             my ($self, $dominated, $by) = @_;
350             for my $col (@{$self->{columns}}) {
351             _EOT_
352             . <<_EOT_
353             $invalid_part
354             return 0 unless $cmp_part;
355             }
356             1;
357             }
358             _EOT_
359             ;
360 32         3282 $self->{_sub_is_dominated} = eval $sub_str;
361             }
362              
363             =head2 is_invalid
364              
365             Checks if the given value is considered invalid for the current object.
366             Every value is valid by default.
367              
368             =cut
369              
370 6     6 1 51 sub is_invalid { return $_[0]->{_sub_is_invalid}($_[1]); }
371              
372             # calculate the string repr. of a vector; to be used as a hash key
373             sub _vector_key {
374 83     83   86 my ($self, $v) = @_;
375 83         94 my @cols = ( );
376 83         71 for my $c (@{$self->{columns}}) {
  83         146  
377 191         273 push @cols, $v->[$c];
378             }
379            
380 83         267 return join ';', @cols;
381             }
382              
383             # checks if the given vector has duplicates in Pareto
384             sub _has_duplicates {
385 43     43   44 my ($self, $v) = @_;
386 43         73 my $key = $self->_vector_key($v);
387 43   100     169 return (exists $self->{vectorStatus}{$key} && $self->{vectorStatus}{$key} > 0);
388             }
389              
390             # mark the vector as not present in Pareto.
391             # In the future it can be used to ban the vector from trying to return
392             # to the Pareto set.
393             sub _ban_vector {
394 6     6   8 my ($self, $v) = @_;
395 6         10 my $key = $self->_vector_key($v);
396 6         11 $self->{vectorStatus}{$key} = 0;
397             }
398              
399             # mark vector as present in the Pareto set.
400             sub _mark_vector {
401 34     34   38 my ($self, $v) = @_;
402 34         56 my $key = $self->_vector_key($v);
403 34         84 $self->{vectorStatus}{$key} = 1;
404             }
405              
406             =head1 TODO
407              
408             Allow specifying built-in dominators inside dominator hash.
409              
410             For large data sets calculations become time-intensive. There are a couple
411             of techniques which might be applied to improve the performance:
412              
413             =over
414              
415             =item * defer the phase of removing vectors dominated by newly added vectors
416             to L call; this results in smaller number of arrays
417             rewritings.
418              
419             =item * split the set of vectors being added into smaller subsets, calculate
420             Pareto sets for such subsets, and then apply insertion of resulting Pareto
421             subsets to the main set; this results in smaller number of useless tries of
422             adding dominated vectors into the set.
423              
424             =back
425              
426             =head1 AUTHOR
427              
428             Przemyslaw Wesolek, C<< >>
429              
430             =head1 BUGS
431              
432             Please report any bugs or feature requests to C, or through
433             the web interface at L. I will be notified, and then you'll
434             automatically be notified of progress on your bug as I make changes.
435              
436             =head1 SUPPORT
437              
438             You can find documentation for this module with the perldoc command.
439              
440             perldoc Data::Pareto
441              
442              
443             You can also look for information at:
444              
445             =over 4
446              
447             =item * RT: CPAN's request tracker
448              
449             L
450              
451             =item * AnnoCPAN: Annotated CPAN documentation
452              
453             L
454              
455             =item * CPAN Ratings
456              
457             L
458              
459             =item * Search CPAN
460              
461             L
462              
463             =back
464              
465             =head1 COPYRIGHT & LICENSE
466              
467              
468             Copyright 2009 Przemyslaw Wesolek
469              
470             This program is free software; you can redistribute it and/or modify it
471             under the terms of the Artistic License 2.0. For details, see the full
472             text of the license in the file LICENSE.
473              
474             =cut
475              
476             1; # End of Data::Pareto