File Coverage

blib/lib/Statistics/Data/Dichotomize.pm
Criterion Covered Total %
statement 149 151 98.6
branch 80 102 78.4
condition 9 17 52.9
subroutine 14 14 100.0
pod 7 7 100.0
total 259 291 89.0


line stmt bran cond sub pod time code
1             package Statistics::Data::Dichotomize;
2 7     7   1033940 use strict;
  7         11  
  7         272  
3 7     7   35 use warnings FATAL => 'all';
  7         12  
  7         349  
4 7     7   30 use base qw(Statistics::Data);
  7         11  
  7         7401  
5 7     7   164314 use Carp qw(croak);
  7         13  
  7         454  
6 7     7   39 use Number::Misc qw(is_numeric);
  7         11  
  7         445  
7 7     7   4746 use Statistics::Lite qw(mean median mode);
  7         10437  
  7         12626  
8            
9             $Statistics::Data::Dichotomize::VERSION = '0.05';
10            
11             =head1 NAME
12            
13             Statistics::Data::Dichotomize - Dichotomize one or more numerical or categorical sequences into a single two-valued sequence
14            
15             =head1 VERSION
16            
17             This is documentation for B of Statistics-Data-Dichotomize.
18            
19             =head1 SYNOPSIS
20            
21             use Statistics::Data::Dichotomize 0.05;
22             my $ddat = Statistics::Data::Dichotomize->new();
23             my $aref;
24            
25             $ddat->load(23, 24, 7, 55); # numerical data
26             $aref = $ddat->cut(value => 'median'); # - or by precise value or function
27             $aref = $ddat->swing(); # by successive rises and falls of value
28             $aref = $ddat->shrink(rule => sub { return $_->[0] >= 20 ? : 1 : 0 }, winlen => 1); # like "cut" if winlen only 1
29             $aref = $ddat->binate(oneis => 7); # returns (0, 0, 1, 0)
30            
31             # - alternatively, call any method giving data directly, without prior load():
32             $aref = $ddat->cut(data => [23, 24, 7, 55], value => 20);
33             $aref = $ddat->pool(data => [$aref1, $aref2]);
34            
35             # or by a multi-sequence load: - by named arefs:
36             $ddat->load(foodat =>[qw/c b c a a/], bardat => [qw/b b b c a/]); # arbitrary names
37             $aref = $ddat->binate(data => 'foodat', oneis => 'c',); # returns (1, 0, 1, 0, 0)
38            
39             # - or by anonymous arefs:
40             $ddat->load([qw/c b c a a/], [qw/b b b c a/]); # categorical (stringy) data
41             $aref = $ddat->match(); # returns [0, 1, 0, 0, 1]
42            
43             =head1 DESCRIPTION
44            
45             A module for transforming one or more sequences of numerical or categorical data (array/s of numbers or strings) into a single binary-valued sequence.
46            
47             Several methods, more or less applicable to numerical and categorical sequences of data, are implemented. These have been (to date) largely derived from the statistical study of sequential effects (as in Swed & Eisenhart, 1943; Wolfowitz, 1943), particularly as applied within the behavioural sciences (as in Siegal, 1956), including parapsychology (as in Burdick & Kelly, 1977). They are particularly relevant for statistical description and analysis of data by the L modules.
48            
49             Each method returns a binary-valued sequence as a reference to an array of 1s and 0s -- by default. However, most methods support the argument B that controls the binary values of which to construct the dichotomous sequence; otherwise, the binary values are intrinsically user-controlled. Where applicable, this argument should key a 2-element array, where the first element (index = 0) replaces what would, by default, be returned as 0, and the second element (index = 1) replaces what would, by default, be returned as 1. So the dichotomous sequence might be comprised of, say, the values -1 and 1, "s" and "f" (success and failure), or "female" and "male", etc., rather than 1s and 0s.
50            
51             There are methods to dichotomise data for:
52            
53             =over 4
54            
55             =item 1. I
56            
57             that can be either (a) dichotomized ("L") about a specified or function-returned value, or a central statistic (mean, median or mode), or (b) dichtomotized according to successive rises and falls in value ("L");
58            
59             =item 2. I
60            
61             which can be collapsed ("Led") into a single dichotomous sequence according to the rank order of their values;
62            
63             =item 3. I
64            
65             where one value is set to equal 1 and all others equal 0 ("L");
66            
67             =item 4. I
68            
69             which can be collapsed into a single dichotomous sequence according to their pairwise "L"; and
70            
71             =item 5. a I
72            
73             which can be dichotomized according to whether or not independent slices of the data meet a specified Boolean rule ("L").
74            
75             =back
76            
77             All arguments are given as an anonymous hash of key => value pairs, or as a reference to such a hash (not shown in examples).
78            
79             =head1 SUBROUTINES/METHODS
80            
81             =head2 new
82            
83             Returns the class object for this module, inheriting all the methods of L, which it uses as a L.
84            
85             =head2 load, add, access, unload
86            
87             Methods for loading, updating and retrieving data are inherited from L. See that manpage for details of these and other inherently supported methods.
88            
89             =cut
90            
91             =head2 Numerical data: Single sequence dichotomization
92            
93             =head3 cut
94            
95             ($aref, $val) = $ddat->cut(data => \@data, value => \&Statistics::Lite::median); # cut the given data at is median, getting back median too
96             $aref = $ddat->cut(value => 'median', equal => 'gt'); # cut the last previously loaded data at its median
97             $aref = $ddat->cut(value => 23); # cut anonymously cached data at a specific value
98             $aref = $ddat->cut(value => 'mean', data => 'blues'); # cut named data (previously loaded as such) at its mean (or whatever)
99             $aref = $ddat->cut(value => CODE); # cut by a user-defined function returning a data-descriptive value
100             $aref = $ddat->cut(value => 23, set => [-1, 1]); # cut as above, but not into 0s and 1s, but -1s and 1s
101            
102             Returns a reference to an array of dichotomously transformed values of a given array of numbers by categorizing its values as to whether they're numerically higher or lower than a particular value, e.g., their median, mean, mode or some given number, or some other function that returns a single value. Called in list context, returns a reference to the transformed values, and then the cut-value itself.
103            
104             So the following data, when cut over values greater than or equal to 5, yield the binary-valued sequence:
105            
106             @orig_data = (4, 3, 3, 5, 3, 4, 5, 6, 3, 5, 3, 3, 6, 4, 4, 7, 6, 4, 7, 3);
107             @cut_data = (0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0);
108            
109             The order of the original values is reflected in the returned "cut data", but their order is not taken into account in making up the dichotomy - in contrast to the L method.
110            
111             I, as follow, specify what value or measure to cut by (default is the median), and how to handle ties with the cut-value (default is to skip them).
112            
113             =over 4
114            
115             =item value => 'mean|median|mode' - or a specific numerical value, or code reference
116            
117             Specifies the value at which the data will be cut. This could be the mean, median or mode (as calculated by L), or a numerical value within the range of the data, or some appropriate subroutine -- one that takes an array (not a reference to one) and returns a single value (presumably a descriptive of the values in the array). The default is the I. The cut-value, as specified by B, can be retrieved as the second element returned if calling for an array.
118            
119             =item equal => 'I|I|I|I'
120            
121             Specifies how to cut the data if the cut-value (as specified by B) is present in the data. The logic applied takes on the following conventions:
122            
123             =over 8
124            
125             =item B 'gt'> [default]
126            
127             All values I the cut-value take on one code (by default, 1), and all values I the cut-value take on another (by default, 0). This is (by convention) the default operation, preventing a given sequence that is fully composed of the cut-value returning an empty-list. So, e.g., given the data (5, 5, 5), and specifying that the cut-value is 5, the list (1, 1, 1) is returned, just as if the given data were, say, (8, 5, 212).
128            
129             =item B 'lt'>
130            
131             All values I the cut-value take on one code (by default, 0), and all higher values take on another code (by default, 1). For the prior example, the given data (5, 5, 5) now becomes (0, 0, 0).
132            
133             =item B 'rpt'>
134            
135             The dichotomous sequence takes on the value that was taken in the immediately prior "cut". So now, given the data (5, 5, 5), the list (1, 1, 1) would be returned--as the first value is given the value of 1 (following the default operation to treat values greater than or equal to the cut-value as 1), and all subsequent values take on the same value. But if the given data were (4, 5, 6, 5), or (-400, 5, 600, 5), the returned list is (0, 0, 1, 1). This value/operation was introduced in Version 0.05.
136            
137             =item B 'skip'|0>
138            
139             Values equal to the cut-value are skipped. So if the cut-value appears as the first value, it is simply skipped (it takes on no value), and an empty list is returned.
140            
141             =back
142            
143             Note that the operational logic here is different, in its default operation, from that following the same argument in the C() method. There, logically, and by convention, the default = 0, i.e., to skip neigbouring values with zero difference. The default operation for equality described here, for the C() method, perhaps should match the latter (and might well, following usage, feedback) for sake of consistency, but it seems most appropriate for now (as since Version 0.00) to make the default operation within the C() method follow convention/expectation, i.e., by its own logic, rather than to exact cross-method consistency for its own sake. In practice, it is advisable to compare results for a test based on the dichotomous sequence from different criteria for equality. If all results of a test are equal, there is no problem; otherwise, the average of the results from different methods can be taken (see Siegal, 1956, pp. 143-144, in discussion of "ties" in dichotomizing data for the two-sample L.)
144            
145             =item set
146            
147             The optional argument B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (index = 0) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
148            
149             =back
150            
151             =cut
152            
153             sub cut {
154 16     16 1 10819 my ( $self, @args ) = @_;
155 16 50       57 my $args = ref $args[0] ? $args[0] : {@args};
156 16 100       56 my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
157 16 50       272 croak __PACKAGE__,
158             '::cut All data must be numeric for dichotomizing about a cut-value'
159             if !$self->all_numeric($dat);
160 16 50       1872 $args->{'value'} = 'median' if !defined $args->{'value'};
161            
162             #$args->{'equal'} = 0 if !defined $args->{'equal'}; #- no default??
163 16 100       38 $args->{'equal'} = 'gt' if !defined $args->{'equal'};
164 16         21 my ( $val, @seqs ) = ();
165            
166             # Get a cut-value:
167 16 100       34 if ( !is_numeric( $args->{'value'} ) ) {
168 3         40 my $code = \&{ delete $args->{'value'} };
  3         8  
169 3         4 $val = $code->( @{$dat} );
  3         7  
170             }
171             else {
172 13         160 $val = $args->{'value'};
173             }
174            
175             # Categorize by number of observations above, below or equal to the cut_value:
176             push @seqs,
177             $_ > $val ? 1
178             : $_ < $val ? 0
179             : $args->{'equal'} eq 'gt' ? 1
180             : $args->{'equal'} eq 'lt' ? 0
181             : $args->{'equal'} eq 'rpt' ? ( exists $seqs[-1] ? $seqs[-1] : 1 )
182 16 100       134 : next foreach @{$dat};
  16 100       295  
    100          
    100          
    100          
    100          
183 16         42 _set( \@seqs, $args->{'set'} );
184 16 100       71 return wantarray ? ( \@seqs, $val ) : \@seqs;
185             }
186            
187             =head3 swing
188            
189             $aref = $ddat->swing(data => [3, 4, 7, 6, 5, 1, 2, 3, 2]); # "swing" these data
190             $aref = $ddat->swing(label => 'reds'); # name a pre-loaded dataset for "swinging"
191             $aref = $ddat->swing(); # use the last-loaded dataset
192             $aref = $ddat->swing(set => [qw/male female/]); # for any of the above, optionally specify the dichotomous values
193            
194             Returns a reference to an array of dichotomously transformed values of a single sequence of numerical values according to their consecutive rises and falls. Each value is subtracted from its successor, and the result is replaced with a 1 if the difference represents an increase, or 0 if it represents a decrease. For example (from Wolfowitz, 1943, p. 283), the following numerical sequence produces the subsequent dichotomous sequence.
195            
196             @values = (qw/3 4 7 6 5 1 2 3 2/);
197             @dichot = (qw/1 1 0 0 0 1 1 0/);
198            
199             Dichotomously, the data commence with an ascending run of length 2 (from 3 to 4, and from 4 to 7), followed by a descending run of length 3 (from 7 to 6, 6 to 5, and 5 to 1), followed by an ascent of length 2 (from 1 to 2, from 2 to 3), and so on. The number of resulting dichotomous observations is 1 less than the original sample-size (elements in the given array).
200            
201             I are as follow.
202            
203             =over 4
204            
205             =item equal => 'I|I|I|I'
206            
207             The default result when the difference between two successive values is zero is to skip the observation, and move onto the next succession. Alternatively, specify B 'rpt'> to repeat the result for the previous succession; skipping only a difference of zero should it occur as the first result. Or, a difference greater than or equal to zero is counted as an increase (B 'gt'>), or a difference less than or equal to zero is counted as a decrease. For example,
208            
209             @values = (qw/3 3 7 6 5 2 2/);
210             @dicho_skip = (qw/1 0 0 0/); # First and final results (of 3 - 3, and 2 - 2) are skipped
211             @dicho_rpt = (qw/1 0 0 0 0/); # First result (of 3 - 3) is skipped, and final result repeats the former
212             @dicho_gt = (qw/1 1 0 0 0 1/); # Greater than or equal to zero is an increase
213             @dicho_lt = (qw/0 1 0 0 0 0/); # Less than or equal to zero is a decrease
214            
215             See description of the same argument in the L for more details (but for which the default value is 'gt').
216            
217             =item set
218            
219             The optional argument B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (zero-indexed) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
220            
221             =back
222            
223             =cut
224            
225             sub swing {
226 8     8 1 6352 my ( $self, @args ) = @_;
227 8 50       26 my $args = ref $args[0] ? $args[0] : {@args};
228 8 100       21 my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
229 8 50       48 croak __PACKAGE__, '::swing All data must be numeric for dichotomizing'
230             if !$self->all_numeric($dat);
231 8 100       741 $args->{'equal'} = 0 if !defined $args->{'equal'}; #- no default??
232 8         9 my ( $i, $res, @seqs ) = ();
233            
234             # Replace observations with the succession of rises and falls:
235 8         9 for ( $i = 0 ; $i < ( scalar @{$dat} - 1 ) ; $i++ ) {
  95         122  
236 87         66 $res = $dat->[ ( $i + 1 ) ] - $dat->[$i];
237 87 100       98 if ( $res > 0 ) {
    100          
238 35         29 push @seqs, 1;
239             }
240             elsif ( $res < 0 ) {
241 44         35 push @seqs, 0;
242             }
243             else {
244 8         9 for ( $args->{'equal'} ) {
245 8 100       24 if (/^rpt/xsm) {
    100          
    100          
246 2 100       7 push @seqs, $seqs[-1] if scalar @seqs;
247             }
248             elsif (/^gt/xsm) {
249 2         5 push @seqs, 1;
250             }
251             elsif (/^lt/xsm) {
252 2         5 push @seqs, 0;
253             }
254             else {
255 2         3 next;
256             }
257             }
258             }
259             }
260 8         19 _set( \@seqs, $args->{'set'} );
261 8         24 return \@seqs;
262             }
263            
264             =head2 Numerical data: Two sequence dichotomization
265            
266             See also the methods for categorical data where it is ok to ignore any order and intervals in numerical data.
267            
268             =head3 pool
269            
270             $aref = $ddat->pool(data => [$aref1, $aref2]); # give data directly to function
271             $aref = $ddat->pool(data => [$ddat->access(index => 0), $ddat->access(index => 1)]); # after $ddat->load(\@aref1, $aref2);
272             $aref = $ddat->pool(data => [$ddat->access(label => '1'), $ddat->access(label => '2')]); # after $ddat->load(1 => $aref1, 2 => $aref2);
273             $aref = $ddat->pool(data => [$aref1, $aref2], set => [-1, 1]); # for any of the above, optionally specify the binary set
274            
275             Returns a reference to an array of dichotomously transformed values of two sequences of I data as a ranked pool, i.e., by pooling the data from each sequence according to the magnitude of their values at each trial, from lowest to heighest. Specifically, the values from both sequences are pooled and ordered from lowest to highest, and then dichotomized into runs according to the sequence from which neighbouring values come from. Another run occurs wherever there is a change in the source of the values. A non-random effect of, say, higher or lower values consistently coming from one sequence rather than another would be reflected in fewer runs than expected by chance.
276            
277             This is typically used for a Wald-Walfowitz test of difference between two samples -- ranking by median; as per Siegal (1956), and Swed and Eisenhart (1943).
278            
279             The I B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (zero-indexed) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
280            
281             =cut
282            
283             sub pool {
284 3     3 1 5382 my ( $self, @args ) = @_;
285 3 50       12 my $args = ref $args[0] ? $args[0] : {@args};
286 3 50       10 my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
287 3         5 $self->all_numeric($_) foreach @{$dat};
  3         20  
288 3         361 my ( $dat1, $dat2 ) = @{$dat};
  3         5  
289 3         4 my $sum = scalar @{$dat1} + scalar @{$dat2};
  3         4  
  3         4  
290             my @dat =
291 3         4 ( [ sort { $a <=> $b } @{$dat1} ], [ sort { $a <=> $b } @{$dat2} ] );
  45         42  
  3         14  
  51         38  
  3         6  
292            
293 3         6 my ( $i, $x, $y, @seqs ) = (0);
294 3         9 while ( scalar(@seqs) < $sum ) {
295 48         35 $x = $dat[0]->[0];
296 48         26 $y = $dat[1]->[0];
297 48 100 66     135 $i = defined $x && defined $y ? $x < $y ? 0 : 1 : defined $x ? 0 : 1;
    50          
    100          
298 48         25 shift @{ $dat[$i] };
  48         39  
299 48         66 push @seqs, $i;
300             }
301 3         8 _set( \@seqs, $args->{'set'} );
302 3         11 return \@seqs;
303             }
304             ## DEV: consider: List::AllUtils::pairwise:
305             # @x = pairwise { $a + $b } @a, @b; # returns index-by-index sums
306            
307             =head2 Categorical data: Single sequence dichotomization
308            
309             =head3 binate
310            
311             $aref = $ddat->binate(oneis => 'E'); # optionally specify a state in the sequence to be set as "1"
312             $aref = $ddat->binate(oneis => 'E', set => [qw/a b/]); # optionally specify that Es be transformed to 'b', other events as 'a'
313             $aref = $ddat->binate(data => \@ari, oneis => 'E'); # same but using pre-loaded data
314            
315             Returns a reference to an array of dichotomously transformed values of an array by setting the first element in the list to 1 (by default, or whatever is specified as B) on all its occurrences in the array, and all other values in the array as zero.
316            
317             The I B, keying a referenced array, specifies that, in fact, the first element (or what might be specified as B) should be transformed into what is given as the index 1 element in this array, and that all other elements should be transformed into what is given as its index 0 element.
318            
319             =cut
320            
321             sub binate {
322 3     3 1 5162 my ( $self, @args ) = @_;
323 3 50       14 my $args = ref $args[0] ? $args[0] : {@args};
324 3 50       20 my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
325             my $oneis =
326             defined $args->{'oneis'}
327 3 100       87 ? delete $args->{'oneis'}
328             : $dat->[0]; # What value set to 1 and others to zero?
329 3 100       4 my @seqs = map { $_ eq $oneis ? 1 : 0 } @{$dat};
  15         30  
  3         6  
330             ; # replace observations with 1s and 0s
331 3         10 _set( \@seqs, $args->{'set'} );
332 3         11 return \@seqs;
333             }
334            
335             =head2 Categorical data: Two-sequence dichotomization
336            
337             =head3 match
338            
339             $aref = $ddat->match(data => [\@aref1, \@aref2], lag => signed integer, loop => 0|1); # with optional crosslag of the two sequences
340             $aref = $ddat->match(data => [$ddat->access(index => 0), $ddat->access(index => 1)]); # after $ddat->load(\@aref1, \@aref2);
341             $aref = $ddat->match(data => [$ddat->access(label => '1'), $ddat->access(label => '2')]); # after $ddat->load(1 => \@aref1, 2 => \@aref2);
342            
343             Returns a reference to an array of dichotomously transformed values of two paired arrays according to the match between the elements at each of their indices. Where the data-values are equal at a certain index, they are represented with a 1; otherwise a 0 (by default, but see the B argument). Numerical or stringy data can be equated. For example, the following two arrays would be reduced to the third, where a 1 indicates a match (i.e., the values are "indexically equal").
344            
345             @foo_dat = (qw/1 3 3 2 1 5 1 2 4/);
346             @bar_dat = (qw/4 3 1 2 1 4 2 2 4/);
347             @bin_dat = (qw/0 1 0 1 1 0 0 1 1/);
348            
349             I are as follow.
350            
351             =over 4
352            
353             =item lag => I (where I < number of observations I I > -1 (number of observations) )
354            
355             Match the two data-sets by shifting the first named set ahead or behind the other data-set by B observations. The default is zero. For example, one data-set might be targets, and another responses to the targets:
356            
357             targets = cbbbdacdbd
358             responses = daadbadcce
359            
360             Matched as a single sequence of hits (1) and misses (0) where B = B<0> yields (for the match on "a" in the 6th index of both arrays):
361            
362             0000010000
363            
364             With B => 1, however, each response is associated with the target one ahead of the trial for which it was observed; i.e., each target is shifted to its +1 index. So the first element in the above responses (I) would be associated with the second element of the targets (I), and so on. Now, matching the two data-sets with a B<+1> lag gives two hits, of the 4th and 7th elements of the responses to the 5th and 8th elements of the targets, respectively:
365            
366             000100100
367            
368             making 5 runs. With B => 0, there are 3 runs. Lag values can be negative, so that B => -2 will give:
369            
370             00101010
371            
372             Here, responses necessarily start at the third element (I), the first hits occurring when the fifth response-element corresponds to the the third target element (I). The last response (I) could not be used, and the number of elements in the hit/miss sequence became n-B less the original target sequence. This means that the maximum value of lag must be one less the size of the data-sets, or there will be no data.
373            
374             =item loop => 0|1
375            
376             Implements circularized lagging if B => 1, where all lagged data are preserved by looping any excess to the start or end of the criterion data. The number of observations will then always be the same, regardless of the lag; i.e., the size of the returned array is the same as that of the given data. For example, matching the data in the example above with a lag of +1, with looping, creates an additional match between the final response and the first target (I); i.e., the last element in the "response" array is matched to the first element of the "target" array:
377            
378             1000100100
379            
380             =item set
381            
382             The optional argument B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (zero-indexed) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
383            
384             =back
385            
386             =cut
387            
388             sub match {
389 7     7 1 12024 my ( $self, @args ) = @_;
390 7 50       39 my $args = ref $args[0] ? $args[0] : {@args};
391 7 50       24 my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
392             $dat = $self->crosslag(
393             lag => $args->{'lag'},
394             data => [ $dat->[0], $dat->[1] ],
395             loop => $args->{'loop'}
396 7 100       38 ) if $args->{'lag'};
397             my $lim =
398 7         14 scalar @{ $dat->[0] } <= scalar @{ $dat->[1] }
  7         18  
399 7         12 ? scalar @{ $dat->[0] }
400 7 50       13 : scalar @{ $dat->[1] }; # ensure criterion data-set is smallest
  0         0  
401 7         13 my (@seqs) = ();
402 7         21 for my $i ( 0 .. $lim ) {
403 72 100 66     201 next if !defined $dat->[0]->[$i] || !defined $dat->[1]->[$i];
404 65 100       130 $seqs[$i] = $dat->[0]->[$i] eq $dat->[1]->[$i] ? 1 : 0;
405             }
406 7         33 _set( \@seqs, $args->{'set'} );
407 7         36 return \@seqs;
408             }
409            
410             =head2 Numerical or categorical data: Single sequence dichotimisation
411            
412             =head3 shrink
413            
414             $aref = $ddat->shrink(winlen => INT, rule => CODE)
415            
416             Returns a reference to an array of dichotomously transformed values of a numerical or categorical sequence by taking I slices, or windows, as given in the argument B, and making a true/false sequence out of them according to whether or not each slice passes a B. The B is a code reference that gets the data as a reference to an array, and so might be something like this:
417            
418             sub { return Statistics::Lite::mean(@{$_}) > 2 ? 1 : 0; }
419            
420             If B is set to 3, this means-wise rule would make the following numerical sequence of 9 elements shrink into the following dichotomous sequence of 3 elements:
421            
422             @data = (1, 2, 3, 3, 3, 3, 4, 2, 1);
423             @means = (2, 3, 2.5 );
424             @dico = (0, 1, 1 );
425            
426             For categorical data, a completely "stringy" rule might be specified in the following ways. If B => 1, and the given data are (A, B, c, d), then the rule
427            
428             sub { my $aref = shift; $aref->[0] =~ /[A-Z]/ ? 1 : 0; }
429            
430             would yield the sequence be (1, 1, 0, 0) -- because the elements A and B satisfy the regular expression (being within the set {A .. Z}), while the remainder (elements c and e) do not.
431            
432             Yet if B => 2 for the same given data, the same case-wise rule might be specified as
433            
434             sub { my $aref = shift; my $str = join q{}, @{$aref}; $str =~ /[A-Z]{2,}/ ? 1 : 0; }
435            
436             and the returned sequence is (1, 0), given that (again) the first two elements (A, B) satisfy the rule (returning 1), and the second pair of elements (c, e) do not (returning 0).
437            
438             The B must, of course, return dichotomous values to dichotomize the data, and B should make up equally sized segments (no error is thrown if this isn't the case, the remainder just gets figured in the same way).
439            
440             Unlike other methods, this method does not respect a B argument -- because the given transformation rule has control of what the set is (1s and 0s, or 1s and -1s, etc.).
441            
442             =cut
443            
444             sub shrink {
445 3     3 1 4904 my ( $self, @args ) = @_;
446 3 50       12 my $args = ref $args[0] ? $args[0] : {@args};
447 3 100       13 my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
448 3         28 my $lim = scalar @{$dat};
  3         5  
449 3         3 my $len = int $args->{'winlen'};
450 3   50     7 $len ||= 1;
451 3         5 my $code = delete $args->{'rule'};
452 3 50 33     17 croak __PACKAGE__, '::shrink Need a code to Boolean shrink'
453             if not $code
454             or ref $code ne 'CODE';
455 3         3 my ( $i, @seqs );
456            
457 3         7 for ( $i = 0 ; $i < $lim ; $i += $len )
458             { # C-style for clear greater-than 1 increments per loop
459 9         82 push @seqs, $code->( [ @{$dat}[ $i .. ( $i + $len - 1 ) ] ] );
  9         19  
460             }
461 3         31 return \@seqs;
462             }
463             *boolwin = \&shrink;
464            
465             =head2 Utilities
466            
467             =head3 crosslag
468            
469             @lagged_arefs = $ddat->crosslag(data => [\@ari1, \@ari2], lag => signed integer, loop => 0|1);
470             $aref_of_arefs = $ddat->crosslag(data => [\@ari1, \@ari2], lag => signed integer, loop => 0|1); # same but not "wanting array"
471            
472             Takes two arrays and returns them cross-lagged against each other, shifting and popping values according to the number of "lags". Typically used when wanting to L the two arrays against each other.
473            
474             =over 4
475            
476             =item lag => signed integer up to the number of elements
477            
478             Takes the first array sent as "data" as the reference or "target" array for the second "response" array to be shifted so many lags before or behind it. With no looping of the lags, this means the returned arrays are "lag"-elements smaller than the original arrays. For example, with lag => +1 (and loop => 0, the default), and with data => [ [qw/c p w p s/], [qw/p s s w r/] ],
479            
480             (c p w p s) becomes (p w p s)
481             (p s s w r) becomes (p s s w)
482            
483             So, whereas the original data gave no matches across the two arrays, now, with the second of the two arrays shifted forward by one index, it has a match (of "p") at the first index with the first of the two arrays.
484            
485             =item loop => 0|1
486            
487             For circularized lagging, B => 1, and the size of the returned array is the same as those for the given data. For example, with a lag of +1, the last element in the "response" array is matched to the first element of the "target" array:
488            
489             (c p w p s) becomes (p w p s c) (looped with +1)
490             (p s s w r) becomes (p s s w r) (no effect)
491            
492             In this case, it might be more efficient to simply autolag the "target" sequence against itself.
493            
494             =back
495            
496             =cut
497            
498             sub crosslag {
499 6     6 1 2158 my ( $self, @args ) = @_;
500 6 50       28 my $args = ref $args[0] ? $args[0] : {@args};
501 6         11 my $lag = $args->{'lag'};
502 6         10 my $dat1 = $args->{'data'}->[0];
503 6         9 my $dat2 = $args->{'data'}->[1];
504 6         10 my $loop = $args->{'loop'};
505             return ( wantarray ? ( $dat1, $dat2 ) : [ $dat1, $dat2 ] )
506             if not $lag
507 6 0 33     26 or abs $lag >= scalar @{$dat1};
  6 50       26  
508            
509 6         8 my @dat1_lagged = @{$dat1};
  6         23  
510 6         8 my @dat2_lagged = @{$dat2};
  6         19  
511            
512 6 100       16 if ( $lag > 0 ) {
    50          
513 5         17 foreach ( 1 .. abs $lag ) {
514 5 100       12 if ($loop) {
515 3         45 unshift @dat1_lagged, pop @dat1_lagged;
516             }
517             else {
518 2         4 shift @dat1_lagged;
519 2         6 pop @dat2_lagged;
520             }
521             }
522             }
523             elsif ( $lag < 0 ) {
524 1         5 foreach ( 1 .. abs $lag ) {
525 2 50       4 if ($loop) {
526 0         0 push @dat1_lagged, shift @dat1_lagged;
527             }
528             else {
529 2         3 pop @dat1_lagged;
530 2         5 shift @dat2_lagged;
531             }
532             }
533             }
534             return wantarray
535 6 50       32 ? ( \@dat1_lagged, \@dat2_lagged )
536             : [ \@dat1_lagged, \@dat2_lagged ];
537             }
538            
539             sub _set {
540 37     37   64 my ( $aref, $set ) = @_;
541 37 100 66     113 return if not ref $set or scalar @{$set} != 2;
  5         26  
542 5         12 for my $i ( 0 .. scalar @{$aref} - 1 ) {
  5         23  
543 59 100       75 if ( $aref->[$i] == 0 ) {
544 34         44 $aref->[$i] = $set->[0];
545             }
546             else {
547 25         28 $aref->[$i] = $set->[1];
548             }
549             }
550 5         12 return;
551             }
552            
553             =head1 AUTHOR
554            
555             Roderick Garton, C<< >>
556            
557             =head1 REFERENCES
558            
559             B (1977). Statistical methods in parapsychological research. In B. B. Wolman (Ed.), I (pp. 81-130). New York, NY, US: Van Nostrand Reinhold. L [Describes the L method of windowed Boolean dichotomization.]
560            
561             B (1956). I. New York, NY, US: McGraw-Hill. L [Re dichotomization for the two-sample L.]
562            
563             B (1943). Tables for testing randomness of grouping in a sequence of alternatives. I, I<14>, 66-87. doi: L<10.1214/aoms/1177731494|http://dx.doi.org/10.1214/aoms/1177731494> [Describes the L method and test example.]
564            
565             B (1943). On the theory of runs with some applications to quality control. I, I<14>, 280-288. doi: L<10.1214/aoms/1177731421|http://dx.doi.org/10.1214/aoms/1177731421> [Describes the L method ("runs up and down") and test example.]
566            
567             =head1 BUGS
568            
569             Please report any bugs or feature requests to C, or through
570             the web interface at L. I will be notified, and then you'll
571             automatically be notified of progress on your bug as I make changes.
572            
573             =head1 SUPPORT
574            
575             You can find documentation for this module with the perldoc command.
576            
577             perldoc Statistics::Data::Dichotomize
578            
579             You can also look for information at:
580            
581             =over 4
582            
583             =item * RT: CPAN's request tracker (report bugs here)
584            
585             L
586            
587             =item * AnnoCPAN: Annotated CPAN documentation
588            
589             L
590            
591             =item * CPAN Ratings
592            
593             L
594            
595             =item * Search CPAN
596            
597             L
598            
599             =back
600            
601             =head1 LICENSE AND COPYRIGHT
602            
603             =over 4
604            
605             =item Copyright (c) 2012-2016 Roderick Garton
606            
607             This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see L).
608            
609             =item Disclaimer
610            
611             To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.
612            
613             =back
614            
615             =cut
616            
617             1; # End of Statistics::Data::Dichotomize