File Coverage

blib/lib/Statistics/Data/Dichotomize.pm

Criterion	Covered	Total	%
statement	149	151	98.6
branch	80	102	78.4
condition	9	17	52.9
subroutine	14	14	100.0
pod	7	7	100.0
total	259	291	89.0

line	stmt	bran	cond	sub	pod	time	code
1							package Statistics::Data::Dichotomize;
2	7			7		1033940	use strict;
	7					11
	7					272
3	7			7		35	use warnings FATAL => 'all';
	7					12
	7					349
4	7			7		30	use base qw(Statistics::Data);
	7					11
	7					7401
5	7			7		164314	use Carp qw(croak);
	7					13
	7					454
6	7			7		39	use Number::Misc qw(is_numeric);
	7					11
	7					445
7	7			7		4746	use Statistics::Lite qw(mean median mode);
	7					10437
	7					12626
8
9							$Statistics::Data::Dichotomize::VERSION = '0.05';
10
11							=head1 NAME
12
13							Statistics::Data::Dichotomize - Dichotomize one or more numerical or categorical sequences into a single two-valued sequence
14
15							=head1 VERSION
16
17							This is documentation for B of Statistics-Data-Dichotomize.
18
19							=head1 SYNOPSIS
20
21							use Statistics::Data::Dichotomize 0.05;
22							my $ddat = Statistics::Data::Dichotomize->new();
23							my $aref;
24
25							$ddat->load(23, 24, 7, 55); # numerical data
26							$aref = $ddat->cut(value => 'median'); # - or by precise value or function
27							$aref = $ddat->swing(); # by successive rises and falls of value
28							$aref = $ddat->shrink(rule => sub { return $_->[0] >= 20 ? : 1 : 0 }, winlen => 1); # like "cut" if winlen only 1
29							$aref = $ddat->binate(oneis => 7); # returns (0, 0, 1, 0)
30
31							# - alternatively, call any method giving data directly, without prior load():
32							$aref = $ddat->cut(data => [23, 24, 7, 55], value => 20);
33							$aref = $ddat->pool(data => [$aref1, $aref2]);
34
35							# or by a multi-sequence load: - by named arefs:
36							$ddat->load(foodat =>[qw/c b c a a/], bardat => [qw/b b b c a/]); # arbitrary names
37							$aref = $ddat->binate(data => 'foodat', oneis => 'c',); # returns (1, 0, 1, 0, 0)
38
39							# - or by anonymous arefs:
40							$ddat->load([qw/c b c a a/], [qw/b b b c a/]); # categorical (stringy) data
41							$aref = $ddat->match(); # returns [0, 1, 0, 0, 1]
42
43							=head1 DESCRIPTION
44
45							A module for transforming one or more sequences of numerical or categorical data (array/s of numbers or strings) into a single binary-valued sequence.
46
47							Several methods, more or less applicable to numerical and categorical sequences of data, are implemented. These have been (to date) largely derived from the statistical study of sequential effects (as in Swed & Eisenhart, 1943; Wolfowitz, 1943), particularly as applied within the behavioural sciences (as in Siegal, 1956), including parapsychology (as in Burdick & Kelly, 1977). They are particularly relevant for statistical description and analysis of data by the L modules.
48
49							Each method returns a binary-valued sequence as a reference to an array of 1s and 0s -- by default. However, most methods support the argument B that controls the binary values of which to construct the dichotomous sequence; otherwise, the binary values are intrinsically user-controlled. Where applicable, this argument should key a 2-element array, where the first element (index = 0) replaces what would, by default, be returned as 0, and the second element (index = 1) replaces what would, by default, be returned as 1. So the dichotomous sequence might be comprised of, say, the values -1 and 1, "s" and "f" (success and failure), or "female" and "male", etc., rather than 1s and 0s.
50
51							There are methods to dichotomise data for:
52
53							=over 4
54
55							=item 1. I
56
57							that can be either (a) dichotomized ("L") about a specified or function-returned value, or a central statistic (mean, median or mode), or (b) dichtomotized according to successive rises and falls in value ("L");
58
59							=item 2. I
60
61							which can be collapsed ("Led") into a single dichotomous sequence according to the rank order of their values;
62
63							=item 3. I
64
65							where one value is set to equal 1 and all others equal 0 ("L");
66
67							=item 4. I
68
69							which can be collapsed into a single dichotomous sequence according to their pairwise "L"; and
70
71							=item 5. a I
72
73							which can be dichotomized according to whether or not independent slices of the data meet a specified Boolean rule ("L").
74
75							=back
76
77							All arguments are given as an anonymous hash of key => value pairs, or as a reference to such a hash (not shown in examples).
78
79							=head1 SUBROUTINES/METHODS
80
81							=head2 new
82
83							Returns the class object for this module, inheriting all the methods of L, which it uses as a L.
84
85							=head2 load, add, access, unload
86
87							Methods for loading, updating and retrieving data are inherited from L. See that manpage for details of these and other inherently supported methods.
88
89							=cut
90
91							=head2 Numerical data: Single sequence dichotomization
92
93							=head3 cut
94
95							($aref, $val) = $ddat->cut(data => \@data, value => \&Statistics::Lite::median); # cut the given data at is median, getting back median too
96							$aref = $ddat->cut(value => 'median', equal => 'gt'); # cut the last previously loaded data at its median
97							$aref = $ddat->cut(value => 23); # cut anonymously cached data at a specific value
98							$aref = $ddat->cut(value => 'mean', data => 'blues'); # cut named data (previously loaded as such) at its mean (or whatever)
99							$aref = $ddat->cut(value => CODE); # cut by a user-defined function returning a data-descriptive value
100							$aref = $ddat->cut(value => 23, set => [-1, 1]); # cut as above, but not into 0s and 1s, but -1s and 1s
101
102							Returns a reference to an array of dichotomously transformed values of a given array of numbers by categorizing its values as to whether they're numerically higher or lower than a particular value, e.g., their median, mean, mode or some given number, or some other function that returns a single value. Called in list context, returns a reference to the transformed values, and then the cut-value itself.
103
104							So the following data, when cut over values greater than or equal to 5, yield the binary-valued sequence:
105
106							@orig_data = (4, 3, 3, 5, 3, 4, 5, 6, 3, 5, 3, 3, 6, 4, 4, 7, 6, 4, 7, 3);
107							@cut_data = (0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0);
108
109							The order of the original values is reflected in the returned "cut data", but their order is not taken into account in making up the dichotomy - in contrast to the L method.
110
111							I, as follow, specify what value or measure to cut by (default is the median), and how to handle ties with the cut-value (default is to skip them).
112
113							=over 4
114
115							=item value => 'mean\|median\|mode' - or a specific numerical value, or code reference
116
117							Specifies the value at which the data will be cut. This could be the mean, median or mode (as calculated by L), or a numerical value within the range of the data, or some appropriate subroutine -- one that takes an array (not a reference to one) and returns a single value (presumably a descriptive of the values in the array). The default is the I. The cut-value, as specified by B, can be retrieved as the second element returned if calling for an array.
118
119							=item equal => 'I\|I\|I\|I'
120
121							Specifies how to cut the data if the cut-value (as specified by B) is present in the data. The logic applied takes on the following conventions:
122
123							=over 8
124
125							=item B 'gt'> [default]
126
127							All values I the cut-value take on one code (by default, 1), and all values I the cut-value take on another (by default, 0). This is (by convention) the default operation, preventing a given sequence that is fully composed of the cut-value returning an empty-list. So, e.g., given the data (5, 5, 5), and specifying that the cut-value is 5, the list (1, 1, 1) is returned, just as if the given data were, say, (8, 5, 212).
128
129							=item B 'lt'>
130
131							All values I the cut-value take on one code (by default, 0), and all higher values take on another code (by default, 1). For the prior example, the given data (5, 5, 5) now becomes (0, 0, 0).
132
133							=item B 'rpt'>
134
135							The dichotomous sequence takes on the value that was taken in the immediately prior "cut". So now, given the data (5, 5, 5), the list (1, 1, 1) would be returned--as the first value is given the value of 1 (following the default operation to treat values greater than or equal to the cut-value as 1), and all subsequent values take on the same value. But if the given data were (4, 5, 6, 5), or (-400, 5, 600, 5), the returned list is (0, 0, 1, 1). This value/operation was introduced in Version 0.05.
136
137							=item B 'skip'\|0>
138
139							Values equal to the cut-value are skipped. So if the cut-value appears as the first value, it is simply skipped (it takes on no value), and an empty list is returned.
140
141							=back
142
143							Note that the operational logic here is different, in its default operation, from that following the same argument in the C() method. There, logically, and by convention, the default = 0, i.e., to skip neigbouring values with zero difference. The default operation for equality described here, for the C() method, perhaps should match the latter (and might well, following usage, feedback) for sake of consistency, but it seems most appropriate for now (as since Version 0.00) to make the default operation within the C() method follow convention/expectation, i.e., by its own logic, rather than to exact cross-method consistency for its own sake. In practice, it is advisable to compare results for a test based on the dichotomous sequence from different criteria for equality. If all results of a test are equal, there is no problem; otherwise, the average of the results from different methods can be taken (see Siegal, 1956, pp. 143-144, in discussion of "ties" in dichotomizing data for the two-sample L.)
144
145							=item set
146
147							The optional argument B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (index = 0) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
148
149							=back
150
151							=cut
152
153							sub cut {
154	16			16	1	10819	my ( $self, @args ) = @_;
155	16	50				57	my $args = ref $args[0] ? $args[0] : {@args};
156	16	100				56	my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
157	16	50				272	croak __PACKAGE__,
158							'::cut All data must be numeric for dichotomizing about a cut-value'
159							if !$self->all_numeric($dat);
160	16	50				1872	$args->{'value'} = 'median' if !defined $args->{'value'};
161
162							#$args->{'equal'} = 0 if !defined $args->{'equal'}; #- no default??
163	16	100				38	$args->{'equal'} = 'gt' if !defined $args->{'equal'};
164	16					21	my ( $val, @seqs ) = ();
165
166							# Get a cut-value:
167	16	100				34	if ( !is_numeric( $args->{'value'} ) ) {
168	3					40	my $code = \&{ delete $args->{'value'} };
	3					8
169	3					4	$val = $code->( @{$dat} );
	3					7
170							}
171							else {
172	13					160	$val = $args->{'value'};
173							}
174
175							# Categorize by number of observations above, below or equal to the cut_value:
176							push @seqs,
177							$_ > $val ? 1
178							: $_ < $val ? 0
179							: $args->{'equal'} eq 'gt' ? 1
180							: $args->{'equal'} eq 'lt' ? 0
181							: $args->{'equal'} eq 'rpt' ? ( exists $seqs[-1] ? $seqs[-1] : 1 )
182	16	100				134	: next foreach @{$dat};
	16	100				295
		100
		100
		100
		100
183	16					42	_set( \@seqs, $args->{'set'} );
184	16	100				71	return wantarray ? ( \@seqs, $val ) : \@seqs;
185							}
186
187							=head3 swing
188
189							$aref = $ddat->swing(data => [3, 4, 7, 6, 5, 1, 2, 3, 2]); # "swing" these data
190							$aref = $ddat->swing(label => 'reds'); # name a pre-loaded dataset for "swinging"
191							$aref = $ddat->swing(); # use the last-loaded dataset
192							$aref = $ddat->swing(set => [qw/male female/]); # for any of the above, optionally specify the dichotomous values
193
194							Returns a reference to an array of dichotomously transformed values of a single sequence of numerical values according to their consecutive rises and falls. Each value is subtracted from its successor, and the result is replaced with a 1 if the difference represents an increase, or 0 if it represents a decrease. For example (from Wolfowitz, 1943, p. 283), the following numerical sequence produces the subsequent dichotomous sequence.
195
196							@values = (qw/3 4 7 6 5 1 2 3 2/);
197							@dichot = (qw/1 1 0 0 0 1 1 0/);
198
199							Dichotomously, the data commence with an ascending run of length 2 (from 3 to 4, and from 4 to 7), followed by a descending run of length 3 (from 7 to 6, 6 to 5, and 5 to 1), followed by an ascent of length 2 (from 1 to 2, from 2 to 3), and so on. The number of resulting dichotomous observations is 1 less than the original sample-size (elements in the given array).
200
201							I are as follow.
202
203							=over 4
204
205							=item equal => 'I\|I\|I\|I'
206
207							The default result when the difference between two successive values is zero is to skip the observation, and move onto the next succession. Alternatively, specify B 'rpt'> to repeat the result for the previous succession; skipping only a difference of zero should it occur as the first result. Or, a difference greater than or equal to zero is counted as an increase (B 'gt'>), or a difference less than or equal to zero is counted as a decrease. For example,
208
209							@values = (qw/3 3 7 6 5 2 2/);
210							@dicho_skip = (qw/1 0 0 0/); # First and final results (of 3 - 3, and 2 - 2) are skipped
211							@dicho_rpt = (qw/1 0 0 0 0/); # First result (of 3 - 3) is skipped, and final result repeats the former
212							@dicho_gt = (qw/1 1 0 0 0 1/); # Greater than or equal to zero is an increase
213							@dicho_lt = (qw/0 1 0 0 0 0/); # Less than or equal to zero is a decrease
214
215							See description of the same argument in the L for more details (but for which the default value is 'gt').
216
217							=item set
218
219							The optional argument B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (zero-indexed) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
220
221							=back
222
223							=cut
224
225							sub swing {
226	8			8	1	6352	my ( $self, @args ) = @_;
227	8	50				26	my $args = ref $args[0] ? $args[0] : {@args};
228	8	100				21	my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
229	8	50				48	croak __PACKAGE__, '::swing All data must be numeric for dichotomizing'
230							if !$self->all_numeric($dat);
231	8	100				741	$args->{'equal'} = 0 if !defined $args->{'equal'}; #- no default??
232	8					9	my ( $i, $res, @seqs ) = ();
233
234							# Replace observations with the succession of rises and falls:
235	8					9	for ( $i = 0 ; $i < ( scalar @{$dat} - 1 ) ; $i++ ) {
	95					122
236	87					66	$res = $dat->[ ( $i + 1 ) ] - $dat->[$i];
237	87	100				98	if ( $res > 0 ) {
		100
238	35					29	push @seqs, 1;
239							}
240							elsif ( $res < 0 ) {
241	44					35	push @seqs, 0;
242							}
243							else {
244	8					9	for ( $args->{'equal'} ) {
245	8	100				24	if (/^rpt/xsm) {
		100
		100
246	2	100				7	push @seqs, $seqs[-1] if scalar @seqs;
247							}
248							elsif (/^gt/xsm) {
249	2					5	push @seqs, 1;
250							}
251							elsif (/^lt/xsm) {
252	2					5	push @seqs, 0;
253							}
254							else {
255	2					3	next;
256							}
257							}
258							}
259							}
260	8					19	_set( \@seqs, $args->{'set'} );
261	8					24	return \@seqs;
262							}
263
264							=head2 Numerical data: Two sequence dichotomization
265
266							See also the methods for categorical data where it is ok to ignore any order and intervals in numerical data.
267
268							=head3 pool
269
270							$aref = $ddat->pool(data => [$aref1, $aref2]); # give data directly to function
271							$aref = $ddat->pool(data => [$ddat->access(index => 0), $ddat->access(index => 1)]); # after $ddat->load(\@aref1, $aref2);
272							$aref = $ddat->pool(data => [$ddat->access(label => '1'), $ddat->access(label => '2')]); # after $ddat->load(1 => $aref1, 2 => $aref2);
273							$aref = $ddat->pool(data => [$aref1, $aref2], set => [-1, 1]); # for any of the above, optionally specify the binary set
274
275							Returns a reference to an array of dichotomously transformed values of two sequences of I data as a ranked pool, i.e., by pooling the data from each sequence according to the magnitude of their values at each trial, from lowest to heighest. Specifically, the values from both sequences are pooled and ordered from lowest to highest, and then dichotomized into runs according to the sequence from which neighbouring values come from. Another run occurs wherever there is a change in the source of the values. A non-random effect of, say, higher or lower values consistently coming from one sequence rather than another would be reflected in fewer runs than expected by chance.
276
277							This is typically used for a Wald-Walfowitz test of difference between two samples -- ranking by median; as per Siegal (1956), and Swed and Eisenhart (1943).
278
279							The I B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (zero-indexed) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
280
281							=cut
282
283							sub pool {
284	3			3	1	5382	my ( $self, @args ) = @_;
285	3	50				12	my $args = ref $args[0] ? $args[0] : {@args};
286	3	50				10	my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
287	3					5	$self->all_numeric($_) foreach @{$dat};
	3					20
288	3					361	my ( $dat1, $dat2 ) = @{$dat};
	3					5
289	3					4	my $sum = scalar @{$dat1} + scalar @{$dat2};
	3					4
	3					4
290							my @dat =
291	3					4	( [ sort { $a <=> $b } @{$dat1} ], [ sort { $a <=> $b } @{$dat2} ] );
	45					42
	3					14
	51					38
	3					6
292
293	3					6	my ( $i, $x, $y, @seqs ) = (0);
294	3					9	while ( scalar(@seqs) < $sum ) {
295	48					35	$x = $dat[0]->[0];
296	48					26	$y = $dat[1]->[0];
297	48	100	66			135	$i = defined $x && defined $y ? $x < $y ? 0 : 1 : defined $x ? 0 : 1;
		50
		100
298	48					25	shift @{ $dat[$i] };
	48					39
299	48					66	push @seqs, $i;
300							}
301	3					8	_set( \@seqs, $args->{'set'} );
302	3					11	return \@seqs;
303							}
304							## DEV: consider: List::AllUtils::pairwise:
305							# @x = pairwise { $a + $b } @a, @b; # returns index-by-index sums
306
307							=head2 Categorical data: Single sequence dichotomization
308
309							=head3 binate
310
311							$aref = $ddat->binate(oneis => 'E'); # optionally specify a state in the sequence to be set as "1"
312							$aref = $ddat->binate(oneis => 'E', set => [qw/a b/]); # optionally specify that Es be transformed to 'b', other events as 'a'
313							$aref = $ddat->binate(data => \@ari, oneis => 'E'); # same but using pre-loaded data
314
315							Returns a reference to an array of dichotomously transformed values of an array by setting the first element in the list to 1 (by default, or whatever is specified as B) on all its occurrences in the array, and all other values in the array as zero.
316
317							The I B, keying a referenced array, specifies that, in fact, the first element (or what might be specified as B) should be transformed into what is given as the index 1 element in this array, and that all other elements should be transformed into what is given as its index 0 element.
318
319							=cut
320
321							sub binate {
322	3			3	1	5162	my ( $self, @args ) = @_;
323	3	50				14	my $args = ref $args[0] ? $args[0] : {@args};
324	3	50				20	my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
325							my $oneis =
326							defined $args->{'oneis'}
327	3	100				87	? delete $args->{'oneis'}
328							: $dat->[0]; # What value set to 1 and others to zero?
329	3	100				4	my @seqs = map { $_ eq $oneis ? 1 : 0 } @{$dat};
	15					30
	3					6
330							; # replace observations with 1s and 0s
331	3					10	_set( \@seqs, $args->{'set'} );
332	3					11	return \@seqs;
333							}
334
335							=head2 Categorical data: Two-sequence dichotomization
336
337							=head3 match
338
339							$aref = $ddat->match(data => [\@aref1, \@aref2], lag => signed integer, loop => 0\|1); # with optional crosslag of the two sequences
340							$aref = $ddat->match(data => [$ddat->access(index => 0), $ddat->access(index => 1)]); # after $ddat->load(\@aref1, \@aref2);
341							$aref = $ddat->match(data => [$ddat->access(label => '1'), $ddat->access(label => '2')]); # after $ddat->load(1 => \@aref1, 2 => \@aref2);
342
343							Returns a reference to an array of dichotomously transformed values of two paired arrays according to the match between the elements at each of their indices. Where the data-values are equal at a certain index, they are represented with a 1; otherwise a 0 (by default, but see the B argument). Numerical or stringy data can be equated. For example, the following two arrays would be reduced to the third, where a 1 indicates a match (i.e., the values are "indexically equal").
344
345							@foo_dat = (qw/1 3 3 2 1 5 1 2 4/);
346							@bar_dat = (qw/4 3 1 2 1 4 2 2 4/);
347							@bin_dat = (qw/0 1 0 1 1 0 0 1 1/);
348
349							I are as follow.
350
351							=over 4
352
353							=item lag => I (where I < number of observations I I > -1 (number of observations) )
354
355							Match the two data-sets by shifting the first named set ahead or behind the other data-set by B observations. The default is zero. For example, one data-set might be targets, and another responses to the targets:
356
357							targets = cbbbdacdbd
358							responses = daadbadcce
359
360							Matched as a single sequence of hits (1) and misses (0) where B = B<0> yields (for the match on "a" in the 6th index of both arrays):
361
362							0000010000
363
364							With B => 1, however, each response is associated with the target one ahead of the trial for which it was observed; i.e., each target is shifted to its +1 index. So the first element in the above responses (I) would be associated with the second element of the targets (I), and so on. Now, matching the two data-sets with a B<+1> lag gives two hits, of the 4th and 7th elements of the responses to the 5th and 8th elements of the targets, respectively:
365
366							000100100
367
368							making 5 runs. With B => 0, there are 3 runs. Lag values can be negative, so that B => -2 will give:
369
370							00101010
371
372							Here, responses necessarily start at the third element (I), the first hits occurring when the fifth response-element corresponds to the the third target element (I). The last response (I) could not be used, and the number of elements in the hit/miss sequence became n-B less the original target sequence. This means that the maximum value of lag must be one less the size of the data-sets, or there will be no data.
373
374							=item loop => 0\|1
375
376							Implements circularized lagging if B => 1, where all lagged data are preserved by looping any excess to the start or end of the criterion data. The number of observations will then always be the same, regardless of the lag; i.e., the size of the returned array is the same as that of the given data. For example, matching the data in the example above with a lag of +1, with looping, creates an additional match between the final response and the first target (I); i.e., the last element in the "response" array is matched to the first element of the "target" array:
377
378							1000100100
379
380							=item set
381
382							The optional argument B, keying a two-element array, controls the binary-values to return; instead of the default set of 0s and 1s, the set might be, say, -1s and 1s, or "male" and "female". The first (zero-indexed) element in the set array replaces what, by default, would be returned as 0, and the second (index = 1) element in the set array replaces what, by default, would be returned as 1.
383
384							=back
385
386							=cut
387
388							sub match {
389	7			7	1	12024	my ( $self, @args ) = @_;
390	7	50				39	my $args = ref $args[0] ? $args[0] : {@args};
391	7	50				24	my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
392							$dat = $self->crosslag(
393							lag => $args->{'lag'},
394							data => [ $dat->[0], $dat->[1] ],
395							loop => $args->{'loop'}
396	7	100				38	) if $args->{'lag'};
397							my $lim =
398	7					14	scalar @{ $dat->[0] } <= scalar @{ $dat->[1] }
	7					18
399	7					12	? scalar @{ $dat->[0] }
400	7	50				13	: scalar @{ $dat->[1] }; # ensure criterion data-set is smallest
	0					0
401	7					13	my (@seqs) = ();
402	7					21	for my $i ( 0 .. $lim ) {
403	72	100	66			201	next if !defined $dat->[0]->[$i] \|\| !defined $dat->[1]->[$i];
404	65	100				130	$seqs[$i] = $dat->[0]->[$i] eq $dat->[1]->[$i] ? 1 : 0;
405							}
406	7					33	_set( \@seqs, $args->{'set'} );
407	7					36	return \@seqs;
408							}
409
410							=head2 Numerical or categorical data: Single sequence dichotimisation
411
412							=head3 shrink
413
414							$aref = $ddat->shrink(winlen => INT, rule => CODE)
415
416							Returns a reference to an array of dichotomously transformed values of a numerical or categorical sequence by taking I slices, or windows, as given in the argument B, and making a true/false sequence out of them according to whether or not each slice passes a B. The B is a code reference that gets the data as a reference to an array, and so might be something like this:
417
418							sub { return Statistics::Lite::mean(@{$_}) > 2 ? 1 : 0; }
419
420							If B is set to 3, this means-wise rule would make the following numerical sequence of 9 elements shrink into the following dichotomous sequence of 3 elements:
421
422							@data = (1, 2, 3, 3, 3, 3, 4, 2, 1);
423							@means = (2, 3, 2.5 );
424							@dico = (0, 1, 1 );
425
426							For categorical data, a completely "stringy" rule might be specified in the following ways. If B => 1, and the given data are (A, B, c, d), then the rule
427
428							sub { my $aref = shift; $aref->[0] =~ /[A-Z]/ ? 1 : 0; }
429
430							would yield the sequence be (1, 1, 0, 0) -- because the elements A and B satisfy the regular expression (being within the set {A .. Z}), while the remainder (elements c and e) do not.
431
432							Yet if B => 2 for the same given data, the same case-wise rule might be specified as
433
434							sub { my $aref = shift; my $str = join q{}, @{$aref}; $str =~ /[A-Z]{2,}/ ? 1 : 0; }
435
436							and the returned sequence is (1, 0), given that (again) the first two elements (A, B) satisfy the rule (returning 1), and the second pair of elements (c, e) do not (returning 0).
437
438							The B must, of course, return dichotomous values to dichotomize the data, and B should make up equally sized segments (no error is thrown if this isn't the case, the remainder just gets figured in the same way).
439
440							Unlike other methods, this method does not respect a B argument -- because the given transformation rule has control of what the set is (1s and 0s, or 1s and -1s, etc.).
441
442							=cut
443
444							sub shrink {
445	3			3	1	4904	my ( $self, @args ) = @_;
446	3	50				12	my $args = ref $args[0] ? $args[0] : {@args};
447	3	100				13	my $dat = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
448	3					28	my $lim = scalar @{$dat};
	3					5
449	3					3	my $len = int $args->{'winlen'};
450	3		50			7	$len \|\|= 1;
451	3					5	my $code = delete $args->{'rule'};
452	3	50	33			17	croak __PACKAGE__, '::shrink Need a code to Boolean shrink'
453							if not $code
454							or ref $code ne 'CODE';
455	3					3	my ( $i, @seqs );
456
457	3					7	for ( $i = 0 ; $i < $lim ; $i += $len )
458							{ # C-style for clear greater-than 1 increments per loop
459	9					82	push @seqs, $code->( [ @{$dat}[ $i .. ( $i + $len - 1 ) ] ] );
	9					19
460							}
461	3					31	return \@seqs;
462							}
463							*boolwin = \&shrink;
464
465							=head2 Utilities
466
467							=head3 crosslag
468
469							@lagged_arefs = $ddat->crosslag(data => [\@ari1, \@ari2], lag => signed integer, loop => 0\|1);
470							$aref_of_arefs = $ddat->crosslag(data => [\@ari1, \@ari2], lag => signed integer, loop => 0\|1); # same but not "wanting array"
471
472							Takes two arrays and returns them cross-lagged against each other, shifting and popping values according to the number of "lags". Typically used when wanting to L the two arrays against each other.
473
474							=over 4
475
476							=item lag => signed integer up to the number of elements
477
478							Takes the first array sent as "data" as the reference or "target" array for the second "response" array to be shifted so many lags before or behind it. With no looping of the lags, this means the returned arrays are "lag"-elements smaller than the original arrays. For example, with lag => +1 (and loop => 0, the default), and with data => [ [qw/c p w p s/], [qw/p s s w r/] ],
479
480							(c p w p s) becomes (p w p s)
481							(p s s w r) becomes (p s s w)
482
483							So, whereas the original data gave no matches across the two arrays, now, with the second of the two arrays shifted forward by one index, it has a match (of "p") at the first index with the first of the two arrays.
484
485							=item loop => 0\|1
486
487							For circularized lagging, B => 1, and the size of the returned array is the same as those for the given data. For example, with a lag of +1, the last element in the "response" array is matched to the first element of the "target" array:
488
489							(c p w p s) becomes (p w p s c) (looped with +1)
490							(p s s w r) becomes (p s s w r) (no effect)
491
492							In this case, it might be more efficient to simply autolag the "target" sequence against itself.
493
494							=back
495
496							=cut
497
498							sub crosslag {
499	6			6	1	2158	my ( $self, @args ) = @_;
500	6	50				28	my $args = ref $args[0] ? $args[0] : {@args};
501	6					11	my $lag = $args->{'lag'};
502	6					10	my $dat1 = $args->{'data'}->[0];
503	6					9	my $dat2 = $args->{'data'}->[1];
504	6					10	my $loop = $args->{'loop'};
505							return ( wantarray ? ( $dat1, $dat2 ) : [ $dat1, $dat2 ] )
506							if not $lag
507	6	0	33			26	or abs $lag >= scalar @{$dat1};
	6	50				26
508
509	6					8	my @dat1_lagged = @{$dat1};
	6					23
510	6					8	my @dat2_lagged = @{$dat2};
	6					19
511
512	6	100				16	if ( $lag > 0 ) {
		50
513	5					17	foreach ( 1 .. abs $lag ) {
514	5	100				12	if ($loop) {
515	3					45	unshift @dat1_lagged, pop @dat1_lagged;
516							}
517							else {
518	2					4	shift @dat1_lagged;
519	2					6	pop @dat2_lagged;
520							}
521							}
522							}
523							elsif ( $lag < 0 ) {
524	1					5	foreach ( 1 .. abs $lag ) {
525	2	50				4	if ($loop) {
526	0					0	push @dat1_lagged, shift @dat1_lagged;
527							}
528							else {
529	2					3	pop @dat1_lagged;
530	2					5	shift @dat2_lagged;
531							}
532							}
533							}
534							return wantarray
535	6	50				32	? ( \@dat1_lagged, \@dat2_lagged )
536							: [ \@dat1_lagged, \@dat2_lagged ];
537							}
538
539							sub _set {
540	37			37		64	my ( $aref, $set ) = @_;
541	37	100	66			113	return if not ref $set or scalar @{$set} != 2;
	5					26
542	5					12	for my $i ( 0 .. scalar @{$aref} - 1 ) {
	5					23
543	59	100				75	if ( $aref->[$i] == 0 ) {
544	34					44	$aref->[$i] = $set->[0];
545							}
546							else {
547	25					28	$aref->[$i] = $set->[1];
548							}
549							}
550	5					12	return;
551							}
552
553							=head1 AUTHOR
554
555							Roderick Garton, C<< >>
556
557							=head1 REFERENCES
558
559							B (1977). Statistical methods in parapsychological research. In B. B. Wolman (Ed.), I (pp. 81-130). New York, NY, US: Van Nostrand Reinhold. L [Describes the L method of windowed Boolean dichotomization.]
560
561							B (1956). I. New York, NY, US: McGraw-Hill. L [Re dichotomization for the two-sample L.]
562
563							B (1943). Tables for testing randomness of grouping in a sequence of alternatives. I, I<14>, 66-87. doi: L<10.1214/aoms/1177731494\|http://dx.doi.org/10.1214/aoms/1177731494> [Describes the L method and test example.]
564
565							B (1943). On the theory of runs with some applications to quality control. I, I<14>, 280-288. doi: L<10.1214/aoms/1177731421\|http://dx.doi.org/10.1214/aoms/1177731421> [Describes the L method ("runs up and down") and test example.]
566
567							=head1 BUGS
568
569							Please report any bugs or feature requests to C, or through
570							the web interface at L. I will be notified, and then you'll
571							automatically be notified of progress on your bug as I make changes.
572
573							=head1 SUPPORT
574
575							You can find documentation for this module with the perldoc command.
576
577							perldoc Statistics::Data::Dichotomize
578
579							You can also look for information at:
580
581							=over 4
582
583							=item * RT: CPAN's request tracker (report bugs here)
584
585							L
586
587							=item * AnnoCPAN: Annotated CPAN documentation
588
589							L
590
591							=item * CPAN Ratings
592
593							L
594
595							=item * Search CPAN
596
597							L
598
599							=back
600
601							=head1 LICENSE AND COPYRIGHT
602
603							=over 4
604
605							=item Copyright (c) 2012-2016 Roderick Garton
606
607							This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see L).
608
609							=item Disclaimer
610
611							To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.
612
613							=back
614
615							=cut
616
617							1; # End of Statistics::Data::Dichotomize