File Coverage

blib/lib/DBSchema/Normalizer.pm

Criterion	Covered	Total	%
statement	10	12	83.3
branch			n/a
condition			n/a
subroutine	4	4	100.0
pod			n/a
total	14	16	87.5

line	stmt	sub	time	code
1				############################################################
2				#
3				# DBSchema::Normalizer - a MySQL database table normalizer
4				#
5				# Copyright (c) 2001 by Giuseppe Maxia
6				# Produced under the GPL (Golden Perl Laziness)
7				# Distributed under the GPL (GNU Geneal Public License)
8				#
9				############################################################
10
11				require 5.004;
12
13				=head1 NAME
14
15				DBSchema::Normalizer - database normalization. - Convert a table from 1st to 2nd normal form
16
17				=head1 SYNOPSIS
18
19				# the easy way is to give all parameters to the constructor
20				# and then call do()
21				#
22				use DBSchema::Normalizer;
23				my $norm = DBSchema::Normalizer->new (
24				{
25				DSN => $DSN,
26				username => $username,
27				password => $password,
28				src_table => $sourcetable,
29				index_field => $indexfield,
30				lookup_fields => $lookupfields, # comma separated list
31				lookup_table => $lookuptable,
32				dest_table => $dest_table,
33				copy_indexes => "yes",
34				});
35				$norm->do(); # Just Do It!
36
37				# Alternatively, you can have some more control, by
38				# creating the lookup table and normalized table separately,
39				# especially useful if one of them is an intermediate step.
40				#
41				use DBSchema::Normalizer qw(create_lookup_table create_normalized_table);
42				my $norm = DBSchema::Normalizer->new(
43				{
44				DSN => $DSN,
45				username => $username,
46				password => $password
47				});
48				$norm->create_lookup_table (
49				{
50				src_table => $tablename,
51				index_field => $indexfield,
52				lookup_fields => $lookupfields,
53				lookup_table => $lookuptable
54				});
55				$norm->create_normalized_table (
56				{
57				src_table => $tablename,
58				index_field => $indexfield,
59				lookup_fields => $lookupfields,
60				lookup_table => $lookuptable,
61				dest_table => $dest_table,
62				copy_indexes => "yes",
63				});
64
65				=head1 DESCRIPTION
66
67				B is a module to help transforming MySQL database tables from 1st to 2nd normal form.
68				Simply put, it will create a lookup table out of a set of repeating fields from a source table, and replace such fields by a foreign key that points to the corresponding fields in the newly created table.
69				All information is taken from the database itself. There is no need to specify existing details.
70				The module is capable of re-creating existing indexes, and should deal with complex cases where the replaced fields are part of a primary key.
71
72				=head2 Algorithm
73
74				The concept behind B is based upon some DBMS properties. To replace repeating fields with a foreign key pointing to a lookup table, you must be sure that for each distinct set of values you have a distinct foreign key. You might be tempted to solve the problem with something like this:
75
76				I. Read all records into memory
77				II. for each record, identify the unique value for the fields to be
78				moved into a lookup table and store it in a hash
79				II. again, for each record, find the corresponding value in the
80				previously saved hash and save the non-lookup fields plus the
81				unique key into a new table
82				IV. for each key in the hash, write the values to a lookup table
83
84				I can find four objections against such attempt:
85
86				1. Memory problems. The source table can be very large (and some of the table I had to normalize were indeed huge. This kind of solution would have crashed any program trying to load them into memory.) Instead of reading the source table into memory, we could just read the records twice from the database and deal with them one at the time. However, even the size of the hash could prove to be too much for our computer memory. A hash of 2_000_000 items is unlikely to handle memory efficiently in most nowadays desktops.
87
88				2. Data specific solution. To implement this algorithm, we need to include details specific to our particular records in our code. It is not a good first step toward re-utilization.
89
90				3. Data conversion. We need to fetch data from the database, eventually transform it into suitable formats for our calculation and then send it back, re-packed in database format. Not always an issue, but think about the handling of floating point fields and timestamp fields with reduced view. Everything can be solved, but it could be a heavy overhead for your sub.
91
92				4. Finally, I would say that this kind of task is not your job. Nor is Perl's. It belongs in the database engine, which can easily, within its boundaries, identify unique values and make a lookup table out of them. And it can also easily make a join between source and lookup table.
93
94				That said, the solution springs to mind. Let the database engine do its job, and have Perl drive it towards the solution we need to achieve. The algorithm is based upon the fact that a table created from a SELECT DISTINCT statement is guaranteed to have a direct relationship with each record of the source table, when compared using the same elements we considered in the SELECT DISTINCT.
95
96				The algorithm takes four steps:
97
98				I. create the lookup table
99
100				CREATE TABLE lookup ({lookupID} INT NOT NULL auto_increment
101				primary key, {LOOKUP FIELDS});
102
103				#(and add a key for each {LOOKUP FIELDS})
104
105				II. fill in the lookup table
106
107				INSERT INTO lookup
108				SELECT DISTINCT NULL {LOOKUP FIELDS} FROM source_table;
109				#(the {lookupID} is automatically created, being auto_increment)
110
111				III. create the normalized table
112
113				CREATE TABLE norm_table ({source_table FIELDS} -
114				{LOOKUP FIELDS} + {lookupID})
115
116				IV. fill in the normalized table
117
118				INSERT INTO normalized table
119				SELECT {source_table FIELDS} - {LOOKUP FIELDS} + {lookupID}
120				FROM source_table
121				INNER JOIN lookup
122				on (source_table.{LOOKUP FIELDS}= lookup.{LOOKUP FIELDS})
123
124				As you can see, the entire operation is run in the server workspace, thus avoiding problems of (a) fetching records (less network traffic), (b) handling data conversion, (c) memory occupation and (d) efficiency.
125
126				Let's see an example.
127
128				Having a table MP3 with these fields
129
130				mysql> describe MP3;
131				+----------+-------------+------+-----+----------+----------------+
132				\| Field \| Type \| Null \| Key \| Default \| Extra \|
133				+----------+-------------+------+-----+----------+----------------+
134				\| ID \| int(11) \| \| PRI \| NULL \| auto_increment \|
135				\| title \| varchar(40) \| \| MUL \| \| \|
136				\| artist \| varchar(20) \| \| MUL \| \| \|
137				\| album \| varchar(30) \| \| MUL \| \| \|
138				\| duration \| time \| \| \| 00:00:00 \| \|
139				\| size \| int(11) \| \| \| 0 \| \|
140				\| genre \| varchar(10) \| \| MUL \| \| \|
141				+----------+-------------+------+-----+----------+----------------+
142				7 rows in set (0.00 sec)
143
144				We want to produce two tables, the first one having only [ID, title, duration, size], while the second one should get [artist, album, genre]. (The second one will also needed to be further split into [artist] and [album, genre] but we can deal with that later).
145
146				Here are the instructions to normalize this table:
147
148				DROP TABLE IF EXISTS tmp_albums;
149				CREATE TABLE tmp_albums (album_id INT NOT NULL AUTO_INCREMENT
150				PRIMARY KEY,
151				artist varchar(20) not null,
152				album varchar(30) not null,
153				genre varchar(10) not null,
154				KEY artist (artist), KEY album (album), KEY genre (genre));
155
156				INSERT INTO tmp_albums
157				SELECT DISTINCT NULL, artist,album,genre FROM MP3;
158
159				DROP TABLE IF EXISTS songs;
160				CREATE TABLE songs (ID int(11) not null auto_increment,
161				title varchar(40) not null,
162				duration time not null default '00:00:00',
163				size int(11) not null,
164				album_id INT(11) NOT NULL,
165				PRIMARY KEY (ID), KEY title (title), KEY album_id (album_id));
166
167				INSERT INTO songs SELECT src.ID, src.title, src.duration,
168				src.size, album_id
169				FROM MP3 src INNER JOIN tmp_albums lkp
170				ON (src.artist =lkp.artist and src.album =lkp.album
171				and src.genre =lkp.genre);
172
173				Eventually, we can use the same technique to normalize the albums into a proper table.
174
175				DROP TABLE IF EXISTS artists;
176				CREATE TABLE artists (artist_id INT NOT NULL AUTO_INCREMENT
177				PRIMARY KEY,
178				artist varchar(20) not null,
179				KEY artist (artist)) ;
180
181				INSERT INTO artists
182				SELECT DISTINCT NULL, artist FROM tmp_albums;
183
184				DROP TABLE IF EXISTS albums;
185
186				CREATE TABLE albums (album_id int(11) not null auto_increment,
187				album varchar(30) not null,
188				genre varchar(10) not null,
189				artist_id INT(11) NOT NULL,
190				PRIMARY KEY (album_id),
191				KEY genre (genre), KEY album (album), KEY artist_id (artist_id));
192
193				INSERT INTO albums
194				SELECT src.album_id, src.album, src.genre, artist_id
195				FROM tmp_albums src
196				INNER JOIN artists lkp ON (src.artist =lkp.artist);
197
198				mysql> describe artists;
199				+-----------+-------------+------+-----+---------+----------------+
200				\| Field \| Type \| Null \| Key \| Default \| Extra \|
201				+-----------+-------------+------+-----+---------+----------------+
202				\| artist_id \| int(11) \| \| PRI \| NULL \| auto_increment \|
203				\| artist \| varchar(20) \| \| MUL \| \| \|
204				+-----------+-------------+------+-----+---------+----------------+
205				2 rows in set (0.00 sec)
206
207				mysql> describe albums;
208				+-----------+-------------+------+-----+---------+----------------+
209				\| Field \| Type \| Null \| Key \| Default \| Extra \|
210				+-----------+-------------+------+-----+---------+----------------+
211				\| album_id \| int(11) \| \| PRI \| NULL \| auto_increment \|
212				\| album \| varchar(30) \| \| MUL \| \| \|
213				\| genre \| varchar(10) \| \| MUL \| \| \|
214				\| artist_id \| int(11) \| \| MUL \| 0 \| \|
215				+-----------+-------------+------+-----+---------+----------------+
216				4 rows in set (0.00 sec)
217
218				mysql> describe songs;
219				+----------+-------------+------+-----+----------+----------------+
220				\| Field \| Type \| Null \| Key \| Default \| Extra \|
221				+----------+-------------+------+-----+----------+----------------+
222				\| ID \| int(11) \| \| PRI \| NULL \| auto_increment \|
223				\| title \| varchar(40) \| \| MUL \| \| \|
224				\| duration \| time \| \| \| 00:00:00 \| \|
225				\| size \| int(11) \| \| \| 0 \| \|
226				\| album_id \| int(11) \| \| MUL \| 0 \| \|
227				+----------+-------------+------+-----+----------+----------------+
228				5 rows in set (0.00 sec)
229
230				It should be clear now WHAT we have to do. Less clear is HOW. The above instructions seem to imply that we manually copy the field structure from the source table to the lookup and normalized tables.
231
232				Actually, that SQL code (except the DESCRIBEs) was produced by this very module and printed to the STDOUT, so that all I had to do was some cut-and-paste.
233				And then we are back to the question of the algorithm. If this is all SQL, where is Perl involved?
234				The answer is that Perl will reduce the amount of information we need to give to the database engine.
235				The information about the field structure and indexes is already in the database. Our Perl module (with a [not so] little help from the DBI) can extract the structure from the database and create the appropriate SQL statements.
236				On the practical side, this means that, before producing SQL code, this module will gather information about the source table. It will issue a "SHOW FIELDS FROM tablename" and a "SHOW INDEX FROM tablename" statements, and parse their results to prepare the operational code.
237
238				That's it. It seems a rather small contribution to your average workload, but if you ever have to deal with a project involving several large tables, with many fields, to be transformed into many normalized tables, I am sure you will appreciate the GPL (Golden Perl Laziness) behind this module.
239
240				BTW, this is the code used to produce the above SQL statements:
241
242				#!/usr/bin/perl -w
243				use strict;
244
245				use DBSchema::Normalizer;
246
247				my $norm = DBSchema::Normalizer->new ({
248				DSN => "DBI:mysql:music;host=localhost;"
249				. "mysql_read_default_file=$ENV{HOME}/.my.cnf",
250				src_table => "MP3",
251				index_field => "album_id",
252				lookup_fields => "artist,album,genre",
253				lookup_table => "tmp_albums",
254				dest_table => "songs",
255				copy_indexes => 1,
256				simulate => 1
257				});
258
259				$norm->do();
260
261				$norm->create_lookup_table ({
262				src_table => "tmp_albums",
263				index_field => "artist_id",
264				lookup_fields => "artist",
265				lookup_table => "artists"
266				});
267
268				$norm->create_normalized_table ({
269				src_table => "tmp_albums",
270				lookup_table => "artists",
271				index_field => "artist_id",
272				lookup_fields => "artist",
273				dest_table => "albums"
274				});
275
276				Twenty-five lines of code. Not bad for such a complicated task. But even that could have been replaced by these two one-liners:
277
278				perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \
279				album_id album,artist,genre tmp_albums songs 1 1 1))->do()'
280
281				perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music \
282				tmp_albums artist_id artist artists albums 1 1 1))->do()'
283
284				(See below item "snew" for more details.)
285
286				One thing that this module won't do for you, though, is to decide which columns should stay with the source table and which ones should go to the lookup table. This is something for which you need to apply some database theory, and I don't expect you to know it unless you have studied it (unless you happen to be J.F. Codd) either at school or independently.
287				I am planning (in a very idle and distant way) another module that will analyze a database table and decide if it is a good design or not. The examples from commercial software I have seen so far did not impress me a lot. I am still convinced that humans are better than machines at this task. But, hey! Ten years ago I was convinced that humans were much better than machines at chess, and instead, not long ago, I had to see an IBM box doing very nasty things to Gary Kasparov. So maybe I'll change my mind. In the meantime, I am enjoying my present intellectual superiority and I keep analyzing databases with the same pleasure that I once felt when solving a chess problem.
288
289				=head2 Simulation mode
290
291				This module can do the data transfer for you, or you may want to run it in "simulation mode", by adding simulate => "1" to the constructor parameters. When in simulation mode, the DBSchema::Normalizer will just print the necessary SQL statements to STDOUT, without passing them to the database engine. You can thus check the queries and eventually change them and use them within some other application.
292
293				=head2 EXPORT
294
295				new, snew, create_lookup_table, create_normalized_table
296
297				=head2 DEPENDENCIES
298
299				DBI, DBD::mysql
300
301				=head2 Architecture
302
303				The Normalizer doesn't enforce private data protection. You are only supposed to call the methods which are documented here as public. In the spirit of Perl OO philosophy, nobody will prevent you from calling private methods (the ones beginning with "_") or fiddling with internal hash fields. However, be aware that such behaviour is highly reprehensible, could lead to unpredictable side effects, of which B are entirely, utterly an irrimediably responsible (not to mention that your reputation will be perpetually tarnished, your nephews will call you "Cheating Joe" and you won't be ever - EVER - invited as dinner speaker to any Perl OO conference and even if you manage to sneak in you will find a hair in your soup.)
304
305				=head2 PORTABILITY
306
307				The algorithm used here is general. It was initially developed in C for an embedded RDBMS and there is no reason to assume that it won't work in any other database engine. However, the information about field types and indexes is, in this module, MySQL dependent. At the moment, I haven't found a clean method to get such information in a database-independent way.
308				To adapt this module for a different database, corresponding SQL statements for the MYSQL specific SHOW INDEX and SHOW FIELDS should be provided. Also the syntax for INNER JOIN might not be portable across databases.
309
310				=head2 CAVEAT
311
312				As always, when dealing with databases, some caution is needed.
313				The create_lookup_table() method will B, if exists. Be careful about the name you supply for this purpose. If you want to use an existing lookup table (whose data integrity and relational correspondence you can swear upon), then skip the create_lookup_table() and ask only for create_normalized_table(). Also for this one, a B statement is issued before the creation.
314				Exercise double care on the names you pass to the module.
315
316				Be also aware that the tables created by this module are of default type. You may either choose to convert them after the data transfer or run the Normalizer in "simulation mode" and then manually modify the SQL statements.
317
318				The Normalizer will usually warn you (and exit with flags and bells) if one or more designated lookup fields in the source table are not indexed. This fact could result in VERY SLOW PERFORMANCE, even for a reasonably low number of records involved.
319				You can choose to ignore this warning, by setting the appropriate parameter, but it is not advisable.
320
321				If the source table does not have a primary key (don't laugh, I have seen some of them) then a fatal error is issued, without any possible remedy. The reason is simple. If there is no primary key, the engine doesn't have a way of identifying which rows should go in a JOIN and then your result may have duplicates (and in addition you will be waiting a lot to get it.)
322
323				=head2 TO DO
324
325				1. Parametrizing the statements for fields and indexes information should improve the chances for
326				portability.
327
328				# e.g.: MySQL index information comes in this flavor
329				$mysql_index_info = {
330				function_name => "SHOW INDEX FROM $table",
331				info_names => "Table,Non_unique,Key_name,Index_seq,Col_name" #etc
332				};
333				# but it is hard to generalize, especially if another database
334				# engine defines
335				# as positive (Unique) what here is negative (Non_unique)
336
337				Maybe a more effective way would be to have a polymorphic version of DBSchema::Normalizer, with the base class calling abstract subs, which the descendant classes are supposed to implement.
338				Sounds interesting, even though I feel that I might have some clashes with the DBI.
339
340				2. Adding support for intermediate steps in converting should also speed up some ugly cases with nested normalization problems, when the lookup table needs to be further normalized.
341
342				3. Adding support for conversion from Zero normal form to First is not straightforward. Some cases are easy to identify and to deal with (e.g. columns Student1, Student2, Student3, StudentN can be converted to column Student_id pointing at a Students table), but others are more subtle and difficult to generalize (e.g. having two column for Male and Female, with yes/no content).
343
344				=head2 DISCLAIMER
345
346				This software can alter data in your database. If used improperly, it can also damage existing data.
347				(And so can any most powerful software on your machine, such as Perl itself. Sorry to scare you, but I have to warn users about potential misuse.)
348				There is B of any sort on this software. This software is provided "AS IS".
349				Please refer to the GPL, GNU General Public License, Version 2, paragraphs 11 and 12, for more details.
350
351				=head1 SEE ALSO
352
353				DBI, DBD::mysql
354
355				=cut
356
357				package DBSchema::Normalizer;
358	1	1	904	use strict;
	1		2
	1		34
359	1	1	5	use warnings;
	1		2
	1		35
360
361	1	1	2276	use DBI;
	1		19825
	1		58
362	1	1	1385	use DBD::mysql; # This version is MySQL dependent. It could be changed later
	0
	0
363				use Carp;
364				require Exporter;
365
366				our @ISA = qw(Exporter);
367				our @EXPORT = qw(new snew do );
368				our @EXPORT_OK = qw(create_lookup_table create_normalized_table);
369
370				our $VERSION = '0.08'; # 09-Nov-2002
371
372				my @_accepted_parameters = qw(dbh DSN username password src_table index_field
373				lookup_fields lookup_table dest_table copy_indexes verbose
374				simulate ignore_warning);
375
376				my %_accepted_params = map {$_, 1} @_accepted_parameters;
377
378				=head1 Class methods
379
380				=over
381
382				=item new
383
384				DBSchema::Normalizer->new ({
385				DSN => $DSN,
386				username => $username,
387				password => $password
388				});
389
390				new() - object constructor. Requires a hash reference with at least the following keys
391
392				DSN
393				username
394				password
395
396				Alternatively, you may pass one already initialized database handler
397
398				dbh => $dbh
399
400				Optional fields (in the sense that if you omit them here, you must declare them when calling I or I)
401
402				src_table The table in 1st normal form
403				index_field the index field that we need to create
404				will become foreign key in the source table
405				and primary key in the lookup table
406				lookup_fields the fields depending on the index,
407				in a comma-separated list
408				lookup_table the lookup table
409				dest_table the Normalized (2nd form) table
410
411				Really optional fields. You may not mention them at all. Their default is 0.
412
413				copy_indexes three values:
414				"no" or "0" : no indexes are copied
415				"yes" or "1" : indexes from the source table will
416				be immediately replicated to the
417				destination table
418				"later" or "2" : indexes will be created after the
419				data transfer,
420				as an ALTER TABLE statement.
421				It may speed up the insertion
422				for large tables.
423				verbose if "1", messages indicating what is going on
424				will be sent to STDOUT.
425				Using "2", even more verbose information is
426				given (all queries printed before execution);
427				Level "3" will also show details about src_table
428				fields and indexes;
429				ignore_warning if "1", warning on missing indexes on lookup fields
430				are ignored, and the requested operation carried
431				out even at a price of long waiting. Default "0"
432				simulate if "1", no operation is carried out
433				but the queries are printed to STDOUT (as in
434				verbose => 2)
435
436				B: src_table, dest_table and lookup_table B be called I or I, which are used internally by the Normalizer. If such names are given, a fatal error is issued.
437
438				If the keys for src_table, index_field, lookup table and fields are missing, they can (actually they MUST) be later provided by calls to create_lookup_table() and create_normalized_table().
439
440				=cut
441
442				sub new {
443				my $class = shift;
444				my $params = shift;
445				my $_dbh = undef;
446				if (exists $params->{dbh} && defined $params->{dbh}) {
447				$_dbh = $params->{dbh}
448				}
449				else {
450				return undef unless defined $params->{DSN};
451				$_dbh= DBI->connect($params->{DSN}, $params->{username},
452				$params->{password}, { RaiseError => 1});
453				}
454				my $self = bless {
455				verbose => 0,
456				copy_indexes => 0,
457				simulate => 0,
458				ignore_warning => 0,
459				_dbh => $_dbh # Being an object, $_dbh is already
460				# a reference. Doesn't need the "\"
461				# before it.
462				}, $class;
463				foreach my $key (keys (%$params)) {
464				croak "invalid parameter $key \n" unless exists $_accepted_params{$key};
465				$self->{$key} = $params->{$key};
466				}
467				if ($self->{simulate} eq "1") {
468				$self->{verbose} = "2";
469				}
470				elsif ($self->{simulate} ne "0") {
471				croak "invalid value for \n";
472				}
473				return ($self);
474				}
475
476				=item snew
477
478				snew is a shortcut for new(). It is called with parameters passed by position instead of using a hash reference.
479				It is a "quick-and-dirty" ("dirty" being the operational word) method intended for the impatient who does not want to write a script.
480				B.
481				Parameters are passed in this order:
482
483				host
484				database
485				source table
486				index field
487				lookup fields
488				lookup table
489				destination table
490				copy indexes
491				verbose
492				simulate
493
494				Here is an example of one-liner normalization call:
495
496				perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \
497				album_id album,artist,genre tmp_albums songs 1 1 1))->do()'
498
499				Note: ALL 11 parameters must be passed, or an "use of uninitialized value" error is issued.
500
501				This one-liner is equivalent to the following script:
502
503				#!/usr/bin/perl
504				no warnings; # Yeah. No warnings. I said it is equivalent,
505				# not recommended.
506				no strict; # Yup. No strict either.
507				use DBSchema::Normalizer;
508				$norm = DBSchema::Normalizer->new (
509				{
510				DSN => "DBI:mysql:music;host=localhost;"
511				. "mysql_read_default_file=$ENV{HOME}/.my.cnf",
512				src_table => "MP3",
513				index_field => "album_id",
514				lookup_fields => "artist,album,genre",
515				lookup_table => "tmp_albums",
516				dest_table => "songs",
517				copy_indexes => 1,
518				verbose => 1,
519				simulate => 1,
520				});
521				$norm->do();
522
523				It is definitely not as safe as the normal call. However, TMTOWTDI, and it's your call. I am using it, but I don't recommend it. Read my lips: I DON'T RECOMMEND IT.
524
525				=cut
526
527				sub snew { # shortcut new (parameters called by position)
528				my ($class, $host, $db, $src_table, $index_field,
529				$lookup_fields, $lookup_table, $dest_table,
530				$copy_indexes, $verbose, $simulate) = @_;
531				my $DSN= "DBI:mysql:$db;host=$host;"
532				. "mysql_read_default_file=$ENV{HOME}/.my.cnf";
533				return new ($class, {
534				DSN => $DSN,
535				src_table => $src_table,
536				index_field => $index_field,
537				lookup_fields => $lookup_fields,
538				lookup_table => $lookup_table,
539				dest_table => $dest_table,
540				copy_indexes => $copy_indexes,
541				verbose => $verbose,
542				simulate => $simulate,
543				});
544				}
545
546				=for internal use
547				(The Destroyer will clean-up DBI objects.)
548
549				=cut
550
551				# use Data::Dumper;
552				sub DESTROY {
553				my $self = shift;
554				# print STDERR Data::Dumper->Dump([$self],["InDestroyer"]);
555				$self->{_dbh}->disconnect();
556				undef $self->{_dbh};
557				}
558
559				=item do
560
561				do();
562
563				do() performs the Normalization according to the parameters already received. Internally calls I and I
564
565				Will fail if not enough parameters have been supplied to new()
566
567				=cut
568
569				sub do {
570				my $self = shift;
571				return 0 unless $self->_init();
572				$self->create_lookup_table();
573				$self->create_normalized_table();
574				return 1;
575				}
576
577				=for internal use
578				(Checks that given keys in internal blessed hash are defined)
579
580				=cut
581
582				sub _init_field {
583				my $self = shift;
584				my @fields = @_;
585				my $def = 1;
586				foreach (@fields) {
587				if (!defined $self->{$_}) {
588				$self->_verbose("0", "missing $_\n");
589				return 0;
590				}
591				}
592				return 1;
593				}
594
595				=for internal use
596				(_verbose() will print a message, depending on the currently set verbose level)
597
598				=cut
599
600				sub _verbose {
601				my $self = shift;
602				my $level = shift;
603				my $msg = shift;
604				if ($self->{verbose} >= $level) {
605				$msg =~ s/\s+/ /g;
606				print STDOUT "$msg\n";
607				}
608				}
609
610				=for internal use
611				(_get_indexes() will find the indexes from src_table and set the internal values _primary_key_fields and _normal_fields_indexes with DML instructions to re-create the indexes within a SQL statement - It will identify multiple and unique keys)
612
613				=cut
614
615				sub _get_indexes {
616				my $self = shift;
617				# gets indexes description from the DB engine
618				my $DML = "SHOW INDEX FROM $self->{src_table}";
619				$self->_verbose("2","#$DML;");
620				my $sth = $self->{_dbh}->prepare($DML);
621				$sth->execute();
622				my %indexes = (); # list of indexes with associated columns
623				my @unique = (); # list of unique indexes
624				my $new_index_added =0;
625				my @lu_fields = split /,/, $self->{lookup_fields};
626				my %lu_indexes = map { $_ , 0 } @lu_fields;
627				while (my $hash_ref = $sth->fetchrow_hashref()) {
628				$self->_verbose("3",
629				"# $hash_ref->{Key_name}\t$hash_ref->{Column_name} ");
630				# check that lookup fields have an associated index
631				if (exists $lu_indexes{$hash_ref->{Column_name}}) {
632				$lu_indexes{$hash_ref->{Column_name}} = 1;
633				}
634				# first, we collect primary key columns
635				if ($hash_ref->{Key_name} eq "PRIMARY") {
636				# check if primary key column is among the lookup fields
637				# and if so, replace any lookup field reference with
638				# the new index (foreign key)
639				if (grep {$hash_ref->{Column_name} eq $_} @lu_fields) {
640				if (!$new_index_added) {
641				$self->{_primary_key_fields} .= ","
642				if $self->{_primary_key_fields} ;
643				$self->{_primary_key_fields} .=$self->{index_field};
644				$new_index_added =1;
645				}
646				}
647				else {
648				$self->{_primary_key_fields} .= ","
649				if $self->{_primary_key_fields} ;
650				$self->{_primary_key_fields} .=$hash_ref->{Column_name};
651				}
652				}
653				else {
654				# collects normal columns indexes, skipping lookup fields
655				next if (grep {$hash_ref->{Column_name} eq $_} @lu_fields);
656				$indexes{$hash_ref->{Key_name}} .= ","
657				if $indexes{$hash_ref->{Key_name}};
658				$indexes{$hash_ref->{Key_name}} .=$hash_ref->{Column_name};
659				push @unique, $hash_ref->{Key_name}
660				if $hash_ref->{Non_unique} eq "0";
661				}
662				}
663				$self->{_primary_key_fields} =
664				", PRIMARY KEY (" . $self->{_primary_key_fields} . ")"
665				if $self->{_primary_key_fields};
666				foreach my $key (keys %indexes) {
667				# create the indexes description for SQL
668				$self->{_normal_fields_indexes} .= ", ";
669				$self->{_normal_fields_indexes} .= " UNIQUE "
670				if grep { $key eq $_ } @unique;
671				$self->{_normal_fields_indexes} .=
672				" KEY " . $key ." ($indexes{$key})";
673				}
674				# check for primary key and keys associated with lookup fields.
675				croak "missing primary key in $self->{src_table}\n"
676				. " A primary key is needed for this operation\n"
677				unless ($self->{_primary_key_fields});
678				if (grep {$_ == 0} values %lu_indexes) {
679				print STDERR "*" x 70, "\n";
680				print STDERR
681				"WARNING. the following columns, identified as lookup fields,\n"
682				. "ARE NOT INDEXED. This fact will have a huge inpact on "
683				. "performance.\n"
684				. "Therefore it is advisable to set these indexes before "
685				. "continuing\n";
686				print STDERR "missing indexes: ",
687				map {"<$_> "} grep {$lu_indexes{$_} == 0} keys %lu_indexes;
688				print STDERR "\n";
689				print STDERR "*" x 70, "\n";
690				if ($self->{ignore_warning}) {
691				print STDERR
692				"you chose to ignore this warning and the operation is "
693				. "carried out anyway, as you wish\n";
694				}
695				else {
696				croak
697				"missing indexes in lookup fields - operation aborted\n";
698				}
699				}
700				}
701
702				=for internal use
703				(_get_field_descriptions() will extract data definition from src_table and prepare apt statements to re-create the needed fields in dest_table and lookup_table)
704
705				=cut
706
707				sub _get_field_descriptions {
708				my $self = shift;
709				# gets table description from DB engine
710				my $DML = "SHOW FIELDS FROM $self->{src_table}"; # DESCRIBE $self->{src_table} would have the same effect
711				$self->_verbose("2","#$DML;");
712				my $sth = $self->{_dbh}->prepare($DML);
713				$sth->execute();
714				my @lu_fields = split /,/, $self->{lookup_fields};
715				# divide description between normal fields (which will go to the
716				# destination table) and lookup fields (for the lookup table)
717				while (my $hash_ref = $sth->fetchrow_hashref()) {
718				$self->_verbose("3", "#$hash_ref->{Field}\t$hash_ref->{Type}");
719				if (grep {$hash_ref->{Field} eq $_} @lu_fields) {
720				$self->{_lookup_fields_description} .= ","
721				if $self->{_lookup_fields_description} ;
722				$self->{_lookup_fields_description} .= $hash_ref->{Field}
723				. " " . $hash_ref->{Type};
724				$self->{_lookup_fields_description} .= " not null "
725				unless $hash_ref->{Null};
726				$self->{_lookup_fields_description} .=
727				" default " . $self->{_dbh}->quote($hash_ref->{Default})
728				if $hash_ref->{Default};
729				}
730				else {
731				$self->{_normal_fields_description} .= ","
732				if $self->{_normal_fields_description} ;
733				$self->{_normal_fields_description} .=
734				$hash_ref->{Field} . " " . $hash_ref->{Type};
735				$self->{_normal_fields_description} .= " not null "
736				unless $hash_ref->{Null};
737				$self->{_normal_fields_description} .= " default "
738				. $self->{_dbh}->quote($hash_ref->{Default})
739				if $hash_ref->{Default};
740				if (lc $hash_ref->{Extra} eq "auto_increment"
741				and $self->{copy_indexes})
742				{
743				$self->{_normal_fields_description} .=
744				" auto_increment ";
745				}
746				$self->{_non_lookup_fields} .= ","
747				if $self->{_non_lookup_fields} ;
748				$self->{_non_lookup_fields} .= $hash_ref->{Field};
749				}
750				}
751				}
752
753				=for internal use
754				(_init() will clean the description fields ane fill them with appropriate calls to _get_field_descriptions() and _get_indexes())
755				Uncommenting the lines mentioning Data::Dumper will produce useful debug information.
756
757				=cut
758
759				#use Data::Dumper;
760				sub _init {
761				my $self = shift;
762				return 0 unless $self->_init_field(qw(src_table lookup_table
763				dest_table index_field lookup_fields));
764				$self->{lookup_fields} =~ tr/ //d;
765				my @lookup_fields = split /,/, $self->{lookup_fields};
766				croak "invalid index field"
767				if grep {$self->{index_field} eq $_} @lookup_fields;
768				# and are the aliases for source and lookup tables used
769				# in the final query.
770				# Therefore they can't be accepted as normal table names
771				croak " and are reserved words for this module. "
772				. "Please choose a different name\n"
773				if grep { /^(:?src\|lkp)$/ } ($self->{src_table},
774				$self->{dest_table}, $self->{lookup_table});
775				#print STDERR Data::Dumper->Dump([$self], [ ref $self ]),"\n"; <>;
776				$self->{$_} ="" foreach (qw(_normal_fields_indexes
777				_lookup_fields_description _non_lookup_fields
778				_normal_fields_description _primary_key_fields));
779				$self->_get_field_descriptions();
780				$self->_get_indexes
781				if lc $self->{copy_indexes} =~ /^(:?1\|2\|yes\|later)$/;
782				#print STDERR Data::Dumper->Dump([$self],["AfterInit"]),"\n"; <>;
783				return 1;
784				}
785
786				=for internal use
787				(gets additional parameters into internal hash)
788
789				=cut
790
791				sub _get_params {
792				my $self = shift;
793				# if parameters are provided, they are merged with the internal hash
794				# and _init() is called
795				if (scalar @_) {
796				my $params = shift;
797				foreach my $key (keys %$params) {
798				croak "invalid parameter $key \n"
799				unless exists $_accepted_params{$key};
800				$self->{$key} = $params->{$key};
801				}
802				if ($self->{simulate} eq "1") {
803				$self->{verbose} = "2";
804				}
805				elsif ($self->{simulate} ne "0") {
806				croak "invalid value for \n";
807				}
808				$self->_init();
809				}
810				}
811
812				=item create_normalized_table()
813
814				create_normalized_table() will create a 2nd normal form table, getting data from a source table and a lookup table.
815				Lookup fields in the source table will be replaced by the corresponding index field in the lookup table.
816
817				If called without parameters, assumes the parameters passed to the object constructor.
818
819				Parameters are passed as a hash reference, and are the same given to new() except I, I and I. None are compulsory here. The missing ones are taken from the constructor. However, a check is done to ensure that all parameters are passed from either sub.
820
821				=cut
822
823				sub create_normalized_table {
824				my $self = shift;
825				$self->_get_params(@_);
826				my $join_clause = "";
827				my $good_fields = ""; # fields that will be moved to the
828				# destination table
829				# ensure that the fields are called with an appropriate table alias
830				foreach (split /,/, $self->{_non_lookup_fields}) {
831				$good_fields .= ", " if $good_fields;
832				$good_fields .= "src." . $_;
833				}
834				# create the JOIN clause, using the lookup fields as foreign keys
835				foreach (split /,/, $self->{lookup_fields}) {
836				$join_clause .= " and " if $join_clause;
837				$join_clause .= "src.$_ =lkp.$_";
838				}
839				# removes any existing table with the same name as dest_table.
840				my $DML = "DROP TABLE IF EXISTS $self->{dest_table}";
841				$self->_verbose("2", "$DML;");
842				$self->{_dbh}->do ("DROP TABLE IF EXISTS $self->{dest_table}");
843				# creates the destination table.
844				$DML =qq[CREATE TABLE $self->{dest_table}
845				($self->{_normal_fields_description},
846				$self->{index_field} INT(11) NOT NULL];
847				if (defined $self->{copy_indexes}
848				and (lc $self->{copy_indexes} =~ /^(:?1\|yes)$/))
849				{
850				$DML .= $self->{_primary_key_fields} ;
851				$DML .= $self->{_normal_fields_indexes};
852				}
853				$DML .= qq[, KEY $self->{index_field} ($self->{index_field}))];
854				if (defined $self->{copy_indexes}
855				and (lc $self->{copy_indexes} =~ /^(:?2\|later)$/))
856				{
857				$DML =~ s/ auto_increment / /;
858				print "# auto_increment for $self->{dest_table} needs to "
859				. "be set manually\n";
860				}
861				$self->_verbose("2", "$DML;");
862				$self->{_dbh}->do($DML) unless ($self->{simulate});
863				# inserts values into the destination table, from the source table
864				# JOINed with the lookup table
865				$DML = qq[INSERT INTO $self->{dest_table}
866				SELECT $good_fields, $self->{index_field}
867				FROM $self->{src_table} src
868				INNER JOIN $self->{lookup_table} lkp ON ($join_clause)];
869				$self->_verbose("2", "$DML;");
870				$self->{_dbh}->do($DML) unless ($self->{simulate});
871				# if copy indexes was , then an ALTER TABLE statement
872				# is issued.
873				if (defined $self->{copy_indexes}
874				and (lc $self->{copy_indexes} =~ /^(:?2\|later)$/))
875				{
876				$DML = qq[ALTER TABLE $self->{dest_table} ];
877				if ($self->{_primary_key_fields}) {
878				$self->{_primary_key_fields} =~ s/^\s?,/ADD/;
879				$DML .= $self->{_primary_key_fields};
880				}
881				if ($self->{_normal_fields_indexes}) {
882				$self->{_normal_fields_indexes} =~ s/^\s?,//
883				unless $self->{_primary_key_fields};
884				$self->{_normal_fields_indexes} =~ s/,/, ADD /g;
885				$DML .= $self->{_normal_fields_indexes};
886				}
887				$self->_verbose("2", "$DML;");
888				$self->{_dbh}->do($DML) unless ($self->{simulate});
889				}
890				$self->_verbose("1", "# $self->{dest_table} created and filled");
891				}
892
893				=item create_lookup_table
894
895				create_lookup_table() will create a lookup table, extracting repeating fields from a 1st normal form table. A numeric primary key is created.
896
897				When called without parameters, assumes the values passed to the object constructor (I).
898
899				Parameters are passed as a hash reference, and should include the following
900
901				src_table table where to take the values from
902				lookup_fields which fields to take
903				lookup_table table to create
904				index_field primary key (will be foreign key in src_table)
905				to be created
906
907				=cut
908
909				sub create_lookup_table {
910				my $self = shift;
911				$self->_get_params(@_);
912				my $table_keys ="";
913				foreach (split /,/, $self->{lookup_fields}) {
914				$table_keys .= ", KEY $_ ($_)";
915				}
916				# removes any existing table with the same name as Lookup_table
917				my $DML = qq[DROP TABLE IF EXISTS $self->{lookup_table}];
918				$self->_verbose("2", "$DML;");
919				$self->{_dbh}->do($DML) unless ($self->{simulate});
920				# create the new table
921				$DML = qq[CREATE TABLE $self->{lookup_table}
922				($self->{index_field} INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
923				$self->{_lookup_fields_description} $table_keys) ];
924				$self->_verbose("2", "$DML;");
925				$self->{_dbh}->do($DML) unless ($self->{simulate});
926				# gets fields from the source table
927				$DML = qq[INSERT INTO $self->{lookup_table}
928				SELECT DISTINCT NULL, $self->{lookup_fields}
929				FROM $self->{src_table}];
930				$self->_verbose("2", "$DML;");
931				$self->{_dbh}->do($DML) unless ($self->{simulate});
932				$self->_verbose("1", "# $self->{lookup_table} created and filled");
933				}
934
935				=back
936
937				=head1 AUTHOR
938
939				Giuseppe Maxia, giuseppe@gmaxia.it
940
941				=head1 COPYRIGHT
942
943				The DBSchema::Normalizer module is Copyright (c) 2001 Giuseppe Maxia,
944				Sardinia, Italy. All rights reserved.
945
946				You may distribute this software under the terms of either the GNU
947				General Public License version 2 or the Artistic License, as
948				specified in the Perl README file.
949
950				The embedded and encosed documentation is released under
951				the GNU FDL Free Documentation License 1.1
952
953				=cut
954
955				1;