![]() |
|
|
Goto the Tip of the Month Archive Other interesting pages ...
LinkedIn Profile |
SAS Tip
of the Month I received an email from Rebecca in Boston asking how some values were being overwritten when two datasets were being joined. A few minutes of discussion and the problem was solved -- a common variable was in both datasets but not in the BY statement. Lets look at an example: data _sales (label='Sales Hours');
label client='Client'
value='Invoiced Amount'
number='Number of Hours';
infile cards;
input client $ value number;
cards;
Huey 1200 120
Dewey 1500 110
Louie 800 5
;
run;
data _budgeted (label='Budgeted Hours');
label client='Client'
number='Budgeted Hours'
manager='Manager';
infile cards;
input client $ number manager $;
cards;
Huey 120 Mickey
Dewey 80 Donald
Louie 50 Goofy
;
run;
Here we have two datasets which we are going to merge by client -- it is easy to see in this example that there are two common variables, CLIENT and NUMBER, the former which we are using to merge the data by. Now lets look at what happens after we merge the two datasets: proc sort data=_sales;
by client;
proc sort data=_budgeted;
by client;
data _merged;
merge _sales _budgeted;
by client;
proc print data=_merged;
title1 "Merged Sales and Budgeted Numbers";
run;
*=== Output ===;
Merged Sales and Budgeted Numbers
Invoiced Number
Obs Client Amount of Hours Manager
1 Dewey 1500 80 Donald
2 Huey 1200 120 Mickey
3 Louie 800 50 Goofy
*=== LOG ===;
106 data _merged;
107 merge _sales _budgeted;
108 by client;
NOTE: There were 3 observations read from the data set
WORK._SALES.
NOTE: There were 3 observations read from the data set
WORK._BUDGETED.
NOTE: The data set WORK._MERGED has 3 observations and 3
variables.
109 proc print data=_merged;
110 title1 "Merged Sales and Budgeted Numbers";
111 run;
NOTE: There were 3 observations read from the data set
WORK._MERGED.
112 proc print data=_merged label;
113 title1 "Merged Sales and Budgeted Numbers";
114 run;
NOTE: There were 3 observations read from the data set
WORK._MERGED.
Looking at the LOG and the OUTPUT everything looks reasonable, particularly if you are unfamiliar with what the expected output is. But look at the Number of Hours which the data, according to the label, comes from the _SALES dataset -- these are not the values from _SALES, but the values from _BUDGETED! No one would detect the issue and the report could go out to the public. Preventing the issue looks easy, just make sure each dataset has unique variable names except those that are being used on the MERGE statement. But it is not so easy if you are merging two datasets where one or both have several hundred variables. So what can we do to detect this issue in our programs? SAS/WPS does provide an option called MSGLEVEL that controls the level of detail in messages that are written to the SAS LOG. This has two values: N - prints notes, warnings, and error messages (default)
I - prints additional notes that refer to index usage, merge processing,
sort utilities, and those using option N
Lets use the MSGLEVEL option now with a value of I and see what comes up in the SAS LOG: *=== Output ===;
Merged Sales and Budgeted Numbers
Invoiced Number
Obs Client Amount of Hours Manager
1 Dewey 1500 80 Donald
2 Huey 1200 120 Mickey
3 Louie 800 50 Goofy
*=== LOG ===;
115 options msglevel=I;
116 proc sort data=_sales;
117 by client;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
118 proc sort data=_budgeted;
119 by client;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
120 data _merged;
121 merge _sales _budgeted;
122 by client;
INFO: The variable number on data set WORK._SALES will be
overwritten by data set WORK._BUDGETED.
NOTE: There were 3 observations read from the data set
WORK._SALES.
NOTE: There were 3 observations read from the data set
WORK._BUDGETED.
NOTE: The data set WORK._MERGED has 3 observations and 3
variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
123 proc print data=_merged label;
124 title1 "Merged Sales and Budgeted Numbers";
125 run;
NOTE: There were 3 observations read from the data set
WORK._MERGED.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
The output is the same, but an INFO message now appears in the SAS LOG with the following text: INFO: The variable number on data set WORK._SALES will be
overwritten by data set WORK._BUDGETED.
Please note that the message may not be exactly the same depending on whether what version of SAS or WPS you are using, but the meaning is the same. This is our only indication in the SAS LOG, without knowing the data or what to expect, that numbers are being overwritten from _BUDGETED to _SALES when the merge occurs -- this is not an ERROR or WARNING message in the SAS LOG. With this information we can amend our programming accordingly so the report is correct either by dropping the NUMBER variable in _SALES or _BUDGETED, or renaming one of the two NUMBER variables and adding an additional column: Merged Sales and Budgeted Numbers
(with NUMBER dropped from _BUDGETED dataset)
Invoiced Number
Obs Client Amount of Hours Manager
1 Dewey 1500 110 Donald
2 Huey 1200 120 Mickey
3 Louie 800 5 Goofy
Merged Sales and Budgeted Numbers
(with NUMBER in _BUDGETED renamed to another variable name during MERGE)
Invoiced Number Budgeted
Obs Client Amount of Hours Hours Manager
1 Dewey 1500 110 80 Donald
2 Huey 1200 120 120 Mickey
3 Louie 800 5 50 Goofy
When you are writing your program, it is useful to use the MSGLEVEL option with the value of I, and always do little checks of the output at each step to make sure that what is being output is what you are expecting! Have a safe than happy holiday season. See you all next year. |
| ________________________________ Updated December 8, 2011 |
|