• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Comments and Questions here

Page history last edited by bob pruzek 12 years, 8 months ago

Bob (or anyone):


 data<-glm(formula = smoke ~ low + ht + bwt, family ="binomial", data = birthwt)

> histbackback(fitted(data)[birthwt$smoke==0],fitted(data)[birthwt$smoke==1])

Error in c(x, y) : 'y' is missing


Above is the code and the error I consistently get when trying to run the histbackback histgram.  This same code runs using my variables for both Cathy and Dima.  Why won't it run on my computer?  I even had Dima start a new R session on my machine and that error still came up.  HELP!!



The URL for the Dependent Sample paper is http://www.amstat.org/publications/jse/v17n1/helmreich.html . Note that you once you open R you can copy and paste the data from our appendix into R, which is easily done. If the package 'psych' has been loaded, then just type read.clipboard( ), and as long as what you copied includes the header, you should get the whole enchilada


Hi, Where do we find the 'psych' package? I am looking in the Appendix. It is probably obvious.


  Tony, Catherine,

       Please use the Comment box, always at the bottom here, for your questions instead of the Edit tab. But I'll answer as well as I can.

Tony, it appears you have solved this problem recently, so I'll leave it alone. The code appears (on the surface ok, but I

would have to see more details re: 'data' to have a good answer.

Also, I want to suggest that you NOT name OBJECTS created by functions, here glm, as 'data'. Names given to ... whatever... in

R are important mnemonic devices; please remember that.


install.packages('psych', dep=T), should get you the package; then library(psych) puts it in the R environment for use.

This can be used for any package; I thought we covered that.


Comments (Show all 114)

bob pruzek said

at 12:45 am on Mar 22, 2011

Dima, Some tests to try: 1. head(rhc) ... Do you get what you saw on the web for these data, or what was in the file using a text editor to look at it? 2. class(rhc) ... is it a dataframe? If not and 1 was ok, then do rhc = dataframe(rhc). Then names(rhc) should work (I'm a bit puzzled by your col.names=NA specification in write.table. This from documentation? 3. If you're using only a subset of variables (seems that you are) then, define, say, rhcsub = rhc[, c( pluck the variables)], like c(1,5, 7, 9). Then attach that rhcsub after ensuring that it is a dataframe itself. Then proceed, being sure to redefine the data in the glm call to rhcsub. I'm not sure about your use of the label function, but these other tests seem likely to be helpful. (And now I'm going to get some zzz's! :)) ) bob

bob pruzek said

at 12:48 am on Mar 22, 2011

PS. From Kaltenbachi's pdf you should get some sense of which variables, and how many of them, are likely to be most helpful for your runs. Don't cut back too drastically on what she identifies as useful covariates, if you can educe that info from her work. b

Dima Kassab said

at 12:21 am on Mar 23, 2011

Hi Bob,
I'm having some slow progress here! I should have changed the class type of one variable. It was "labelled" while it should be numeric!
Using head(hrc), as you suggested, I figured out that I have a problem with cat2 column. It has <NA> instead of NA. I used replace() to change the values but I got "Warning message:
In `[<-.factor`(`*tmp*`, 1, value = "NA") : invalid factor level, NAs generated"

I'm trying to figure out how to solve this!

When I run the glm using the same variables and covariates used by Kaltenbachi (except cat2 as i didn't solve the <NA> problem yet)!, I get the following error:"Error in data.frame(response, treatment, propensity) :
arguments imply differing number of rows: 5735, 634"
Depending on which variables I'm using I sometimes get an error similar to above or other times "Error in `contrasts<-`(`*tmp*`, value = "contr.treatment") :
contrasts can be applied only to factors with 2 or more levels"

I'm confused because based on the combination of variables I'm using it might work.
For example, this will work:
bwt.lrr=glm(formula =swang1~ age+sex+edu+ income +ninsclas+factor(cat1)+ca,family="binomial",data=rhc)
loess.bwt2 = loess.psa(as.numeric(t3d30), treatment=swang1,fitted(bwt.lrr),int=7)
I'll email you the graph I got running the above!

I think the main reason for the problems I'm getting is that the dataset is not clean! What do you think?

Also, the dataset is big. It has about 5700 rows. Should i use a subset? If yes, how should I pick a subset?

I'm sorry I keep posting late at night! I'm a night person!


bob pruzek said

at 9:31 am on Mar 23, 2011

Dima, Yes, the data set is really too large for our purposes here. That's why I've asked for DETAILS when students start working with something else beyond the birthwt set.
I would not labor much longer w/ the rhc dataset simply due to the complications you mention, including NAs, which nearly always cause lots of trouble. Clean data are
essential when one is at an early stage of learning a method that is new to you. Short of sitting with you and working w/ each problem as it comes up, I don't think I can
help much here. And I don't have the time today or tomorrow to do those things. I'm sorry I did not remember to save the history when I wrapped things up last week too;
but since I did not, I cannot recover that after the fact. There's nothing bad, weak, or wrong about going back to the birthwt data and predicting smoke using a different
glm model than I used, then following that through. We shall stay with this topic through one more week to give some degree of closure on the topic. Best, bob

Dima Kassab said

at 10:15 am on Mar 23, 2011

Thanks Bob! I was going to ask you if I can do that (go back to the birthwt). It was good experience working on this dataset!

I'm not sure if you have mentioned that before, does R have a debugger?


bob pruzek said

at 10:59 am on Mar 23, 2011

Dima, Two functions you might study (and try?) are: traceback and debug Let us know if you find them useful. bob

alex said

at 2:27 am on Mar 24, 2011

Hi Bob,
I'm trying to get a box.psa() graph, but I get an error, can you please provide advice?
continuous = birthwt.psa$response
treatment = birthwt.psa$treatment
strata = birthwt.psa$ps
box.psa(continuous, treatment, strata)
Error in bxp(list(stats = c(138.280701754386, 138.280701754386, 138.280701754386, :
'at' must have same length as 'z$n', i.e. 112

bob pruzek said

at 10:58 am on Mar 24, 2011

Try setting this up w/ the three column input (see ?box.psa). If the input matrix is as specified you should have no trouble. You might also check the lengths of the input vectors initially, especially strata.
You may find that the cut2 command that I use in my post of today will do the trick. Use it as I did, and let me know how that works. More often than not people have trouble w/ this vector, and your
use of strata = birthwt.psa@ps will NOT do the trick w/out action using cut2 (most simply).

alex said

at 11:22 am on Mar 24, 2011

Thank you, Bob.
#paste here Jason's code all the way to birthwt.psa, then:
box.psa(birthwt.psa$response, birthwt.psa$treatment, cut.fit.bwt)

Output (image): http://screencast.com/t/mbkFG2yORZhw

Liz Fisk said

at 11:36 am on Mar 24, 2011

Hi all,
I had questions about identifying strata, and was mistakenly referring to the histbackback graph to help. Here's Bob's response:

There is NO information in any such histogram that will identify strata.
Remember, you could just as well have generated overlapping density plots
instead of histograms. The information of interest in such plots concerns the
region of 'mutual support'
for the groups being compared. Overlap is essential to be able to proceed w/
conventional PSA, the more the better; but ironically, LR and related methods
seek to separate groups (in some sense maximally), and if there are really
notable differences between the groups on even one of the covariates being used
in the LR model that may be sufficient to make for little overlap. We want
overlap, but the method can deny it; this means that if the covariates that
'really make the separation' are ones that seem, on substantive grounds,
essential, then such data are telling us that the data deny effective use of
PSA. In this sense the method is self-critical -- which I deem as a virtue, but
some may not (especially if you 'really want to know' the answer that PSA could
well provide).

Hope this helps clarify any questions you had as well!

Catherine Dumas said

at 1:14 pm on Mar 24, 2011

Hey guys,
Does anyone know of a good free tool that can be installed in the browser to store URLs and potentially thumbnails of pages.

alex said

at 8:19 pm on Mar 25, 2011

A friend of mine likes to use "Zotero."

Catherine Dumas said

at 9:29 pm on Mar 25, 2011

Thanks Alex.

Jason Bryer said

at 8:38 pm on Mar 25, 2011

Zotero is fantastic for collecting sources. It can catalog your PDFs too. It will also export to bi Tex which works with swerve and latex. I would also look into instapaper for more general use.

Catherine Dumas said

at 9:28 pm on Mar 25, 2011

I use Zotero for exactly that. It is great. I will check out instapaper.

alex said

at 10:57 pm on Mar 25, 2011

Jason, Delicious bookmarks are very useful. Do you use delicious?

bob pruzek said

at 12:20 pm on Mar 25, 2011

Try Evernote ...should be easy to find. b

Catherine Dumas said

at 9:31 pm on Mar 25, 2011

Thanks Bob, I found it & I will try it...:)

alex said

at 4:55 pm on Mar 25, 2011

Bob or anyone,
We are treating PSA as a means for data analysis in this course, i.e., we find data that fits into a PSA analysis we can run. Is PSA predominantly a method for data analysis? Or, and is this for Bob's next course: you plan a research study knowing before hand your will use PSA? Does this question make sense?
Will Dr. Stuart's paper help answer this?

bob pruzek said

at 8:21 pm on Mar 25, 2011

Alex, Yes, PSA is a method of data analysis, better: a class of methods for data analysis. You put your finger on one of the topics I will discuss briefly next week, namely
the difference between retrospective approaches to PSA and prospective approaches. Archival data, data collected (usually by others than the analyst) in the past, are more
often than not the starting points for data analysis. The prospective approach, which entails looking forward with a plan for data collection, is preferred for the major reason
that it helps ensure that covariates that are deemed most needed to account for selection bias are named (sometimes this involves constructing tests or instruments) and
measured for all individuals who will then select their 'treatment' -- according to their behavior or actions. More soon. bob

alex said

at 10:51 pm on Mar 25, 2011

Thank you, Bob. Earlier in the semester you told us that when individuals sometimes write books, they do so concurrently with a research endeavor. I am sure people do this with R as well, i.e., conduct a research endeavorer and along the way produce an R package. For me, this is a fun and different way to look at R packages. For example, the survey package (and possibly Jason's new PSA package?) can be used to step through a method from a starting point to and an ending (or moving on) point.

bob pruzek said

at 5:38 pm on Mar 26, 2011

Alex, Am not sure there is a question here, but I can say that nearly any approach to writing, books and packages and more, has been tried before, and if the ideas are sound and useful (and the work looks good in relation to would-be competition) good things can happen. b

Catherine Dumas said

at 1:28 pm on Mar 28, 2011

Hey guys,
How do we define pt.1 as binary? And is it the same as the ptl variable in the birthwt data?

Catherine Dumas said

at 2:10 pm on Mar 28, 2011

Catherine Dumas said

at 2:11 pm on Mar 28, 2011

Hi Bob,
I am trying to run this code & I am getting these error messages. Do you know what I am doing wrong?

Error in Matchby(Y = bwt/(28.35 * 16), Tr = smoke, X = bwt.sub5[, c(1, :
(list) object cannot be coerced to type 'double'
In addition: Warning messages:
1: In Ops.factor(left, right) : / not meaningful for factors
2: In Ops.factor(left, right) : / not meaningful for factors
3: In Ops.factor(left, right) : / not meaningful for factors
4: In Ops.factor(left, right) : / not meaningful for factors
5: In Ops.factor(left, right) : / not meaningful for factors
> str(bwt.sub5)
'data.frame': 189 obs. of 5 variables:
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
$ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
$ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
> bwt.match5=Matchby(Y=bwt/(28.35*16),Tr=smoke,X=bwt.sub5[,c(1,2)],by=list(race))
Error in Matchby(Y = bwt/(28.35 * 16), Tr = smoke, X = bwt.sub5[, c(1, :
(list) object cannot be coerced to type 'double'
In addition: Warning messages:
1: In Ops.factor(left, right) : / not meaningful for factors
2: In Ops.factor(left, right) : / not meaningful for factors
3: In Ops.factor(left, right) : / not meaningful for factors
4: In Ops.factor(left, right) : / not meaningful for factors
5: In Ops.factor(left, right) : / not meaningful for factors
> bwt.match5=Matchby(Y=bwt/(28.35*16),Tr=smoke,X=bwt.sub5[,c(1,2)],by=list(race))


Catherine Dumas said

at 2:32 pm on Mar 28, 2011

I am ok now. I started a new session.

bob pruzek said

at 3:30 pm on Mar 28, 2011

Catherine, I presume you also have pt.1 working. I defined that in my handout (or in class). You could also use the factor( ) function for this: pt.1 = factor(ptl > 0) ... which will give a 'character' variable (TRUE, FALSE); and were to need it, could do as.numeric(pt.1) to make that numeric. b

alex said

at 1:48 am on Mar 30, 2011

Bob, Are there any plans to post the video recorded last week?

alex said

at 12:59 am on Mar 31, 2011

I am going to spend the next few hours doing a pass through the Sarkar lattice piece. I am a night owl.
I am sharing my document with you through Google Docs: http://goo.gl/XUXEG
I made the document accessible to anyone with the above link and editable.
There won't be much real time-ness to this since most of you will be sleeping,
but what I hope to test is sharing a work in progress with the rest of the Wiki-users.

Young Do said

at 1:28 am on Mar 31, 2011

I'll check it out. Thanks, Alex.

alex said

at 1:33 pm on Mar 31, 2011

Yeah, Sarkar is great. I'll have to check out some of his other work.

Young Do said

at 5:44 am on Mar 31, 2011


I am working on the last portion of the assignment, in assessing for covariate balance. When I try to run the following code:

cv.bal.bwt6=cv.bal.psa(cv.trans.bwt$covariates.transf,smoke,fitted(bwt.lrr), strata=cut.fit.bwt)

I get this error message:

Error in if (xlim[1] > 0) xlim[1] <- 0 :
missing value where TRUE/FALSE needed

Help. Anyone else get this??


Young Do said

at 5:49 am on Mar 31, 2011

Here is the full code that I used:


ui <- factor(ui)
ht <- factor(ht)
low <- factor(low)
race <- factor(race, labels = c("white", "black", "other"))
bwt1 <-data.frame(ht, age, race, smoke, ui)

bwt.lrr=glm(formula=smoke ~ ht + age + race + ui, family="binomial", data=bwt1)






granova.ds((cbind(bwt[bwt.match6$index.treated],bwt[bwt.match6$index.control])/(28.35*16))[,2:1],xlab="Birthwts in lbs. for babies born to non-smokers",ylab='Birthwts in lbs for babies born to smokers')

loess.bwt2=loess.psa(birthwt$bwt/(28.35*16), factor(birthwt$smoke),fitted(bwt.lrr),int=10)




cv.bal.bwt6=cv.bal.psa(cv.trans.bwt$covariates.transf,smoke,fitted(bwt.lrr), strata=cut.fit.bwt)

bob pruzek said

at 10:46 am on Mar 31, 2011

Young, We'll have to address this after 3 pm today. Could you come early to class, say before 4? Thanks bob

alex said

at 2:59 pm on Mar 31, 2011

Dima,Catherine, and Liz.
Thank you for being part of NTIR2011. Due to inclement weather, the conference has been postponed until April 15, 2011.

Dima Kassab said

at 7:47 pm on Apr 6, 2011

Hi Bob,
On Sarkar paper on Lattice package, Q-Q plot is discussed. What are Q-Q plots exactly used for? My understanding is in the example Sarkar used, the plot is telling us gcsescore is a good predictor for score. Is that right, or does the plot is giving only descriptive information related to distribution and skewness?
If my statement about gcsescore is true, don't we need other evidence such as confidence interval?


Tony Leonardi said

at 8:17 pm on Apr 6, 2011

Bob (or others in the class):
In the Scatter Plots and Extensions section of the Lattice paper, an asymptotic scatter plot is converted using logs and presented as giving us more information than the original plot. I know this is done a lot in all fields, and I have even done things like this in my "previous life." I am, however, uncomfortable using these transformations as I half consider this to be "lying" about the data. It is difficult, sometimes, to correctly interpret the graphs after the conversions because now you're interpreting a log function, so I was wondering about the use of these transformations on a regular basis. How is one to be sure the interpretation is correct after such transformations are done?
Thanks for your input!

Catherine Dumas said

at 9:42 pm on Apr 6, 2011

Hi Bob,
Our group is having a discussion about observational studies and covariates. Can race, nationality, & ESL be used as covariates in an educational study? We know that race is used, but what about nationality? Does nationality give us any additional information that race doesn't give us ie. cultural differences?

bob pruzek said

at 10:17 pm on Apr 6, 2011

Dima, Tony, Catherine,
Re: q-q plots, these are used chiefly to compare shapes of distributions, but also 'levels' and 'spreads'. AFAIK, they have only incidental value in predictive work... as, for example, when predictions of Y from one or more X's are limited because shapes of X and Y distributions are quite different. You should always think of two kinds of purposes of any analysis: descriptive and inferential. q-q plots pertain mostly to description; CIs concern different issues (almost entirely). More in class tomorrow, as needed.
Tony, the term 'asymptotic scatter plot' has no meaning for me, and a search did bring it up in the Sarkar08LatticeLab pdf. As for use of log transformations -- say in predictive work -- think about this q: why should we believe that any given X and Y will have metrics that make them, w/out change, predictable, as in Y ~ X? Ask instead, what (non-linear) transformations might improve the degree of linearity when predicting, say Y from f(X), i.e. f(X) represents possibly a log of X, the square root, or the reciprocal of X? (We know gallons per mile is (far) better to use than miles per gallon in linear prediction (as an outcome) when predicting from features of a car, such as weight, no. of cylinders, displacement, etc. -- these being cases where the outcome Y is non-linearly transformed, not the X's.) And Tukey said, "It is generally better to unbend data that are bent, and then use linear methods, than it is to create a new non-linear method for each new X, Y system."
Catherine, These are largely empirical questions. If nationality adds value to race when predicting ??, or ESL adds value, that is usually sufficient to argue that such variables are worthy of inclusion in prediction frameworks. To say more would require details about application/situations.
Hope this helps, bob

bob pruzek said

at 10:18 pm on Apr 6, 2011

My phrase, line 3 above, "and a search did bring it up in the Sarkar08LatticeLab pdf" should have said "and a search did NOT bring it up in the Sarkar08LatticeLab pdf". b

You don't have permission to comment on this page.