If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Comments and Questions here

Page history last edited by bob pruzek 13 years ago

Bob (or anyone):

data<-glm(formula = smoke ~ low + ht + bwt, family ="binomial", data = birthwt)

> histbackback(fitted(data)[birthwt$smoke==0],fitted(data)[birthwt$smoke==1])

Error in c(x, y) : 'y' is missing

Above is the code and the error I consistently get when trying to run the histbackback histgram. This same code runs using my variables for both Cathy and Dima. Why won't it run on my computer? I even had Dima start a new R session on my machine and that error still came up. HELP!!

The URL for the Dependent Sample paper is http://www.amstat.org/publications/jse/v17n1/helmreich.html . Note that you once you open R you can copy and paste the data from our appendix into R, which is easily done. If the package 'psych' has been loaded, then just type read.clipboard( ), and as long as what you copied includes the header, you should get the whole enchilada

Hi, Where do we find the 'psych' package? I am looking in the Appendix. It is probably obvious.

Catherine

Tony, Catherine,

Please use the Comment box, always at the bottom here, for your questions instead of the Edit tab. But I'll answer as well as I can.

Tony, it appears you have solved this problem recently, so I'll leave it alone. The code appears (on the surface ok, but I

would have to see more details re: 'data' to have a good answer.

Also, I want to suggest that you NOT name OBJECTS created by functions, here glm, as 'data'. Names given to ... whatever... in

R are important mnemonic devices; please remember that.

Catherine,

install.packages('psych', dep=T), should get you the package; then library(psych) puts it in the R environment for use.

This can be used for any package; I thought we covered that.

bob

Comments (114)

Hi Catherine,
If you copy/paste the syntax on the front page to install packages, it should be included in that batch of zipped folders. Once these are installed you should be able to load it by using Load package under the Packages menu. However, I still wasn't able to copy/paste, as it seems like the html made the formatting funky. I copied the data into a txt file, and then used the read.table() function.
-Liz

Post your comments on THIS page re: syllabus ideas. thanks bob

Hi Bob,
I would be interested in learning about the graphics packages that would be useful for quasi-experimental studies, as this tends to be the design methodology most frequently used by the school districts with whom I work. This could include more information about ranking and use of propensity scores, and conversion of discrete, non-normal data for use with the granova package. Additionally, a few of us in class took the Structural Equation Modeling course last semester, so if there are graphics options for the R SEM package it would be helpful for us to explore those as well.
Outside of learning more about which packages to use under different research conditions, I would like to focus on the interpretation aspect of the graphics. Specifically, I might need guidance on formatting conventions, using the graphics dynamically, and using them in conjunction with (numeric) statistical output to assist us in understanding data and in being able to provide audiences with a more thorough interpretation of results.
Thanks,
Liz

Hi Professor Pruzek,

My syllabus ideas are mostly the same as those of Liz. In addition, I would also like to know functions can be used in bootstrapping, and graphics options that present results with big sample. Thank you!

Best,
Fei

Before the Syllabus is published, you may want to go to the wiki, epym08learnr.pbworks.com . There you will find reference to OpenMx, a relatively new site aimed at helping do structural equation modeling (with Michael Neale's excellent program Mx). This especially for those who took the SEM course last semester. It is under active development, so rechecking from time to time may repay your efforts. All the other items that Liz mentions will, in due course, be covered -- as will items from a major reading list I'll be providing. bp

Hi Bob,
Where would you like us to upload our assignment #1.
Catherine

Please upload your doc or pdf files on the Upload Files page; I'll put them all in a folder, say Feb3.hmwk
b

Hi Bob,
I don't know why I can't clikc "Upload files".

Wan-Hisang

Hello,
I tried to use the library(RODBC) to import my data from an Excel file on my Mac. I spent a while on it and finally just put the data in manually. Did anyone else have problems with mac & importing data from Excel?
Catherine

Catherine,

Look at the read.xls file in gdata package. I tend to use this for reading in Excel files. In order for RODBC to work you will need to install an ODBC driver for Excel. I've not read Excel files this way but I have some gode that uses RODBC with Microsoft Access databases if anyone is interested.

-Jason

I should have also mentioned that those on Windows should also install Rtools. It is available here: http://www.murdoch-sutherland.com/Rtools/

Thanks, Jason.

BTW, I just uploaded a better version of my Assignment #2. (Got rid of all the annoyimng whitespace, sorry about that)

Catherine

Wanted to share news with my class regarding a conference put together by myself and fell first year PhD students in the informatics department. Also I want to give you a personal invitation to present a poster.

Registration is now open for the 6th New Trends in Informatics Research conference. NTIR 2011 is scheduled for Friday April 1, 2011 from 9am to 3pm at the Assembly Hall and Fireplace Lounge on the second floor of the Campus Center. For more information, please visit this URL to learn more: http://www.albany.edu/cci/ntir/ntir2011/index.html

It is a conference where technology is a big focus, but if you focus on R, elemental graphics, etc. that should be fine. Like if Liz wanted to do a poster on her presentation this past week, for example.

We are very open with the kind of research we accept. Let me know if interested.

Jason I'd like you school CCI on R. Please present a poster. :)

Hi Bob,
Could you please explain what this statement means "a t-test can be either directional or non-directional, whereas the analysis of variance is (like chi-square) intrinsically non-directional?" It is from Lowry's chapter 13 used in your one way ANOVA notes.

Thanks,
Dima

Dima, I went to Chapter 7 of the same Lowry book, http://faculty.vassar.edu/lowry/ch7pt1.html , and not surprisingly, found a fulsome discussion of the issues.
Read that, then just note that when I spoke (THurs) of the F-statistic as the ratio of two variances, one realizes that no variance can be negative, so F is always non-negative.
The upshot is that F operates in a different metric than t; indeed if there are only two groups, then the square of t equals F when testing the null hypothesis (two mu's equal).
After you've read all this, write back to say whether this does the trick for you. Bob

Bob, Thanks for the reference and your comments! that makes it clear to me. I have another question though. We can use F-test to examine the means of various groups to see if there are at least two means that are significantly different. So let's say we are testing the effects of three types of a drug, we can tell if at least two out of these three types have significantly different effects. In practice what does that mean if it's not telling us which one has the greatest effect? How can we tell which two pairs are significantly different?

Thanks,
Dima

Dima, This topic is huge, but Wikipedia has an authoritative summary: http://en.wikipedia.org/wiki/Multiple_comparisons Let me know what questions
you have after reading this. b

I just approved your poster, Catherine.

Thank you Alex.

Hi Bob,
In class on Thursday, we talked about my not being able to produce the granova.2w graphs on my mac. You suggested that I find a download online. What am I looking for again?
C

I okayed a person with 3 posters. If you want to submit a second one on this course or another topic, that's fine.
At these conference one gets a lot of good feedback on research.

Catherine, Much depends on your operating system. What is that for you and your mac? If you are wholly up to date you may (?) not need
to do much, but it is hard for me to answer w/out some details. In general, it has been true (am not sure if it still is...Jason??) that to use X11
as support for rgl, which is needed for the .2w function, that you would have to load R code in the version for your OS. Call me Monday if you
cannot get this working, or write to Jason, OR, call tech support at the Apple Store (Crossgates). Sorry I cannot be definitive. Let me know what happens. bob

Catherine,

Try installing the three tools listed on this page: http://cran.r-project.org/bin/macosx/tools/ Let me know if that helps. If not, then I would try installing Xcode (can download from developer.apple.com).

-Jason

Hi Bob & Jason,
I downloaded the tools from Jason's link & my graphs work!!!!! Thank you so much.
Catherine

data sets for Feb17Homework, Bootstrap-methodsChapt18HesterbergPBS18:
http://bcs.whfreeman.com/pbs/pages/bcs-main.asp?v=&s=18000&n=00020&i=18020.01&o=

anyone else having trouble running boostrap()?

library(psych)
y = read.clipboard() #copy of rows 1 to 1665 (column A) of Excel file ca18_001 (first file in Hesterberg datasets)
means4 = function(x, tr1=.1, tr2=.2)
{xm1 <- mean(x)
xmt.1 <- mean(x, trim =tr1) # 10% trimmed mean
xmt.2 <- mean(x,trim=tr2) # 20% trimmed mean
xm.5 <- median(x) # 50% trimmed mean = median
xms <- c(xm1, xmt.1, xmt.2, xm.5)
xms }
mns4.y <- bootstrap(y,nboots=1000,means4)

Error in n * nboot : non-numeric argument to binary operator

Alex,
When I run xt3b = bootstrap(xt3,1000,means4), all works as it should. xt3 is just a vector of scores that I got as a rounded version of xt3 = rt(100,3)*2+20.
Let me know if this helps or not. b

Bob,
Thanks for replying at 1am...
I want to spice my assignment up for this week, but I couldn't get my own imported data to work.
Alex

Alex, Send me the following about your data: the first few rows, w/ head, also tails, also dim( ) , also summary, also class( ) , also describe( ) [Hmisc lib.], also datadensity( ) plot same lib..
I strongly suspect there are anomalies in the data (so easy to have), and that you are unable to find them by 'standard means'. Can you come to class a bit early? b

Yes, will be there early. 3:30. Doing the recommendations now.

#####reply to Bob starts after the big ---- line
library(psych); library(bootstrap); library(Hmisc)
y = read.clipboard() #[data: http://www.albany.edu/~at937848/pubfiles/bootstrap.txt This is the ILEC data from the very first example in the bootstrapping book.]
means4 = function(x, tr1=.1, tr2=.2)
{xm1 <- mean(x)
xmt.1 <- mean(x, trim =tr1) #[10% trimmed mean]
xmt.2 <- mean(x, trim=tr2) #[20% trimmed mean]
xm.5 <- median(x) #[50% trimmed mean = median]
xms <- c(xm1, xmt.1, xmt.2, xm.5)
xms }
mns4.y <- bootstrap(y,nboots=1000,means4) #Error in n * nboot : non-numeric argument to binary operator

is.vector(y) #FALSE. Making it a vector doesn't work either.
hist(y) #for fun
boxplot(y) #for fun
mean(y);sd(y);quantile(y)
--------------------------------------------------------------------------------------
head(y);dim(y);summary(y);class(y);describe(y);datadensity(y)
#see screen shot for results: http://screencast.com/t/DPO1c6VFA (click full size)

I was wondering what are some examples of when one might use a one-way ANOVA for two samples (F-test), instead of a t-test? For example, the same dataset were used for granova.ds and granova.1w and the resulting statistics were different, as well as the graphing output. When would we use one test or the other for the same dataset?

Young,
The pdfs (two of them) on comparing two groups are basically concerned w/ comparing independent groups.
The main categories are description and inference. Think of two groups of singers (singer data, lattice package), with
respect to heights. One way anova could be used to compare the groups, particularly to do so graphically. The test statistic,
either t or F (which is t-squared here) is of NO interest whatsoever, and should not only be ignored, but the reader of a
report regarding such an application should be reminded, the the method is intended ONLY for description of group differences with respect to the outcome, i.e., response. Inferential uses of one way anova are most relevant when the data are from aa true experiment, where groups had initially been defined using randomization. The t and F methods for comparing two groups inferentially are EXACTLY the same, but for the sign of t being of interest, and F being a unidimensional statistic. Graphicss can be used for either, but the granova.1w graphic would not have been 'invented' were there only to have been a two independent group t-test, however; so when comparing two groups inferentially AND descriptively, I find the latter graphic to add special value. Hope this helps. bob

Hi Bob,
We're trying to replicate some things in BootstrppngIntro11.pdf. Where did you get my.summary?

Our code:

library(bootstrap)

means4 <- function(x,tr1=.1,tr2=.2)
{xm1 <- mean(x)
xmt.1 <- mean(x, trim =tr1) # 10% trimmed mean
xmt.2 <- mean(x,trim=tr2) # 20% trimmed mean
xm.5 <- median(x) # 50% trimmed mean = median
xms <- c(xm1, xmt.1,xmt.2, xm.5)
xms }

xt3 = rt(100,3)*2+20
xt3b = bootstrap(xt3,1000,means4)

dist0=xt3
dist1=t(xt3b$thet)[,1]
dist2=t(xt3b$thet)[,2]
dist3=t(xt3b$thet)[,3]
dist4=t(xt3b$thet)[,4]

means = cbind(mean(dist0),mean(dist1),mean(dist2),mean(dist3),mean(dist4))
s.d.s = cbind(sd(dist0),sd(dist1),sd(dist2),sd(dist3),sd(dist4))
skewns =
table = rbind(means,s.d.s)

cbind(my.summary(xt3),my.summary(xt3b.500))

#Super basic function:
my.summary <- function(x)
{
mean(x)
}

Here it is: bob
my.summary <- function(xxx,dig=2)
{#routine provides means/s.d.s/ skewness's and kurtosis for each col of x
xxx <- as.matrix(xxx)
xm <- apply(xxx, 2, mean)
s.d <- sqrt(apply(xxx, 2, var))
xs <- scale(xxx)
sk <- apply(xs^3, 2, mean)
kr <- apply(xs^4, 2, mean) - 3
rg <- apply(xxx, 2, range)
sumry <- round(rbind(xm, s.d, sk, kr, rg), 3)
dimnames(sumry)[1] <- list(c("means", "s.d.s", "skewns", "krtsis", "low", "high"))
sumry <- round(sumry, dig)
sumry }

Bob: Are these statements correct?
Confidence Interval
If I am using a 95% confidence interval, is it true that I can be 95% certain that the mean lies within that interval?
If I took multiple samples of my population, 95 out of 100 of the CI’s calculated would contain the mean.
Dependent Sample vs. Independent Sample
One of the assumptions, when you run any (most?) parametric statistic(s), is that the samples are independent.
Dependent samples use different types of statistics. This is based on the probability: if the samples are dependent, probabilities are different than if the samples are independent, so the statistical tests we use have to be different.

We’ve been throwing these ideas around during our meetings and thought we’d better check to see if our thinking was correct. We thank you for your feedback.
Tony, Dima and Catherine

T, D and C,
I'm putting up another paper coauthored by Geoff Cumming, this one from 2006, that will expand on what I say here about confidence intervals. But I should add that I see bootstrapping as
having more promise than most authors, including Cumming, seem to realize, so I want to devote a portion of next week's class to CI's as well as contrasts, will illustrations. I shall be posting more soon,
but in the meantime, some direct answers to your questions follow: 1. The proper language is to say that one can be 95% CONFIDENT that one's POPULATION mean (or mean difference, etc.) is
included within the limits of a properly computed 95% interval. Not 95% 'certain.'
2. The statement here is not terrible, but proper language would say that 95% of CIs, over multiple samples, should contain the POPULATION MEAN (or other parameter under estimation).
3. The key is not that SAMPLES be independent, but that observations or scores within and between samples be independent (not predictable, by any method, from others in the sampling fame).
Both parametric and non-parametric methods routinely make this same assumption. (Parametric methods generally make stronger assumptions about shapes, and spreads, of parent populations,
as compared w/ their non-parametric counterparts.)
4. Dependent samples DO NOT use different types of statistics. (Where did this come from?) All statistical methods are based on some kinds of probability considerations, or assumptions.
Probabilities do not enter into dependent sample methodology any differently than for independent sample methodology. Think about the fact that a t-statistic or CI for dependent sample
inference is based MERELY on the mean and s.d. of the sample difference score distribution.
More soon. Bob

Hey Bob,
Could you please tell us what nint in here "histogram(xx[,j], nint= 10[intervals])"?

Also, concerning question 3 part c: (Now"try"using"two"predictors"w/"the"loess"function). We did loessPrediction<-loess(cereal9$calories~cereal9$carbo*cereal9$fat,cereal9, span =1). is this how we should do it?
We're still working on the 3d..we'll let you know if we need help.

Thanks,
D,C and T

D,C and T, think...nint = number of intervals....that's it. (thought the help file made it clear)
Use Plus (+_ between the two predictors instead of product in your loess call. Keep me informed on progress....same for ALL of you! bob

Hi Bob,
We were using the "hist" function and this is why nint was working properly! All set now. Thanks for your help.
"scatter3d" seems to be working fine.
We have a question concerning "Plot the loess curve on top of the point plot and compare this non parametric result with the linear version above." We did:
To plot the linear version:
plot(cereal9$fat,cereal9$calories)
fit<-lm(cereal9$calories~cereal9$fat)
abline(fit)
To plot the loess curve:
loessObject<-loess(cereal9$calories~cereal9$fat,cereal9, span =0.65)
loessPrediction<-predict(loessObject,cereal9$fat)
scatter.smooth(cereal9$fat,loessPrediction, span=0.65)
Are we in doing it right? We used the smallest possible span. If we use a smaller value we got warnings and the plot seemed corrupted. The larger the value of span, the closer the plot is to the linear one. How should we decide which value of "span" to use?

Thanks,
D, C and T

Sorry missed "not" in the first line! I meant to say this was why nint wasn't working properly!
Also, we tried to plot both the linear regression and the loess curve on the same panel, but we don't seem to get it to work! Do you have suggestions on how to do that?

Thanks

Dima,
WHich of the two fitted lines/curves does work in the above? did abline(fit) work or not? Did scatter.smooth work or not? I'd write loessPrediction=predict(loessObject) and not have that second argument in prens. As for span, the usual default is about .7, plus or minus a little. .65 should be reasonable, but .75 might be ok too.
Let me know. I'll check soon. If you'd like to talk about details, I'm at 793-4395. will be up till at least 12:15.
bob

Both works fine..We weren't sure if we should use scatter.smooth() or something else..mainly because we couldn't have them plotted on the same panel...

Thanks Bob!

Professor Pruzek,
There is some mistakes in the word file "887Fei.03.10.HW4.doc" and I just uploaded a pdf version with the mistakes corrected. Could you delete the word one? Sorry for any confusion or inconvenience caused. Thank you!
Fei

Yes, I'll delete; but in the future you could too. Chk the upload page. bob

I started using a free TechSmith tool (Jing) to copy and paste R output, but I wonder if there is a better way to export things like ANOVA tables. E.g., an SPSS like ANOVA table.

Alex,
The last thing I want or would recommend is an SPSS like ANOVA table...FWIW. I would aim to liberalize the presentations of both numbers and graphics, doing new and unusual
things when it seems needed. See Edward Tufte's website on Visualization (I thnk my wiki epym08learnr.pbworks.com gives the URL) as being the sort of thing people should be
using to guide their approach to such matters. Tufte is gifted and has been hugely influential (as he should be) in improving visualization of data. (And you all might want to hear
Jason Bryer's comments on Tufte...Jason I hope you speak on this point.) Pasting is never a problem anyway: for graphics just do control-save whle the cursor is on the graphic, to save
graphics as pdfs; and for number displays, copy and paste works fine (but you will want to use courier new fonts to align columns of tables). There is more to be said on this and
I'll ask Jason to provide his favorite functions/approaches rather than give mine her. HTH, bob

Added: see item 4a on the first page of this wiki for the tufte URL... there's a lot there, so plan to have an hour or two free for explorations. b

It is important to not pigeonhole. In this case, creativity related to course topics. Thank you.
As a consultant, my fear would be client expecations of data output -- and this is where I think the question stems from most of all.
I'll refer to Jason Bryer's comments re: Tufte and check Tufte on this. I imagine there is an answer.
Thanks,
alex

Bob (and class):

I have a study with data containing 8 predictor variables and one criterion variable. Two of us are thinking of using some of the predictors for our analysis. Is this what is meant by "use different models", each of use using different predictors?

Thanks for your input!

Tony

Tony,
Please see the new item on the first page: March 24 Assignment. I trust this will clarify things for everyone.
re: your question, the answer is no, that is not what I meant by 'different models.' bob

Jason,
Thank you for sharing the latex and sweave items.
Jason, would something like this make a nice table of R summary output?
https://stat.ethz.ch/pipermail/r-help/2004-May/051242.html

alex

Alex,

Yes, the xtable package is what you want to use to create tables from R. I can show some examples Thursday but is generally straight forward. See also this document: http://cran.r-project.org/web/packages/xtable/vignettes/xtableGallery.pdf

-Jason

bwt.lrr=glm(formula=ptl~lwt+smoke+factor(race),data=birthwt)
> bwt.lrr

Call: glm(formula = ptl ~ lwt + smoke + factor(race), data = birthwt)

Coefficients:
(Intercept) lwt smoke factor(race)2 factor(race)3
0.342197 -0.001926 0.203275 0.016279 0.061550

Degrees of Freedom: 188 Total (i.e. Null); 184 Residual
Null Deviance: 45.76
Residual Deviance: 43.22 AIC: 269.5

Hi Bob,
Above is the code I used for my version of the LR . I want to use ptl as my DV (or y) and lwt, smoke, & race as my IV's. Am I on the right track?

Also, can you post your code from last class?

Thanks,
Catherine

Catherine,
The glm setup ignores the first key question: what variables predict the binary variable 'smoke'. You all need to have the same y in glm: smoke. And by all means don't use any other variable except bwt for your DV in subsequent
Phase II analyses. Keep asking questions!! bob

Hey Bob,
That was what I thought originally, about keeping y, smoke. I should have stuck with my gut feelings. Thanks, I will get back to you with my results & with questions...Thanks,
C

bwt.lrr=glm(formula=smoke~lwt+ptl+age, data=birthwt)
> bwt.lrr

Call: glm(formula = smoke ~ lwt + ptl + age, data = birthwt)

Coefficients:
(Intercept) lwt ptl age
0.4913089 -0.0001166 0.1890692 -0.0052352

Degrees of Freedom: 188 Total (i.e. Null); 185 Residual
Null Deviance: 45.03
Residual Deviance: 43.29 AIC: 267.8
> bwt.lrr=glm(formula=smoke~lwt+ptl+age,family="binomial", data=birthwt)
> bwt.lrr

Call: glm(formula = smoke ~ lwt + ptl + age, family = "binomial", data = birthwt)

Coefficients:
(Intercept) lwt ptl age
0.0025167 -0.0005318 0.8110113 -0.0231624

Degrees of Freedom: 188 Total (i.e. Null); 185 Residual
Null Deviance: 253
Residual Deviance: 245.8 AIC: 253.8

Hi Bob,
I ran these 2 functions. The first one without the family="binomial". The results are significantly different. What is happening here? What is the "binomial" argument doing?
C

Hi Dumas,

I don't know if it will be misleading or helpful, but I think by specify family-"binomial", you are requesting the glm() to link to logistic model, which is Bob's assignment want us to do.

Hi Bob (or anyone),
I read the documentation on the histbackback histogram but I am struggling trying to generate the graph. I need help with the code.
Thanks,
C

I need help on histobackback histogram too.

this is what I did and it worked for me:
data<-glm(formula = smoke ~ low + ht + bwt, family = "binomial", data = birthwt)
histbackback(fitted(data)[birthwt$smoke==0],fitted(data)[birthwt$smoke==1])

Thanks,
Dima

Oh..don't forget you need library(Hmisc) and library(MASS)

Dima

Thanks Fei,
I read through the documentation on the "binomial". I see that it is a certain type of Family Objects for Models. The function family accesses the family objects which are stored within objects created by modelling functions (e.g., glm).
C

Did you try Dima's code for the Histbackback?

Dima's code works, Thank you~

Hey guys,
I am good with the histbackback...:)
C

How did you do that? I tried code, but it keeps on telling me: Error in hist.default(c(x, y), plot = FALSE) : 'x' must be numeric

circ.bwt=circ.psa(loess.bwt$summary.strata[,1:4],summary=T)
This gives me 7 circles in a straight line (on the identity line). Any guesses as to why the circles are in a straight line? Are they supposed to be arranged that way?

Fei,
Did it work?
C

Yes, it works, thank you!

But how can I round the propensity scores on the y axis in to 3 decimal?

options(digits=3)

Tony, It seems yours is the only question still hanging...but let me know, anyone, if there are still issues here.
The first thing I'd try re: your circ plot is WHAT IS YOUR CODE for LOWESS.BWT? And since the summary has only a few lines, then go ahead and paste that here.
If you look at the summary you should see the coordinates of all circles (means for count.0 and count.1) and the counts for each stratum. You could even post a
pdf or doc of the details. But before you do, also look at the descriptive stats for the INPUTS to loess.bwt, that is to the bwt and especially fitted( ) values. You might
also run cor(smoke, fitted( ) ) to see how closely these two correlate (about .43 in my case, but you have fewer predictors so it is likely to be smaller). On second
thought, do these latter things first, and then let us know what you get before going further. bob

Hi Bob,
After checking Lisa Kaltenbachi http://www.mc.vanderbilt.edu/gcrc/workshop_files/2008-04-11.pdf work, I have been working on the dataset we talked about: Here is what I did:
load(url('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/rhc.sav'))
write.table(rhc, file='rhc.csv', sep=',', col.names=NA)
attach(rhc)
bwt.lrr=glm(formula = swang1 ~ age + sex + edu + factor(death),family = "binomial", data = rhc)
histbackback(fitted(bwt.lrr)[swang1=="RHC"],fitted(xx)[swang1=="No RHC"])
library(PSAgraphics)

When I run this
loess.bwt2 = loess.psa(t3d30, treatment=swang1,fitted(bwt.lrr),int=7)
I got the error: Error in as.data.frame.default(response) :
cannot coerce class '"labelled"' into a data.frame
I fixed that by issuing: label(t3d30) <- 't3d30'
and this time I got the error:
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(c("1", "2", "3", :
invalid 'dimnames' given for data frame

Do you think the problem is with the format of the dataset, or should I pick different variables for my analysis?

Thanks,
Dima

Dima, Some tests to try: 1. head(rhc) ... Do you get what you saw on the web for these data, or what was in the file using a text editor to look at it? 2. class(rhc) ... is it a dataframe? If not and 1 was ok, then do rhc = dataframe(rhc). Then names(rhc) should work (I'm a bit puzzled by your col.names=NA specification in write.table. This from documentation? 3. If you're using only a subset of variables (seems that you are) then, define, say, rhcsub = rhc[, c( pluck the variables)], like c(1,5, 7, 9). Then attach that rhcsub after ensuring that it is a dataframe itself. Then proceed, being sure to redefine the data in the glm call to rhcsub. I'm not sure about your use of the label function, but these other tests seem likely to be helpful. (And now I'm going to get some zzz's! :)) ) bob

PS. From Kaltenbachi's pdf you should get some sense of which variables, and how many of them, are likely to be most helpful for your runs. Don't cut back too drastically on what she identifies as useful covariates, if you can educe that info from her work. b

Hi Bob,
I'm having some slow progress here! I should have changed the class type of one variable. It was "labelled" while it should be numeric!
Using head(hrc), as you suggested, I figured out that I have a problem with cat2 column. It has <NA> instead of NA. I used replace() to change the values but I got "Warning message:
In `[<-.factor`(`*tmp*`, 1, value = "NA") : invalid factor level, NAs generated"

I'm trying to figure out how to solve this!

When I run the glm using the same variables and covariates used by Kaltenbachi (except cat2 as i didn't solve the <NA> problem yet)!, I get the following error:"Error in data.frame(response, treatment, propensity) :
arguments imply differing number of rows: 5735, 634"
Depending on which variables I'm using I sometimes get an error similar to above or other times "Error in `contrasts<-`(`*tmp*`, value = "contr.treatment") :
contrasts can be applied only to factors with 2 or more levels"

I'm confused because based on the combination of variables I'm using it might work.
For example, this will work:
bwt.lrr=glm(formula =swang1~ age+sex+edu+ income +ninsclas+factor(cat1)+ca,family="binomial",data=rhc)
loess.bwt2 = loess.psa(as.numeric(t3d30), treatment=swang1,fitted(bwt.lrr),int=7)
I'll email you the graph I got running the above!

I think the main reason for the problems I'm getting is that the dataset is not clean! What do you think?

Also, the dataset is big. It has about 5700 rows. Should i use a subset? If yes, how should I pick a subset?

I'm sorry I keep posting late at night! I'm a night person!

Thanks,
Dima

Dima, Yes, the data set is really too large for our purposes here. That's why I've asked for DETAILS when students start working with something else beyond the birthwt set.
I would not labor much longer w/ the rhc dataset simply due to the complications you mention, including NAs, which nearly always cause lots of trouble. Clean data are
essential when one is at an early stage of learning a method that is new to you. Short of sitting with you and working w/ each problem as it comes up, I don't think I can
help much here. And I don't have the time today or tomorrow to do those things. I'm sorry I did not remember to save the history when I wrapped things up last week too;
but since I did not, I cannot recover that after the fact. There's nothing bad, weak, or wrong about going back to the birthwt data and predicting smoke using a different
glm model than I used, then following that through. We shall stay with this topic through one more week to give some degree of closure on the topic. Best, bob

Thanks Bob! I was going to ask you if I can do that (go back to the birthwt). It was good experience working on this dataset!

I'm not sure if you have mentioned that before, does R have a debugger?

Thanks,
Dima

Dima, Two functions you might study (and try?) are: traceback and debug Let us know if you find them useful. bob

Hi Bob,
I'm trying to get a box.psa() graph, but I get an error, can you please provide advice?
continuous = birthwt.psa$response
treatment = birthwt.psa$treatment
strata = birthwt.psa$ps
box.psa(continuous, treatment, strata)
Error in bxp(list(stats = c(138.280701754386, 138.280701754386, 138.280701754386, :
'at' must have same length as 'z$n', i.e. 112

Alex,
Try setting this up w/ the three column input (see ?box.psa). If the input matrix is as specified you should have no trouble. You might also check the lengths of the input vectors initially, especially strata.
You may find that the cut2 command that I use in my post of today will do the trick. Use it as I did, and let me know how that works. More often than not people have trouble w/ this vector, and your
use of strata = birthwt.psa@ps will NOT do the trick w/out action using cut2 (most simply).
bob

Thank you, Bob.
#paste here Jason's code all the way to birthwt.psa, then:
library(Hmisc)
cut.fit.bwt=cut2(fitted(lr),quantile(fitted(lr),prob=seq(.01,.99,len=7)))
box.psa(birthwt.psa$response, birthwt.psa$treatment, cut.fit.bwt)

Output (image): http://screencast.com/t/mbkFG2yORZhw

Hi all,
I had questions about identifying strata, and was mistakenly referring to the histbackback graph to help. Here's Bob's response:

There is NO information in any such histogram that will identify strata.
Remember, you could just as well have generated overlapping density plots
instead of histograms. The information of interest in such plots concerns the
region of 'mutual support'
for the groups being compared. Overlap is essential to be able to proceed w/
conventional PSA, the more the better; but ironically, LR and related methods
seek to separate groups (in some sense maximally), and if there are really
notable differences between the groups on even one of the covariates being used
in the LR model that may be sufficient to make for little overlap. We want
overlap, but the method can deny it; this means that if the covariates that
'really make the separation' are ones that seem, on substantive grounds,
essential, then such data are telling us that the data deny effective use of
PSA. In this sense the method is self-critical -- which I deem as a virtue, but
some may not (especially if you 'really want to know' the answer that PSA could
well provide).

Hope this helps clarify any questions you had as well!

Hey guys,
Does anyone know of a good free tool that can be installed in the browser to store URLs and potentially thumbnails of pages.
C

A friend of mine likes to use "Zotero."

Thanks Alex.

Zotero is fantastic for collecting sources. It can catalog your PDFs too. It will also export to bi Tex which works with swerve and latex. I would also look into instapaper for more general use.

I use Zotero for exactly that. It is great. I will check out instapaper.

Jason, Delicious bookmarks are very useful. Do you use delicious?
Alex

Try Evernote ...should be easy to find. b

Thanks Bob, I found it & I will try it...:)

Bob or anyone,
We are treating PSA as a means for data analysis in this course, i.e., we find data that fits into a PSA analysis we can run. Is PSA predominantly a method for data analysis? Or, and is this for Bob's next course: you plan a research study knowing before hand your will use PSA? Does this question make sense?
Will Dr. Stuart's paper help answer this?

Alex, Yes, PSA is a method of data analysis, better: a class of methods for data analysis. You put your finger on one of the topics I will discuss briefly next week, namely
the difference between retrospective approaches to PSA and prospective approaches. Archival data, data collected (usually by others than the analyst) in the past, are more
often than not the starting points for data analysis. The prospective approach, which entails looking forward with a plan for data collection, is preferred for the major reason
that it helps ensure that covariates that are deemed most needed to account for selection bias are named (sometimes this involves constructing tests or instruments) and
measured for all individuals who will then select their 'treatment' -- according to their behavior or actions. More soon. bob

Thank you, Bob. Earlier in the semester you told us that when individuals sometimes write books, they do so concurrently with a research endeavor. I am sure people do this with R as well, i.e., conduct a research endeavorer and along the way produce an R package. For me, this is a fun and different way to look at R packages. For example, the survey package (and possibly Jason's new PSA package?) can be used to step through a method from a starting point to and an ending (or moving on) point.

Alex, Am not sure there is a question here, but I can say that nearly any approach to writing, books and packages and more, has been tried before, and if the ideas are sound and useful (and the work looks good in relation to would-be competition) good things can happen. b

Hey guys,
How do we define pt.1 as binary? And is it the same as the ptl variable in the birthwt data?
Catherine

Hi Bob,
I am trying to run this code & I am getting these error messages. Do you know what I am doing wrong?

bwt.match5=Matchby(Y=bwt/(28.35*16),Tr=smoke,X=bwt.sub5[,c(1,3)],by=list(race))
Error in Matchby(Y = bwt/(28.35 * 16), Tr = smoke, X = bwt.sub5[, c(1, :
(list) object cannot be coerced to type 'double'
In addition: Warning messages:
1: In Ops.factor(left, right) : / not meaningful for factors
2: In Ops.factor(left, right) : / not meaningful for factors
3: In Ops.factor(left, right) : / not meaningful for factors
4: In Ops.factor(left, right) : / not meaningful for factors
5: In Ops.factor(left, right) : / not meaningful for factors
> str(bwt.sub5)
'data.frame': 189 obs. of 5 variables:
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
$ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
$ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
> bwt.match5=Matchby(Y=bwt/(28.35*16),Tr=smoke,X=bwt.sub5[,c(1,2)],by=list(race))
Error in Matchby(Y = bwt/(28.35 * 16), Tr = smoke, X = bwt.sub5[, c(1, :
(list) object cannot be coerced to type 'double'
In addition: Warning messages:
1: In Ops.factor(left, right) : / not meaningful for factors
2: In Ops.factor(left, right) : / not meaningful for factors
3: In Ops.factor(left, right) : / not meaningful for factors
4: In Ops.factor(left, right) : / not meaningful for factors
5: In Ops.factor(left, right) : / not meaningful for factors
> bwt.match5=Matchby(Y=bwt/(28.35*16),Tr=smoke,X=bwt.sub5[,c(1,2)],by=list(race))

Catherine

Bob,
I am ok now. I started a new session.
C

Catherine, I presume you also have pt.1 working. I defined that in my handout (or in class). You could also use the factor( ) function for this: pt.1 = factor(ptl > 0) ... which will give a 'character' variable (TRUE, FALSE); and were to need it, could do as.numeric(pt.1) to make that numeric. b

Bob, Are there any plans to post the video recorded last week?

I am going to spend the next few hours doing a pass through the Sarkar lattice piece. I am a night owl.
I am sharing my document with you through Google Docs: http://goo.gl/XUXEG
I made the document accessible to anyone with the above link and editable.
There won't be much real time-ness to this since most of you will be sleeping,
but what I hope to test is sharing a work in progress with the rest of the Wiki-users.

I'll check it out. Thanks, Alex.

Yeah, Sarkar is great. I'll have to check out some of his other work.

Bob,

I am working on the last portion of the assignment, in assessing for covariate balance. When I try to run the following code:

cv.bal.bwt6=cv.bal.psa(cv.trans.bwt$covariates.transf,smoke,fitted(bwt.lrr), strata=cut.fit.bwt)

I get this error message:

Error in if (xlim[1] > 0) xlim[1] <- 0 :
missing value where TRUE/FALSE needed

Help. Anyone else get this??
Thanks!

-Young

Here is the full code that I used:

library(MASS)
library(ggplot2)
library(Hmisc)
data(birthwt)
attach(birthwt)

ui <- factor(ui)
ht <- factor(ht)
low <- factor(low)
race <- factor(race, labels = c("white", "black", "other"))
bwt1 <-data.frame(ht, age, race, smoke, ui)

bwt.lrr=glm(formula=smoke ~ ht + age + race + ui, family="binomial", data=bwt1)
AIC(bwt.lrr)

library(Matching)

bwt.sub6<-cbind(ht,age,race,smoke,ui,bwt)

bwt.match6=Matchby(Y=bwt/(28.35*16),Tr=smoke,X=bwt.sub6[,c(1,4,5)],by=list(race))

summary(bwt.match6)

bwt.match6

library(granova)
granova.ds((cbind(bwt[bwt.match6$index.treated],bwt[bwt.match6$index.control])/(28.35*16))[,2:1],xlab="Birthwts in lbs. for babies born to non-smokers",ylab='Birthwts in lbs for babies born to smokers')

library(PSAgraphics)
loess.bwt2=loess.psa(birthwt$bwt/(28.35*16), factor(birthwt$smoke),fitted(bwt.lrr),int=10)
loess.bwt2

cv.trans.bwt=cv.trans.psa(bwt.sub6[,-6],fcol=3)

head(cv.trans.bwt$covariates.transf)

cut.fit.bwt=cut2(fitted(bwt.lrr),quantile(fitted(bwt.lrr),prob=seq(.01,.99,len=7)))

cv.bal.bwt6=cv.bal.psa(cv.trans.bwt$covariates.transf,smoke,fitted(bwt.lrr), strata=cut.fit.bwt)

Young, We'll have to address this after 3 pm today. Could you come early to class, say before 4? Thanks bob

Dima,Catherine, and Liz.
Thank you for being part of NTIR2011. Due to inclement weather, the conference has been postponed until April 15, 2011.
Alex

Hi Bob,
On Sarkar paper on Lattice package, Q-Q plot is discussed. What are Q-Q plots exactly used for? My understanding is in the example Sarkar used, the plot is telling us gcsescore is a good predictor for score. Is that right, or does the plot is giving only descriptive information related to distribution and skewness?
If my statement about gcsescore is true, don't we need other evidence such as confidence interval?

Thanks,
Dima

Bob (or others in the class):
In the Scatter Plots and Extensions section of the Lattice paper, an asymptotic scatter plot is converted using logs and presented as giving us more information than the original plot. I know this is done a lot in all fields, and I have even done things like this in my "previous life." I am, however, uncomfortable using these transformations as I half consider this to be "lying" about the data. It is difficult, sometimes, to correctly interpret the graphs after the conversions because now you're interpreting a log function, so I was wondering about the use of these transformations on a regular basis. How is one to be sure the interpretation is correct after such transformations are done?
Thanks for your input!
Tony

Hi Bob,
Our group is having a discussion about observational studies and covariates. Can race, nationality, & ESL be used as covariates in an educational study? We know that race is used, but what about nationality? Does nationality give us any additional information that race doesn't give us ie. cultural differences?
Thanks,
Catherine

Dima, Tony, Catherine,
Re: q-q plots, these are used chiefly to compare shapes of distributions, but also 'levels' and 'spreads'. AFAIK, they have only incidental value in predictive work... as, for example, when predictions of Y from one or more X's are limited because shapes of X and Y distributions are quite different. You should always think of two kinds of purposes of any analysis: descriptive and inferential. q-q plots pertain mostly to description; CIs concern different issues (almost entirely). More in class tomorrow, as needed.
Tony, the term 'asymptotic scatter plot' has no meaning for me, and a search did bring it up in the Sarkar08LatticeLab pdf. As for use of log transformations -- say in predictive work -- think about this q: why should we believe that any given X and Y will have metrics that make them, w/out change, predictable, as in Y ~ X? Ask instead, what (non-linear) transformations might improve the degree of linearity when predicting, say Y from f(X), i.e. f(X) represents possibly a log of X, the square root, or the reciprocal of X? (We know gallons per mile is (far) better to use than miles per gallon in linear prediction (as an outcome) when predicting from features of a car, such as weight, no. of cylinders, displacement, etc. -- these being cases where the outcome Y is non-linearly transformed, not the X's.) And Tukey said, "It is generally better to unbend data that are bent, and then use linear methods, than it is to create a new non-linear method for each new X, Y system."
Catherine, These are largely empirical questions. If nationality adds value to race when predicting ??, or ESL adds value, that is usually sufficient to argue that such variables are worthy of inclusion in prediction frameworks. To say more would require details about application/situations.
Hope this helps, bob

My phrase, line 3 above, "and a search did bring it up in the Sarkar08LatticeLab pdf" should have said "and a search did NOT bring it up in the Sarkar08LatticeLab pdf". b

You don't have permission to comment on this page.

To join this workspace, request access.

Already have an account? Log in!

Loading…

This is your Sidebar, which you can edit like any other page in your workspace.

This Sidebar appears everywhere on your workspace. Add to it whatever you like -- a navigation section, a link to your favorite web sites, or anything else.

Loading…

Comments and Questions here

Comments and Questions here

Page Tools

Insert links

Comments (114)

Liz Fisk said

bob pruzek said

Liz Fisk said

Fei Chen said

bob pruzek said

Catherine Dumas said

bob pruzek said

Wan-Hsiang Hsu said

Catherine Dumas said

Jason Bryer said

Jason Bryer said

Catherine Dumas said

alex said

Dima Kassab said

bob pruzek said

Dima Kassab said

bob pruzek said

alex said

Catherine Dumas said

Catherine Dumas said

alex said

bob pruzek said

Jason Bryer said

Catherine Dumas said

alex said

alex said

alex said

bob pruzek said

alex said

bob pruzek said

alex said

alex said

Young Do said

bob pruzek said

alex said

bob pruzek said

Catherine Dumas said

bob pruzek said

Dima Kassab said

bob pruzek said

Dima Kassab said

Dima Kassab said

bob pruzek said

Dima Kassab said

Fei Chen said

bob pruzek said

alex said

bob pruzek said

bob pruzek said

alex said

Tony Leonardi said

bob pruzek said

alex said

Jason Bryer said

Catherine Dumas said

bob pruzek said

Catherine Dumas said

Catherine Dumas said

Fei Chen said

Catherine Dumas said

Fei Chen said

Dima Kassab said

Dima Kassab said

Catherine Dumas said

Fei Chen said

Catherine Dumas said

Fei Chen said

Tony Leonardi said

Catherine Dumas said

Fei Chen said

Fei Chen said

Catherine Dumas said

bob pruzek said

Dima Kassab said

bob pruzek said