George Bridges

Reflection

2021-07-29T00:00:00+00:00

My understanding of what a data scientist does has mostly stayed the same- certainly more has stayed the same than has changed over the course of this class. One thing I am certain that has changed, however, is the size of the role of cleaning and manipulating the data between “obtaining it one way or another” and “analyzing” it. At first I was completely ignorant of how data was read in in r, so naturally what is literally typically a single line of code (once the file you are trying to read in is in the current working directory!) - reading data into R - seemed to be a huge part of the process. I would say I thought one quarter of the whole process was reading the data in whereas now I would say that constitutes maybe 5 percent. Cleaning, readying, and manipulating the data once you have it read in can sometimes be a quarter of the work of a project, or at least I would say it is more likely to be a quarter of a project than is simply reading the data in. I think R is fantastic for data science! It truly is a remarkably powerful language. With only a few lines of code you can create a graph or fit a model - do much much more than I would have thought could be done with such little code. It truly can read in the data, clean it, analyze it, model it, and finally create presentations and applications to share ones project with others. I will use R going forward at every opportunity - which I hope is often because I want to continue in this program as well as work directly in the field once I have my degree. One thing I am very interested to see going forward is how R compares to Python in terms of data science. I know they are the two dominant languages, and I have much less experience with Python. In practice after this course I will put more of an emphasis on writing elegant code. By this I mean, not copying and pasting more than once but rather creating a variable and using that for instance. This course has taught me that oftentimes there is a really good way to do exactly what you have to do, you just have to find the right package and the right function. In practice I will hold out for exactly the right tool, or function, for the job because oftentimes there is some package that has a function that does just what you need it to without much modification. Prior to this class I would often try to brute force my way to the solution of a problem, like by manually mapping a bunch of variables rather than referencing a table, or by trying to modify a function to do something it isn’t meant to rather than searching for just the right function. In practice going forward I will try to get the most bang out of every line of code that I write.

Project2

2021-07-10T00:00:00+00:00

For this project, we read in data containing information about bike rental behavior for different days, as well as various weather and day information (was it a weekday or was it a holiday for example). Marcus and I then removed columns from the data we did not use in our analysis, split the data into training and test sets and performed some exploratory data analysis. This involved creation of some summary statistics and some plots. We then fit four different models, two linear and two ensemble models, to the test portion of the bicycle data and compared them based on RMSE. If I were to do the project again I would include interaction and quadratic terms in the linear equations. It would have been cool to make the models more complex and complicated. If we had coded the project up in a way where we were working off the same data objects throughout. The way we did it was that because Marcus removed some of the columns that I wanted to use I made a replica of the data, just with those columns retained. The most difficult part for me was remembering exactly how to predict on the test data and then compare the models via a common metric. Frankly, another student asked a question on Slack that biased me towards an incorrect understanding which didn’t help. They referred to two of the algorithms as classification algorithms and their question kind of assumed the comparison would be hard. I should have ignored that - because once I had done it the comparisons based on RMSE made perfect sense to me. My big takeaways from the project are that R is a very powerful language. It was fairly straightforward to create summary statistics, plots, models and successfully test those models. Also, there is data available that can be used in a meaningful way from such a large variety phenomenon/topics. This rental bike data allows for prediction of rental levels from values mostly relating to the weather. I could imagine similar methods being used to predict restaurant or retail store patronage, demand for a product being sold online, or attendance at concerts. Data science is fascinating and statistics as “the grammar of science” is so flexible and powerful! This is the link to the repository: here This is the link to the pages page: here

Project1

2021-06-20T00:00:00+00:00

Here is the link to my pages site: https://gbridges34.github.io/Project1/ Here is a link to my repo: https://github.com/gbridges34/Project1 I had a lot of trouble figuring out how to map the names of NHL teams to their respective ID numbers. I ended up taking a very, very brute force approach that worked so far as it goes but was very inelegant. I also found it very difficult to figure out how to actually get to the useful data that came back from the stats API. It originally came back in a doubly nested data frame and I had to discover the “unnest” function in order to access. I do not believe we worried about this function in class and I do not know how else I would have gotten that job done. One thing I would do differentely is I would have started even sooner than I did. I started with a full week left to go before the deadline and had cleared out lots of time every day to work on the project but I think next time I will start as soon as I have access to the second project. Also, I learned the value of having other people to ask during this project - I made heavy use of the general channel during this project and it did a great job of getting me unstuck. I am going to make heavy, heavy use of it going forward for homework and future projects. I was not able to get my pictures to render on my site for the longest time and finally bless Adeyemi he walked me through deleting my repo and starting again. It finally worked after he was on the phone with me for an hour!

Blog Post

2021-06-08T00:00:00+00:00

I may have nearly the least programming experience of anyone taking this course because I lobbied to take it before taking the prerequisite course or courses. I simply found that the classes I had taken in the program so far utilized R so much that I needed a solid foundation in it. That being said, when I was an undergrad I took some classes in Java and some classes in HTML. I see similarities and differences with R and both of those languages. I am glad that R has some of the object oriented attributes that Java has. The idea that everything that exists in R is an object and everything that happens with R is a function is very familiar to me from my time studying Java. I see the biggest overlap between R and HTML being in a lot of the way Rmarkdown works. The notion of there being a header and then subheaders being the foundation of the heirchichal organization of the text with additional words designating various aspects of formatting is very famililar from HTML. The thing that R is totally different with respect to (and I see this as a blessing and a curse at the same time) is the fact that in R you can do everything at least two different ways. That is good from the standpoint of less rigidity/ almost less memorization of syntax because if you type what you want to do in an organized and consistent way you can almost guess one of the valid ways of coding up what you are trying to code up. The downside obviously is that there is no one to tell you just which particular way of doing a thing one should actually learn. Also, it is not the case that packages are consistent in terms of which syntax they use - even sometimes within one package! I’d say anyone who is being honest, even the authors of some of these packages, would readily admit that one syntax is enough for a particular package. Finaly, there is one aspect in which R is straightfowardly left wanting: by all accounts it is simply a slow language. There is no tradeoff here: it would be nothing but positive if R were a little (or a lot) faster. Yes there are ways of making R faster than it would otherwise be by coding things up thoughtfully, but this doesn’t change the fact that it is sluggish. Overall, I think R is very fair to the user from a learning standpoint, probably falling for me between HTML and Java in terms of ease of learning. That is I think that R is a little tougher than HTML and not as hard as Java. I will admit, though, that the course I took in Java is kind of the “weed out” course for CS majors at Stanford so it is made intentionally hard.

Blog Post

2021-05-22T00:00:00+00:00

My dream job using this masters degree in statistics is to work in analytics for a sports team (preferably basketball or football). That being said, though, the number of professional and high-level NCAA teams is not growing whereas the demand for data scientists does seem to be consistently growing. Therefore, I see myself very likely getting into data science because that is where the jobs are likely to be. I would then try to do some volunteer or consulting work for a sports team to perhaps eventually break into that area. I just do not think it is realistic for me to think I could go get a job as a Director of Analytics or even an analyst for some NBA team right off the bat. I like the description of data scientists that is offered in the assigned article “Data Scientists vs Statisticians,” which is that they are practicioners that follow a particular process pretty closely. Specifically this is the “data science process,” which consists of several steps that are pretty straightforward to understand. These steps are, again per the same article: “data ingest, data transformation, exploratory data analysis, model selection, model evaluation, and data storytelling. I interpret this process as meaning that data scientists take in data into their chosen analytical framework, transform it if neccessary in order to do some initial analysis of the data before selecting what appears to be the most appropriate model based on that initial analysis. evaluating the predictive performance of that model and then finally wrapping the analysis into a narrative that can be presented to key business leaders in a way that is understandable, even for those who do not have a strong technical background. One obvoius different between data science and statistics is that data science tends to deal with much larger data sets that are imported for analysis whereas a statistician might aquare data from an experiment or survey. Another difference is that statisticians are much more focused on quantifying the uncertainty associated with their calculations/estimations. Data scientists are, rather, primarily foccused on prediction. I think the biggest area of similarity between data science and statistics is simply that the “science” in data science is in fact statistics - statistics is in the background of virtually everything a data scientist does. I think this fact is what leads critics such as Nate Silver to cliaim that data science is nothing but a fancy word for applied statistics. I do not buy this argument though - I think that the usefulness and impact of data science is sufficient to warrant its own term.