The collateral damage of my interest in gardening is a head full of half-remembered Latin plant names. I’ve put it to use by scanning Reddit’s r/WhatsThisPlant to see how many requests I can answer (my hobbies are riveting. Would it surprise you that in addition to data viz and gardening, I am also interested in cats and reading?)
At any rate, I’ve started to notice some strong patterns in what people tend to ask about when. It would be fun to quantify that with some graphs, and it would also be fun to learn the Pushshift API!
First up, overall posting activity. Super cyclical:
Next up, what kinds of plants do people want identified? For this, I gathered up a list of about 1000 common landscaping plants, trees, houseplants, and garden weeds. I tried my best to create a complete list that included both Latin and common names.
It can be incredibly difficult to manage the association of Latin and common names: for example, what we call morning glories could actually be flowers in two separate genuses, Convolvulus and Ipomoea. I tried my best to manage all of this information with varying success.
I then counted up the number of posts that have at least one comment mentioning the plant. This was just a simple search. I didn’t check:
- negation (e.g. “This is clearly not a morning glory”).
- similar words (e.g. finding “ivy” when the plant identified was “poison ivy.”)
- specific varieties (e.g. Ficus can refer to a huge range of plants)
- if the identifications were correct.
With those caveats, folks were just itching to know about these plants:
But let’s put these two together. What popular plants are the most cyclical in nature? Which are the least? Common sense tells me that interest in showy flowering and fruiting plants like Crape myrtle (Lagerstroemia spp.) and Kousa Dogwood (Cornus kousa) would be highly seasonal, whereas houseplants would be asked about year-round.
For this, I grabbed all plants that were mentioned in a comment in at least 100 posts. I then calculated what percentage of posts occurred during each month. By my definition, the plant with the most seasonal interest is the one that has the highest percentage of posts in one particular month. The one with the least seasonal interest would have about an equal percentage in each month. (Of course, the seasonality of the subreddit as a whole could affect post counts, but I did not attempt to correct for this.)
And the winners are…
Welp. Guess I was largely right, though I missed on the exact genuses. One neat thing I found upon further Googling is that Lily of the Valley is also known as May Bells–guess it’s super obvious why! (They tend to bloom for me in April, so ¯\_(ツ)_/¯)
Also, the spike of interest in Equisetum is pretty interesting. It’s a pretty pernicious weed that happens to be native to my part of the world… never had to deal with it myself, never want to.
That wraps that up, though I do have further plans to continue playing with this data.
Very neat! Can I get a little more info — how did get the data from Pushshift parsed into a usable format to feed into Excel?
sorry – checked into Excel’s “Data” → “New Query” → “From Other Sources” → “From Web” and I’ll give that a whirl.
Huh, I never knew about that! I’m using VBA’s XMLHTTP object to get the data from the API (and iterating through a list of plant names as well as epoch values to collect relevant comments)
Dim xmlhttp As Object, slink$
Set xmlhttp = CreateObject(“MSXML2.serverXMLHTTP”)
slink = “https://api.pushshift.io/reddit/comment/search/?” & _
“q=(hollyhock|alcea)” & _
“&size=500” & _
xmlhttp.Open “GET”, slink, False
LikeLiked by 1 person