GoLang simple scraper
When it comes to scraping data with Golang, the go-to framework for most developers is Golang Colly. Its popularity stems from its efficiency and ease of use. Additionally, if you're only interested in scraping data from a single page, GoQuery is a great tool to consider.
Requirements:
- Golang installed
- IDE of your choice
Create a new project in your root directory
mkdir hacker-news-scraper
Now lets the Golang scraper module
go mod init hacker-news-scraper
Install colly
go get -u github.com/gocolly/colly/v2
Next, lets create main.go
package main
import (
"fmt"
"github.com/gocolly/colly"
"log"
)
type Comment struct {
User string
Comment string
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("news.ycombinator.com"),
)
comments := make([]Comment, 0)
c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
comment := Comment{
User: e.ChildText("a.hnuser"),
Comment: e.ChildText("span.commtext"),
}
comments = append(comments, comment)
})
err := c.Visit("https://news.ycombinator.com/item?id=40396005")
if err != nil {
log.Fatal(err)
}
fmt.Println(comments)
}
What we did here is basically create a new colly collector(scraper).
c.OnHtml says that we want to get all of tr.athing and get child user and comment text.
and put them into the comments slice
with a few improvements to be able to provide URL in the command line
I will add a flag parser to add the URL parameter
var url string
flag.StringVar(&url, "url", "https://news.ycombinator.com/item?id=40396005", "URL to scrape")
flag.Parse()
And now our code look like this.
package main
import (
"flag"
"fmt"
"github.com/gocolly/colly"
"log"
)
type Comment struct {
User string
Comment string
}
func main() {
var url string
flag.StringVar(&url, "url", "https://news.ycombinator.com/item?id=40396005", "URL to scrape")
flag.Parse()
c := colly.NewCollector(
colly.AllowedDomains("news.ycombinator.com"),
)
comments := make([]Comment, 0)
c.OnHTML("tr.athing", func(e *colly.HTMLElement) {
comment := Comment{
User: e.ChildText("a.hnuser"),
Comment: e.ChildText("span.commtext"),
}
comments = append(comments, comment)
})
err := c.Visit(url)
if err != nil {
log.Fatal(err)
}
fmt.Println(comments)
}
you can run it withgo run ./main.go -url=
https://news.ycombinator.com/item?id=40396005
That's it; a simple scraper is working.
Of course, you can trim comment text to remove redundant whitespace, or new lines, etc,
Member discussion