Goquery 설치 및 사용법

설치

실행:

go get github.com/PuerkitoBio/goquery

임포트

import "github.com/PuerkitoBio/goquery"

페이지 로드

IMDb 인기 영화 페이지를 예로 들어보겠습니다.

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    res, err := http.Get("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()
    if res.StatusCode != 200 {
        log.Fatalf("상태 코드 오류: %d %s", res.StatusCode, res.Status)
    }

문서 객체 가져오기

    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }
    // 기타 생성 방법
    // doc, err := goquery.NewDocumentFromReader(reader io.Reader)
    // doc, err := goquery.NewDocument(url string)
    // doc, err := goquery.NewDocument(strings.NewReader("<p>Example content</p>"))

요소 선택

요소 선택자

기본 HTML 요소를 기반으로 선택합니다. 예를 들어, dom.Find("p")는 모든 p 태그와 일치합니다. 체인 호출을 지원합니다.

ele.Find("h2").Find("a")

속성 선택자

요소 속성 및 값으로 요소를 필터링하며, 여러 매칭 방법을 제공합니다.

Find("div[my]")        // my 속성이 있는 div 요소 필터링
Find("div[my=zh]")     // my 속성이 zh인 div 요소 필터링
Find("div[my!=zh]")    // my 속성이 zh와 같지 않은 div 요소 필터링
Find("div[my|=zh]")    // my 속성이 zh이거나 zh-로 시작하는 div 요소 필터링
Find("div[my*=zh]")    // my 속성이 문자열 zh를 포함하는 div 요소 필터링
Find("div[my~=zh]")    // my 속성이 단어 zh를 포함하는 div 요소 필터링
Find("div[my$=zh]")    // my 속성이 zh로 끝나는 div 요소 필터링
Find("div[my^=zh]")    // my 속성이 zh로 시작하는 div 요소 필터링

`parent > child` 선택자

특정 요소 아래의 자식 요소를 필터링합니다. 예를 들어, dom.Find("div>p")는 div 태그 아래의 p 태그를 필터링합니다.

`element + next` 인접 선택자

요소가 불규칙하게 선택되지만 이전 요소에 패턴이 있는 경우에 사용합니다. 예를 들어, dom.Find("p[my=a]+p")는 p 태그의 my 속성 값이 a인 인접한 p 태그를 필터링합니다.

`element~next` 형제 선택자

동일한 부모 요소 아래의 인접하지 않은 태그를 필터링합니다. 예를 들어, dom.Find("p[my=a]~p")는 p 태그의 my 속성 값이 a인 형제 p 태그를 필터링합니다.

ID 선택자

#로 시작하고 요소를 정확하게 일치시킵니다. 예를 들어, dom.Find("#title")는 id=title인 콘텐츠와 일치하며, 태그 dom.Find("p#title")를 지정할 수 있습니다.

ele.Find("#title")

클래스 선택자

.로 시작하고 지정된 클래스 이름으로 요소를 필터링합니다. 예를 들어, dom.Find(".content1")이며, 태그 dom.Find("div.content1")를 지정할 수 있습니다.

ele.Find(".title")

선택자 OR (|) 연산

쉼표로 구분된 여러 선택자를 결합합니다. 하나라도 만족하면 필터링이 수행됩니다. 예를 들어, Find("div,span")입니다.

func main() {
    html := `<body>
                <div lang="zh">DIV1</div>
                <span>
                    <div>DIV5</div>
                </span>
            </body>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("div,span").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Html())
    })
}

필터

`:contains` 필터

지정된 텍스트를 포함하는 요소를 필터링합니다. 예를 들어, dom.Find("p:contains(a)")는 a를 포함하는 p 태그를 필터링합니다.

dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) {
    fmt.Println(selection.Text())
})

`:has(selector)`

지정된 요소 노드를 포함하는 요소를 필터링합니다.

`:empty`

자식 요소가 없는 요소를 필터링합니다.

`:first-child` 및 `:first-of-type` 필터

Find("p:first-child")는 첫 번째 p 태그를 필터링합니다. first-of-type은 해당 유형의 첫 번째 요소여야 합니다.

`:last-child` 및 `:last-of-type` 필터

:first-child 및 :first-of-type의 반대입니다.

`:nth-child(n)` 및 `:nth-of-type(n)` 필터

:nth-child(n)은 부모 요소의 n번째 요소를 필터링합니다. :nth-of-type(n)은 동일한 유형의 n번째 요소를 필터링합니다.

`:nth-last-child(n)` 및 `:nth-last-of-type(n)` 필터

역순으로 계산하며, 마지막 요소가 첫 번째 요소가 됩니다.

`:only-child` 및 `:only-of-type` 필터

Find(":only-child")는 부모 요소의 유일한 자식 요소를 필터링합니다. Find(":only-of-type")은 동일한 유형의 유일한 요소를 필터링합니다.

콘텐츠 가져오기

ele.Html()
ele.Text()

순회

Each 메서드를 사용하여 선택된 요소를 순회합니다.

ele.Find(".item").Each(func(index int, elA *goquery.Selection) {
    href, _ := elA.Attr("href")
    fmt.Println(href)
})

내장 함수

배열 위치 지정 함수

Eq(index int) *Selection
First() *Selection
Get(index int) *html.Node
Index...() int
Last() *Selection
Slice(start, end int) *Selection

확장 함수

Add...()
AndSelf()
Union()

필터링 함수

End()
Filter...()
Has...()
Intersection()
Not...()

루프 순회 함수

Each(f func(int, *Selection)) *Selection
EachWithBreak(f func(int, *Selection) bool) *Selection
Map(f func(int, *Selection) string) (result []string)

문서 수정 함수

After...()
Append...()
Before...()
Clone()
Empty()
Prepend...()
Remove...()
ReplaceWith...()
Unwrap()
Wrap...()
WrapAll...()
WrapInner...()

속성 조작 함수

Attr*(), RemoveAttr(), SetAttr()
AttrOr(e string, d string)
AddClass(), HasClass(), RemoveClass(), ToggleClass()
Html()
Length()
Size()
Text()

노드 검색 함수

Contains()
Is...()

문서 트리 순회 함수

Children...()
Contents()
Find...()
Next...() *Selection
NextAll() *Selection
Parent[s]...()
Prev...() *Selection
Siblings...()

타입 정의

Document
Selection
Matcher

도우미 함수

NodeName
OuterHtml

예제

시작하기 예제

func main() {
    html := `<html>
            <body>
                <h1 id="title">O Captain! My Captain!</h1>
                <p class="content1">
                O Captain! my Captain! our fearful trip is done,
                The ship has weather’d every rack, the prize we sought is won,
                The port is near, the bells I hear, the people all exulting,
                While follow eyes the steady keel, the vessel grim and daring;
                </p>
            </body>
            </html>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("p").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Text())
    })
}

IMDb 인기 영화 정보 크롤링 예제

package main

import (
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    doc.Find(".titleColumn a").Each(func(i int, selection *goquery.Selection) {
        title := selection.Text()
        href, _ := selection.Attr("href")
        fmt.Printf("영화 제목: %s, 링크: https://www.imdb.com%s\n", title, href)
    })
}

위의 예제는 IMDb 인기 영화 페이지에서 영화 이름과 링크 정보를 추출합니다. 실제 사용에서는 필요에 따라 선택자와 처리 로직을 조정할 수 있습니다.

Leapcell: 웹 호스팅을 위한 차세대 서버리스 플랫폼

마지막으로, Go 서비스 배포에 가장 적합한 플랫폼인 **Leapcell**을 추천합니다.

1. 다국어 지원

JavaScript, Python, Go 또는 Rust로 개발하세요.

2. 무제한 프로젝트를 무료로 배포

사용량에 따라서만 지불하세요. 요청이나 요금은 없습니다.

3. 최고의 비용 효율성

유휴 요금 없이 사용한 만큼만 지불하세요.
예시: $25로 평균 응답 시간 60ms에서 694만 건의 요청을 지원합니다.

4. 간소화된 개발자 경험

간편한 설정을 위한 직관적인 UI.
완전 자동화된 CI/CD 파이프라인 및 GitOps 통합.
실행 가능한 통찰력을 위한 실시간 메트릭 및 로깅.

5. 간편한 확장성 및 고성능

쉬운 동시성 처리를 위한 자동 확장.
운영 오버헤드가 없으므로 구축에만 집중하세요.

문서에서 더 자세히 알아보세요!

Leapcell Twitter: https://x.com/LeapcellHQ