NLP, custom modeling with Spacy

내가 보려고 기록하는것

NLP, custom modeling with Spacy

디리릭 2022. 3. 10. 02:16

attribute를 분류하는 로직이 필요해 nlp 적용을 확인하고 있다.

그 중 spacy 패키지를 이용하여 모델을 커스텀 해봤다.

1. spacy 3를 설치 한다.

python -m install spacy
# 확인
python -m spacy info

2. config 파일을 만든다. spacy 홈페이지에서 config 파일을 다운로드 할 수도 있지만 아래와 같은 명령어로 터미널에서 쉽게 만들 수 있다.

python -m spacy init config config.cfg --lang en --pip

3. Training data에 대한 json 파일을 만든다.
https://tecoholic.github.io/ner-annotator/

위의 링크에서 쉽게 만들 수 있다. 그치만 모든 라벨을 클릭해야한다는 불편함이 있긴하다. 여기서 json 파일을 만들어 코드와 동일한 경로에 옮겨준다.

4. spacy 파일을 만든다. 아래의 코드를 실행하면 같은 경로에 training_data.spacy 파일이 생성된다. 이 파일로 나만의 모델을 만들 것이다.

from doctest import DONT_ACCEPT_BLANKLINE
from click import style
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import json

nlp = spacy.blank('en')
db = DocBin() 
f = open('training_data.json')
TRAIN_DATA = json.load(f)

for text, annot in tqdm(TRAIN_DATA['annotations']):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot['entities']:
        span = doc.char_span(start, end, label = label, alignment_mode="contract")
        if span is None:
            print("Skipping entitiy")
        else:
            ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("./training_data.spacy")

5. 터미널에 아래와 같이 명령한다.

python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

그리고 training 하는 과정도 보여준다.

완료가 되면 파일 구성이 약간 달라진다.

'model-best', 'model-last' 폴더가 생성되는데 이 중 model-best를 이용할 것이다.

6. 내가 만든 모델을 spacy에 로드 시킨 후 아래와 같이 코드를 실행ㅇ하면 된다.

nlp_ner = spacy.load("model-best")
doc = nlp_ner('''(텍스트 넣으세요)''')

for token in doc.ents:
    print(f"{token.text} -> {token.label_}")

오늘은 여기까지

내일 더 시도해봐야지

'내가 보려고 기록하는것' 카테고리의 다른 글

[linux] 기본 명령어 정리 (0)	2022.09.12
WSL 재설치 (0)	2022.05.02
C#) get property value from string using reflection (0)	2022.04.27
NLP 라이브러리 (0)	2022.03.10
구글 시트에 데이터 json으로 받기 (0)	2021.11.09

현재글NLP, custom modeling with Spacy

docker, ChatGPT, 리눅스, 디자인패턴, Spring, Kotlin, 아두이노, 프로세스, Container, os, 가상 면접 사례로 배우는 대규모 시스템 설계, 스프링, 2의 제곱수, Pattern, NLP, 도커, GitHub, 운영체제, json, 윈도우,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

꾸준히 기록해보자