`
thecloud
  • 浏览: 883287 次
文章分类
社区版块
存档分类
最新评论

搜索引擎–Python文本文件分割、PyLucene建立索引和索引搜索

 
阅读更多

主机平台:Ubuntu 13.04

Python版本:2.7.4
PyLucene版本:4.4.0
原创作品,转载请标明:http://blog.yanming8.cn/archives/108

最近想使用Python来做一个小的搜索引擎,一来是强化学习一下Pyhton语言,而来是学习一下搜索引擎实现原理。
在网上搜索了好久,网上的资料和书籍都是使用JAVA平台下的Lucene,而PyLucene是Lucene的Python实现。下面是官网的说明:
PyLucene is aPythonextension for accessing JavaLuceneTM. Its goal is to allow you to use Lucene’s text indexing and searching capabilities from Python. It is API compatible with the latest version of Java Lucene, version 4.4.0
Pylucene是一个可以使用Lucene的扩展。它的目标是让你能够在Python中使用Lucene的索引和搜索。它的API和最新的JAVA Lucene是兼容的。

虽然官网说API是兼容的,但是对Python还不是很熟悉的情况下,感觉还是有点怵,慢慢摸索吧。

splitFiles.py
它实现了将大文本文件切割成50行的小文本文件。

#!/usr/bin/env python
import os
import sys
import os.path
def split(file):
“”"split file to some small ones”"”
if not os.path.isfile(file):
print file,”is not a file”
exit(1)
txtfile=open(file,”r”)

dirname=os.path.dirname(file)

file_index=0

line_cnt = 0
outfile=open(dirname+”/output_%d”%file_index+’.txt’,'w’)
for line in txtfile:
if line_cnt < 50:
outfile.write(line)
line_cnt+=1
else:
outfile.close()
file_index+=1
outfile=open(dirname+”/output_%d”%file_index+’.txt’,'w’)
line_cnt=0

outfile.close()
txtfile.close()

if __name__ == “__main__”:
base_dir=os.path.dirname(os.path.abspath(sys.argv[0]))
root=os.path.join(base_dir,”txtfiles”)
#print root
for rootdir,dirnames,filenames in os.walk(root):
for filename in filenames:
if not filename.endswith(‘.txt’):
continue
txtname=rootdir+”/”+filename
#print txtname
split(txtname)

IndexFiles.py
它实现了将指定索引目录下的txt文件,并保存索引到指定的目录,供搜索使用。
#!/usr/bin/env python
INDEX_DIR = “IndexFiles.index”
import sys, os, lucene, threading, time
from datetime import datetime
from java.io import File
from org.apache.lucene.analysis.miscellaneous import LimitTokenCountAnalyzer
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, FieldType
from org.apache.lucene.index import FieldInfo, IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version
“”"
This class is loosely based on the Lucene (java implementation) demo class
org.apache.lucene.demo.IndexFiles. It will take a directory as an argument
and will index all of the files in that directory and downward recursively.
It will index on the file path, the file name and the file contents. The
resulting Lucene index will be placed in the current directory and called
‘index’.
“”"
class Ticker(object):
def __init__(self):
self.tick = True
def run(self):
while self.tick:
sys.stdout.write(‘.’)
sys.stdout.flush()
time.sleep(1.0)
class IndexFiles(object):
“”"Usage: python IndexFiles <doc_directory>”"”
def __init__(self, root, storeDir, analyzer):
if not os.path.exists(storeDir):
os.mkdir(storeDir)
store = SimpleFSDirectory(File(storeDir))
analyzer = LimitTokenCountAnalyzer(analyzer, 1000)#1048576
config = IndexWriterConfig(Version.LUCENE_CURRENT, analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
writer = IndexWriter(store, config)
self.indexDocs(root, writer)
ticker = Ticker()
print ‘commit index’,
threading.Thread(target=ticker.run).start()
writer.commit()
writer.close()
ticker.tick = False
print ‘done’
def indexDocs(self, root, writer):

#Create a new FieldType with default properties.
t1 = FieldType()
t1.setIndexed(True)
t1.setStored(True)
t1.setTokenized(False)#True if this field’s value should be analyzed by the Analyzer.
t1.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS)

#Create a new FieldType with default properties.
t2 = FieldType()
t2.setIndexed(True)
t2.setStored(True)
t2.setTokenized(True)#True if this field’s value should be analyzed by the Analyzer.
t2.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)

for root, dirnames, filenames in os.walk(root):
for filename in filenames:
if not filename.endswith(‘.txt’):
continue
print “adding”, filename
try:
path = os.path.join(root, filename)
file = open(path)
contents = file.read()
file.close()
doc = Document()
doc.add(Field(“name”, filename, t1))
doc.add(Field(“path”, root, t1))
if len(contents) > 0:
doc.add(Field(“contents”, contents, t2))
print “length of content is %d”%(len(contents))
else:
print “warning: no content in %s” % filename
writer.addDocument(doc)
except Exception, e:
print “Failed in indexDocs:”, e
if __name__ == ‘__main__’:
if len(sys.argv) < 1:
print IndexFiles.__doc__
sys.exit(1)
lucene.initVM(vmargs=['-Djava.awt.headless=true'])
print ‘lucene’, lucene.VERSION
start = datetime.now()
try:
base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
print base_dir
print os.path.abspath(sys.argv[0])

IndexFiles(“./txtfiles”, os.path.join(base_dir, INDEX_DIR),
StandardAnalyzer(Version.LUCENE_CURRENT))
end = datetime.now()
print end – start
except Exception, e:
print “Failed: “, e
raise e

SearchFile.py
它就是搜索前面生成的索引,输入搜索结果。
#!/usr/bin/env python
INDEX_DIR = “IndexFiles.index”
import sys, os, lucene
from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import DirectoryReader
from org.apache.lucene.index import Term
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.search import Query, TermQuery
from org.apache.lucene.util import Version
“”"
This script is loosely based on the Lucene (java implementation) demo class
org.apache.lucene.demo.SearchFiles. It will prompt for a search query, then it
will search the Lucene index in the current directory called ‘index’ for the
search query entered against the ‘contents’ field. It will then display the
‘path’ and ‘name’ fields for each of the hits it finds in the index. Note that
search.close() is currently commented out because it causes a stack overflow in
some cases.
“”"
def run(searcher, analyzer):
while True:
print
print “Hit enter with no input to quit.”
command = raw_input(“Query:”)
if command == ”:
return
print
print “Searching for:”, command
“”"
query = QueryParser(Version.LUCENE_CURRENT, “contents”,
analyzer).parse(command)
“”"
query = TermQuery(Term(“contents”, command))
hits = searcher.search(query,10000)
print “%s total matching documents.” % hits.totalHits
print “Max score:”,hits.getMaxScore()
for hit in hits.scoreDocs:
doc = searcher.doc(hit.doc)
print ‘URI:’,doc.getField(“path”).stringValue()
print ‘File:’,doc.getField(‘name’).stringValue()
#print ‘Digest:’,doc.getField(‘contents’).stringValue()
print ‘Health:’,hit.score

if __name__ == ‘__main__’:
lucene.initVM(vmargs=['-Djava.awt.headless=true'])
print ‘lucene’, lucene.VERSION
base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
directory = SimpleFSDirectory(File(os.path.join(base_dir, INDEX_DIR)))
searcher = IndexSearcher(DirectoryReader.open(directory))
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
run(searcher, analyzer)
del searcher

下面是建立索引部分结果输出:
下面是搜索的部分结果输出:
分享到:
评论

相关推荐

    用python做一个搜索引擎(Pylucene)的实例代码

    搜索引擎是“对网络信息资源进行搜集整理并提供信息查询服务的系统,包括信息搜集、信息整理和用户查询三部分”。如图1是搜索引擎的一般结构,信息搜集模块从网络采集信息到网络信息库之中(一般使用爬虫);然后...

    Linux在Python3.6下安装PyLucene-8.1.1

    Python: Python 3.6 Ubuntu: Ubuntu 18.04.4 LTS Pylucene: Pylucene-8.1.1 1. 下载Pylucene-8.1.1并解压 地址:https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/pylucene/#targz 解压: tar xzvf pylucene-...

    lupyne:基于PyLucene的Pythonic搜索引擎

    Lupyne是基于PyLucene的搜索引擎, 是用于访问Java Lucene的Python扩展。 Lucene是一个相对较低级别的工具包,PyLucene通过自动代码生成对其进行包装。 因此,尽管在可能的情况下将Java习惯用语翻译成Python习惯用语...

    基于倒排索引的小型文档搜索引擎

    基于倒排索引的小型文档搜索引擎,用C/C++实现

    【信息检索课程设计】sdu新闻网站全站爬取+索引构建+搜索引擎

    索引构建和搜索功能用Python的Whoosh和jieba库实现。(由于lucene是java库,所以pyLucene库的安装极其麻烦,因此选用Python原生库Whoosh实现,并使用jieba进行中文分词。) 搜索网页界面用django实现,页面模板套用...

    pylucene-7.7.1.rar

    pylucene 7.7.1

    基于lucene的python版中文纠错研究+源代码+文档说明

    #python 文本纠错 ####近来发现语言识别过程中会有很多是有偏差了的,经常查询发现lucene可以实现纠错功能,前提是需要大量的文本,后来发现有pylucene就想到了用这个 #开发语言 * python #python依赖包 * PyLucene...

    pylucene-3.0.0-1-src.tar.gz

    pylucene 让Python程序员可以更好的使用Lucene

    plucene python版

    lucene.cn 搜索引擎索引模块 pylucene msn : geshuai@gmail.com

    Lucene 索引、删除、检索 实例

    完整的Eclipse 工程,举一些小例子,Lucene的建立索引、删除索引、以及检索,希望对喜欢或者用Lucene开发程序的朋友们有所帮助!

    lucene-win-python.egg

    for those who are not able to access code.google.com

    CS598CXZ-Project

    CS598CXZ-项目 这个项目有几个依赖项。 您需要安装flask,以及pyLucene。 首先,您需要生成数据的文件语料库,这... 提供文件语料库的本地路径和要保存索引的输出路径。 最后我们可以使用 run.py 来运行我们的程序。

    ubuntu下安装solr4.9详细介绍

    文档中详细介绍了如何在ubuntu下面安装solr-4.9.0,以及在安装过程中出现的问题和解决办法

Global site tag (gtag.js) - Google Analytics