This is an open-source project for a paper search engine, which includes a Scrapy-Redis distributed crawler, an Elasticsearch search engine, and a Django frontend. The project was designed to provide a platform for users to easily search and access research papers.
- Scrapy-Redis distributed crawler using CSS Selectors.
- Centralized deduplication with Redis for distribution.
- Text search engine implemented with ElasticSearch.
- Full-stack web application built using Django.
The main technology stack used in this project includes:
- Scrapy-Redis
- Elasticsearch
- Django
👉👉👉 More technical details that help to understand my project as follows. 中文版本
- Both requests and beautifulsoup are libraries, and scrapy is the framework;
- Requests and beautifulsoup can be added to the scrapy framework;
- Scrapy is based on twisted, performance is the biggest advantage;
- Scrapy is convenient for expansion and provides many built-in functions;
- The built-in css and xpath selector of scrapy is very convenient, and the biggest disadvantage of beautifulsoup is slow.
Depth first (recursive implementation)
def depth_tree(tree_node):
if tree_node is not None:
print (tree_node._data)
if tree_node._left is not None:
return depth_tree(tree_node._left)
if tree_node._right is not None:
return depth_tree(tree_node._right)
Breadth first (queue implementation)
def level_queue(root):
if root is None:
return
my_queue = []
node = root
my_queue.append(node)
while my_queue:
node = my_queue.pop(0)
print (node.elem)
if node.lchild is not None:
my_queue.append(node.lchild)
if node.rchild is not None:
my_queue.append(node.rchild)
- Save the visited URL in the database;
- Save the visited URL in the set, and query the URL only at the cost of O(1);
- The URL is saved in the set after being hashed by md5 and other methods;
- Use the bitmap method to map the visited URL to a certain bit through the hash function;
- The bloomfilter method improves bitmap, and multiple hash functions reduce conflicts.
- Computers can only process numbers, and text can only be processed by converting text to numbers. 8 bits in the computer are regarded as a byte, so the largest number that a byte can represent is 255;
- ASCII (one byte) encoding has become the standard encoding for Americans;
- ASCII is not enough to handle Chinese. China has developed GB2312 encoding, which uses two bytes to represent a Chinese character;
- The emergence of unicode unifies all languages into a set of codes;
- The garbled problem is solved, but if the content is all in English, unicode encoding requires twice the storage space than ASCII, and at the same time, if the transmission requires twice the transmission;
- The emergence of variable-length encoding utf-8 has changed the length of English to one byte and Chinese characters to three bytes. Especially uncommon ones become 4-6 bytes. If a large amount of English is transmitted, the effect of utf-8 will be obvious.
scrapy is a fast and high-level screen scraping and web scraping framework developed by Python to scrape web sites and extract structured data from pages. Advantages: high concurrency (the bottom layer is asynchronous IO frame time loop + callback). Official document
- download:
pip install Scrapy
- new:
scrapy startproject namexxx
- xpath uses path expressions to navigate in xml and html;
- xpath contains standard function library;
- xpath is a w3c standard.
- Make full use of the bandwidth of multiple machines to accelerate crawling;
- Make full use of the IP of multiple machines to accelerate the crawling speed.
- Centralized management of request queue: The scheduler is stored in memory in the form of a queue, and other servers cannot get the contents of the current server's memory;
- De-duplicate centralized management. Solution: Put the request queue and de-replay into third-party components, using Redis (memory database, faster reading speed).
Redis is a key-value storage system, and data is stored in memory.
String hash/hash list collection, sortable collection
- Inherit RedisSpider;
- All requests are no longer completed by the local schedule, but the schedule of Scrapy-Redis;
- Need to push the starting url.
- Cookies are stored in the form of key-value
pip install wheel
pip install -r requirements.txt
- How to quickly discover new data
- The full amount of crawlers is still going on
- Restart a crawler: one is responsible for full crawling, and the other is responsible for incremental crawling
- Use priority queue (conducive to maintenance)
- Crawler is over
- Crawler is closed
- How to find that there is a new URL to be crawled, once there is a URL, a script is required to start the crawler
- Crawler waiting: continue to push URL
- Crawler is closed
- The full amount of crawlers is still going on
- How to solve the data that has been crawled (scrapy comes with a deduplication mechanism)
- After the list data has been crawled, continue crawling
- Whether to continue crawling the items that have been crawled (involving update issues) Optimal solution: Modify the scrapy-redis source code to achieve the goal.
Fields that will be updated: cited amount
- Efficient
- Zero configuration is completely free
- Able to interact with search engines simply through json and http
- Search server is stable
- Able to easily expand one server to hundreds
- Lucene-based search server
- Provides a full-text search engine with distributed multi-user capabilities
- Based on RESTful web interface
- Developed in Java and released as open source under the terms of the Apache license
- Unable to score -> Unable to sort
- No distributed
- Unable to parse search request
- low efficiency
- Participle
- Install elasticsearch-rtf
- Installation of head plugin and kibana
http.cors.enabled: true
http:cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Type, Content-Length, X-User"
- Cluster: One or more nodes are organized together
- Node: A node is a server in the cluster, identified by a name, the default is the name of a random comic character
- Fragmentation: The ability to divide the index into multiple parts, allowing horizontal partitioning and capacity expansion, multiple shards responding to requests, improving performance and throughput
- Replica: The ability to create one or more copies of a shard, and the rest of the nodes can be on top when one node fails
- index => database
- type => table
- document => line
- fields => columns
The inverted index comes from the need to find records based on the value of attributes in practical applications. Each item in this index table includes an attribute value and the address of each record with the attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index. A text with an inverted index is referred to as an inverted file.
- Case conversion issues, such as python and PYTHON should be a word
- Stemming, looking and look should be treated as one word
- Participle
- The inverted index file is too large, compression encoding Elasticsearch can complete all of the above problems.
Mapping: When creating an index, you can predefine the field type and related attributes.
ES will guess the field mapping you want based on the basic type of the JSON source data. Turn the entered data into searchable index items. Mapping is the data type of the field defined by my mother. It also tells es how to index the data and whether it can be searched.
Role: It will make the index creation more detailed and perfect.
- Basic query: use es built-in query conditions to query
- Combined query: Combine multiple queries together for compound query
- Filtering: the query passes the filter condition to filter the data without affecting the scoring
Edit distance is a calculation method of similarity between strings. That is, the edit distance between two character strings is equal to the minimum number of operations for insert/delete/replace/swap positions of adjacent character strings to make one character string become another character string.
Regarding the calculation of edit distance, dynamic programming is commonly used.
- pip freeze > requirements.text
- pip install -r requirement.txt
Elasticsearch中ik_max_word和 ik_smart的区别
- Search for the glucose keyword, hope that the result contains only glucose, not grapes; search for grapes, hope that the result contains glucose.
- Searching for "RMB" will only match the content that contains the keyword "RMB". In fact, "RMB" and "RMB" are synonyms. We hope that users can search for "RMB" and "RMB" to match each other. How to configure ES synonyms ?
- User search pinyin: such as "baidu", or the first letter of pinyin "bd", how to match the keyword "百度", and if the user enters the word "摆渡", it can also match the keyword "Baidu", how does the Chinese pinyin match? Do it?
- How to ensure that the search keywords are correctly segmented, usually we will use a custom dictionary to do it, so how to get a custom dictionary?
- ik_max_word: Split the text at the finest granularity, such as splitting the "Great Hall of the People of the People's Republic of China" into "People's Republic of China, Chinese People, Chinese, Chinese, People's Republic, People, Republic, Great Hall, Assembly, Words such as hall.
- ik_smart: Will do the most coarse-grained split, such as splitting the "Great Hall of the People of the People's Republic of China" into the People's Republic of China and the Great Hall of the People.
The best practice for the use of the two tokenizers is: use ik_max_word for indexing, and ik_smart for search.
That is: the content of the article is segmented to the maximum when indexing, and the desired result is more precise when searching. When indexing, in order to provide the coverage of the index, the ik_max_word analyzer is usually used, which will index with the most fine-grained word segmentation. In order to improve the search accuracy, the ik_smart analyzer will be used for coarse-grained word segmentation.
- character filter: process the string before word segmentation and remove HTML tags;
- tokenizer: English word segmentation can separate words according to spaces, Chinese word segmentation is more complicated, and machine learning algorithms can be used to segment words;
- token filters characterize filters: modify capitalization, stop words, add synonyms, add words, etc.;
- ES word segmentation process: character filter-->>tokenizer-->>token filters
- Custom analyzer
- Word segmentation mapping settings
"content": {
"type": "string",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
Suggest words need to match the prefix of Pinyin, Quanpin, and Chinese. For example: "百度", type "baidu", "bd", "百" must be matched, so it needs to be divided into multiple words when indexing A word segmenter is used to index and save. Chinese uses single-character word segmentation. Pinyin first letter and Quanpin require a custom analyzer to index.