Overview of the Endeca Content Acquisition System
CAS 是一套用于Endeca Application添加,配置和抓取数据源的系统。数据源涉及文件系统,内容管理系统,Web 服务器,和定制的数据源。CAS将爬取的数据源转换文档和文件成Endeca Records, 和Stores,然后用于Forge pipeline.
The Endeca Content Acquisition System is made up of the following components:
1 CAS Service: CAS service 是一个运行在CAS server上的servlet 容器。包含Component Instance Manager,any number of Record Store instances,Dval Id Manager.
2 CAS Server: 是一个管理所有文件系统和CMS 爬取操作的组件
3 CAS Console: 是一个位于Endeca Workbench 基于web 应用,用于爬去各种数据源包括文件系统和CMS系统。CAS 安装期间,CAS Console也会作为扩展额外的安装。
4 CAS Sersver API 允许用户写代码去联系CAS Server,它有一个WSDL 接口也有一个命令行工具
5 Dimension Value Id Manager: 是一个CAS 组件,用于创建,存储和取得Dimension Value 标识符。
6 Endeca Web Crawler 管理所有的web 爬虫相关的操作。
7 Endeca CMS Connectors: 提供一种以各种CMS 类型访问和抓取数据源
8 Component Instance Manager : 用于创建,列举和删除record store instance,它有一个WSDL 接口和CIM 命令行工具
9 Endeca Record Store 提供持久化存储为产生的各种数据。他也有WSDL 接口和Record Store 命令行工具。CAS Server 将爬取的输出从每个数据源写到一个唯一的Record Store 实例
10 CAS Extension API 提供一套接口和classes去构建一些诸如定制数据源和定制的扩展
Note: 每一个数据源可以有多个record store 实例;每一个应用上可以有多个dimension value id manager
开启CAS Service:
Windows: <install path>/CAS/<version>/bin/cas-service.sh cas-service-wrapper.exe
Linux: <install path>/CAS/<version>/bin/cas-service.sh
CAS Server: 特征
1 包含CAS Document Conversion Module,它允许CAS Server去转换二进制文件成txt文件
2 使用包含和排除filters去指定需要从哪些些文件和文件夹去取东西或者不从哪些文件和文件夹取东西
3 支持增量爬取
Record Store:
是一个web 服务,为产生的record store 提供持久化服务。在后面能被Forge访问或者被CAS 查询的时候访问,将会代替写输出到文件。
1 位records提供高效的repository(以前是各种源数据放在不同的目录下,现在在一个地方,消除了需要在不同的目录之间进行拷贝和移动)
2 能够取得索引和增量索引数据
3 支持异步操作,也就是说CAS 可以一边写数据到record store,另一边Forge可以读取
4针对每一个数据源创建一套单独的record store
5 自动清除旧数据
6 通过其命令行工具可以很容易的配置和管理
Dimension value ID Managers
1 There is a command line interface (cas-cmd) to the component to manually perform the following operations:
• Create a Dimension Value Id Manager.
• Generate dimension value Ids.
• Export and import dimension value Ids.
• Get dimension value Ids.
• Delete a Dimension Value Id Manager.
initialize_services script creates a new instance of a Dimension Value Id Manager. CAS generates dimension value Ids as part of writing MDEX output. You manually delete the Dimension Value Id Manager using the cas-cmd utility before removing an Endeca application.
2 备份 和重新存储dimension value ids
备份 exportDimensionValueIdMap¬pings task of cas-cmd.
重新存储:importDimensionValueIdMappings task of cas-cmd.
3 可以跨环境传播dimension value ids
很多时候,你不得不移动dimension value id 映射文件在不同的环境,比如dev uat prod等等
You can coordinate this work in your Deployment Template script by calling exportDimensionValueIdMappings() on the Content Acquisition
ServerComponent, copying the file to the necessary machine, and calling importDimensionValueIdMappings() to load the file into another instance of a Dimension Value Id Manager.(也就是导入导出的流程)
Overview of the default CAS data sources and manipulators
Chapter 2: Create A Crawl
你可以使用CAS Console, CAS Server Command-Line 工具,和CAS Server API创建和配置一个应用任何数量crawls,如果你使用CAS Console,注意crawl 是等价于数据源的。
你应该指定配置选项:
1 crawl 的名字
2 用于抓取的源数据的位置
3 过滤应该包括或者排除的文件或者文件夹
4 CMS 数据源的repository属性
5 修改Endeca Records的Manipulator 作为crawl的一部分
Chapter 3 Load data into an MDEX Engine
1 Creating a Forge pipeline to read from or write to a Record Store
描述怎样构建一个Forge pipeline从一个或多个Record Store去读取Endeca Records。
要读取records到Forge pipeline,你需要添加input record adapter.如果record adapter 从CAS 输出文件读取数据,你需要指定的文件的格式是xml 还是二进制文件。
URL用于指定文件的位置
如果record adapter 是从Record Store 实例读取数据,你需要配置record adapter 成定制的adapter
1. Create a record adapter to read the Endeca records that CAS produced (required).
2. Map the record properties to Endeca properties and dimensions (required, but not documented in this guide.
See Endeca Developer Studio Help.).
Creating a record adapter to read from one or more Record Store instances:
1 New Record Adapter
2 From format list, choose Custom Adapter
3 Specify the JAVA__HOME,Class and ClassPath,eg:
Class:
com.endeca.itl.recordstore.forge.RecordStoreSource
Class Path:
<install path>/CAS/<version>/lib/recordstore-forge-adapter/recordstore-
forge-adapter-<version>.jar.
4 Select the Pass Throughs tab of the Record Adapter editor.
5 On the Pass Throughs tab, create the following name/value pairs:
5.1 Set a HOST pass-through to the fully qualified host name of the machine running the Endeca CAS Service. For example, HOST = hostname.endeca.com.
5.2 Set a PORT pass-through to the port number that the Endeca CAS Service is listening on. For example, PORT = 8500.
5.3 If reading from one Record Store instance, set an INSTANCE_NAME pass-through to the name of the
Record Store instance that you want Forge to read from. For example, INSTANCE_NAME = crawlID.
This pass-through is not required if the adapter is reading from multiple Record Store instances.
5.4 For a baseline pipeline, set a READ_TYPE pass-through to BASELINE. The BASELINE setting instructs Forge to read the latest version of all records in the Record Store. For example, READ_TYPE = BASE¬LINE.
For a partial-update pipeline, set a READ_TYPE pass-through to DELTA. The DELTA setting instructs Forge to read records that have been modified or added between the last committed generation in the Record Store and the last generation read by the same client as identified by CLIENT_ID setting. For example, READ_TYPE = DELTA.
5.5 Set a CLIENT_ID pass-through to a string that distinguishes this client from others that may also be reading from the Record Store instances. For example, CLIENT_ID = FORGE. The CLIENT_ID pass-through specifies the client ID to be set for the generation that is being read in. In effect, this pass-through is performing the set-last-read-generation task that can be performed with the CAS Server Command-line Utility (i.e., state is being set for the client, which is Forge in this case). This pass-through can be used only for READ_TYPE operations.
5.6 Optionally, set a RECORDS_PER_TRANSFER pass-through to the number of records to transfer at a time for each Record Store instance. The default is 500. Click OK to add the new record adapter to the project.
5.7 Optionally, to enable SSL with server only authentication, add pass through options for the truststore location (SSL_TRUSTSTORE), type (SSL_TRUSTSTORE_TYPE), password (SSL_TRUSTSTORE_PASS¬ WORD), and CAS port usage (IS_PORT_SSL).
5.8 Optionally, to enable SSL with mutual authentication, add pass-through options for the keystore location
(SSL_KEYSTORE), type (SSL_KEYSTORE_TYPE), and password (SSL_KEYSTORE_PASSWORD). For example: SSL_KEYSTORE = C:\Endeca\CAS\workspace\conf\keystore.ks, SSL_KEY¬
STORE_TYPE = JKS, SSL_KEYSTORE_PASSWORD = endeca, IS_PORT_SSL = false.
In some cases, you may get an Out of Memory error if Forge is reading or writing records from a Record Store instance. To work around this error, you can increase the amount of memory allocated to the JVM running
Forge. To increase the memory, run Forge with --javaArgument flag and the -Xmx argument, for example --javaArgument -Xmx512m.
Record properties for all dimension values
对于抓取的数据,每一个record都会产生一个dimension value.每一个record或许都会有record properties列在下面的。
dimval.spec:dimension value id, 必须是唯一的
dimval.dimension_name:dimension name,不是dimension value name
dimval.display_order(optinal): 值是数字类型,展示dimension value 的顺序,值越小越在前面,如果某个每有这个属性,那么将会处于有这个属性的后面
dimval.parent_spec:父类dimension value的id,如果是root,就是/
dimval.display_name:dimension value name
dimval.match.use_spec(optinal):是否用dimval.spec 去匹配属性,当然是一个range value properties,那么默认值就是false.否则默认值为true
dimval.search_synonym:dimension value的同义词
Record properties for range dimension values
dimval.range.lower_bound(Optional):指定一个最小值
dimval.range.lower_bound_in¬clusive(optinal):是否包含当前low_bound value
dimval.range.upper_bound(Optional):指定一个最大值
dimval.range.upper_bound_in¬clusive(optinal):是否包含当前upper_bound value
About automatically generating dimension values
CAS 能根据data records的property values 自动产生dimension value,如果需要自动产生,你应该设置dimension 的isAutoGen为true,然后运行一个full crawl 去产生MDEX-compatible output.然后dimension value id manager会产生dimension value id.
2 Creating a CAS crawl to write MDEX-compatible output
我们可以配置任何crawl 写MDEX-compatible output,但是我们最通用方式的是:
创建Record Store Merger crawl 去写,当运行full-carwl模式的时候,一下事情将会发生:
1 从多个所有者合并index 配置
2 处理dimension,properties,precedence rules,dimension value records
3 处理data records
4 写配置和记录到MDEX-Compatiable
Chapter 4 CAS Command Line Utilities
The command syntax for executing the tasks is:
cas-cmd task-name [options]
You get the capabilities for a data source or manipulator by running the listModules task or the getMod¬uleSpec task of cas-cmd.
cas-cmd.bat listModules -h localhost -p 8500
The getAllCrawlMetrics task retrieves a list of crawl IDs and their associated metrics
cas-cmd getAllCrawlMetrics [-h HostName] [-p PortNumber] [-l true|false]
Getting the status of a crawl
cas-cmd getCrawlStatus -id CrawlName [-h HostName] [-p PortNumber] [-l true|false]
Component Instance Manager Command-line Utility
Command-line options
The command syntax for executing the tasks is:
component-manager-cmd task-name [options]
The create-component task creates a Record Store instance:
component-manager-cmd create-component -n RecordStoreName -t RecordStore[-h HostName] [-p PortNumber] [-l true|false]
The delete-component task deletes a Record Store:
component-manager-cmd delete-component -n RecordStoreName
[-h HostName] [-p PortNumber] [-l true|false]
Listing components:
The list-components task lists all component instances that are managed by the Component Instance Manager.
component-manager-cmd list-components [-h HostName] [-p PortNumber] [-l true|false]
Listing types:
The list-types task lists all component types that are managed by the Component Instance Manager. Executing the task returns a list of all managed component types in the CAS Service. In this release, the only supported component type is RecordStore.
The syntax for this task is:
component-manager-cmd list-types [-h HostName] [-p PortNumber]
[-l true|false]
Record Store Command-line Utility
Command-line options With one exception, the command syntax for executing the tasks is:
recordstore-cmd task-name [options]
Writing tasks:
The write task writes a list of records into a specified Record Store instance.
The syntax for this task is:
recordstore-cmd write -a RecordStoreInstanceName [-b] -f InputFile [-h HostName] [-l true|false] [-p PortNumber] [-r Type] [-x Id]
Reading tasks:
The read-baseline task reads the baseline records from a Record Store instance.
The syntax for this task is:
recordstore-cmd read-baseline -a RecordStoreInstanceName
[-c] [-f FileName.xml] [-g GenId] [-h HostName] [-l true|false]
[-p PortNumber] [-n NumRecs] [-x id]
Cleaning a Record Store instance:
recordstore-cmd clean -a RecordStoreInstanceName [-h HostName]
[-l true|false] [-p PortNumber]
Clearing the last read generation:
recordstore-cmd clear-last-read-generation -a RecordStoreInstanceName
-c ClientId [-h HostName] [-l true|false] [-p PortNumber] [-x Id]
Committing transactions:
recordstore-cmd commit-transaction -a RecordStoreInstanceName -x Id
[-h HostName] [-l true|false] [-p PortNumber]
Getting the configuration of a Record Store instance:
recordstore-cmd get-configuration -a RecordStoreInstanceName
-f FileName.xml [-h HostName] [-l true|false] [-n] [-p PortNumber]
Getting the ID of the last-committed generation:
recordstore-cmd get-last-committed-generation -a RecordStoreInstanceName [-h HostName] [-l true|false] [-p PortNumber] [-x Id]
Getting the last-read generation:
recordstore-cmd get-last-read-generation -a RecordStoreInstanceName
-c ClientId [-h HostName] [-l true|false] [-p PortNumber] [-x Id]
Setting the configuration of a Record Store instance:
recordstore-cmd set-configuration -a RecordStoreInstanceName
-f FileName.xml [-h HostName] [-l true|false] [-p PortNumber]
Listing generations:
recordstore-cmd list-generations -a RecordStoreInstanceName
[-h Hostname] [-l true|false] [-p PortNumber]
Record properties generated by crawling
Common record properties
Endeca.Action:[UPSERT|DELETE]
Endeca.SourceType:[FILESYSTEM|WEB|CMS|EXTENSION]
Endeca.Id: RECORD_IDENTIFIER,如果是文件系统,可能是path,如果是web server,可能是URL
Endeca.SourceId: Data Source name,和crawl 配置文件的crawlId应该是一样的
Endeca.File.IsArchive:文件是否是压缩文件
Endeca.File.IsInArchive:当前文件是否是从压缩文件提取的
Endeca.File.Size:字节数