3
- High speed streaming: software raid of few iSER disks (using tgt, may
5
- Cluster of multiple standard nodes (no big storage nodes): Gluster
6
- Single storage node and several computational nodes: iSER + OCFS2
7
- Few big storage nodes: Lustre/Gluster/FhGFS ?
11
- According to the benchmarks on the net, iSCSI delivers performance on the
12
pair with NFS. Few my tests reported about 350 MB/s on the storage capable
13
of 3.5 GB/s, though I have not did any tunning. So, we are only interested
14
in iSER solutions for high performance systems.
17
Target: the server part of protocol
18
Initiator: the iSCSI client
20
- According to the standard multiple initiators may share a single target.
21
This is supported by most of the stacks. However, synchronization issues
22
are up to the user. So, it is only possible to use with some higher level
23
clustering filesystem taking care for locking (for example GFS2).
26
ietd - no iSER support (on 28.11.2012)
27
scst - re-implementation of ietd (still no iSER according to documentation)
28
stgt - iSER support, but known being not that fast.
29
lio/tcm - iSER support is expected in kernel 3.9
32
openiscsi - iSER is supported
36
- OpenSuSE does not include support for iSER. The package should be recompiled
37
adding ISCSI_RDMA=1 to make command
39
List current configuration:
40
tgtadm --lld iscsi --op show --mode target
41
When properly configured you expected to see:
42
Target 1: iqn.ipepdvcompute2.ssd3500
44
LUN: 0 (this is the virtual adapter LUN)
45
LUN: 1 (this is a first published disk LUN)
47
Configuring (create interface, allow access, and create first disk):
48
tgtadm --lld iser --mode target --op new --tid 1 --targetname "iqn.$(hostname).ssd3500"
49
tgtadm --lld iser --mode target --op bind --tid 1 --initiator-address ALL
50
tgtadm --lld iser --mode logicalunit --op new --tid 1 --lun 1 --backing-store /dev/disk/by-uuid/6eeffa6d-d61e-4157-8732-e1da39368325 --bstype aio
53
tgt-admin --dump > /etc/tgt/targets.conf
57
- Most recent and actively developed solution. Selected for inclusion into the kernel.
58
rDMA support is schedulled for integration in kernel 3.9.
59
- I checked their git tree, there is no iSER as of 11.2012
63
iscsi_discovery 192.168.11.5 -t iser -f
67
- There is multiple options affecting the performance of the system.
68
- LUN types: direct-store (RAW devices) or backing-store (files, etc.)
69
- Backend (bs-type): rdwr (cached file access), aio (Kernel AIO), sg (Direct Access)
70
- Write cache (write-cache): disabling is good idea for high speed streaming
71
- Block Size (block-size): 512 - 4096 (target supports bigger sizes, but initiator not).
72
With not standard size (not 512), the O_DIRECT access on the client is failing with
74
- Packet sizes (MaxRecvDataSegmentLength, MaxXmitDataSegmentLength, FirstBurstLength,
75
MaxBurstLength) and on the client (node.conn[0].iscsi.MaxRecvDataSegmentLength,
76
node.session.iscsi.FirstBurstLength, node.session.iscsi.MaxBurstLength).
77
If segment length are set bellow, the raid blocks size (strip * n_disk), the there is
78
problems connecting to the target in the aio mode.
79
- For high performance streaming we need: direct-store, aio backend (sg is not working
80
for me, but aio gives really good speed), and big buffers
81
- On the target side, the read ahead should be set to the strip size (in 512 blocks). It
82
is also affects writting speed.
83
blockdev --setra 65536 /dev/sdc
85
/etc/tgt/targets.conf:
88
<target iqn.ipepdvcompute2.ssd3500>
89
<direct-store /dev/disk/by-uuid/6eeffa6d-d61e-4157-8732-e1da39368325>
91
MaxRecvDataSegmentLength 2097152
92
MaxXmitDataSegmentLength 2097152
93
FirstBurstLength 8388608
94
MaxBurstLength 8388608
103
GFS2 - Provides a shared access to a single network block device (like iSCSI,
104
NBD, DRBD) to multiple clients providing access syncrhonization (with in
105
kernel DLM manager). In the kernel since 2.6.19. However, userland
106
utilities are packaged only for RH.
107
OCFS2 - Oracle solution to handle network block devices in cluster.
108
Installation procedure sounds simpler than for GFS. SuSE is supported.
109
Lustre - Conists of Management Server (MGS) storing configuration of Lustre
110
file-system, Metadata Servers (MDS) and Object Storage Servers (OSS).
111
Clients are directly communicating with all these server. The file
112
system is reported to be very fast and scalable. It is developed by
113
Sun and currently controlled by Oracle. Seems to have native rDMA
114
support. It seems to be fastest system out there and it is specially
116
The main disadvantage it is not in the main-line kernel. The patched
117
kernel and e2fsprogs are required (not possible to have just a
118
additional modules). Official patches are made only against RHEL and
119
SLES. As of 11.2012 (3.6 is long out) the latest patches are against
120
kernel 2.6.38. Also, seems to be lacking integrated fault tollerance,
121
3rd party solutions to be used bellow the OSS.
122
Details: http://wiki.whamcloud.com/
123
HDFS - File system for Apache Hadoop. In many respects similar to Lustre
124
but optimized use case than data nodes are compute nodes as well.
125
It is not a POSIX compatible. The mount is possible through FUSE
126
module, but it is performance inoptimal.
127
There is 3rd party support for rDMA implemented over JNI interface.
128
Gluster- Easy to setup clustering file system. It has only storage servers,
129
all meta-data is just rsynced between nodes. The location of file
130
(or chunk) is determined by file name (chunk number?). The storage
131
node may declare any directory in it's file system as part of the
132
file system (yes, it should not be a standalone mounted partition).
133
RDMA support is included out of the box. The current version has
134
performance problems with fast storage (i.e. it is fine
135
interfacing multiple nodes with just a pair of hard drives,
136
but slow if a single node with attached Raid is used). However,
137
nothing in architecture prevents good performance. I guesss it
138
is only matter of developing good AIO-based storage module.
139
pNFS - Parallel NFS (part of NFS 4.1 specifications). There is
140
client implementation ready, but no server. Actually, pNFS is
141
best considered as an access method for a real distributed
142
filesystem, not as a complete solution in and of itself. There
143
ideas to wrap GFS2, etc. More details: http://linux-nfs.org
145
modprobe.d: alias nfs-layouttype4-1 nfs_layout_nfsv41_files
146
mount -o minorversion=1 server:/filesystem
147
No additional steps are needed to enable pNFS; if the server
148
supports it, the client will automatically use it.
153
Ceph - Lustre-style. Has integrated fault tollerance. Automated data
154
migration to avoid hotspots. Merged in 2.6.34. RDMA seems is
155
not available at the moment. After heavy fight with authorization,
156
got about 100 MB/s (quite expected without rDMA).
158
PohmelFS - Consists of 3 components.
159
* Eliptics (a p2p-based storage manager distributing data chunks to nodes
160
according to DHT tables (hash). Unlike Gluster, it is fully dynamic, the
161
nodes may come and go.
162
* Multiple storage backends optimized for different types of data.
163
* PohmelFS kernel module (since 2.6.30) providing a file system on top of
164
eliptics. The core components here is the cache coherency management.
165
It supports weak synchronization between mounted nodes in that regard,
166
that data read/written into local page cache is not synced with the storage
168
- Eliptics is Yandex development. However, the PohmelFS in the kernel is
169
older and based on something else. It is not clear when new will hit it.
170
Also, there are not known users of old a new. Why it end-up in the
173
MogileFS, WebDFS - data distributed through HTTP, metadata in stored
174
in PostgreSQL. Standard drives as storage nodes.
176
fhgfs - Fraunhofer file system. Quite unixish and easy to install. Management,
177
Meta, and Data services. The client-side kernel module. Uses directories
178
as data stores. RDMA support. Max sequential read/write 500 MB/s per
179
node (SSD raid capable of 3.5 GB/s). No complete source available,
180
only kernel module and few libraries. There are builds for RHEL,
181
SLES, Debian. SLES build working with OpenSuSE 12.2
183
GPFS - proprietary file system from IBM in many ways similar to Lustre.
184
Unlike lustre, it has distributed metadata. There is also extra
185
features like snapshots, etc.