OpenStack Summit LiveBlog – Swift Disk design session

Note: Live blog, blah blah, formatting, blah

Disintermediated Hardware, that is, remove the middle layers between the application and the disk. Optimize the disk for object storage.

“Why did you go down this particular path?” – Other approaches got us only part way. Let me ask you, “Why does swift use filesystems?” The world has used filesystems, etc. However storage has been delivered as block devices, most applications don't use block directly.

What you gain by going this route, is aleviating CPU at scale by offloading to the device.

The filesystem is a standard abstraction layer.

Seagate wants to move the world from sector addressing to something that is more valuable to how storage is consumed.

Disaggreggation – Allow the drives to move from behind a single storage server. This would allow any drive to supply any app.

“A full fabric of access” I missed the part on striping, but I think the Swift striping would need coordination between Swift storage servers, this would allow the disks to be accessed from /all/ the things?

“Does this expose every volume on every server in the case of a break out?” – There are some mitigations, we're getting there.

SAS allows two atomic units, read and write on sectors; the Kinetic = Get, Put, Delete. data :: count * 512 bytes; Kinetic key=1 byte to 4KiB, value = 0 bytes to 1 MiB. Kinetic allows HMAC & SHA hashing to extend data corruption / recovery ability. Also helps with dedupe

If you plug these into the wrong interface, it wont make blue smoke. If you lose power during a Put, and it has not completed, the data does not exist.

Interface port > key/value > Authenticity & Integrity > Authroization > KV Semantics > Storage; Ancilliary port only has access to drive mgmt, you can poll & control fans, etc w/o having to come down main channels.

Keys have a flat name space, lexographic ordering: 1, 10, 11, 2, 21, 22, etc, Key schema is opaque to the device. Design pattern, high order info bytes first, low order last.

K/V Access:

  • k/v object = {key, value, version, EDC} (version == opaque blob of bits that is specific to this key)
  • key specifiers = {this key, or the key following it, or the one preceeding it}
  • key range = [start, end] or inclusive exclusive
  • P2P data copying, the drives can talk directly to eachother, application says replicate, but you don't have to draw data into a server and push it back down. Relieves bandwidth for replication.

Overwrite a key = one thing, deleting it another. Wether or not the data goes away right away on a delete depends. Also, either the K/V pair is there or it is not. There is nothing in between.


Get Options

  • Do a media health scan. Reduce the error correction from the media so we can decide if it's imperfect but failing.

Put / Delete options

  • Can we cache this? – Per command, you can ask if a thing has to be done the hard way or can it be cached.
  • Version requirement. Allows performance with consistency.

Put Only:

  • if version == old version, then put will succeed, if it's version 10, replace it with data from 11, and update to v 11, compare/swap
  • Create / Replace controls – Must / Always put
Security

  • At rest encryption
  • Instant secure erase, no more data comes off, rotate key == data gone, at device / Kinetic device level
  • Message Auth = HMAC give the user key, is it legit, everything goes throught the layers
  • Client Autzn, restricted ops, such as RO clients. Restricted keyspace, this tenant only gets to see these keys,
  • TLS – Optionally
Chassis Enclosure Svcs
  • Drive discovery, vibration, health, etc.
Cluster Mgmt
  • Get keylist, logs profiling
  • cluster verions, failed keys, and state

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.