Tuesday 16 November 2021à 9:30 – 10:30 |
STFC SOC technical Meeting Minutes |
[David Crooks - DC] [Jonathan Churchill - JC]
[Greg Corbett - GC] [Alastair Dewhurst - AD]
[Anish Mudaraddi - AM] [Ian Colier - IC]
[Olivier Restuccia - OR] [James Adams]
Weekly meeting with a rotating set of people depending on needs
Agenda for today: Updates on: physical aspect & network layout, procurement, deployment and AOB
DC: still in same position as far as the rack 245 is concerned, Cristina has requested help clearing it, will talk to AD and Martin Bly
Logical and physical diagrams are completed, will be circulated
Opensearch:
AM: the IAM issue is resolved (issue with IAM was the aquilon config not taking and being overwritten, key was changed but then overwritten), test admin group set up and need to talk about how to structure the grouping and what permissions people should have
DC: suggest that we want to put in a grouping now, but is something we need to give extended thought to as we proceed. Should define security group in IAM and have sub groups for different things (though this is out of scope for current status of deployment)
AM: now finishing the aquilon config for it, have set up cluster and will be done soon
GC: proceeding on assumption that SOC stuff will be its own archetype and we will move this config to this shared archetype when the moment comes
DC: cluster deployment and virtual cluster, still need to combine the existing cloud projects, then GC, OR and I will set up. Task will be to work out what we need, how to deploy the virtual cluster. Could OR set up a 30 minutes meeting with myself and GC for discussing this.
After discussion with James Adams, will use new archetypye: secops. Will have restricted set of admins, firewalls, selinux turned off by default (with option to turn on), goal is for this to be useful for SOC and other groups, things tested in SOC deployment may find their way into other archetypes over time
Zeek development:
DC: looking at kafka after cloud consolidation
OR: perf monitoring dashboard setup, still need to add a few custom metrics, started looking at zeek logs to get familiar and see what data is available
DC: during zeek week, at ESnet they’re testing new network driver to break up traffic which is much quicker than af_packet, called dpdk, to split networking across all computer cores, and less config required
DC: from Friday, AD has checked in with budget controller with estimate of 254K, which is on budget
Plan was dual 100GB for data ingest 25Gbit for internal rack network and 1Gbit for firewall and outputs. 25Gbit cards have longer lead-time so will get 4 100Gbit ports, which may be overkill but should mean we never have network speed issues
AD: From DELL, Connect-X6 cards do the letter at the end matter?
JA: Yes, we want dx (the newest one), en (the standard one), will send a list of acceptable hardware
AD: lots of progress, we have a full part list, with reasonably accurate cost list, for the 25Gig cards there was a more than 65 day lead-time, some of the 100Gig cards are in stock. In terms of memory, may end up buying 2 machines with lots of memory and take half out and put into the old machines.
So now zeek nodes will have 2 100Gbit cards instead of 25Gbit and 100 Gbit and use a 100Gbit port instead of the splitter
DC: sounds good, could use the ports or also use splitters anyway to have the ports available
AD: do we have any information on PGs costs?
DC: not yet, will get in touch, need to setup another project meeting with AD, Paul and others.
The 254K, is that based on previous calculations or including recent discussions with DELL?
AD: includes recent discussions with DELL, have multiple conversations ongoing with DELL, not just SOC orders, would be good to try combine as much as possible
AD: Big Tier1 order is due in beginning of December so if need be could borrow some nodes from that
DC: need to put in a date to set up the first set of nodes in the rack, and dispose of existing hardware
AD: need to disucss this urgently with Martin Bly to get it added to the list. Need a hand-off from JC.
Rack currently doesn’t have UPS power feed, about £1000 to get a UPS power feed to it, would protect rack from a 30s power glitch, will still turn off after 10-15 minutes (after room get’s too hot) if there were a power outage
DC: set up task list to work through
DC: combine existing cloud projects
DC: set up general project meeting with AD, Paul and others
DC: send email to Cristina, Martin Bly, JC and AD about rack 245 handover and addition of UPS power
OR: set up meeting with GC and DC to discuss the SOC virtual cloud setup
OR: dig into dpdk for splitting packets across multiple cores (also to talk to Jouker about)