Meetup: ChatOps San Francisco

July 31, 2015, 8:05 am

≫ Next: DCD Internet 2015

≪ Previous: Meetup: Cisco Spark Reviews And Demos Event Driven Operations Platform

Thursday, July 30, 2015
PagerDuty Office
501 2nd St., Suite 100
San Francisco, CA

Stormer James Fryman presented on “Managing IT outages with Icinga and StackStorm” at this month’s ChatOps San Francisco Meetup; check out his full presentation below.

EVENT WEBSITE

The post Meetup: ChatOps San Francisco appeared first on StackStorm.

↧

DCD Internet 2015

August 5, 2015, 8:00 am

≫ Next: StackStorm “Automation Happy Hour” (July 31, 2015)

≪ Previous: Meetup: ChatOps San Francisco

July 30-31, 2015
San Francisco, CA

StackStorm CEO Evan Powell joined Eric Wells of Fidelity Investments and Grant Richard of Goldman Sachs as part of a panel addressing how to build successful data-driven data centers.

EVENT WEBSITE

The post DCD Internet 2015 appeared first on StackStorm.

↧

StackStorm “Automation Happy Hour” (July 31, 2015)

August 5, 2015, 9:00 am

≫ Next: Auto-Remediation Defined

≪ Previous: DCD Internet 2015

Friday, July 31, 2015

The bi-weekly “Automation Happy Hour” is our way of connecting directly with the community to help solve automation challenges together.

Check out the July 31st discussion below, and as always, please feel free to follow us on Twitter at @Stack_Storm and tweet specific questions using #AskAnAutomator. We’ll do our best to answer your question during the next Happy Hour.

EVENT WEBSITE

The post StackStorm “Automation Happy Hour” (July 31, 2015) appeared first on StackStorm.

↧

Auto-Remediation Defined

August 7, 2015, 10:43 am

≫ Next: User story: StackStorm, Workflows, and ChatOps

≪ Previous: StackStorm “Automation Happy Hour” (July 31, 2015)

August 7, 2015
by Evan Powell

One thing I tried to do when helping kick off the “software defined storage” craze some years ago was to define what we meant at Nexenta by that term. A number of analysts in the space were positive about our clarity as were, more importantly, many users and partners.

I realized that while we’ve blogged here and there about what we mean at StackStorm by auto-remediation, we have not directly posited a canonical definition of it. People seem to grok that auto-remediation is a subset of event-driven automation however it is nigh time for us to have a single spot for our take on the definition. With no further adieu, please read on and comment back here or via twitter.

Auto remediation is an approach to automation that responds to events with automations able to fix, or remediate, underlying conditions. Remediation means more that simply clearing an alert; for example, it can mean ascertaining the scope of a problem through automated validation and investigation, noting the diagnosis of a problem in a ticketing system and very often in a chat system as well as in a logging system, and then taking a series of steps where each step’s completion or failure can be a prerequisite for the next step.

Components needed by auto-remediation software include the ability to listen to events, some notion of a rules engine to respond appropriately to these events, and a workflow engine to transparently execute often long running automations comprised of multiple discrete tasks tied together with conditional logic. Additionally, as discussed below, the human factors of auto-remediation are crucial as we build and increasingly trust autonomous systems to run ever more complex environments.

Attempts at auto remediation should recognize the challenges and limitations of prior attempts at closed loop automation most of which were at the time called “run book automation” with leading solutions including Opalis, Tidal Software, RealOps and others, most of which were purchased by large system vendors. These limitations have included:

Challenges in authoring and maintaining both the necessary integrations and the automations themselves; modern systems support infrastructure as code so these artifacts are treated as code and hence can be authored and maintained far easier; additionally systems such as StackStorm can incorporate existing scripts, tie into the four leading configuration management systems, and have a large open source community of thousands of integrations already available.
A loss of context on the part of the human operators leading to a loss of trust; modern systems are radically transparent and proactively keep humans in the loop, for example by the automation system interacting with operators via chat as a peer to these operators or through advanced visualization techniques.
The risk of run away automations or flapping; any control system has to be able to control itself – auto remediation systems must have the ability to limit responses to given sources of events for example both to insure human error does not spawn a cycle of remediations remediating remediations and as a part of security in depth.
Last but not least, the ability to scale to today’s environments. Prior systems automated much less dynamic environments that were orders of magnitude smaller than today’s; modern auto remediation needs to scale horizontally and typically incorporates a message queue and other techniques to achieve this scale.

Successful auto remediation systems include Facebook’s Auto Remediation, or FBAR, and WebEx Spark’s Bootstrap 2.0. More information about these systems is available here for Facebook (although you would have learned more from the recent event driven automation meet-up) and here for WebEx’s Spark (disclosure, leverages StackStorm and from a later talk at the same meet-up).

You can read much more about example uses of event-driven automation and specifically auto-remediation on the StackStorm site. For here suffice it to say that use cases for auto-remediation range from providing resilient environments for your Cassandra cluster and other key components (more on that at the upcoming Cassandra Summit) to responding to a broad and ever changing set of cyber intrusions at banks and other larger targets. A good resource for the later use case including a demo is a talk given at BSides in the Spring by our own Tomaz Muraus.

Please help us solidify this definition. Any and all feedback is welcome.

The post Auto-Remediation Defined appeared first on StackStorm.

↧

User story: StackStorm, Workflows, and ChatOps

August 14, 2015, 5:12 pm

≫ Next: 0.13 released! Time to upgrade!

≪ Previous: Auto-Remediation Defined

August 14, 2015
by Joe Topjian

Introduction

For the past nine months or so, some of us at Cybera have been using a system called StackStorm. StackStorm is a very powerful tool that provides a hub for building automated workflows. That’s a pretty vague description, but StackStorm’s power comes from its amorphous character.

Initial Steps

A core feature of StackStorm is the ability to store a library of “commands”. These commands can be anything: creating a ticket in Jira, executing an action on a remote server, doing a Google search — anything. We already had our own library of everyday commands, so our first task was to port this library into StackStorm. This process felt awkward. It quickly became obvious that most of our commands were focused on single-phase information reconnaissance. StackStorm seemed to work best with multi-phase workflows. The StackStorm team was very receptive to this feedback and worked with us on some simple changes that made our library a bit less awkward to use.

Once we had our original library stored inside StackStorm, we then began exploring how we could change our library to take advantage of StackStorm’s other features. It was around this time that I came across their original ChatOps instructions.

ChatOps

A quick Google search shows that ChatOps is a term that came out of GitHub. It’s a methodology that enables collaborative development and troubleshooting through a “chat” medium such as IRC or Slack.

This sounded like an interesting feature to explore. Instead of having each team member running the same command in one window, getting the same output, and then discussing the interpretation of the results in another window, we could just do everything in one window. It sounded like such an obvious thing to do.

The above-mentioned instructions describe how to integrate StackStorm with Slack. Fortunately, Slack is what we use at Cybera, so the process of integration was quite easy.

Once it was set up, the benefits were immediately obvious. On the same day that integration was in place, we held our weekly team meeting for the Rapid Access Cloud. Our meetings usually involve everyone sitting around a table with their laptops. Whenever the topic of a certain project came up (how many virtual machines the project was using, if a new project had begun using the Rapid Access Cloud, etc), someone would run a command in Slack that would print the report for all of us to see.

Instead of:

“Can someone lookup how many instances that project is using?”

pause

“It looks like they’re using three.”

“How big are the instances?”

pause

wash, rinse, repeat.

There was now:

“So as everyone can see in Slack, that project is using three instances.”

Screen Shot 2015 07 07 at 3.02.24 PM

ChatOps as a Catalyst

ChatOps integration was the key to our library modifications. It allowed us to see how our original monolithic reports could be broken down into smaller atomic pieces. These pieces are then mixed and matched like LEGOs, building multi-phase workflows that either help us collaborate in Slack or do some behind-the-scenes automations.

LEGO-building may be the best way to describe how we’re currently using StackStorm. StackStorm provides a repository of community-contributed packs. By using these packs in conjunction with our own in-house Cybera-specific pack, we can build different workflows and actions for our different projects.

Workflows

“Workflows” has been mentioned several times already and it deserves more explanation. A great, in-depth article on workflows can be found here. Basically, a workflow can be thought of as a multi-step process. We’ve only scratched the surface of using workflows, but can already see their power.

As an example, we have a command that will generate a report of a project’s usage in the Rapid Access Cloud. This command accepts a project in its unique ID form.

Projects are internally referenced by a unique ID such as “9aa5f9f66b4b417d84d778a23acdf45b”, as well as a common name like “jtopjian”. When referring to projects in conversation, it’s easier to use the latter form. However, for automated processes, the unique ID is used. This is because a project can sometimes have special characters in the common name, or even the same common name as another project. The unique ID will always have alpha-numeric characters and be guaranteed to be unique.

So if I want to run a report of the “jtopjian” project, I first need to look up the unique ID and use that to run the report command. Why not just combine the two steps into a workflow?

Step 1: Take a project’s common name as an input and output the unique ID.

Step 2: Take the unique ID from step 1 as input, run the reporting command, and print the result as output.

Even more beneficial is that Step 1 is written in Ruby, our common language for internal tools. The reporting tool is written in Golang (as an exercise to further explore this language). So pieces that make up workflows can be completely unrelated.

Conclusion

StackStorm has been a very exciting tool for us. It’s enabled us to discover new ways of collaborating as well as building automated workflows for our environments. As we continue to use it every day, we look forward to the new discoveries we’ll make.

References

The Return Of Workflows
So, What is ChatOps? And How do I Get Started?
ChatOps at GitHub
StackStorm And “ChatOps For Dummies”
Enhanced ChatOps From StackStorm
Integrating ChatOps With StackStorm
Ansible and ChatOps. Get started

The post User story: StackStorm, Workflows, and ChatOps appeared first on StackStorm.

↧

0.13 released! Time to upgrade!

August 26, 2015, 1:10 pm

≫ Next: Turning Java App Into StackStorm Action

≪ Previous: User story: StackStorm, Workflows, and ChatOps

August 26, 2015
by Lakshmi Kannan

We are excited to announce another release of StackStorm. 0.13 comes with some great features, user contributions and many bug fixes. It’s definitely worth upgrading and the upgrade should be non-eventful. If you are trying us out for the first time, use the shiny GUI installer!

You can bring your own box or use AWS AMI or a VMware VMDK or vagrant as the base box and kick off the (beta) installer after provisioning.

Please ask for support if you face issues!

Speaking of which, if you need help, a great place to get it is our slack community. If you haven’t registered yet, sign up here.

If you are entering into production with StackStorm, we do have support and professional service options that most of our known production users are leveraging. Sorry for the sales pitch, read more here: http://stackstorm.com/services/

configure_chatops

User contributions

Itxaka added support for openstack authentication backend to StackStorm auth. Now you can use openstack authentication if you already have it installed with StackStorm. This gives you one more opportunity to try out StackStorm’s openstack pack. Thanks Itxaka!

Highlights

0.13 brings you support for tracing. You can now add a trace tag to every execution request for manual executions or add a trace tag to a trigger dispatched from sensor. You can use the trace tag to then see what rules were fired and what executions were executed. This is a wonderful debugging tool and also improves the visibility of what’s happening inside StackStorm. This was a popular request from some of our advanced customers who want the ability to correlate external events to StackStorm events. Please try it out and let us know how you like it! Docs should get you started. We are open to feedback or comments.

We are moving away from fabric based SSH runner to a home grown SSH runner on top of Paramiko. Thanks to libcloud project and our own Tomaz Muraus for the initial implementation. This should address a lot of issues we were seeing with parallel execution of actions on multiple hosts. New SSH runner also would allow you to run SSH actions as a different user + credentials combo than the system user. The new runner is enabled by default and it should be exactly compatible with the previous generation SSH runner. Try it out and let us know how it works for you! You can also revert to the old runner by setting use_paramiko_ssh_runner to false in config.

Support for clustered rabbit mq is now available! This is also a popular request from our advanced customers who have a clustered rabbit mq setup already and want to use that with StackStorm. As a StackStorm user, you won’t have to make any changes in configuration.

We now have X-Request-ID tagging of API requests as a HTTP header. This will help debug API errors and track it in the logs using the ID (grep {ID} /var/log/st2/st2api.log). Look for that header in API responses.

We also added support to re-spawn sensors on failures or exceptions. Sensors could be hitting APIs of external services, process the results and inject triggers into StackStorm. Any sensor interacting with an external service or system could crash due to a multitude of reasons. If that happens, sensor process might die. Now those sensor processed will be re-spawned on crash or failure by sensor container a maximum of two times.

Bug fixes

A lot of bug fixes went into the 0.13 release. Some critical bugs were fixed with API responses, sandboxed python environment path etc. Also, a potential security issue w.r.t injection of arbitrary code when supplying positional parameters for scripts was fixed. An upgrade is highly recommended.

We also have several additional features and plans coming up as we gear up for our 1.0 release later this year, including Role Based Access Control. Be sure to take a look at our Roadmap and stay subscribed to our newsletter.

The post 0.13 released! Time to upgrade! appeared first on StackStorm.

↧

Turning Java App Into StackStorm Action

September 11, 2015, 7:03 pm

≫ Next: Turning Java App Into StackStorm Action

≪ Previous: 0.13 released! Time to upgrade!

September, 11 2015 by Dmitri Zimine

A StackStorm user with large investment in Java asked us: “Can I turn my Java code into StackStorm actions, and how?”

The answer is “Yes you can, in three basic steps”:

java-st2

Wrap the Java code in a Java console application;
Take the input as command line arguments
For the best results, output formatted JSON to stdout/stderr – this way StackStorm will auto-parse it so that you reference them with dotted.notation in workflows.

But before diving into details: how StackStorm can leverages Java assets in StackStorm? They become a part of automation library, with unified API, CLI and UI. You combine them via workflows and rules with other actions and sensors – thousands on st2community and through integration with Chef, Puppet, Git, monitoring, and others. They become part of your auto-remediation, continuous deployment, security responses, or other solutions. Last, but not the least, StackStorm native ChatOps integration makes your Java actions runnable from Slack or HipChat or IRC, with few lines of configuration. And if you are NOT Java, check out Actions of All Flavors – an excellent tutorial on turning “any” script into action.

Now lets get techy and dive into step-by-step details.

I am using default pack in the example below, adjust accordingly if you are creating your Java action in a new pack.

1. Sample Java app

This is the simplest Java app that fits our bill: uses an extra party java library, takes cli arguments, and spits out JSON:

View this code snippet on GitHub.

I place MyApp.java into /opt/stackstorm/packs/default/actions/, and the dependency json-simple-1.1.1.jar under /opt/stackstorm/packs/default/actions/lib. You can keep them wherever you like, and adjust the paths on the next step, when defining action metadata.

Compile and make sure it runs: $ javac -cp lib/<em>:. MyApp.java $ java -cp lib/</em>:. MyApp foo bar {"args":["foo","bar"],"description":"array of arguments"}

2. Create action

View this code snippet on GitHub.

You can see, this action meta data as usual. Few notes on “secret sauce” that makes it all work:

I use `local-shell-cmd` runner. It runs an arbitrary command, specified by `cmd` parameter. In our case, it is calling Making the `default` parameter `immutable` effectively hardcodes it.
`p1` and `p1` are the parameters I define for the action, they are mapped to the input of MyApp in `cmd` parameter.
`cwd`, “current working directory”, is where the command will be executed.
`env` lets me add environment variables to the execution context. That’s exactly what I need to pass `CLASSPATH`. Given that I already set up the `cwd` as the directory where MyApp is located, classpath is relative to it: `lib/*:.`.

Create the action with the following command:

st2 action create /opt/stackstorm/packs/default/actions/jaction.yaml

3. Run action

View this code snippet on GitHub.

Yes, that’s it!

And make a note: the output is parsed as JSON – to make sure, do st2 execution --json.

An arbitrary complex java code will follow this exact pattern: wrap in the console app, pass the parameters, refer the location in the action meta data, supply CLASSPATH using env, optionally, spit out JSON to conveniently refer the output in the workflows.

Dev note

Creating a Java action in StackStorm is not hard once you know how. But let’s admit it, it is not intuitive. The right way to do it is to create Java runner that takes care of all the mechanics for you. And contribute this runner to st2.

If someone in the community got the cycles to do it before we get our hands to this task, please let us know; we will guide, help, review and gladly accept. Check out this PRs where our HP friends contributed a CloudSlang runner, for hints and directions.

What’s next

Explore the other tutorials. Check out “Actions” documentation, and help improve where we miss out.

Come see us in IRC – #stackstorm on freenode.org, or join the stackstorm-community on Slack for live discussion on this and other topics (register here).

Enjoy using StackStorm!

The post Turning Java App Into StackStorm Action appeared first on StackStorm.

↧

Turning Java App Into StackStorm Action

September 15, 2015, 1:45 pm

≫ Next: Auto-remediating bad hosts in Cassandra cluster with StackStorm

≪ Previous: Turning Java App Into StackStorm Action

September, 11 2015 by Dmitri Zimine

A StackStorm user with large investment in Java asked us: “Can I turn my Java code into StackStorm actions, and how?”

The answer is “Yes you can, in three basic steps”:

java-st2

Wrap the Java code in a Java console application;
Take the input as command line arguments
For the best results, output formatted JSON to stdout/stderr – this way StackStorm will auto-parse it so that you reference them with dotted.notation in workflows.

Now lets get techy and dive into step-by-step details.

I am using default pack in the example below, adjust accordingly if you are creating your Java action in a new pack.

1. Sample Java app

This is the simplest Java app that fits our bill: uses an extra party java library, takes cli arguments, and spits out JSON:

View this code snippet on GitHub.

Compile and make sure it runs: $ javac -cp lib/<em>:. MyApp.java $ java -cp lib/</em>:. MyApp foo bar {"args":["foo","bar"],"description":"array of arguments"}

2. Create action

View this code snippet on GitHub.

You can see, this action meta data as usual. Few notes on “secret sauce” that makes it all work:

I use `local-shell-cmd` runner. It runs an arbitrary command, specified by `cmd` parameter. In our case, it is calling Making the `default` parameter `immutable` effectively hardcodes it.
`p1` and `p1` are the parameters I define for the action, they are mapped to the input of MyApp in `cmd` parameter.
`cwd`, “current working directory”, is where the command will be executed.
`env` lets me add environment variables to the execution context. That’s exactly what I need to pass `CLASSPATH`. Given that I already set up the `cwd` as the directory where MyApp is located, classpath is relative to it: `lib/*:.`.

Create the action with the following command:

st2 action create /opt/stackstorm/packs/default/actions/jaction.yaml

3. Run action

View this code snippet on GitHub.

Yes, that’s it!

And make a note: the output is parsed as JSON – to make sure, do st2 execution --json.

Dev note

What’s next

Explore the other tutorials. Check out “Actions” documentation, and help improve where we miss out.

Come see us in IRC – #stackstorm on freenode.org, or join the stackstorm-community on Slack for live discussion on this and other topics (register here).

Enjoy using StackStorm!

The post Turning Java App Into StackStorm Action appeared first on StackStorm.

↧

Auto-remediating bad hosts in Cassandra cluster with StackStorm

September 22, 2015, 2:25 pm

≫ Next: StackStorm 1.0 Enterprise Edition launched: w/ Netflix as user

≪ Previous: Turning Java App Into StackStorm Action

September 23, 2015

by Lakshmi Kannan

If “SLAs”, “five 9 uptime”, “pager fatigue” and “customer support” are phrases you use everyday in your work, you know by now auto-remediation is a serious use case. If you are running critical infrastructure of any kind, you may already be looking into auto-remediation, or even using it like Facebook, LinkedIn, Netflix (more on that later). The idea is that if you are running critical systems of any kind, you need to see when events happen and to act on them as fast as humanly possible. Actually, no, to improve mean time to recovery you need to respond FASTER than humanly possible.

If I start to sound like “automate everything”, that’s because I am saying that in less direct terms. As a developer and an operations person, I enjoy automation as much as anyone. StackStorm is the machine you want! I wish a system like StackStorm existed in my previous gigs so I could have just focused more on automating than worrying about inventing a platform to perform the automation. I also wish I had slept more than being woken up by the sound of pagers and phones telling me about a problem that could have been remediated by a piece of code!

What are you going to show me this time?

This post – and this automation – is inspired by one of our largest users of StackStorm for Cassandra auto-remediation – Netflix. They are speaking about how they manage Cassandra at the Cassandra Summit so you can learn more about their overall approach there and can spend quality time with them in an upcoming meet-up as well at their HQ.

In this blog post, I’ll walk you through a really common problem – a bad host in a Cassandra cluster – and how you can leverage StackStorm to auto-remediate it. I have been a long time happy Cassandra user and I definitely enjoy the solid performance it provides applications. We also use it at StackStorm in our yet to be released Analytics platform. StackStorm would complement Cassandra so well by handling the operations.

Cassandra host replacement – manual style!

A common problem is a host in the Cassandra ring dies and now you are running in low availability mode. Though Cassandra can deal with one node outages (depending on how you setup the cluster), you’d ideally want to replace the dead host with a new host asap. Typically, a monitoring system watches the cluster for dead nodes (nodetool status) and sends an event upstream. An alerting system then figures out if the alert needs to be a page. If so, a system like Pager Duty is then used to wake up a poor soul. I say “wake up” with confidence because I honestly can’t remember a failure that happened during regular business hours. Maybe the computers feel lonely post work hours but I digress ;). A DevOps engineer then validates that the alert is indeed positive, figures out a replacement node, spins a VM, deploys Cassandra on the box (depending on whether you use Chef or Puppet or StackStorm you have Cassandra deployed automatically) and then invokes a set of procedures to replace the node. Let’s walk through the actual steps involved in replacing a node. The steps are listed in this runbook from our friends at Datastax. As you can see, it’s a 6 step process and it involves an engineer monitoring the system carefully to make sure things are going according to plan. This pretty much steals a few hours from an engineer’s time which could be spent on something more productive.

Cassandra host replacement – StackStorm style!

StackStorm has got your back! In the StackStorm world, the node down event from the monitoring system goes to StackStorm, StackStorm then acts on the event i.e. spins up a new VM with Cassandra deployed and replaces the dead host with the new host following the runbook codified as a workflow (Infrastructure as code!). If StackStorm fails by any chance, it then relays the alert to a human who can decide what to do. The power of such a tool will save you so much pain and effort.

I am in! Can I see the code now?

Yes, you can see the code. We support infrastructure as code 100%. However – you can also see the automation via our brand new FLOW.

FLOW in action

Here’s that code (actually YAML – and this syntax is emerging as a defacto standard – another discussion for another time):

version: '2.0'

cassandra.replace_host:
    description: A basic workflow that replaces a dead cassandra node with a spare.
    type: direct
    input:
        - dead_node
        - replacement_node
        - healthy_node
    output:
        just_output_the_whole_worfklow_context: "<% $ %>"
    tasks:
        is_seed_node:
            action: cassandra.is_seed_node
            input:
                hosts: "<% $.healthy_node %>"
                node_id: "<% $.dead_node %>"
            publish:
                seed_node: "<% $.is_seed_node.get($.healthy_node).stdout %>"
            on-success:
                - abort_replace: "<% $.seed_node = 'True' %>"
                - create_vm: "<% $.seed_node = 'False' %>"
                - error_seed_node_determination: "<% not $.seed_node in list(False, True) %>"
            on-error:
                - error_seed_node_determination
        abort_replace:
            action: slack.post_message
            input:
                channel: "#dsedemo"
                message: "```[CASS-REPLACE-HOST] [<% $.dead_node %>] STATUS: FAILED REASON: SEED NODE DEAD. NOT HANDLED. ABORTED.```"
            on-complete:
                - fail
        error_seed_node_determination:
            action: slack.post_message
            input:
                channel: "#dsedemo"
                message: "```[CASS-REPLACE-HOST] [<% $.dead_node %>] STATUS: FAILED READON: SEED NODE DETERMINATION FAILED.```"
            on-complete:
                - fail
        create_vm:
            action: core.local  # You can call your create_vm workflow here!
            input:
                cmd: "echo Replacing <% $.dead_node%> with <% $.replacement_node %>"
            on-success:
                - stop_cassandra_service
            on-error:
                - notify_create_vm_failed
        notify_create_vm_failed:
            action: slack.post_message
            input:
                channel: "#dsedemo"
                message: "```[CASS-REPLACE-HOST] STATUS: FAILED REASON: create_vm_with_role failed.```"
            on-complete:
                - fail
        stop_cassandra_service:
            action: cassandra.stop_dse
            input:
                hosts: "<% $.replacement_node %>"
            on-success:
                - remove_cass_data
        remove_cass_data:
            action: cassandra.clear_cass_data
            input:
                data_dir: "/var/lib/cassandra"
                hosts: "<% $.replacement_node %>"
            on-success:
                - remove_replace_address_jvm_opt_if_exists
        remove_replace_address_jvm_opt_if_exists:
            action: cassandra.remove_replace_address_env_file
            input:
                hosts: "<% $.replacement_node %>"
            on-success:
                - set_jvm_opts_with_replace_address
        set_jvm_opts_with_replace_address:
            action: cassandra.append_replace_address_env_file
            input:
                dead_node: <% $.dead_node %>
                hosts: <% $.replacement_node %>
            on-success:
                - start_cassandra_service
        start_cassandra_service:
            action: cassandra.start_dse
            input:
                hosts: "<% $.replacement_node %>"
            on-success:
                - notify_replace_host_started
        notify_replace_host_started:
            action: slack.post_message
            input:
                channel: "#dsedemo"
                message: "```[CASS-REPLACE-HOST] [<% $.dead_node %>] STATUS: STARTED```"
            on-success:
                - wait_for_read_ports_to_open
            on-error:
                - wait_for_read_ports_to_open
        wait_for_read_ports_to_open:
            action: cassandra.wait_for_port_open
            input:
                check_port: 9042
                server: "<% $.replacement_node %>"
                hosts: "<% $.replacement_node %>"
                timeout: 1800
            on-success:
                - remove_replace_address_env_file
            on-error:
                - notify_replace_host_failed
        remove_replace_address_env_file:
            action: cassandra.remove_replace_address_env_file
            input:
                hosts: "<% $.replacement_node %>"
            on-success:
                - notify_replace_host_success
        notify_replace_host_success:
            action: slack.post_message
            input:
                channel: "#dsedemo"
                message: "```[CASS-REPLACE-HOST] [<% $.dead_node %>] STATUS: SUCCEEDED```"
        notify_replace_host_failed:
            action: slack.post_message
            input:
                channel: "#dsedemo"
                message: "```[CASS-REPLACE-HOST] [<% $.dead_node %>] STATUS: FAILED```. REASON: BOOTSTRAP TIMED OUT."

Runbook as code

We picked a simple case where a dead node is first checked to see if it is a seed node. Seed nodes are special in Cassandra and for the purpose of simplicity, we do not handle seed node replacements. Instead, as you can see, we abort the workflow if a node is a seed node. If it is not, then we follow the steps listed in the runbook one at a time. Each step in the runbook is codified as simple actions in StackStorm and placed in appropriate packs. For a brief intro to packs, read our documentation and check out the community packs available. All the actions listed in the mistral workflow are part of Cassandra pack

As you might guess, some of the actions need to happen on remote boxes. To be precise, they need to happen on the new box being spun up. It goes without saying, StackStorm needs passwordless SSH access (via keys) to this new box. You might also notice the notify tasks in the workflow. They post notifications on a specified channel in slack. This gives you visibility into the status of the workflow as it’s being run! The point here is instead of having to have a task that notifies at each task completion, you can configure the whole workflow to notify Slack. This is one of many ways we’ve made ChatOps a first class citizen, but I digress. There is a great blog about simplifying and extending ChatOps with StackStorm here.

Hey, so how do I run the codified runbook?

StackStorm ships with a CLI and a lovely UI in the community version. The Cassandra pack comes with a nodetool action that you can use to run nodetool actions on cluster. For example:

st2 run cassandra.nodetool hosts=10.0.2.247 command='status'

You can also kickoff the replace host workflow manually using the CLI.

st2 run cassandra.replace_host \
    dead_node='10.0.2.246' \
    healthy_node='10.0.2.247' \
    replacement_node='st2-dse-demo-replacement001' -a

Here’s a juicy screenshot of using the UI to run the nodetool action.

nodetool status via st2

Whoa! That’s cool! How do I fully automate it?

StackStorm has a concept of rules. Rules connect triggers (external events) to actions or workflows registered with StackStorm. Triggers can simply be webhooks. For active polling of an external system and other use cases, you might want to look at sensors. You could setup your external monitoring system like
sensu or new relic to post a webhook to StackStorm. Webhooks are registered by registering a rule.

See the sample rule definition below.

---
    name: "replace_dead_host"
    pack: "cassandra"
    description: "Rule to handle cassnadra node down event."
    enabled: true

    trigger:
        type: "core.st2.webhook"
        parameters:
            url: "cassandra/events"

    criteria:
        trigger.body.event_type:
            type: "iequals"
            pattern: "cass_node_down"

    action:
        ref: "cassandra.replace_host"
        parameters:
            dead_node: "{{trigger.body.node_ip}}"
            healthy_node: “10.0.2.247”  # usually you’d get this from consul or etcd.
            replacement_node: "st2-dse-demo-replacement001"  # usually you’d get this from consul or etcd.

See how we dropped straight to the code there? Pretty cool, huh. Of course you can do the same thing via the UI that actually spells out the IF and THEN relationship in rules page.

Rules UI

The webhook complete URL is https://stackstorm_host.com/v1/webhooks/cassandra/events. Whenever a webhook is received, the rule engine in StackStorm validates if the event_type is cass_node_down. If yes, then it invokes the replace host workflow with the dead node’s IP address (obtained from webhook payload) and a replacement node. An example webhook that you post to the URL will look like

{
  "event_type": "cass_node_down",
  "node_ip": "10.0.2.247"
}

In the real world, you probably have spares available with Cassandra pre-installed. If so, then you might include an action to get a replacement from set of replacements (perhaps via a consul integration) and use that in the workflow. If you want to spin a node as part of the workflow, you can do so as shown in the workflow. This will be slower as expected. So now you have things wired up and good to go!

Pretty neat! How do I write my own workflows?

If you think this is interesting, imagine remediations for all the important pieces of your environment that can benefit from a few steps being run in response to any action. Take a look at the StackStorm community for example – specifically the many packs now appearing here, and you can see for example hooks to security systems emerging. See where I’m going with that? There is a huge push now by some incredibly harassed companies and agencies to automate much more of their remediations – this time in response to security events such as unfriendly hackers.

Holy cow! I want StackStorm for my org!

We’re glad to hear that. Take a look at StackStorm and maybe ask some questions on the StackStorm community. Use our free trial of the enterprise edition or our all in one installer to get started.

We do support StackStorm – including the Enterprise Edition features – 24×7 in some mission critical environments. And we also are keen to get more and more users sharing their remediations as code. So jump in, the water’s perfect.

Can we be friends? Not like facebook friends, real friends?

Sure, can! Check out our slack community where stormers and larger StackStorm community hang out and help each other! Feel free to register. You can also look at interesting contributions like packs, bug fixes and even new features! Perhaps you could also be motivated to open a pull request! We are also on
IRC or email us at moc.mrotskcatsnull@troppus. Feel free to subscribe to our newsletter to get interesting updates on StackStorm!

The post Auto-remediating bad hosts in Cassandra cluster with StackStorm appeared first on StackStorm.

↧

StackStorm 1.0 Enterprise Edition launched: w/ Netflix as user

September 22, 2015, 4:48 pm

≫ Next: DCD Internet 2015

≪ Previous: Auto-remediating bad hosts in Cassandra cluster with StackStorm

September 23, 2015

by Evan Powell

Today we announce StackStorm 1.0 – and release our Enterprise Edition 1.0 release candidate.

Maybe more noteworthy, Netflix is announcing – at the Cassandra Summit which they are helping to keynote as one of the world’s largest Cassandra users – that they use StackStorm to auto-remediate their Cassandra environments.

Netflix, StackStorm, Cassandra

It has been more than two years since we got StackStorm going. And last November – we open-sourced StackStorm.

Rereading the announcement of that open sourcing, you can see that much is still the same. Perhaps my favorite line – one that I think now many thousands of StackStorm users have embraced – is that:

“StackStorm ties together your existing infrastructure and application environment so you can more easily automate that environment — with a particular focus on taking actions in response to events.”

These days we call that approach Event-Driven Automation and Remediation. An entire community, with its own meet-up, has sprung-up around this approach with speakers from LinkedIn and Facebook and WebEx and of course StackStorm and Lithium and many others as well, talking about how they remediate common situations and use such systems to tie together today’s incredibly complex, large and ever changing environments. Maybe more importantly, these environments convey competitive advantage in a world in which tech billionaires proclaim that “software is eating everything.” And maybe it is.

Today we are excited to share what we have added to StackStorm over the last year – and in particular the usage of StackStorm by some of the world’s best operators of information technology including – again – Netflix.

I’m particularly excited to share a utility we have developed that we call Flow. Flow integrates with StackStorm and enables Enterprise Edition users to visually design and control your entire automation. We know, first hand, how older automation solutions were built and the GUIs they had that, thus far, no DevOps automation solution could touch. We think Flow blows right past them with the first interface to treat infrastructure as code – updating in real time changes to your automation as code as you drag and drop.

Take a look here:

st2flow_take1

And learn more about Flow here.

Other additions to StackStorm that together comprise the Enterprise Edition are detailed farther down the product page.

For example – role based access controls, 24 x 7 (or 9 x 5) support, and more.

Want to learn more?

Take a look at our site to see many testimonials and best practices and more about how StackStorm is built.

And if you are interested in Cassandra auto-remediation – and want to see, in action, a version of what Netflix is using to keep their environment running with a minimum of risky and annoying human firefighting – please take a look for a blog we are posting today that explains our approach and (you guessed it!) shares as code this operational pattern.

There is so much that has been added in the last year that you’ll just need to dive in yourself to discover it. Don’t worry, we have a Top 10 Additions blog to help. Somehow making all Salt, Ansible, Chef and Puppet actions available via StackStorm – and hence via Flow – only made #9 on my top 10 list. I might put it much higher were I to do it again as we hear again and again how important these integrations are to users.

Grab StackStorm now – a free trial of the Enterprise Edition – officially our 1.0 release candidate (RC) – is available via a few clicks from here on our home page. Or – begin automating now with our Community Edition. It continues to progress quickly with the help of community members like Netflix and many others. With every contribution – many of which are new integrations such as to other bots or monitoring systems or firewalls or other security appliances – StackStorm gets smarter and more valuable.

Join us. Let’s make auto-remediation a reality. Let’s tie our environments together in a much easier to manage way – let’s make infrastructure as code fully realized by sharing operational patterns as code.

Here we go!

The post StackStorm 1.0 Enterprise Edition launched: w/ Netflix as user appeared first on StackStorm.

↧

DCD Internet 2015

August 5, 2015, 8:00 am

≫ Next: OctopusDeploy Integration with StackStorm

≪ Previous: StackStorm 1.0 Enterprise Edition launched: w/ Netflix as user

July 30-31, 2015
San Francisco, CA

StackStorm CEO Evan Powell joined Eric Wells of Fidelity Investments and Grant Richard of Goldman Sachs as part of a panel addressing how to build successful data-driven data centers.

EVENT WEBSITE

The post DCD Internet 2015 appeared first on StackStorm.

↧

OctopusDeploy Integration with StackStorm

October 1, 2015, 8:02 am

≫ Next: Tutorial of the Week: Cassandra Auto-Remediation

≪ Previous: DCD Internet 2015

October 1, 2015
Guest post by Anthony Shaw, Head of Innovation, ITaaS at Dimension Data

This blog post will take you through the integration pack for OctopusDeploy and give you some example actions and rules to integrate with other packs.

What is OctopusDeploy?

Octopus Deploy is an automated deployment tool for .NET and Windows environments. It has gained significant popularity amongst the .NET development community for it’s ease of use and integration into the Microsoft development ecosystem. OctopusDeploy enables users to automate deployment of applications, packages and tools to Windows environments.

Why integrate OctopusDeploy into StackStorm?

Octopus Deploy provides a rich system for Windows application deployments, but this is typically part of a wider DevOps process. Unlike StackStorm, it does not support closed-loop monitoring, remediation or infrastructure configuration and building, it does not integrate into configuration management tools (nor claim to). If you want to integrate OctopusDeploy from another tool, as part of a DevOps or environment tool you could write custom integrations from each tool to the Octopus API or you could simply use StackStorm as the go-between to join your systems together. Imagine the possible integration scenarios:

Configuring the Octopus Deploy agent as part of a new environment creation
Creating a new release when a git commit is detected
Calling a 3rd party system when a release or deployment is created in Octopus

The Octopus Deploy pack for StackStorm enables you to drive those scenarios with no additional development. StackStorm packs consist of 2 components:

Actions – Tasks that can be called from a trigger, e.g. “Create Release”
Sensors – Processes that run and detect events as triggers, e.g. “New Release”

Supported Actions The following actions listed below are supported:

Create a new release – create_release
Deploy a release to an environment – deploy_release
Get a list of releases for a project – get_releases
Add a new machine to an environment(s) – add_machine

Sensors Actions or workflows can be initialized automatically from these sensors:

Detect a new release being created – new_release_sensor
Detect a new deployment being created – new_deployment_sensor

Installing the OctopusDeploy integration pack

From the StackStorm console, use the packs.install task to download and install the pack:

st2 run packs.install packs=octopusdeploy repo_url=https://github.com/StackStorm/st2contrib.git

StackStorm packs by convention each have a configuration file for server-wide properties. The Octopus pack needs to be configured first, update /opt/stackstorm/packs/octopusdeploy/config.yaml to setup the connection to Octopus. You need to issue an Octopus Deploy API key to integrate the pack. The documentation for this is available on the Octopus Website. Within config.yaml, populate the example properties with your test instance

api_key – an API key generated in Octopus for your user
host – the hostname of your Octopus server e.g. octopus.mydomain.com
port – the port your API service is running on, 443 by default Now, restart the StackStorm services to reload the configuration
st2ctl restart

Now you can test in the UI to get releases and versions. Select the actions pane and choose get_releases. Enter the project ID of one of your existing projects and execute the action.

Screenshot of Octopus Integration

A real life example

Got it? Ok, lets look at a more concrete example. Let’s create a basic rule that detects when a new release or deployment is raised and posts a message in Slack with some details. I’m going to assume you already have a Slack account, if you haven’t already installed the Slack pack you can do so now

st2 run packs.install packs=slack repo_url=https://github.com/StackStorm/st2contrib.git

Then update /opt/stackstorm/packs/slack/config.yaml with an authentication token from Slack using the instructions here.

From the console, create a new rule definition using this format:

{
   "name": "octopus_releases",
   "tags": [],
   "enabled": true,
   "trigger": {
       "type": "octopusdeploy.new_release",
       "parameters": {},
       "pack": "octopusdeploy"
    },
   "criteria": {},
   "action": {
       "ref": "slack.post_message",
       "parameters": {
           "username": "anthonypjshaw",
           "message": "{{trigger.author}} created a release {{trigger.version}} in octopus project {{trigger.project_id}} with notes {{trigger.release_notes}}",
           "channel": "#releases"
       } 
   }, 
   "pack": "octopusdeploy"
}

Import the rule into StackStorm by using the rule create command.

st2 rule create release_rule.json

Back in the UI you will see your new rule,

Completed Rule

When someone has created a new release, you will see the rule action in the History pane:

Run history

To detect new deployments, create a new rule

{
    "name": "octopus_deployments",
    "tags": [],
    "enabled": true,
    "trigger": {
         "type": "octopusdeploy.new_deployment",
         "parameters": {},
         "pack": "octopusdeploy"
    },
    "criteria": {},
    "action": {
          "ref": "slack.post_message",
          "parameters": {
                  "username": "anthonypjshaw",
                  "message": "{{trigger.author}} created a deployment {{trigger.version}} in octopus project {{trigger.project_id}}",
                  "channel": "#releases"
          }
    },
    "pack": "octopusdeploy"
}

And then in Slack you should see a live feed of new releases and deployments within the “Releases” channel.

slack view

What’s next

Now you’re familiar with the pack and the potential for other integration scenarios, check out some of the other actions you could run in packs like:

Ansible, Salt, Chef, Puppet – Combining your Octopus workflows with other devops tools
Git, GitHub, Jenkins – Triggering new releases or deployments from build or source control events
AWS, Azure, LibCloud – Deploying Octopus tentacles as part of your infrastructure standup
Twitter – Want to automatically announce new releases of your product to the world?!

The post OctopusDeploy Integration with StackStorm appeared first on StackStorm.

↧

Tutorial of the Week: Cassandra Auto-Remediation

October 2, 2015, 11:06 am

≫ Next: Auto-remediation by example: handling out-of-disk-space.

≪ Previous: OctopusDeploy Integration with StackStorm

October 2, 2015
by Evan Powell

Let’s get right to it.

This week we feature a tutorial that published last week. It has to do with auto-remediating your environment. The tutorial focuses on using StackStorm to auto-remediation Cassandra; it was published at the Cassandra Summit after all – building on Netflix’s use of StackStorm for that use case.

remediation cassandra

However – and here is the kicker – keep in mind StackStorm is basically a giant lego set. The pattern outlined in the tutorial works. Not Cassandra? OK, have a different source of events and a different target for your actions. MySql? MongoDB? Heck, even Oracle? It can take you just a matter of minutes to adapt this tutorial to your needs.

And if you do contribute back some remediation patterns – your name can appear in lights as a guest blogger. Ping us on our community channel on Slack and we can help you adapt the pattern this blog highlights to your particular circumstances.

Here’s that tutorial: https://stackstorm.com/2015/09/22/auto-remediating-bad-hosts-in-cassandra-cluster-with-stackstorm/

And the GIF from the tutorial. Have at it!

The post Tutorial of the Week: Cassandra Auto-Remediation appeared first on StackStorm.

↧

Auto-remediation by example: handling out-of-disk-space.

October 5, 2015, 3:59 pm

≫ Next: Hello World – StackStorm is GA (1.1 shipping)

≪ Previous: Tutorial of the Week: Cassandra Auto-Remediation

October 5, 2015
by Dmitri Zimine, Patrick Hoolboom

A host is running out of disk space. What follows is a routine pager panic and rush in cleaning things up, at best. At worst, downtime. It is silly, but it happens much more than most of us care to admit.

This, and many other annoying events like this can, and shall be auto-remediated. The “classic” pattern of wiring monitoring to and paging is simply not good enough, and know it when you’re paged at 3am to clean the disk on production server.

And to those of you who hard-wire their remediation scripts into Nagios/Sensu event handlers, Splunk alert scripts and NewRelic web hooks: it is plain wrong there’s a better way.

In this blog, we show how StackStorm auto-remediation platform helps you hand out-of-disk case, with step-by-step walk-through and a working automation sample to kick-start your auto-remediation.

If you’re in the hurry, grab the automation code in st2-demos and run it on your StackStorm instance.

st2 run packs.install packs=st2-demos repo_url=StackStorm/st2incubator

For the rest, let’s walk through the three steps of setting auto-remediation with StackStorm. First, configure the integrations with monitoring and paging system, Second, define your auto-remediation workflow, “runbook as code”. Third, create a rule mapping event to auto-remediation.

1. Set up the integrations.

Install sensu and victorops pack from st2contrib:

st2 run packs.install packs=sensu,victorops

Configure to point to your Sensu and Victorops sensu/config.yaml and victorops/config.yaml under /opt/stackstorm/packs. Follow the detailed instructions on sensu pack to send Sensu events to StackStorm. If you are on Nagios, NewRelic, Splunk, or other monitoring – pick the integration for your tool. You find many on st2contrib or st2incubator. New integrations are easy to build, we welcome and support your contributions. Likewise, you happen to use PagerDuty – grab and configure pagerduty pack.

In this example ChatOps with Slack is used to post updates and fire commands. If you’re on HipChat or IRC or some other chat, or still prefer Email or SMS or JIRA for notifications – adjust accordingly.

2. Design your auto-remediation action.

This is yours to define. Our example here is “if disk is filled up with log files, just prune them, if it’s something else, wake me up”. Read the workflow code from diskspace_remediation.yaml, it’s self-explanatory. Note that action results are passed down the flow.

---
version: '2.0'
name: st2-demos.diskspace_remediation

workflows:
  main:
    input:
      - hostname
      - directory
      - file_extension
      - threshold
      - event_id
      - check_name
      - alert_message
      - raw_payload
    tasks:
      silence_check:
        # [215, 26]
        action: sensu.silence
        input:
          client: <% $.hostname %>
          check: <% $.check_name %>
        on-success:
          - check_dir_size
        on-error:
          - victorops_escalation
      check_dir_size:
        # [285, 128]
        action: st2-demos.check_dir_size
        input:
          hosts: <% $.hostname %>
          directory: <% $.directory %>
          threshold: <% $.threshold %>
        on-error:
          - remove_files
        on-success:
          - victorops_escalation
      remove_files:
        # [355, 230]
        action: core.remote_sudo
        input:
          hosts: <% $.hostname %>
          cmd: "rm -Rfv <% $.directory %>/*<% $.file_extension %>"
        on-error:
          - victorops_escalation
        on-success:
          - validate_dir_size
      victorops_escalation:
        # [105, 434]
        action: victorops.open_incident
        input:
          severity: "critical"
          entity: "<% $.hostname %>"
          message: "DemoBot could not autoremediate disk space event on <% $.hostname %>. Alert: <% $.alert_message %>"
      validate_dir_size:
        # [425, 332]
        action: st2-demos.check_dir_size
        input:
          hosts: <% $.hostname %>
          directory: <% $.directory %>
          threshold: <% $.threshold %>
        on-success:
          - post_success_to_slack
        on-error:
          - victorops_escalation
      post_success_to_slack:
        # [435, 434]
        action: slack.post_message
        input:
          channel: "#demos"
          message: "DemoBot has pruned <% $.directory %> on <% $.hostname %> due to a monitoring event.  ID: <% $.event_id %>\nhttp://st2demo002:8080/#/history/<% $.__env.st2_execution_id %>/general"

Define your workflow your way. Your environment, tools and run books are special. You may want to move files to s3 instead of deleting. Or provision and attach an extra volume. Or check for few other suspects before paging. And your logic will differ dependent on server role. Please yourself, mix and match your scripts with existing building blocks in the workflow that works for your case.

If you are on StackStorm Enterprise: the workflow graphical editor, Flow, will help create and visualize the workflows. Here is how the our sample diskspace remediation looks in Flow:

Editing diskspace_remediation.yaml in Flow

3.Create a rule.

Create a rule: if monitoring event triggers, fire the remediation action. The rule definition code shown below. Note that the trigger payload is parsed, used in criteria, and passed in action input.

---
# rules/diskspace_remediation.yaml
    name: "diskspace_remediation"
    pack: "st2-demos"
    description: "Clean up disk space on critical monitoring event."
    enabled: true
    trigger:
        type: "sensu.event_handler"
    criteria:
        trigger.check.status:
            pattern: 2
            type: "equals"
        trigger.check.name:
            pattern: "demo_diskspace"
            type: "equals"
    action:
        ref: "st2-demos.diskspace_remediation"
        parameters:
            hostname: "{{trigger.client.name}}"
            directory: "{{system.logs_dir}}"
            threshold: "{{system.logs_dir_threshold}}"
            event_id: "{{trigger.id}}"
            check_name: "{{trigger.check.name}}"
            alert_message: "{{trigger.check.output}}"
            raw_payload: "{{trigger}}"

If you like StackStorm’s slick UI, you can use it to create the rule. Or use CLI:

st2 rule create rules/diskspace_remediation.yaml

Profit!

Create a large file (an action in our sample pack does this for you), and see how StackStorm fires the actions. If you create the large file somewhere else, check that you’ll get an incident in VictorOps. Now that you know it works, enjoy: this irritating “out-of-disk space” problems will be auto-fixed before the page would even reach you.

And one more key thing: Manage your automation as code. Create an automation pack – just as we did with [st2-demos], commit to git, review, and deploy by packs.install. Or share it on github and exchange auto-remediation patterns and run books with your fellow devops – as code!

Hope this gives you a good start on the path to auto-remediation. At first, setting up StackStorm just for the sake of one simple integration may seem an overkill. But now that you have it all set up, adding automations is easy, almost addictive. Soon you’ll enjoy the compound value of rich action library, automation control plain under source control, and auto-remediations that keep your pager from going off at night.

And did you drink ChatOps cool-aid? Check what it can do for your operations, like these guys did, and stay tuned for our next blogs!

The post Auto-remediation by example: handling out-of-disk-space. appeared first on StackStorm.

↧

Hello World – StackStorm is GA (1.1 shipping)

October 28, 2015, 2:02 pm

≫ Next: StackStorm v1 is out!

≪ Previous: Auto-remediation by example: handling out-of-disk-space.

October 28, 2015
by Evan Powell

Time flies.

Over two years ago we got StackStorm going. And today we announce the general availability of StackStorm, both the Enterprise Edition and the Community Edition.

We have made StackStorm generally available because it is now ready, having proven itself at Netflix, WebEx and with thousands of other users. Maybe more importantly, we are announcing general availability because we are ready, with commercial license subscriptions, 24×7 support, and more.

We’ve learned a lot over the last couple of years thanks to countless conversations with automators and operators and thanks to discussions amongst what I strongly believe is the best core technical team in the overall DevOps market. All that learning shows up in StackStorm – a solution that is different than earlier automation in a number of ways:

Event-driven automation: Let’s start with the fundamentals. StackStorm is built from the ground up to wire together heterogeneous environments and to then allow you to take actions based on what is occurring. Do that between two systems – with middling reliability and, well, meh. The chewing gum scripts between your monitoring and your configuration management works well enough. But tie together many systems reliably so that you can, for example, serve the world streaming video (thanks Netflix!) and that’s hard to do. StackStorm has helped create the event-driven automation category – learning from the likes of Facebook, LinkedIn and others.

Rapid time to value: We did not want to fall into the trap of the old approaches to autonomic computing, including runbook automation, that could only deliver closed loop computing after months and months of bespoke integrations and coding. We put a lot of work into making the authoring of integrations – and of course automations – simple. And for Enterprise Edition users, that means making it as easy as drag and drop via Flow. Also, there are lots of integrations included with over 1500 total sensors and actions available in the StackStorm community. Actually there are even more as you can snap in your Ansible, Chef, Puppet or Salt, and start leveraging all the actions you’ve got there.

The reliability and scalability that a control plane requires: We have a different background than many DevOps tools. Yes, we have uber sys admins on our team – and they are invaluable. And we have folks that have help build services like AWS. However we add to that folks that have actually built and shipped product used by thousands of enterprises – including Dmitri who led engineering at one of the leaders of the old runbook automation space and who ran a big chunk of vSphere engineering as well. We built StackStorm with a team and an architecture that enables it to scale horizontally and that leverages what is now widely recognized as the most powerful reliable open source workflow available. Cool fast utilities that are hard to scale and nearly impossible to run reliably are great in support roles where building for scale could even be overkill. For the maestro – for event driven automation – you need the kind of experienced team we’ve built to dedicate years of effort to get to where we are now – reliably running environments like parts of WebEx and Netflix.

Workflow is the transmission: We anticipated a trend that is now widely acknowledged (again) – workflow is a useful component of solutions that tie together operational environments. Over two years ago Dmitri, my co-founder, and Renat Akhmerov, a senior engineer at Mirantis met and kicked off a collaboration that is yet another example of the power of open source based development. Mistral – the workflow they and fellow contributors designed and built – is now upstream as a core OpenStack project and is thriving with contributors from Alcatel, Huawei and many others. StackStorm builds upon the highly reliable and flexible Mistral, making it easier to use thanks to the rest of StackStorm.

IFTTT for Ops: At some point in the last couple of years, a user called us “If this then that for Ops “– and the tag line stuck. We even tweaked the open source GUI so the rules engine literally reads IF and THEN. With StackStorm you use rules to interpret events your sensors have noticed; for example, you see that an application is throwing errors, and your rules see that and fire off a troubleshooting workflow to pinpoint why that may be, while updating the humans; based on the results of that workflow you may decide to run another that fixes or remediates the issue.

ChatOps: We embrace ChatOps because for humans to accept powerful automation that automation must be transparent. Today we believe we are the only product that truly productizes ChatOps. In fact, if you want ChatOps – grab StackStorm and you’ll get it plus all the power of the underlying StackStorm platform.

OpenSource: Enterprises and other operators are simply tired of being locked into control planes from either proprietary vendors or from the last hipster engineer who built it using some cool stuff.

Power plus ease of use: Our first users, like WebEx, grabbed StackStorm for its abilities to control and leverage their existing automation well before we ever developed and then open sourced our GUI. Since that time we have built improved ease of use both via the GUI and the Flow automation authoring utility mentioned above. If you have not checked out Flow – you really should – here is a GIF of this utility which allows you to visually compose workflows while keeping infrastructure as code.

remediation cassandra

And there is more, much more, including features that are hard to appreciate until you start using StackStorm. For example, with StackStorm the result of every action can be an event that itself can easily be used to trigger another action. And those actions can be workflows. So the Lego analogy of building ever more powerful automations over time, by snapping them together, actually holds.

Well, those are at least some of the reasons that StackStorm has emerged as a leader of event-driven automation. Take it out for a drive – with StackStorm 1.1 we are releasing an improved all in one installer that includes a GUI if you are so inclined and that will install either the Community Edition or the Enterprise Edition.

Just go to http://docs.stackstorm.com/install/all_in_one.html. Please note that to grab Flow – and other Enterprise Edition capabilities – you’ll need to register on our front page and you’ll get an automated email with once again a link to this installer and, importantly, a license key to unlock those capabilities.

Lastly – what’s next? We will be accelerating our development now and next year will be releasing a piece of software that we believe changes the game for operations – again. Maybe more important than that big bang will be the day to day to day work to make every user successful. Find us on our community, ask us questions, and help us incrementally improve StackStorm.

Finally – I’m now talking to dozens of users and would be partners or competitors, all of whom are struggling with build vs. partner decisions. You have a tough decision to make, do you struggle on with your existing solution or do you “pull a Netflix” and bet on StackStorm. At they put in their talk at the Cassandra Summit, with StackStorm you get both a vendor committed to your success and a community adding capabilities to StackStorm and removing the specter of vendor lock-in.

We hope you’ll join us. Together we are making the sort of transformative efficiencies and agility delivered by event driven automation like Facebook’s FBAR available for all of us.

PS – a shout out to our friends at PagerDuty and VictorOps – how do you make getting paged at 2am suck less? You make sure you don’t get paged in the first place!

The post Hello World – StackStorm is GA (1.1 shipping) appeared first on StackStorm.

↧

StackStorm v1 is out!

November 2, 2015, 5:55 pm

≫ Next: AHH #07 – Unpacking the Gloriousness of StackStorm 1.1 & StackStorm Enterprise

≪ Previous: Hello World – StackStorm is GA (1.1 shipping)

November 02, 2015
by Dmitri Zimnie

stackstorm-v1-rules

A new release of StackStorm is out…. and (…drums…) it is version 1.1!

Yes, this is a major release. The product has really come together, so we decided to name it “version 1”. In his recent Hello World blog Evan Powell shared the learnings over two years that become foundation of StackStorm and made it a distinct product. Here I will go over specific feature highlights of version 1, touch on migration path from earlier versions, and point out to StackStorm’s future directions.

Highlights

Version 1 comes in two editions – Community and Enterprise. They share a common codebase; “Community” is full-featured, production-ready, Apache 2.0, and free forever. “Enterprise” brings commercial support and additional tools to improve productivity at scale.

StackStorm v1 introduces a few new exciting features as well as accumulated improvements based on your feedback and extensive field usage. For the complete list, see Changelog.

Installation and deployment: The new All-in-One installer brings a secure, reliable, best-practice reference deployment on a single box. Interactive graphical setup (or answer file for unattended installation) to configure users, SSL certificates, wire up a chat system for ChatOps and so on. If an Enterprise license is supplied, the installer deploys Enterprise additions, too.
Behind the All-In-One installer, there are puppet modules to use directly for your custom deployments, and st2workroom to build StackStorm into a variety of form-factors. In addition to Ubuntu 14, we introduce support to RHEL 6 and 7.
Flow v1 – an innovative visual workflow designer. Flow is unique: unlike every other workflow designer, Flow doesn’t hide the code: it highlights it as part of our support for infrastructure as code. It helps you navigate, understand, and learn the workflow definition YAML with appealing visual representation, and makes you more productive building workflow structure with drag-and-drop WYSWIG functionality. It is worth giving a try on your side; it is also worth a dedicated blog on us (coming up).
Security: RBAC and StackStorm-supported LDAP integration are essential enterprise features. There is more: pluggable auth backends with solid PAM, Keystone and other auth providers; ‘Secrets’ in metadata, which prevent flashing parameters in logs and API; and API keys, especially handy for webhooks.
ChatOps: with StackStorm, ChatOps is turn-key, much improved and maturing based on massive feedback from the community. Even those still cautious about closed loop automation and auto-remediation find it very appealing to be able to take their existing scripts, plus StackStorm community actions including Ansible, Salt, Chef or Puppet integrations, turn them at will into bot-friendly ChatOps commands with few lines of meta data. These ChatOps users then get the workflows, APIs, execution history and everything else of StackStorm as a bonus – we see them growing over time into more powerful users.

Migrating to v1

The recommended way to migrate to v1 is to provision a new StackStorm instance with All-In-One installer, and to roll over the content. Copy your content from /opt/stackstorm/packs to the new v1 server. If you’re doing it right, your content should already be under source control. Adjust content according to upgrade notes. Test, make sure everything works. To keep your previous history for audit purpose, save the /var/log/st2/*.audit.* files.

The old scripted installer aka st2_deploy.sh is still supported, and it will likely get you upgraded. However we seriously encourage you to switch to the all-in-one installer. Or for serious production, puppet, chef, or ansible yourself a custom deployment of v1.

What’s next:

StackStorm continues to rapidly evolve. Our next focus is around productizing techniques to runat scale, refining content management, debuggability of sensors and triggers, completing the mission of making StackStorm easy to deploy. We are thinking of introducing StackStorm Forge to get together even more of the hundreds of integration packs spread all over github. We want to help the community share and exchange operational patterns as code blueprints. The detailed roadmap is here, and your input is welcome.

Get v1, install, use, enjoy. Take the Flow for a ride. Give us feedback. And stay engaged on stackstorm-community.slack.com (if you’re not there yet, join), or IRC #stackstorm on freenode.org.

The post StackStorm v1 is out! appeared first on StackStorm.

↧

AHH #07 – Unpacking the Gloriousness of StackStorm 1.1 & StackStorm Enterprise

November 15, 2015, 8:04 pm

≫ Next: StackStorm v1.1.1 has been released

≪ Previous: StackStorm v1 is out!

The post AHH #07 – Unpacking the Gloriousness of StackStorm 1.1 & StackStorm Enterprise appeared first on StackStorm.

↧

StackStorm v1.1.1 has been released

November 16, 2015, 9:23 am

≫ Next: Happy Happy Hour Hour (2 happy hours) – stump the engineers

≪ Previous: AHH #07 – Unpacking the Gloriousness of StackStorm 1.1 & StackStorm Enterprise

November 16, 2015
by Tomaz Muraus

Slightly more than 2 weeks after the StackStorm v1.1.0 release we are happy to announce that we have just released StackStorm v1.1.1.

As you can guess from the version identifier (since v1.1.0 release we are following semantic versioning), this is minor release which means there are no breaking or backward incompatible changes and the release mostly includes smaller improvements and bug fixes.

StackStorm v1.1.0 recap

Before we dive into v1.1.1, here is a quick recap of new features which were released in StackStorm v1.1.0.

New (graphical) installer

StackStorm v1.1 introduced a new graphical based installer which allows you to easily install and configure StackStorm on a single server.

First page of the installer where hostname SSL certificate and enterprise license key is configured.

In addition to the graphical mode, installer also allows user to providers “answers” file (YAML file with configuration options) and run in unattended mode.

The goal of the installer is to reduce the barrier to entry and make it easier and faster to get up and running with StackStorm.

Enterprise Edition with RBAC, Flow, LDAP authentication backend and more

In addition to many other new features and improvements, StackStorm v1.1 was also the first release which brings our Enterprise Edition.

The StackStorm v1.1 Enterprise Edition builds on top of the community edition which is fully free and open source and adds some additional features which come handy especially in large enterprise environments.

Flow

Flow is a one of a kind graphical workflow editor which fully embraces and integrates with the infrastructure as code approach.

Flow with an opened disk auto remediation workflow.

The goal of flow is to make it easier for users to build, visualize and share workflows. This comes especially handy for complex workflows with many tasks and transitions.

Flow also differentiates itself from legacy workflow editors by running in the browser (no need to install resource hungry Java applications) and by being built on open-source technologies such as d3 and react.

In addition to that, Flow fully embraces an infrastructure as code approach – all the changes you make in the graphical editor are immediately visible in the right pane which contains “source” code (easy to read YAML) for that workflow. Embracing infrastructure as code means that workflows are the same as any other source code or configuration files – you can version control them, review them, etc.

The right pane with source code is also editable which means you can quickly switch between “drag and drop” and text based editing.

Role Based Access Control (RBAC)

The Enterprise Edition also comes with Role Based Access Control (RBAC) which allows you to restrict user access to particular operations.

An error which is displayed when a user doesn’t have a permission to run (execute) an action (in this case this is “core.local” action).

RBAC is an essential feature for large teams and organizations where you have many people working on different projects. RBAC allows you to organize permissions into roles and assign those roles to the StackStorm users.

My favorite example of this is a user who has a powerful automation called “bootstrap datacenter” – for obvious reasons, they’d rather not have everyone who has access to StackStorm able to run this automation.

Another example is limiting which actions StackStorm users can view and run . You can lock parameters for a particular action (e.g. if you have “create_vm” action you could limit “region” parameter to a particular set of approved regions).

That’s a quick recap. For a deeper dive and more information about v1.1.0 and the Enterprise Edition, please check the following post – StackStorm v1 is out.

StackStorm v1.1.1

And now back to the shiny new v1.1.1.

Improved CLI experience

We have optimized the speed of the CLI and now performing operations such as listing executions (st2 execution list) and retrieving particular execution details (st2 execution get) is much faster. This is especially noticeable for users with a lot of executions which contain large results.

In our build server case where we have many executions with large results, running time of “st2 execution list” went from 8 seconds down to 0.5 seconds.

Another improvement we made to the st2 execution list is displaying elapsed / running time for all the executions which are currently in the “running” state.

Output of “st2 execution list -n5” command.

This makes it easier to see, at a glance, how long a particular action has been running. This is also useful for helping you identify outliers and actions which are potentially stuck and will result in a timeout.

Improved action-chain workflow validation

We’ve made some improvements to the action-chain workflows so some validation such as task existence, etc. is done immediately when the workflow runs. Previously, some of that validation happened only when the task was about to run. This means that in cases where you have many long running tasks it could take many minutes to identify some common errors such as a typo in the referenced task name.

Detecting common validation errors as early as possible is very important since it speeds up the whole “develop-test/run” feedback loop. Imagine if you need to wait for 10 minutes for tasks to finish to notice that you have made a typo in one of the referenced tasks – that’s not very pleasant and you lose motivation and context.

It’s also worth pointing out that because of the dynamic nature of the workflows (e.g. using jinja expressions in the task names, etc.) some validation can only be performed during actual run-time.

Conclusion

That’s it for the highlights. You can find the whole list of changes here. We encourage you to go try it out and join us at our Slack community channel (or #stackstorm on freenode) if you are an IRC person) where you can leave your feedback and chat with other stormers and StackStorm users.

The post StackStorm v1.1.1 has been released appeared first on StackStorm.

↧

Happy Happy Hour Hour (2 happy hours) – stump the engineers

November 20, 2015, 8:54 am

≫ Next: Netflix: StackStorm-based Auto-Remediation – Why, How, and So What

≪ Previous: StackStorm v1.1.1 has been released

November 20, 2015
by Evan Powell

Tuesday the 1st at 10am we’ll be talking event driven automation and specifically auto-remediation with our friends at Netflix.

We’re really happy Sayli Karmarkar and Jean-Sebastien Jeannotte are joining in, willing to take all manner of automation, StackStorm and Netflix and specifically Cassandra (DataStax) questions. As in “why don’t they talk auto remediation in House of Cards?” haha.

Sayli and JS are directly responsible for Cassandra (DataStax) operations at Netflix as well as building and running what they call Winston, their StackStorm based auto-remediation as a service offering at Netflix. So come armed with Cassandra (DataStax) questions too.

To register for the Happy Hour, please go to www.stackstorm.com/register

As you’ll see the format is a Google Hangout. Our own DoriftoShoes (aka “Patrick”) will take questions as well – you can share them then via the hangout once it starts or via twitter through #AskAnAutomator. Feel free to bring your #badauto scenarios as in “a friend of mine said one time their automation pulled all their servers out of the queue on Cyber Monday.” Or, “my ChatBot keeps telling jokes that are not humorous.”

We will also take questions via the StackStorm community on Slack. If you have not already joined that community, please do so. To request an invite go: stackstorm.com/community-signup

And then, December 15th at 11am Pacific, “Patrick” will be joined by Anthony Shaw, innovation lead at Dimension Data ITaaS. Anthony is an active StackStorm community member helping StackStorm and Dimension Data offer joint solutions including ChatOps enabled self-management of complex clouds with much more to come.

Dimension Data is one of the largest integrators in the world. And they are also perhaps the leading provider of hosted Microsoft software – think Office365+ functionality run in your private cloud. Anthony’s contributions include improved StackStorm integrations with Windows PowerShell and related pieces of the ecosystem such as OctopusDeploy and Yammer. I’m really looking forward to trying to stump Anthony; he knows StackStorm very well AND he works day to day with national governments and other Dimension Data customers on their transition to the cloud. Fascinating perspective IMO.

In short – please register for our upcoming Happy Hours and come armed with questions about how and why to leverage StackStorm to address your automation challenges as well as any other questions you might have about the remediation and automation requirements and approaches of Netflix and DimensionData.

The post Happy Happy Hour Hour (2 happy hours) – stump the engineers appeared first on StackStorm.

↧

Netflix: StackStorm-based Auto-Remediation – Why, How, and So What

November 21, 2015, 1:18 pm

≫ Next: Build or Integrate Your Own Operational Dashboard w/ StackStorm (guest blog)

≪ Previous: Happy Happy Hour Hour (2 happy hours) – stump the engineers

Lessons from this week’s Event Driven Automation Meet-up

November 21, 2015
by Evan Powell

This week two excellent engineers at Netflix spoke at the Event Driven Automation meet-up which Netflix hosted. It was great to see old friends and thought leaders from Cisco, Facebook, LinkedIn and elsewhere. This blog summarizes Netflix’s presentation.

My quick summary is that it was the best presentation I’ve seen that combines both solid reasoning about why to move towards auto-remediation as well as information about how to do so.

Before we get to all that substance, however, I should admit that my favorite moment of the evening was probably when they explained why Netflix calls auto-remediation based on StackStorm “Winston.” Remember Mr Wolf?

hk-pulp-fiction

The entire video (of the meet-up that is, not Pulp Fiction) will be posted on the Event Driven Automation meet-up soon. I highly recommend it and I’ll try to remember to cross link here when it is posted. As mentioned below, an upcoming Automation Happy Hour will dig into Netflix’s Winston as well.

After introductions of our speakers Sayli and JS we got down to business

JS kicked off the discussion by talking about the great AWS reboot of last year; you may remember that Amazon discovered a Xen vulnerability and over a particular weekend they rebooted basically the entire fleet.

While this caused considerable stress at Netflix it did not cause downtime to their Cassandra fleet thanks in large part to the existing remediation plus of course the resiliency of Cassandra.

However, JS explained that their experience – and the massive scaling they are undertaking at Netflix – helped motivate the teams at Netflix to pay attention to what worked and what was not working so well with the existing remediation.

In short, the pre-existing remediation still far too often left the engineer on call and dealing with 2am pages.

Before

Incidentally – we’ve seen this pattern before. While Jenkins has some workflow capabilities one is really stretching its capabilities to use it in this way.

As they explained, the result is team burn-out and more:

Pain points

Looking at the human workflow – how alerts are handled absent an effective auto-remediation flow – makes it clear why this is.

Manual flow

As you can see – it takes at least 30 minutes to solve the problem and during that time a number of sometimes intricate manual tasks are performed under duress, at 2am.

Netflix dwelled on a photo of a physical runbook at the event. It was a binder maybe 7-10 inches thick. Imagine trying to search through that at 2am. And yet that’s – or the digital equivalent – is often what occurs without automated remediation.

Their experience led them towards a handful of seemingly simple requirements:

Requirements (1)

When I first say this slide, my gut sort of clenched as I thought: “oh noos, we have become PaaS!” It turns out that they meant more that they wanted to emulate the approach taken by Facebook and Linkedin and elsewhere – the remediation itself should be extensible and run as a service, so that other groups could consume it.

The automation using building blocks is itself worthy of a blog. Actually, if you scan our blog you’ll see that theme woven into a number of blogs already.

I’m leaving out a number of slides of course just to give a summary.

At this point the talk turned to a brief view of the technologies and then to outcomes.

Once they started using StackStorm, they were able to change the process. Note that they call their remediation solution Winston, which is excellent both because as mentioned a)
hk-pulp-fiction and b) because by naming their StackStorm based remediation something other than StackStorm they recognize the work they and other users do to adapt StackStorm to their environment. Suffice it to say we perceive and appreciate the real engineering Netflix has done to help StackStorm mature and deliver value.

These days, instead of waiting for pages, they use their Winston to solve an ever increasing percentage of issues before they distract and disrupt the humans. They used the following illustration to show the event flow these days (note that the doggie has a pager by its head – so paging is still possible :)).

Winston after

Sayli emphasized that when pages do happen they happen with “assisted diagnosis” already performed. So when you get paged you already have what we tend to call facilitated troubleshooting performed. Hence the stuff you do every time a condition of type X is reported is already done and you hopefully can use your pattern matching skills to take those results and quickly identify and then fix (maybe again with Winston / StackStorm’s help) the issue.

Being engineers they didn’t stop at that level, of course. They went into the underlying architecture a bit. As you can see they leverage StackStorm to pull events out of the SQS based queue via which their monitoring, called Atlas, announces events. StackStorm then matches on those events, determines what course of action to take, and then executes on that course of action. And all components scale horizontally and vertically.

Conceptual arch w StackStorm

This point caused some real excitement in the audience. There were a number of questions about “who watches the watcher” and “why remediate when you can just do it right the first time.”

Regarding the first question, JS claimed that so far they’ve been unable to overrun StackStorm with events. Even so, they err on the side of flagging StackStorm events as high priority (such as StackStorm starting to peg a CPU) since conceivably that event could be the sign that the dam is leaking water. They have thought about using StackStorm itself to remediate StackStorm – a pattern we’ve seen elsewhere – however have not yet implemented it.

Regarding the question about “why not do it right the first time” they said yes, sure. Of course, as Vinay Shah, an engineering leader said at the Meet-up:

“if remediation was not needed then neither would be monitoring.”

And as I’ve pointed out before, we have 158 and counting monitoring projects. Shit happens people, deal with it!

Having said that, one benefit of auto remediation is you can start to enable developers to themselves think not just about how to test their systems (aka test driven development) but how to remediate them. Why would they take on this perspective – well, at Netflix and many other places these days the developers have pagers. This dynamic is a huge motivation for developers to embrace StackStorm and remediation platforms more generally.

This summary is just the skeleton of what was to me the best overall presentation I’ve seen of the why, how and so what of auto-remediation. As is always the case in these meet-ups, the conversations in the aisle over burritos and beers was fascinating and invaluable. It was great to catch up with folks like Shane Gibson and to meet face to face some of the team at Pleexi for example.

Hopefully this has whetted your appetite. Good news, in addition to the upcoming posting of the video you can also join StackStorm’s Happy Hour on December 1st. Sayli and JS will join StackStorm’s Patrick Hoolboom and James Fryman to dig in deeper into Winston and StackStorm. Please come armed with questions.

Register here for that Happy Hour: www.stackstorm.com/register/

Last but not least, if you want to take a look at StackStorm, head to StackStorm.com and grab either the Community Edition or the Enterprise Edition. Both editions are based on the same code – Enterprise Edition has some capabilities, including Flow, that help especially enterprises get value out of StackStorm. You can also see an example of Cassandra auto-remediation in a tutorial blog format – complete with a snazzy Flow gif – here.

Finally – many thanks to JS and Sayli and Vinay and Nir and of course Christos and the rest of the team at Netflix. Giving back to the overall community by hosting the meet-up was truly good of you :).

The post Netflix: StackStorm-based Auto-Remediation – Why, How, and So What appeared first on StackStorm.

↧