Show HN: s3-lambda – Lambda functions over S3 objects: each, map, reduce, filter

  • Its weird how S3 seems to be the unwanted stepchild of AWS.

    So many obvious innovations just aren't turning up.

    For example, strangely, AWS introduced tagging for S3 resources, but you can't search/filter by tag, nor is the tag even returned when you get a list of objects, you can only get the tag with an object request. The word "pointless" springs to mind.

    In fact it's strange that there is NO useful filtering at all apart from the very useful folder/hierarchy/prefix filtering. But apart from that you can't do wildcard searches or filters or date filters or tag filters.

    I'm building an application right now that needs to get a list of all the jpg files - the only way to do that is get every single object in the bucket and manually filter out the unwanted ones - feels like its 1988 again.

    It seems like it would also be valuable for there to be alternate interfaces to S3 such as the ability to send data via ftp or SMTP or sftp or whatever, but there are no such interfaces.

    Hopefully Google will goad AWS into action on S3 innovation by implementing such features.

  • Might make sense to rename this to avoid confusion with AWS Lambda (I immediately thought it was related). Otherwise, looks like an awesome library!

  • First impression: this is a brilliant piece of software design.

    The ability to compose a map/filter chain and execute it in parallel against every object in an S3 bucket that matches a specific prefix - wow.

    The set of problems that can be quickly and cheaply solved with this thing is enormous. My biggest problem with lambda functions is that they are a bit of a pain to actually write - for transforming data in S3 this looks like my ideal abstraction.

  • see also aws athena https://aws.amazon.com/athena/ ?

  • So... the client-side code iterates S3 objects matching a certain filter, and then schedules a lambda for each one of those objects. Is that right? Or does the iteration procedure itself is a lambda? Also, when you chain several operators together, where does the chaining happen?

    I'd like to understand where different parts of the code are being executed.

  • This is a nice project. For real-world use cases, we have good alternatives:

    1. Migrate s3 ==> gc and use BigQuery which does support udf

    2. Register to databricks (I'm not affiliated)

    3. (for the brave) poke aws support to implement udf on Athena

  • If anyone is interested in this same kind of architecture for multi-cloud file-system providers ( no cloud lock-in ), please check out this project: https://github.com/bigcompany/hook.io-vfs

    Used in production, but it could use some contributors.

  • Getting aan index of (millions of) files on s3 is very slow for us, like, days. Is there anything you do to work around this? It seems since this is not an AWS Lambda project the client first has to acquire an index from S3 before concurrency benefits set in?

  • Is this susceptible to any of S3's eventual consistency constraints?

  • I thinking having the default be destructive for mapping is a strange design decision. That is going to bite someone one day soon.

  • Really nice to have a generic functional interface to S3. Thanks.

  • where actually you can use it ? in which cases? can you provide examples?