MrPowers

โœจย Data blog: https://mungingdata.com/

Personal blog: https://neapowers.com/

Github: https://github.com/mrpowers

From New York, spend most of my time in Colombia & Brazil.

Speak Spanish fluently, learning Portuguese.

๐Ÿ“… Joined in 2012

๐Ÿ”ผ 1,636 Karma

โœ๏ธ 312 posts

๐ŸŒ€
15 latest posts

Load

(Replying to PARENT post)

Rust is a good language for performant computing in general, but especially for data projects because there are so many great OSS data libraries like DataFusion and Arrow.

SedonaDB currently supports SQL, Python, R, and Rust APIs. We can support APIs for other languages in the future. That's another nice part about Rust. There are lots of libraries to expose other language bindings to Rust projects.

๐Ÿ‘คMrPowers๐Ÿ•‘1mo๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

You can generate the dataset with the instructions in this readme: https://github.com/apache/sedona-spatialbench/tree/main

Here are the queries: https://github.com/apache/sedona-spatialbench/blob/main/prin...

They should be fairly easy to replicate!

๐Ÿ‘คMrPowers๐Ÿ•‘1mo๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The "DuckDB is probably the most important geospatial software of the last decade" post has a nice related discussion: https://news.ycombinator.com/item?id=43881468
๐Ÿ‘คMrPowers๐Ÿ•‘1mo๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

There is a project called GeoPolars: https://github.com/geopolars/geopolars

From the README:

> Update (August 2024): GeoPolars is blocked on Polars supporting Arrow extension types, which would allow GeoPolars to persist geometry type information and coordinate reference system (CRS) metadata. It's not feasible to create a geopolars. GeoDataFrame as a subclass of a polars. DataFrame (similar to how the geopandas. GeoDataFrame is a subclass of pandas.DataFrame) because polars explicitly does not support subclassing of core data types.

๐Ÿ‘คMrPowers๐Ÿ•‘1mo๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

SedonaDB builds on libraries in the Rust ecosystem, like Apache DataFusion, to provide users with a nice geospatial DataFrame experience. It has functions like ST_Intersects that are common in spatial libraries, but not standard in most DataFrame implementations.

There are other good alternatives, such as GeoPandas and DuckDB Spatial. SedonaDB has Python/SQL APIs and is very fast. New features like full raster support and compatibility with lakehouse formats are coming soon!

๐Ÿ‘คMrPowers๐Ÿ•‘1mo๐Ÿ”ผ0๐Ÿ—จ๏ธ0
๐Ÿ‘คMrPowers๐Ÿ•‘1mo๐Ÿ”ผ197๐Ÿ—จ๏ธ49
๐Ÿ‘คMrPowers๐Ÿ•‘6mo๐Ÿ”ผ16๐Ÿ—จ๏ธ1

(Replying to PARENT post)

IMO, it would have been better to donate the repos to a shared org and motivate the community to continue maintaining them.

But pretty awesome this individual is retiring from programming / taking a sabbatical. There is nothing wrong with taking some time off and pursuing other interests when you lose your passion.

๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> A Data Lakehouse is fine but what benefit does it give you over a much more simple solution of ETL/ELTing the data in batches (weekly, daily, hourly, etc) and letting it sit in some kind of DB.

Lots of engines like Polars, PyTorch, Spark, and Ray can read structured data from databases, but Lakehouses are more efficient.

Databases aren't as good for storing unstructured data.

Databases can also be much more expensive than a Data Lakehouse.

Databases are awesome and have lots of amazing use cases of course. Like you mentioned, data lakehouses are great for high data volume and throughput, but there are other use cases as well IMO.

๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ0๐Ÿ—จ๏ธ0
๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ1๐Ÿ—จ๏ธ2

(Replying to PARENT post)

Lots of Spark workloads are executed with the C++ Photon engine on the Databricks platform, so we ironically have partially moved back to C++. Disclosure: I work for Databricks.
๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The OP is the original creator of Ballista, so he's well aware of the project.

Ballista is much less mature than Spark and needs a lot of work. It's awesome they're making Spark faster with Comet.

๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ0๐Ÿ—จ๏ธ0
๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ1๐Ÿ—จ๏ธ0
๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ1๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I love Medellin and lived there for many years, but the air quality is terrible and getting worse. You can talk with any locals and they say that the climate is noticeably different than it was in the past.

Medellin is surrounded by mountains and the contaminated air cannot escape. There didn't used to be a lot of cars, but now there is financing so the number of cars is growing significantly.

The hills are steep and old busses spew black smoke.

Here is some more info on pollution in Medellin: https://medellinguru.com/medellin-pollution/

Saying Medellin's temp decreased by 2 degrees Celsius based on "Mejorar el microclima hasta 2ยฐC" is a misinterpretation. I think this article is quite misleading.

๐Ÿ‘คMrPowers๐Ÿ•‘1y๐Ÿ”ผ0๐Ÿ—จ๏ธ0